Do You Test with Cloned Production Data?

Properly thorough testing of new code ideally requires data that is structured very like what the code will encounter in production. The easiest solution is to clone production data into dev and test tiers. Do you or don't you?

One of the last things I worked on in my life as a captive employee in a major corporation was a data deidentification project. "Data deidentification," I hear you ask? Yup - mapping actual, production data onto fake test data of exactly the same structure (within and across multiple data inventories) but with none of the same values in a fashion that precludes reversing the mapping...no way to associate - identify - the data with a real person (customer).

This international corporation handled lots of highly confidential information, medical, financial, familial, contact, etc., about our customer base. After a merger with a multinational corporation, our new ant overlords required us to ensure that zero sensitive information that could be tied to real customers was ever loaded into IT data testbeds.

This model seems to me to presume that programmers, QA, and business unit testers are untrustworthy either in fact or in perception. I'm very trustworthy AND very cynical, both due to high self-interest.

Being able to pretend to defend against programmer malfeasance strikes me mainly as a cynical, albeit imperfectly considered, legal CYA trick intended to shield the corporation from litigation. In my view, smart programmers CAN steal information pretty much at will.

Teaching programmers not to do evil, instilling correct ethics and then trusting them, is a cultural imperative - companies that fail this deserve a dishonorable doom. However, my old shop elected to deidentify production data to populate testbeds. The process was hugely complex and expensive to build (said I as the overall system architect and the engineer of one major part). It was expensive to pass production data though a filter that guaranteed consistent cross-system reference relations with zero violations of confidentiality.

I get that corporate liability issues matter - lawsuits happen and should, when companies fail adequately to protect confidential data of their customers. However, I also get that you better be able to trust your developers with your reputation.

Creating solid data testbeds is a MUST. Protecting production data is a MUST. Decisions, decisions...

How do you generate test data?

Responses(2)

I've always struggled with this. From one side, having production data is the best way to validate that your systems behave correctly, especially for input-intensive applications like APIs, chats, etc.

But having customer data in a developer's machine is dangerous. Is not only about developers stealing information. There are many other ways that customer data can get compromised: pasting it on a chat, on a support ticket, forwarding on an email, even getting the laptop stolen (this actually happened somewhere I worked at).

I think it's better to err on the side of caution and be extra careful. Either don't give your developers access to the customer data, or invest in proper deidentification. I recently went to a meetup where the founders of Tonic.ai talked about this topic extensively. They are building Tonic to make this process as cheap as possible. I haven't use it myself, but from the demos it looked promising.

This model seems to me to presume that programmers, QA, and business unit testers are untrustworthy either in fact or in perception.

It seems like good practise to me to only rely on trust when it's needed. When people don't need to have access to something, don't give it to them.

I don't think you can reasonably assume that among all the employees in a big corporation, there's nobody who is curious about his ex-girlfriend's medical record, or nobody who might have opened a questionable email attachment.

Believing organizations can trust their internal employees with data is also not really in agreement with history, I feel. Various CIA spies, Snowden leaks*, Facebook, Snapchat, others ... And that's the big ones, not the individual employees looking up their neighbours.

*(I'm personally glad Snowden leaked that data, but from an information security perspective, it's not something to use as example).

If that's unconvincing, it's also very likely plain illegal in Europe to give random programmers access to people's medical data. Really fast way to lose millions of euros.

TLDR: if your answer to the titular question is "yes", maybe don't post it online.

Another problem with using production data for tests, unrelated to privacy, is that if you're designing something new, there is no production data to test with.

Thread

Do You Test with Cloned Production Data?

Responses(2)

Recent threads

Search Hashnode

Do You Test with Cloned Production Data?

Responses(2)

Recent threads