My FeedDiscussionsHeadless CMS
New
Sign in
Log inSign up
Learn more about Hashnode Headless CMSHashnode Headless CMS
Collaborate seamlessly with Hashnode Headless CMS for Enterprise.
Upgrade ✨Learn more

Do You Test with Cloned Production Data?

Richard Uie's photo
Richard Uie
·Mar 30, 2019

Properly thorough testing of new code ideally requires data that is structured very like what the code will encounter in production. The easiest solution is to clone production data into dev and test tiers. Do you or don't you?

One of the last things I worked on in my life as a captive employee in a major corporation was a data deidentification project. "Data deidentification," I hear you ask? Yup - mapping actual, production data onto fake test data of exactly the same structure (within and across multiple data inventories) but with none of the same values in a fashion that precludes reversing the mapping...no way to associate - identify - the data with a real person (customer).

This international corporation handled lots of highly confidential information, medical, financial, familial, contact, etc., about our customer base. After a merger with a multinational corporation, our new ant overlords required us to ensure that zero sensitive information that could be tied to real customers was ever loaded into IT data testbeds.

This model seems to me to presume that programmers, QA, and business unit testers are untrustworthy either in fact or in perception. I'm very trustworthy AND very cynical, both due to high self-interest.

Being able to pretend to defend against programmer malfeasance strikes me mainly as a cynical, albeit imperfectly considered, legal CYA trick intended to shield the corporation from litigation. In my view, smart programmers CAN steal information pretty much at will.

Teaching programmers not to do evil, instilling correct ethics and then trusting them, is a cultural imperative - companies that fail this deserve a dishonorable doom. However, my old shop elected to deidentify production data to populate testbeds. The process was hugely complex and expensive to build (said I as the overall system architect and the engineer of one major part). It was expensive to pass production data though a filter that guaranteed consistent cross-system reference relations with zero violations of confidentiality.

I get that corporate liability issues matter - lawsuits happen and should, when companies fail adequately to protect confidential data of their customers. However, I also get that you better be able to trust your developers with your reputation.

Creating solid data testbeds is a MUST. Protecting production data is a MUST. Decisions, decisions...

How do you generate test data?