Technical deep dive

Anonymizing your data in Db2 for better testing and development

Author

Chiara Colombi

Author

October 20, 2021

Anonymizing your data in Db2 for better testing and development

IBM Db2: An Overview

The Db2 family of products developed by IBM are some of the oldest data management and relational database management systems (RDBMS) on the market, with development on IBM's first relational database beginning back in the 1970s. Db2 began as a strictly relational system, but has added object-relational and non-relational features over the years like XML, graph store, and JSON. In addition, it now offers AI capabilities and support for multi-cloud environments.

Db2 comes in three distinct flavors, each running on a different operating system:

‍Db2 LUW is IBM's database server. LUW stands for "Linux, Unix, Windows," the most common operating systems the server runs on.‍
Db2 for z/OS was introduced in 2006, and represents the version of Db2 that runs on IBM's 64-bit operating system for IBM z/Architecture mainframes.‍
Db2 for i was previously known as Db2 iSeries, and is IBM's integrated relational database. It runs on the IBM i architecture with the claim that it provides companies a lower cost of ownership.

Key features of Db2 include in-depth querying, accelerated hybrid transaction analytical processing (HTAP) performance, Oracle SQL compatibility, AI and ML capabilities, and actionable compression. Though it is one of the oldest RDBMS's on the market, it ranks fifth on dB-engines.com for relational databases, and seventh overall.

Challenges Creating Fake Data in Db2

Companies using Db2 frequently need to create mock data that mimics production data for the purposes of QA, testing, and development. Creating homegrown de-identified data using scripts might seem like an inexpensive way to get the job done, but it poses a number of challenges, particularly when working with Db2.

First and foremost, hashing test data in-house is extremely time-consuming. Depending on the size of the database, it can involve weeks or months of monotonous work. And once the work is done...it isn't done. Data is a living organism, so it needs to be refreshed constantly. If not, you run the risk of missing bugs during testing, or presenting inaccurate software performance results to your executive team.

In addition, if you're running Db2 on the mainframe, your data types may not have LUW equivalents. This means you may need to convert the data in order to format it properly. You also cannot register an object in tool data management (TDM) which introduces more complexity to the test data generation process.

Another challenge developers continually face is the difficulty of creating test data that accurately mimics production data. This is a huge ask of someone who is attempting to execute data anonymization manually because there are so many nuances, data links, and data types to identify and replicate. This is especially difficult in databases like Db2 that allow both structured and unstructured data. While it is a gargantuan task for a single developer, it is absolutely necessary in order to render synthetic data that will perform the same as your production data.

The most important driving factor for teams attempting to generate test data based on their Db2 databases is data privacy. Creating homegrown test data opens up potential opportunities for personally identifiable information (PII) to leak into your test data. With unstructured data, data strings can contain pieces of PII, even if the field in which they are contained seems nebulous. Without a method for de-identifying personal info inside text strings, or embedded tables, the risk of exposing data to groups that shouldn't have access is high.

You can search and sort through thousands of records to identify potential PII by hand or using a script. Or you can hook your Db2 database up to Tonic.

Creating Realistic Test Data by Mimicking Data in Db2

Tonic eliminates the difficulties involved in creating test data in Db2 by sitting between your production database and lower environments to safely de-identify and generate your test data. Creating synthetic data in Db2 is simple with Tonic because it automates the most challenging tasks of the process for you:

‍Synthesizing data across tables while linking related columns to preserve your data’s complexity, utility, and privacy.‍
Identifying, obfuscating, and transforming PII so you don't have to worry about it sneaking into lower environments.‍
Offering dozens of customized data generators that allow you to build a model of your data that will accurately replicate your real-world data.‍
Running generations as often as you need, so that your data stays fresh and up to date.‍
Proactively protecting your sensitive data with automatic schema change alerts and differential privacy providing mathematical guarantees of data privacy.

Ready to equip your teams with a faster, safer way to generate test data for Db2? Get in touch with our team; we’re excited to show you what we’ve built.

Want to make your data usable?

Unblock product innovation with safe, high-fidelity data de-identification and synthesis.

Book a demo

Chiara Colombi

Director of Product Marketing

A bilingual wordsmith dedicated to the art of engineering with words, Chiara has over a decade of experience supporting corporate communications at multi-national companies. She once translated for the Pope; it has more overlap with translating for developers than you might think.