Data masking is a data transformation method used to protect sensitive data by replacing it with a non-sensitive substitute. Often the goal of masking data is to allow the use of realistic test or demo data for development, testing, and training purposes while protecting the privacy of the sensitive data on which it is based.
Data masking can be done in a variety of ways, both in terms of the high-level approach determined by where the data lives and how the end user needs to interact with it, and in terms of the entity-level transformations applied to de-identify the data. In this guide, we’ll provide definitions of both the high-level approaches to masking data, as well as the types of data masking techniques used to achieve masked data at the entity level.
Types of Data Masking
Data masking can be achieved by way of three primary high-level approaches. These approaches differ based on where the data is stored and functional requirements of the masked output data. The approach you take to masking data will be determined by these requirements.
Static Data Masking
Static data masking is masking performed on data at rest (aka data in a data store), and the process is designed to permanently replace sensitive data with non-sensitive substitutes. This approach to masking data creates data that is read/write, an essential quality for test data for software development and QA teams. Static data masking may be performed on a traditional relational database, such as a PostgreSQL test database, in a NoSQL database like MongoDB, or on file-based data like CSV or JSON files.
Dynamic Data Masking
Dynamic data masking is masking performed on data in transit by way of a proxy. This process for masking data is designed to leave the original at-rest data intact and unaltered. The masked data isn’t stored anywhere and is read-only, making dynamic masking appropriate for simple data access management but not a usable approach for software development and testing workflows.
On-the-fly Data Masking
To a certain degree, on-the-fly data masking can be thought of as a combination of the dynamic and static methods. It involves altering sensitive data in transit before it is saved to disk, with the goal of having only masked data reach a target destination.
How Data Masking Works
Within each of the above high-level approaches, a variety of transformation techniques can be applied to the data to achieve a masked output dataset. These techniques can be as simple as replacing existing data with random values pulled from a library, or as complex as recreating the statistics of the original values in a new, but still realistic, distribution of values. When it comes to masking data, here are a few of the more common techniques:
- Redaction: This involves removing or obscuring confidential information from a document or record, by replacing the original data with generic figures like x’s or, famously, blackened bars. Data redaction is one of the most well-known ways to protect data, but also arguably the least useful for maintaining realism.
- Scrambling: This technique involves taking data and rearranging it in a way that makes it difficult to read or interpret. For example, you could scramble the letters in a word or scramble the order of a sentence.
- Shuffling: Similar to scrambling, shuffling involves rearranging data. However, instead of rearranging characters at the field level, shuffling can involve moving the values around within a column. This ensures realism in that the original values still appear within the column, but they are no longer necessarily tied to the same records. This can be useful when working with categorical data, whose values and distribution need to be preserved.
- Substitution: This technique for masking data involves replacing sensitive data with other data that is similar in nature—think redaction but with the added value of realism. Real values are replaced with realistic values. This technique can also be configured to preserve the statistics or format of the real values. It can be highly valuable in preserving data utility for software development and testing.
- Encryption: This is among the most secure techniques for masking data. It involves converting data into a code that can only be read by someone who has the encryption key. This ensures that even if someone gains access to the data, they won't be able to read it without the key. Format-preserving encryption takes this technique one step further by ensuring that the encrypted values share the same format as the original values, to provide strong security alongside strong utility for software development and testing.
By identifying the best high-level masking approach for your use case and using a combination of these data masking techniques within your approach, organizations can ensure that their sensitive data is protected from unauthorized access, while also maximizing their teams’ productivity. But which teams need data masking to maximize their work? Let’s take a closer look at the use cases for masking data to better understand their goals.
Common Use Cases for Data Masking
Many teams in an organization can benefit from data masking to ensure data privacy while simplifying access to quality data to fuel productivity. Here are several common use cases for masking data in today’s companies.
Software testing and QA
In order to build a proper staging environment for testing and QA, organizations need usable test data that represents production as closely as possible. For organizations with sensitive data in production, creating realistic, masked data is a key ingredient for building a quality staging environment. Representative data makes it much easier and more reliable for developers and testers to catch bugs in staging before they’re pushed live to production. Masking data also enables them to validate fixes in staging environments, as well. Without representative data, a staging environment is effectively useless for complex testing and developers and testers will find themselves validating fixes in production.
Software development
Developers typically work in their own ‘sandbox’ or development environment. Often, they’re running their applications on their own computers and need manageable datasets to work with in order to validate their work. Both to ensure data security and to ensure access to representative datasets, data masking can be fundamental for equipping developers with data they can safely and effectively use for software development. When masking is paired with subsetting—scaling a database down to a targeted, coherent slice of representative data—developers can best streamline their productivity and workflows.
Sales demos and customer onboarding
Since software runs on data, software demos and training also require data in order to run smoothly. It goes without saying that demos and training should not run on real-world data. But spinning up realistic demo data has become increasingly difficult as our data ecosystems and scale has grown more complex. Masking data is a powerful approach for ensuring the availability of quality, representative datasets for sales demos, employee training, and customer onboarding. By creating demo data from real-world production data, teams can craft datasets that best spotlight their products’ features and capabilities. Here, too, pairing masking with subsetting enables crafting tailored datasets for specific use cases, industries, or customer journeys.
Data analytics
Data privacy regulations set limits on the length of time production data can be stored by an organization. In order to perform analytics over time, teams need a way to preserve their historic data that doesn’t infringe on these limitations. Masking data by way of tokenization eliminates all PII/PHI/sensitive data and enables compliant long-term data storage, as it allows you to store de-identified data instead of storing real-world data. Tokenized data is compliant with regulations like GDPR, HIPAA, and CCPA, but also fully preserves the data’s utility for data analytics.
Off-shore teams
Data masking is also used in the field of outsourcing. Many organizations outsource their business processes to third-party vendors. However, sharing sensitive data with these vendors can create a security risk. Data masking allows organizations to share masked data with vendors which is safe to use and does not put the organization at risk.
Pros and Cons of Data Masking
When implemented effectively, masking data provides a wealth of advantages, though there are a few caveats to consider, as well. The table below provides an overview of data masking pros and cons.
How to Implement Data Masking in Your Organization
Implementing data masking in your organization is an important step towards ensuring the safety and security of your sensitive data. Not only is data masking a best practice for data privacy, it is increasingly a legal requirement for organizations today. It is essential to implement data masking in a way that fully complies with the regulations your organization is subject to. Keep in mind: GDPR, which regulates the use, rights, and protection of consumer data in Europe; CCPA, which does the same for consumers based in California (essentially setting a baseline data privacy standard for all of the US); HIPAA which regulates healthcare data in the US; and PCI, which is a critical security standard in the financial industry. At the time of this writing, 11 US states had signed privacy bills into law, and another 5 US states had active bills progressing through the legislative process.
Historically, organizations have implemented data masking in a number of ways. Many organizations begin by implementing in-house solutions for masking data. These may rely on custom scripts or freely-available open source tools like Faker, which, as of 2022, was downloaded 2.4 million times per week, to patch together data masking workarounds. The in-house approach can work for simpler use cases and smaller teams, earlier on in their growth, but quickly become ineffective as an organization’s data becomes more complex. In general, given the patched together nature of in-house scripts, they aren’t able to guarantee the same level of privacy. What’s more, they require endless maintenance as your data changes over time.
Teams that outgrow in-house solutions often turn next to legacy test data management tools. By legacy TDM software, we mean earlier generations of data masking software. In addition to data masking, these tools may also offer database virtualization and orchestration. An important distinction of legacy TDM is that they often prioritize data security over data utility in their execution of data masking techniques, meaning that realistic test data is not their end goal. This is where they often fall short in providing useful masked data for software testing and development. In addition, since their technology is older, their UI and underlying architecture often reflects their age, resulting in a less user-friendly experience and more pain points when it comes to working with data at scale. Simply put, they aren’t built to work with today’s complex data pipelines and modern CI/CD workflows.
In response to the gaps of in-house solutions and legacy TDM tools, modern data platforms have entered the market, designed to better manage and scale with the complexity of today’s data. Unlike legacy software, these newer solutions place a stronger emphasis on maintaining data realism to ensure data utility in software development and testing. At the same time, they incorporate more modern data security techniques like differential privacy to guarantee data privacy along the way. These platforms offer today’s teams a streamlined approach to data masking, with native integrations to data stores (from SQL Server to Snowflake), modern workflow automations, and full access by way of API. They are purpose built for enabling developer productivity by maximizing the quality and ease of access to safe, realistic test data.
When implementing data masking, it is important to consider the type of data being masked and the level of security required. It is also essential to ensure that the masking technique used does not compromise the integrity or quality of the data. Regular testing and auditing should be done to ensure that the masking technique is effective and that the sensitive data remains secure. By implementing data masking techniques, you can protect your sensitive data from unauthorized access and comply with industry regulations, ensuring the safety and security of your organization's information.
The Tonic test data platform is a modern solution built for today's engineering organizations, for their complex data ecosystems, and for the CI/CD workflows that require realistic, secure test data in order to run effectively. To learn more, explore our product pages, or connect with our team.