Category
Data de-identification

What is Data Masking?

A bilingual wordsmith dedicated to the art of engineering with words, Chiara has over a decade of experience supporting corporate communications at multi-national companies. She once translated for the Pope; it has more overlap with translating for developers than you might think.
Author
Chiara Colombi
November 16, 2023

Data masking is a data transformation method used to protect sensitive data by replacing it with a non-sensitive substitute. Often the goal of masking data is to allow the use of realistic test or demo data for development, testing, and training purposes while protecting the privacy of the sensitive data on which it is based.

Data masking can be done in a variety of ways, both in terms of the high-level approach determined by where the data lives and how the end user needs to interact with it, and in terms of the entity-level transformations applied to de-identify the data. In this guide, we’ll provide definitions of both the high-level approaches to masking data, as well as the types of data masking techniques used to achieve masked data at the entity level.

Types of Data Masking

Data masking can be achieved by way of three primary high-level approaches. These approaches differ based on where the data is stored and functional requirements of the masked output data. The approach you take to masking data will be determined by these requirements.

Static Data Masking

Static data masking is masking performed on data at rest (aka data in a data store), and the process is designed to permanently replace sensitive data with non-sensitive substitutes. This approach to masking data creates data that is read/write, an essential quality for test data for software development and QA teams. Static data masking may be performed on a traditional relational database, such as a PostgreSQL test database, in a NoSQL database like MongoDB, or on file-based data like CSV or JSON files.

Dynamic Data Masking

Dynamic data masking is masking performed on data in transit by way of a proxy. This process for masking data is designed to leave the original at-rest data intact and unaltered. The masked data isn’t stored anywhere and is read-only, making dynamic masking appropriate for simple data access management but not a usable approach for software development and testing workflows.

On-the-fly Data Masking

To a certain degree, on-the-fly data masking can be thought of as a combination of the dynamic and static methods. It involves altering sensitive data in transit before it is saved to disk, with the goal of having only masked data reach a target destination.

How Data Masking Works

Within each of the above high-level approaches, a variety of transformation techniques can be applied to the data to achieve a masked output dataset. These techniques can be as simple as replacing existing data with random values pulled from a library, or as complex as recreating the statistics of the original values in a new, but still realistic, distribution of values. When it comes to masking data, here are a few of the more common techniques:

  • Redaction: This involves removing or obscuring confidential information from a document or record, by replacing the original data with generic figures like x’s or, famously, blackened bars. Data redaction is one of the most well-known ways to protect data, but also arguably the least useful for maintaining realism. 
  • Scrambling: This technique involves taking data and rearranging it in a way that makes it difficult to read or interpret. For example, you could scramble the letters in a word or scramble the order of a sentence. 
  • Shuffling: Similar to scrambling, shuffling involves rearranging data. However, instead of rearranging characters at the field level, shuffling can involve moving the values around within a column. This ensures realism in that the original values still appear within the column, but they are no longer necessarily tied to the same records. This can be useful when working with categorical data, whose values and distribution need to be preserved.
  • Substitution: This technique for masking data involves replacing sensitive data with other data that is similar in nature—think redaction but with the added value of realism. Real values are replaced with realistic values. This technique can also be configured to preserve the statistics or format of the real values. It can be highly valuable in preserving data utility for software development and testing.
  • Encryption: This is among the most secure techniques for masking data. It involves converting data into a code that can only be read by someone who has the encryption key. This ensures that even if someone gains access to the data, they won't be able to read it without the key. Format-preserving encryption takes this technique one step further by ensuring that the encrypted values share the same format as the original values, to provide strong security alongside strong utility for software development and testing.
An illustrative example of format-preserving encryption
An illustrative example of format-preserving encryption

By identifying the best high-level masking approach for your use case and using a combination of these data masking techniques within your approach, organizations can ensure that their sensitive data is protected from unauthorized access, while also maximizing their teams’ productivity. But which teams need data masking to maximize their work? Let’s take a closer look at the use cases for masking data to better understand their goals.

Common Use Cases for Data Masking

Many teams in an organization can benefit from data masking to ensure data privacy while simplifying access to quality data to fuel productivity. Here are several common use cases for masking data in today’s companies.

Software testing and QA

In order to build a proper staging environment for testing and QA, organizations need usable test data that represents production as closely as possible. For organizations with sensitive data in production, creating realistic, masked data is a key ingredient for building a quality staging environment. Representative data makes it much easier and more reliable for developers and testers to catch bugs in staging before they’re pushed live to production. Masking data also enables them to validate fixes in staging environments, as well. Without representative data, a staging environment is effectively useless for complex testing and developers and testers will find themselves validating fixes in production.

Software development

Developers typically work in their own ‘sandbox’ or development environment. Often, they’re running their applications on their own computers and need manageable datasets to work with in order to validate their work. Both to ensure data security and to ensure access to representative datasets, data masking can be fundamental for equipping developers with data they can safely and effectively use for software development. When masking is paired with subsetting—scaling a database down to a targeted, coherent slice of representative data—developers can best streamline their productivity and workflows.

Sales demos and customer onboarding

Since software runs on data, software demos and training also require data in order to run smoothly. It goes without saying that demos and training should not run on real-world data. But spinning up realistic demo data has become increasingly difficult as our data ecosystems and scale has grown more complex. Masking data is a powerful approach for ensuring the availability of quality, representative datasets for sales demos, employee training, and customer onboarding. By creating demo data from real-world production data, teams can craft datasets that best spotlight their products’ features and capabilities. Here, too, pairing masking with subsetting enables crafting tailored datasets for specific use cases, industries, or customer journeys.

Data analytics

Data privacy regulations set limits on the length of time production data can be stored by an organization. In order to perform analytics over time, teams need a way to preserve their historic data that doesn’t infringe on these limitations. Masking data by way of tokenization eliminates all PII/PHI/sensitive data and enables compliant long-term data storage, as it allows you to store de-identified data instead of storing real-world data. Tokenized data is compliant with regulations like GDPR, HIPAA, and CCPA, but also fully preserves the data’s utility for data analytics.

Off-shore teams

Data masking is also used in the field of outsourcing. Many organizations outsource their business processes to third-party vendors. However, sharing sensitive data with these vendors can create a security risk. Data masking allows organizations to share masked data with vendors which is safe to use and does not put the organization at risk.

Make sensitive data usable for testing and development.
Unblock data access, turbocharge development, and respect data privacy as a human right.

Pros and Cons of Data Masking

When implemented effectively, masking data provides a wealth of advantages, though there are a few caveats to consider, as well. The table below provides an overview of data masking pros and cons.

Pros of Data Masking Cons of Data Masking
  • Ensures access to safe, useful data for a variety of teams, including software development and testing, data science, customer success, and sales functions
  • Streamlines workflows and team productivity
  • Enables global organizations in granting data access to off-shore teams
  • Protects sensitive data from unauthorized access, exposure, breach, or leakage
  • Satisfies the compliance requirements of data privacy regulations and quality standards and certifications
  • Reduces the overall risk of data breaches and cyber attacks
  • Not an easy solution to build in-house, especially when working with complex or highly regulated data
  • Open source solutions like Faker, often aren’t adequate for today’s data
  • Given that today’s data is in constant flux, data masking requires maintenance over time—it isn’t a one-and-done solution
  • Can make it difficult to perform certain types of analysis, as masked data may not be suitable for certain types of queries or calculations
 

How to Implement Data Masking in Your Organization

Implementing data masking in your organization is an important step towards ensuring the safety and security of your sensitive data. Not only is data masking a best practice for data privacy, it is increasingly a legal requirement for organizations today. It is essential to implement data masking in a way that fully complies with the regulations your organization is subject to. Keep in mind: GDPR, which regulates the use, rights, and protection of consumer data in Europe; CCPA, which does the same for consumers based in California (essentially setting a baseline data privacy standard for all of the US); HIPAA which regulates healthcare data in the US; and PCI, which is a critical security standard in the financial industry. At the time of this writing, 11 US states had signed privacy bills into law, and another 5 US states had active bills progressing through the legislative process.

 State  Law  Status Effective On
 California California Consumer Privacy Act  Passed, 2018  Jan 1, 2020
 California California Privacy Rights Act  Passed, 2020  Jan 2, 2023 
 Virginia  Virginia Consumer Data Protection Act  Passed, 2021  Jan 1, 2023 
 Colorado  Colorado Privacy Act  Passed, 2021  July 1, 2023
 Connecticut  Connecticut Data Privacy Act  Passed, 2022  July 1, 2023
 Utah  Utah Consumer Privacy Act  Passed, 2022  Dec 31, 2023
 Oregon  Oregon Consumer Privacy Act  Passed, 2023  July 1, 2024
 Texas  Texas Data Privacy and Security Act  Passed, 2023  July 1, 2024
 Montana  Montana Consumer Data Privacy Act  Passed, 2023  Oct 1, 2024
 Iowa  Iowa Consumer Data Protection Act  Passed, 2023  Jan 1, 2025
 Tennessee  Tennessee Information Protection Act  Passed, 2023  July 1, 2025
 Indiana  Indiana Consumer Data Protection Act  Passed, 2023  Jan 1, 2026
 Delaware  Delaware Personal Data Privacy Act  In progress   
 Massachusetts Multiple bills  In progress   
 New Jersey  New Jersey Disclosure and Accountability Transparency Act  In progress   
 North Carolina Consumer Data Privacy Act  In progress   
 Pennsylvania Consumer Data Protection Act  In progress   

Historically, organizations have implemented data masking in a number of ways. Many organizations begin by implementing in-house solutions for masking data. These may rely on custom scripts or freely-available open source tools like Faker, which, as of 2022, was downloaded 2.4 million times per week, to patch together data masking workarounds. The in-house approach can work for simpler use cases and smaller teams, earlier on in their growth, but quickly become ineffective as an organization’s data becomes more complex. In general, given the patched together nature of in-house scripts, they aren’t able to guarantee the same level of privacy. What’s more, they require endless maintenance as your data changes over time.

Teams that outgrow in-house solutions often turn next to legacy test data management tools. By legacy TDM software, we mean earlier generations of data masking software. In addition to data masking, these tools may also offer database virtualization and orchestration. An important distinction of legacy TDM is that they often prioritize data security over data utility in their execution of data masking techniques, meaning that realistic test data is not their end goal. This is where they often fall short in providing useful masked data for software testing and development. In addition, since their technology is older, their UI and underlying architecture often reflects their age, resulting in a less user-friendly experience and more pain points when it comes to working with data at scale. Simply put, they aren’t built to work with today’s complex data pipelines and modern CI/CD workflows.

In response to the gaps of in-house solutions and legacy TDM tools, modern data platforms have entered the market, designed to better manage and scale with the complexity of today’s data. Unlike legacy software, these newer solutions place a stronger emphasis on maintaining data realism to ensure data utility in software development and testing. At the same time, they incorporate more modern data security techniques like differential privacy to guarantee data privacy along the way. These platforms offer today’s teams a streamlined approach to data masking, with native integrations to data stores (from SQL Server to Snowflake), modern workflow automations, and full access by way of API. They are purpose built for enabling developer productivity by maximizing the quality and ease of access to safe, realistic test data.

When implementing data masking, it is important to consider the type of data being masked and the level of security required. It is also essential to ensure that the masking technique used does not compromise the integrity or quality of the data. Regular testing and auditing should be done to ensure that the masking technique is effective and that the sensitive data remains secure. By implementing data masking techniques, you can protect your sensitive data from unauthorized access and comply with industry regulations, ensuring the safety and security of your organization's information.

The Tonic test data platform is a modern solution built for today's engineering organizations, for their complex data ecosystems, and for the CI/CD workflows that require realistic, secure test data in order to run effectively. To learn more, explore our product pages, or connect with our team.

FAQs

What is Data Masking?
Chiara Colombi
Director of Product Marketing

A bilingual wordsmith dedicated to the art of engineering with words, Chiara has over a decade of experience supporting corporate communications at multi-national companies. She once translated for the Pope; it has more overlap with translating for developers than you might think.

Make your sensitive data usable for testing and development.

Accelerate your engineering velocity, unblock AI initiatives, and respect data privacy as a human right.
Accelerate development with high-quality, privacy-respecting synthetic test data from Tonic.ai.Boost development speed and maintain data privacy with Tonic.ai's synthetic data solutions, ensuring secure and efficient test environments.