The basics of data security
No one can deny the value of data for today’s organizations. With the ongoing rise of data breaches and cyber attacks, it is increasingly essential for organizations and government agencies to protect sensitive data from unauthorized access, use, disclosure, modification, or destruction. Data security is the practice of implementing measures to ensure the confidentiality, integrity, and availability of data to the appropriate end users.
There are many techniques used in data security. In this article, we’ll focus on data privacy and two of the most popular approaches in protecting sensitive data: data masking and data tokenization. At their essence, these are both techniques for generating fake data, but they are achieved in distinct, technically complex ways, and it is essential to understand their differences in order to choose the right approach for your organization.
What is data masking?
Data masking is a data transformation method used to protect sensitive data by replacing it with a non-sensitive substitute. Often the goal of data masking is to allow the use of realistic test or demo data for development, testing, and training purposes while protecting the privacy of the sensitive data on which it is based.
Data masking can be done in a variety of ways, both in terms of the high-level approach determined by where the data lives and how the end user needs to interact with it, and in terms of the entity-level transformations applied to de-identify the data.
Briefly, the high-level approaches include:
- Static data masking: Masking performed on data at rest (aka data in a data store), which permanently replaces sensitive data with non-sensitive substitutes. This approach creates masked data that is read/write, an essential quality for test data for software development and QA teams. Static data masking may be performed on a traditional database, such as a PostgreSQL test database, on NoSQL databases like MongoDB, or on file-based data like CSV or JSON files.
- Dynamic data masking: Masking performed on data in transit by way of a proxy, which leaves the original at-rest data intact and unaltered. The masked data isn’t stored anywhere and is read-only, making dynamic masking not a usable approach for software development and testing workflows.
- On-the-fly data masking: This can to a certain degree be thought of as a combination of the dynamic and static methods. It involves altering sensitive data in transit before it is saved to disk, with the goal of having only masked data reach a target destination.
Within each of these high-level approaches, a variety of transformation techniques can be applied to the data. Some examples include:
- Redaction: This involves removing or obscuring confidential information from a document or record, by replacing the original data with generic figures like x’s or, famously, blackened bars. Data redaction is one of the most well-known ways to protect data, but also arguably the least useful for maintaining realism.
- Scrambling: This technique involves taking data and rearranging it in a way that makes it difficult to read or interpret. For example, you could scramble the letters in a word or scramble the order of a sentence.
- Shuffling: Similar to scrambling, shuffling involves rearranging data. However, instead of rearranging characters at the field level, shuffling can involve moving the values around within a column. This ensures realism in that the original values still appear within the column, but they are no longer necessarily tied to the same records. This can be useful when working with categorical data, whose values and distribution need to be preserved.
- Substitution: This technique involves replacing sensitive data with other data that is similar in nature—think redaction but with the added value of realism. Real values are replaced with realistic values. This technique can also be configured to preserve the statistics or format of the real values. It can be highly valuable in preserving data utility for software development and testing.
- Encryption: This is among the most secure data masking techniques. It involves converting data into a code that can only be read by someone who has the encryption key. This ensures that even if someone gains access to the data, they won't be able to read it without the key. Format-preserving encryption takes this technique one step further by ensuring that the encrypted values share the same format as the original values, to provide strong security alongside strong utility for software development and testing.
By identifying the best high-level masking approach for your use case and using a combination of these data masking techniques within your approach, organizations can ensure that their sensitive data is protected from unauthorized access, while also maximizing their teams’ productivity.
Pros and cons of data masking
When implemented effectively, data masking provides a wealth of advantages, including:
- ensuring access to safe, useful data for a variety of teams, including software development and testing, data science, customer success, and sales functions;
- streamlining workflows and team productivity;
- enabling global organizations in granting data access to off-shore teams;
- protecting sensitive data from unauthorized access, exposure, breach, or leakage;
- satisfying the compliance requirements of data privacy regulations and quality standards and certifications; and,
- reducing the overall risk of data breaches and cyber attacks.
The caveats and potential disadvantages of data masking include:
- It is not an easy solution to build in-house, especially when working with complex or highly regulated data. While there are open source solutions like Faker, these often aren’t adequate for growing software teams today.
- Given that today’s data is in constant flux, data masking requires maintenance over time—it isn’t a one-and-done solution. The ideal approach includes automation to streamline this maintenance as much as possible.
- Data masking can sometimes make it difficult to perform certain types of analysis, as the masked data may not be suitable for certain types of queries or calculations.
What is data tokenization?
Tokenization is a technique used to protect sensitive data by replacing it with a non-sensitive substitute called a token. The token represents the original data, but it does not reveal any sensitive information. The goal of data tokenization is to protect sensitive data while allowing authorized users to access and process the tokenized data. Tokenization is often used in the context of analysis, when the statistics of the data are important to preserve but the values need not look like real-world values.
Data tokenization can also be performed in a way that is format-preserving. This technique preserves the format and length of the original data while replacing it with a token. It is widely used in the financial sector, e-commerce, and other industries where sensitive data is transmitted and stored. By preserving the format of the data, it ensures that the data can be easily processed and used by the systems that require it, while at the same time protecting it from unauthorized access and theft.
Advantages and disadvantages of data tokenization
Tokenization has several advantages, including:
- enabling secure data analysis for business intelligence and data science;
- providing strong protection for sensitive data;
- helping organizations comply with data privacy regulations, including allowing them to store data long-term without breaching compliance requirements; and,
- reducing the risk of data breaches and cyber attacks.
On the flip side, tokenization also has some disadvantages worth mentioning:
- It can be a complex process that requires technical knowledge and expertise to implement effectively.
- Depending on the scale of the data in scope, data tokenization can sometimes result in reduced system performance due to the additional processing power required to manage the tokens.
- Another potential disadvantage of data tokenization is that it may not be suitable for all types of data and use cases, given that it decreases the realism of the data it generates.
Data masking vs. data tokenization: the comparison
As outlined above, data masking and data tokenization are two popular techniques used in data security, but they serve different purposes. Data masking is used to protect sensitive data while allowing the use of realistic test or demo data, while data tokenization is used to protect sensitive data while allowing authorized users to access and process the tokenized data, for example, for use in analytics.
Differences in techniques and processes
When comparing data tokenization versus data masking, their core differences can be found in the techniques and processes they use. Data masking replaces sensitive data with a non-sensitive substitute that may look as realistic as the original, while tokenization replaces sensitive data with a token that is not intended to resemble a real-world value. Data masking can be done in an extensive variety of ways, including redaction, scrambling, substitution, and encryption, while tokenization is achieved by way of a narrower scope of approaches, including cryptographic methods and format-preserving data tokenization.
Role in data security and regulatory compliance
Both data masking and data tokenization play a vital role in data security and regulatory compliance, as they are fundamental approaches to protecting sensitive data. They help organizations comply with data privacy regulations, such as GDPR, CCPA, HIPAA, and PCI. They can be used in combination to meet the differing needs of an organization’s teams in achieving the necessary level of data protection.
Making your choice based on a use case
Choosing between data masking and tokenization depends on your organization's specific needs and use case. If your organization needs to preserve both the privacy and the utility of its data, while granting access to software development, customer success, and sales teams, data masking is likely the better choice. For example, if your organization has a SQL database that contains PII with real-world eccentricities you need to preserve in your lower environments, data masking might solve your use case. The variety of approaches available by way of data masking enables shaping your masked data based on the specific qualities of each data type. It offers greater flexibility and control over the look and feel of the data, allowing you to achieve the right balance of privacy and utility, which is particularly useful when crafting test data for use in software development and testing.
If, on the other hand, your organization needs to protect sensitive data while allowing authorized users to access and process the protected data for use in analytics, tokenization may be a better choice. Tokenization not only allows you to preserve the statistics of your real-world data, it provides a secure way to store your data long-term, beyond the limitations for storing your original data dictated by regulations like GDPR. If long-term data storage and analytics are your goal, data tokenization is likely your solution.
The Tonic test data platform is a modern solution built for today's engineering organizations, for their complex data ecosystems, and for the CI/CD workflows that require realistic, secure test data in order to run effectively. It offers both data masking and data tokenization capabilities within its versatile approaches to data transformation. To learn more, explore our product pages, or connect with our team.