Industries that handle sensitive information—like healthcare, government, or finance—function under heavy regulations to protect consumer data. Among these regulations are de-identification requirements that dictate what, how, and when personally identifiable information (PII) should be removed, masked, or anonymized in a company’s datasets prior to use in software development and testing.
Adhering to these requirements is essential to comply with laws like HIPAA and GDPR and build trust with customers and partners. In this article, we’ll discuss what data de-identification is, the types of data you must de-identify, and how companies in several industries implement data de-identification techniques to protect data while enabling product innovation.
What are de-identified datasets?
De-identified datasets refer to sets of personal information that have identifiable details removed, replaced, or adjusted. This protects the individual's identity while still allowing software developers, AI engineers, and data scientists to use the data for testing, development and model training.
Examples of data that might be de-identified include:
- Names
- Social security numbers
- Gender
- Zip codes
- Addresses
Data typesDe-identified datasets are made up of two types of data: direct and indirect identifiers. Let’s review the differences.Direct identifiersDirect identifiers are exactly what they sound like: data points that tell you who a particular individual is.
- Full name
- Social Security number
- Driver's license number
- Email address
These data pose the most risk of exposure so they are high in priority for de-identification.Indirect identifiersAn indirect identifier could tell you who the individual is, but often require the addition of supplementary data to make that determination. These data types include:
- Zip codes
- Birth dates
- Ethnicity
- Occupation
To effectively minimize privacy risks, both direct and indirect identifiers should be anonymized when de-identifying data.
Uses for de-identified datasets
While securing and masking personal data is important broadly speaking, there are several industries where privacy and compliance are absolutely crucial. Let’s look at a few examples.
Healthcare
Healthcare industry software developers use data to build, test, and validate products that help providers communicate, speed service delivery, and improve patient outcomes. De-identifying the datasets they use allows them to improve their products based on real-world scenarios without exposing personal information.
Finance
Companies building software for financial institutions rely on data to build products and features that improve operational efficiencies, manage compliance and risk, and predict the outcomes of investments. By de-identifying data like account numbers, transaction histories, and credit scores, the developers building these tools can pursue innovative solutions to improve the industry while protecting the identities of the individuals included in the datasets.
Research
Data is essential for research-focused entities that run studies to drive discoveries, test hypotheses, and validate findings. De-identifying this data allows them to access it without viewing individual data points so they can conduct meaningful large-scale analysis.
Marketing & analytics
Marketing teams and data analysts collect data like email addresses, browsing histories, and purchasing records to understand customer behavior, optimize campaigns, and improve user experiences. To ethically extract insights and personalize services without exposing individual identities, they de-identify the data to hide sensitive details.
Government
Software developers building government applications and AI models need access to de-identified data to design tools that enhance policy decision-making, budget management, and public service delivery. De-identifying details like demographics and income levels ensures that individual data is secure and inaccessible so these companies can test and improve their tools without breaching privacy laws.
How to de-identify data
Numerous approaches exist to de-identify data, ranging from custom scripts to advanced software solutions, and from simple redaction to complex data synthesis. Regardless of the approach, the process involves detecting sensitive data within a dataset and either removing or altering it to obscure the sensitive information. Key methods and techniques include:
Data masking
Data masking anonymizes data with techniques like:
- Static data masking: Unidirectionally masks data from production systems into non-production environments for realistic, consistent datasets that can be refreshed, as needed.
- Dynamic data masking: Masks data upon access, based on a user’s role-based permissions.
- On-the-fly data masking: Data is masked as it flows through production, development and testing for high efficiency.
- Statistical data masking: This technique masks data while retaining its statistical integrity for optimal accuracy.
Safe Harbor methodHIPAA’s Safe Harbor method provides specific guidelines to remove 18 types of identifiers, such as names, phone numbers, and addresses, to ensure HIPAA compliance. It minimizes risk by focusing on the systematic removal or masking of direct identifiers.
Expert Determination method
HIPAA’s Expert Determination method requires working with an expert determinator to analyze the dataset and its context to ensure that individuals cannot be identified, even indirectly. It is more flexible than the Safe Harbor method, but it requires a qualified statistician or data expert to address and mitigate the risk of re-identification.
De-identify your data with Tonic.ai
Integrating de-identification solutions in your software and AI development workflows is essential for maintaining privacy, meeting compliance standards, and enabling safe data usage across industries. Tonic.ai offers industry-leading platforms that generate high-fidelity, de-identified structured and unstructured data to accelerate innovation by equipping your developers with the data they need. Connect with our team to learn more.
Unblock data access, turbocharge development, and respect data privacy as a human right.
FAQs
Data de-identification and data anonymization are synonymous and both umbrella terms refer to the process of obscuring or altering data to protect sensitive information. Many approaches and techniques fall under the umbrella of de-identification or anonymization, including data masking, data tokenization, and data synthesis. For more information, read this guide to data anonymization.
Experts analyze the dataset for direct and indirect identifiers and use statistics to measure and mitigate risks, evaluate access controls, and address vulnerabilities while aiming to maintain the utility of the de-identified data.
The inclusion of zip codes in de-identified datasets is dependent upon the type of dataset the zip codes are in and the regulations that data is subject to. Most privacy regulations consider addresses to be personally identifiable information (PII), though there can be exceptions, depending on the purpose for which zip codes are collected (e.g. solely for shipment delivery). However, HIPAA has specific requirements for de-identifying zip codes in order to fully protect individuals in smaller datasets. HIPAA’s Safe Harbor method dictates that the first three digits of the zip code can be included if the area has a high enough population. If the location doesn’t meet this qualification, the first three digits of the code must be changed to 000.