All Tonic.ai guides
Category
Data de-identification

Use cases for de-identified datasets

A bilingual wordsmith dedicated to the art of engineering with words, Chiara has over a decade of experience supporting corporate communications at multi-national companies. She once translated for the Pope; it has more overlap with translating for developers than you might think.
Author
Chiara Colombi
February 3, 2025

Industries that handle sensitive information—like healthcare, government, or finance—function under heavy regulations to protect consumer data. Among these regulations are de-identification requirements that dictate what, how, and when personally identifiable information (PII) should be removed, masked, or anonymized in a company’s datasets prior to use in software development and testing.

Adhering to these requirements is essential to comply with laws like HIPAA and GDPR and build trust with customers and partners. In this article, we’ll discuss what data de-identification is, the types of data you must de-identify, and how companies in several industries implement data de-identification techniques to protect data while enabling product innovation.

What are de-identified datasets?

De-identified datasets refer to sets of personal information that have identifiable details removed, replaced, or adjusted. This protects the individual's identity while still allowing software developers, AI engineers, and data scientists to use the data for testing, development and model training.

Examples of data that might be de-identified include:

  • Names
  • Social security numbers
  • Gender
  • Zip codes
  • Addresses

Data typesDe-identified datasets are made up of two types of data: direct and indirect identifiers. Let’s review the differences.Direct identifiersDirect identifiers are exactly what they sound like: data points that tell you who a particular individual is.

  • Full name
  • Social Security number
  • Driver's license number
  • Email address

These data pose the most risk of exposure so they are high in priority for de-identification.Indirect identifiersAn indirect identifier could tell you who the individual is, but often require the addition of supplementary data to make that determination. These data types include:

  • Zip codes
  • Birth dates
  • Ethnicity
  • Occupation

To effectively minimize privacy risks, both direct and indirect identifiers should be anonymized when de-identifying data.

Uses for de-identified datasets

While securing and masking personal data is important broadly speaking, there are several industries where privacy and compliance are absolutely crucial. Let’s look at a few examples.

Healthcare

Healthcare industry software developers use data to build, test, and validate products that help providers communicate, speed service delivery, and improve patient outcomes. De-identifying the datasets they use allows them to improve their products based on real-world scenarios      without exposing personal information.

Finance

Companies building software for financial institutions rely on data to build products and features that improve operational efficiencies, manage compliance and risk, and predict the outcomes of investments. By de-identifying data like account numbers, transaction histories, and credit scores, the developers building these tools can pursue innovative solutions to improve the industry while protecting the identities of the individuals included in the datasets.                      

Research

Data is essential for research-focused entities that run studies to drive discoveries, test hypotheses, and validate findings. De-identifying this data allows them to access it without viewing individual data points so they can conduct meaningful large-scale analysis.

Marketing & analytics

Marketing teams and data analysts collect data like email addresses, browsing histories, and purchasing records to understand customer behavior, optimize campaigns, and improve user experiences. To ethically extract insights and personalize services without exposing individual identities, they de-identify the data to hide sensitive details.

Government

Software developers building government applications and AI models need access to de-identified data to design tools that enhance policy decision-making, budget management, and public service delivery. De-identifying details like demographics and income levels ensures that individual data is secure and inaccessible so these companies can test and improve their tools without breaching privacy laws.

How to de-identify data

Numerous approaches exist to de-identify data, ranging from custom scripts to advanced software solutions, and from simple redaction to complex data synthesis. Regardless of the approach, the process involves detecting sensitive data within a dataset and either removing or altering it to obscure the sensitive information. Key methods and techniques include:

Data masking

Data masking anonymizes data with techniques like:

  • Static data masking: Unidirectionally masks data from production systems into non-production environments for realistic, consistent datasets that can be refreshed, as needed.
  • Dynamic data masking: Masks data upon access, based on a user’s role-based permissions.
  • On-the-fly data masking: Data is masked as it flows through production, development and testing for high efficiency.
  • Statistical data masking: This technique masks data while retaining its statistical integrity for optimal accuracy.

Safe Harbor methodHIPAA’s Safe Harbor method provides specific guidelines to remove 18 types of identifiers, such as names, phone numbers, and addresses, to ensure HIPAA compliance. It minimizes risk by focusing on the systematic removal or masking of direct identifiers.

Expert Determination method

HIPAA’s Expert Determination method requires working with an expert determinator to analyze the dataset and its context to ensure that individuals cannot be identified, even indirectly. It is more flexible than the Safe Harbor method, but it requires a qualified statistician or data expert to address and mitigate the risk of re-identification.

De-identify your data with Tonic.ai

Integrating de-identification solutions in your software and AI development workflows is essential for maintaining privacy, meeting compliance standards, and enabling safe data usage across industries. Tonic.ai offers industry-leading platforms that generate high-fidelity, de-identified structured and unstructured data to accelerate innovation by equipping your developers with the data they need. Connect with our team to learn more.

Make sensitive data usable for testing and development.

Unblock data access, turbocharge development, and respect data privacy as a human right.

Book a demo

FAQs

Data de-identification and data anonymization are synonymous and both umbrella terms refer to the process of obscuring or altering data to protect sensitive information. Many approaches and techniques fall under the umbrella of de-identification or anonymization, including data masking, data tokenization, and data synthesis. For more information, read this guide to data anonymization.

Experts analyze the dataset for direct and indirect identifiers and use statistics to measure and mitigate risks, evaluate access controls, and address vulnerabilities while aiming to maintain the utility of the de-identified data.

The inclusion of zip codes in de-identified datasets is dependent upon the type of dataset the zip codes are in and the regulations that data is subject to. Most privacy regulations consider addresses to be personally identifiable information (PII), though there can be exceptions, depending on the purpose for which zip codes are collected (e.g. solely for shipment delivery). However, HIPAA has specific requirements for de-identifying zip codes in order to fully protect individuals in smaller datasets. HIPAA’s Safe Harbor method dictates that the first three digits of the zip code can be included if the area has a high enough population. If the location doesn’t meet this qualification, the first three digits of the code must be changed to 000.

Use cases for de-identified datasets
Chiara Colombi
Director of Product Marketing

A bilingual wordsmith dedicated to the art of engineering with words, Chiara has over a decade of experience supporting corporate communications at multi-national companies. She once translated for the Pope; it has more overlap with translating for developers than you might think.

Make your sensitive data usable for testing and development.

Accelerate your engineering velocity, unblock AI initiatives, and respect data privacy as a human right.
Accelerate development with high-quality, privacy-respecting synthetic test data from Tonic.ai.Boost development speed and maintain data privacy with Tonic.ai's synthetic data solutions, ensuring secure and efficient test environments.