Data de-identification in the healthcare industry

Author

Janice Manwiller

November 11, 2024

Healthcare organizations have access to personally identifiable information (PII) and protected health information (PHI)—data such as diagnosed conditions, appointment notes, prescribed treatments, insurance coverage, contact information, and payment details.

Healthcare organizations need to use this data to, among other tasks, develop and test software such as patient portals, and train artificial intelligence (AI) models for systems such as automated chats.

However, healthcare data is very tightly regulated. In particular, the Health Insurance Portability and Accountability Act (HIPAA) lays out strict rules for protecting patient data privacy.

This means that before a healthcare organization can use or share the data for the purposes of software and AI development, they must de-identify it.

What is data de-identification?

Let's start with a quick definition of data de-identification.

To quote our earlier guide to data de-identification:

"Data de-identification is any action taken to eliminate or modify personally identifiable information (PII) and sensitive personal data within datasets to safeguard individuals' privacy."

For example, you might strip out or obscure names, account numbers, or any other information that could identify a person or provide sensitive personal information about that person. A number of data de-identification techniques exist that we’ll define later in this article.

How does the healthcare industry use de-identified data?

Similar to many other industries, healthcare organizations need to use de-identified data as they develop and test their software, whether it’s a customer-facing system or an internal tool for data analytics.

Developers need to use realistic data to verify that new features and functions work correctly and perform as expected.

De-identified healthcare data can also be useful for AI model training. Automated help and chat systems require many, many examples of patient questions and conversations as they learn what to ask next and how to guide patients to appropriate resources.

But real-world conversations that contain sensitive information cannot be used as training data, because of the risk that the model might leak that information when it is put into production. The sensitive information in those conversations must first be de-identified.

De-identification methods defined in HIPAA

HIPAA §164.514(a) provides the standard to use to determine whether a specific piece of PHI data needs to be de-identified.

§164.514(b) describes the methods to use for healthcare data de-identification:

§164.514(b)(1) covers Expert Determination.
§164.514(b)(2) covers Safe Harbor.

Expert Determination

For Expert Determination, an experienced and knowledgeable expert applies statistical and scientific principles to identify data that must be protected, and to de-identify that data.

They then assess the risk that an individual patient could be identified from the data in its current de-identified state.

The expert continues to iterate over the data until they are satisfied that the risk of identification is small enough.

See our earlier blog about using data for Expert Determination.

Safe Harbor

The Safe Harbor method of healthcare data de-identification identifies the following specific types of values that must be removed from the data to ensure that the patient cannot be identified:

Names
All geographic subdivisions smaller than a state, including street address, city, county, and zip code
All dates (except year) related to an individual, such as birth dates and admission/discharge dates
Telephone numbers
Fax numbers
Email addresses
Social Security numbers
Medical record numbers
Health plan beneficiary numbers
Account numbers
Certificate/License numbers
Vehicle identifiers and serial numbers
Device identifiers and serial numbers
Web URLs
IP addresses
Biometric identifiers, including finger and voice prints
Full face photographic images and similar images
Any unique identifying number, characteristic, or code

For more information, see our earlier guide about using Safe Harbor to de-identify PHI.

Safely de-identify PHI for software testing and development.

Accelerate healthcare innovation and model training with HIPAA-compliant test data.

Book a demo

Other types of values to be aware of

Regardless of the healthcare data de-identification method used, you must remove any information that could possibly be used, by itself or indirectly in combination with other data, to identify a patient.

The HIPAA Safe Harbor list is a great place to start, but it isn't a complete list. It doesn't include some additional types of data that have cropped up since the list was established. And it doesn't specifically mention information that, while not a unique identifier, can be used to indirectly identify a patient.

For example, when the Safe Harbor list was created, there were no such things as social media aliases.

Other examples of newer data types that are not in the Safe Harbor list, but that might be used to identify a patient and should be de-identified, include race and gender designations.

Even the name of a doctor could in some cases be used to identify the patient.

Healthcare data de-identification techniques

Once you determine the data that you need to de-identify, you next decide the technique to use to de-identify it.

Here are some of the more common techniques for healthcare data de-identification:

Redaction	Redaction indicates to block or remove a sensitive value. For example, in Tonic Textual, which de-identifies unstructured files, the redaction option replaces sensitive values with a generic placeholder that simply identifies the information type. For PDFs and images, redaction displays a black box over the value.
Masking	Masking means to replace a sensitive value with another more or less realistic value. For example, you might swap out a first name for another first name, scramble the characters in an insurance identifier, or adjust appointment timestamps by an hour or a week. Tonic Structural has an extensive set of generators to perform different types of masking on various types of database and file data. Textual also generates realistic replacements for sensitive values.
Generalization	With generalization, you make a value less specific. For example, you might specify a range instead of an actual value - 50-60 instead of 54. Or you might truncate a procedure date to the month - so January 2020 instead of January 3, 2020 at 3:00 PM.
Synthesis	Synthesizing data only uses the data structure, not the individual data values. It creates new data that uses the same structure and statistics as the original data, but that is not based on existing records.

For more information about these techniques, see our guide to data de-identification.

Benefits and challenges of de-identifying healthcare data

For developers, the main benefit derived from high-fidelity healthcare data de-identification is that realistic data makes development, testing, and model training more effective and efficient, while ensuring regulatory compliance.

Also, when data is de-identified, healthcare organizations do not need patient consent to use it.

On the other hand, de-identifying a large volume of data can be a daunting task. Searching through structured databases for sensitive values is difficult enough. At least there you usually have column names to help guide you. Finding and replacing sensitive values in an assortment of unstructured notes and files is an even greater challenge.

How can Tonic.ai help?

Healthcare organizations are legally required to protect patient information. HIPAA and other privacy regulations identify the specific types of information that must be protected. Failure to follow these regulations is a violation of patient trust, and can lead to serious consequences for both the patient and the organization.

However, healthcare organizations also need to use patient data to test and train their systems.

To de-identify all PII and PHI in healthcare data is a difficult challenge—the data includes a wide variety of values in both traditional databases and in unstructured files such as appointment notes and test results.

Tonic.ai exists to help healthcare organizations meet this challenge. Tonic Structural allows you to quickly identify and mask PII and PHI in your structured and semi-structured databases, to accelerate your engineering velocity with quality, compliant test data. Tonic Textual enables you to redact and synthesize PII and PHI values across a range of unstructured file types, to generate free-text data that is safe to use in AI development and implementation.

For more information about Tonic.ai products and how to use them for healthcare data, connect with our team or start a free trial of Tonic Structural or Tonic Textual today.

FAQs

Yes, de-identified healthcare data can be used without patient consent.

If a doctor’s name can be used to identify an individual in a dataset, either on its own or when combined with de-identified data, the doctor’s name must also be de-identified.

While HIPAA specifies patient privacy requirements at the federal level, individual states can also have data privacy regulations.

Those regulations might specify protection requirements that are more or less strict than what HIPAA requires.

When there are both federal and state regulations, the more stringent rules take precedence.

More recent regulations, such as the GDPR in Europe and CCPA in California, include biometric information within their definition of PII. Healthcare organizations should also be aware of their requirements.

‍

See all FAQs

Janice Manwiller

Principal Technical Writer

Janice Manwiller is the Principal Technical Writer at Tonic.ai. She currently maintains the end-user documentation for all of the Tonic.ai products and has scripted and produced several Tonic video tutorials. She's spent most of her technical communication career designing and developing information for security-related products.

Continue with the next guide in this series

Data privacy vs security: understanding the difference