Synthesizing healthcare data for AI model training, with HIPAA Expert Determination

Author

Adam Kamor, PhD

January 28, 2025

For organizations deploying generative AI applications in the healthcare sector, maintaining compliance with the strict data privacy standards of the Health Insurance Portability and Accountability Act, or HIPAA, requires an even higher level of rigor than for those in other sectors.

In this article we’ll explore approaches to de-identifying protected health information (PHI) to ensure HIPAA compliance in AI workflows, from development to implementation.

Understanding HIPAA and healthcare data

While HIPAA does not specifically mention AI, it does apply to AI’s use in healthcare contexts. For instance, if a HIPAA-covered entity, like a healthcare provider, uses PHI to train AI models, they must ensure compliance with HIPAA’s privacy and security rules while ensuring that PHI is properly de-identified, as outlined by the U.S. Department of Health and Human Services.

Similarly, if an AI company processes PHI on behalf of a HIPAA-covered entity, it becomes a business associate and is required to adhere to HIPAA regulations, including prohibitions against using non-de-identified health data to train generative AI models. This ensures that sensitive health information––such as social security numbers, dates of service, or personal names––remains protected and secure.

HIPAA de-identification: Safe Harbor vs. Expert Determination

HIPAA recognizes the utility of healthcare information and allows for PHI to be de-identified in two different ways, as specified in §164.514(a)-(b) of the regulations: Safe Harbor and Expert Determination.

Safe Harbor: a conservative approach

The Safe Harbor method involves removing any instance of 18 specific identifiers from the dataset to ensure the data is safe from re-identification. While Safe Harbor provides strong privacy guarantees, it is very prescriptive, leaving less room for nuance in determining how to de-identify PHI.

Safe Harbor can be an appropriate method to use when the de-identified data doesn’t need to meet a high degree of fidelity to the original data. For use cases that do require a high degree of fidelity––such as data used to train ML models––it may not be a suitable approach.

Expert Determination: a tailored solution

Expert Determination in HIPAA adopts a customized approach tailored to the specific dataset and intended use case, typically resulting in data with much higher utility. An "expert"—using reasonable statistical and scientific principles—determines the approach to de-identify the data that ensures a very low chance of re-identification. Because the expert provides the rules and process for de-identifying the data, the data owner is the one responsible for actually performing the data de-identification process.

By using a qualified expert to navigate the complexities of data de-identification, organizations can safeguard their ethical obligations to their patients while maintaining legal compliance. However, the requirements for Expert Determination are equally stringent, and the expert's qualifications, the rigor of the risk assessment, the methodologies employed, and the documentation of the process undertaken are all fundamental to the integrity of the process.

Key aspects of HIPAA Expert Determination

To safeguard PHI in compliance with HIPAA standards, it's necessary to understand what goes into the Expert Determination process. All of the elements outlined below can influence the privacy of PHI and compliance with HIPAA.

Expert qualifications

The first aspect of Expert Determination is the expert themself. They must possess specialized knowledge of the statistical and scientific methodologies used for data de-identification, including deep knowledge of health privacy laws and which statistical methods can be used to minimize the risk of identifying individuals' personal information based on the data set. The expert’s credentials, experience in data privacy protection, and knowledge of the latest in data security practices play a huge role in ensuring the integrity of de-identified health information.

Risk assessment

Risk assessment, as part of the Expert Determination process, evaluates how likely it is that de-identified data can be used either on its own or with other easily available information to recreate direct identifiers for individual records. This process includes analysis of both the nature of the data and the context in which it will be used to determine the chances of re-identification.

Methods and documentation

As part of the de-identification requirements, the expert doing the analysis must conduct rigorous documentation and justification of their methods. They must outline which statistical techniques were used and why, along with the steps taken to reduce the risk of re-identification. The expert must attest to how the methods apply to the particular data set and environment to prove that the procedures are transparent, reproducible, and aligned with current best practices in data privacy.

Contextual factors

Expert Determination must also consider the contextual factors surrounding the data, including the environment where the data will be used, the level of access to additional data sources that could be combined with the de-identified data, and the technological landscape. The expert's evaluation has to consider how these factors could impact privacy risks for the de-identified data and take steps to mitigate them.

Ongoing monitoring and updating

Due to the ever-evolving nature of technology and methodology, de-identification protocols and risk assessments must be reviewed and updated regularly to ensure that mitigation measures remain effective against new threats. Along with their initial assessment and report, the expert needs to schedule periodic reviews of the de-identification processes and document any changes made to maintain compliance with the HIPAA privacy rule's de-identification standard.

Applications of Expert Determination in AI

Let's explore three critical use cases where Expert Determination is essential to maintain HIPAA compliance and protect patient privacy while deploying advanced AI technologies in healthcare settings. These scenarios highlight the importance of thorough data de-identification before data processing and model training.

LLM Fine-Tuning

When fine-tuning a Large Language Model (LLM) on data that includes sensitive identifiers, there’s a risk that Personally Identifiable Information (PII) or PHI could be inadvertently encoded into the model’s parameters. This can lead to unintended data leakage during inference, where sensitive information is revealed through the model's responses.

For example, imagine you’re developing a patient history summarization engine by fine-tuning an LLM. If identifiable details about patients aren’t removed beforehand, the model could accidentally expose portions of a patient's history to unauthorized users. This would not only breach patient privacy but could also violate HIPAA regulations.

Using Expert Determination to carefully de-identify data prior to LLM fine-tuning minimizes these risks, helping to ensure the model remains compliant while still retaining valuable context for healthcare applications.

RAG Building

In Retrieval-Augmented Generation (RAG) systems, embedding or vectorizing documents that contain PII or PHI can inadvertently introduce sensitive information into the retrieval process, increasing the risk of data leakage. To mitigate this, it's crucial to remove PII and PHI from documents before they are embedded in a vector database.

For instance, if a health insurer is building a customer support chatbot powered by a RAG system, it will need access to historical customer interactions (via phone, email, or chat) to retrieve relevant responses. However, to maintain privacy, identifiable details must be removed either before or during the vectorization process. For optimal accuracy, this de-identification is typically more effective if done before chunking, as the full document context enhances Named Entity Recognition (NER) performance.

By de-identifying unstructured data at this stage, you create a safer RAG system that retrieves relevant information without compromising the privacy of individual customers, enabling a compliant and effective support tool.

LLM or SLM building

Building foundational LLMs or Small Language Models (SLMs) with high-quality, realistic data is complex—and becomes even more challenging when sensitive information is involved. Training on data that includes PII or PHI can risk encoding this information into the model weights, leading to potential data leakage during inference, similar to the risk seen in fine-tuning.

Many organizations develop foundational models by synthesizing their own corpora of text, often drawn from customer support exchanges, email communications, or patient histories. These synthetic datasets are invaluable for creating models that can generate safe, contextually accurate text. To prevent data leakage, PII and PHI must be removed before training begins, ensuring that all synthetic output remains compliant and free from sensitive information.

By applying Expert Determination to de-identify training data, organizations can build robust, reliable models that retain essential context while safeguarding privacy, paving the way for trustworthy AI applications in healthcare and beyond.

Tonic.ai and Expert Determination

Tonic.ai offers solutions for generating high-quality synthetic data that mimics real-world data while safeguarding patient privacy. Using Tonic Textual, a platform built for de-identifying and synthesizing free-text data, in conjunction with Expert Determination offers a robust solution for healthcare organizations leveraging sensitive patient data in generative AI applications.

The process begins by utilizing Tonic Textual’s proprietary NER models to identify PHI within unstructured data. An expert then examines the efficacy of these models by understanding the relationships and underlying PHI in the source data, determining that the redacted data is in compliance with HIPAA regulations.

Once the expert validates the approach based on the model's quality and the data itself, Textual can be employed to effectively de-identify the data. This allows organizations to safely and securely use their data for advanced AI development on Google Cloud, all while maintaining the highest standards of privacy and regulatory compliance.

Developing healthcare solutions with confidence

Navigating HIPAA compliance in the age of AI and LLMs requires a nuanced approach, especially as organizations strive to maintain data utility while ensuring patient privacy. Expert Determination offers a valuable pathway to achieve this balance, allowing sensitive information to be de-identified without the rigid limitations of Safe Harbor guidelines. This flexibility empowers companies to leverage healthcare data responsibly for powerful applications like LLM fine-tuning, RAG systems, and foundational model building.

By prioritizing de-identification through Expert Determination, organizations can unlock the potential of healthcare data safely and ethically. Whether you’re fine-tuning models, building robust retrieval systems, or crafting foundational datasets, preserving privacy safeguards the integrity of your AI initiatives while honoring the trust of those whose data powers these advancements. Embracing Expert Determination helps ensure that innovation and compliance go hand in hand, driving the future of healthcare technology responsibly forward.

Through partnering with Tonic.ai, innovative healthcare organizations building on Google Cloud can leverage synthetic data alongside Expert Determination to achieve both compliance and high data utility, empowering them to innovate without compromising on security or privacy.

Safely de-identify your unstructured data for use in AI.

Unblock your AI initiatives and build features faster by securely leveraging your free-text data.

Book a demo