All Tonic.ai guides
Category
Data privacy in AI

Safeguarding data privacy while using LLMs

Joe is a Senior Data Scientist at Tonic. He has a PhD in mathematics from UC Santa Cruz, giving him a background in complex math research. At Tonic, Joe focuses on implementing the newest developments in generating synthetic data.
Author
Joe Ferrara, PhD
April 22, 2024

As more and more organizations adopt artificial intelligence solutions, it has never been more essential to protect data privacy. This is particularly true when it comes to deploying Large Language Models (LLMs) in your business operations. Whether you're incorporating third-party APIs to back your LLM app, doing Retrieval-Augmented Generation (RAG), building chatbots, or fine-tuning an LLM, the risks of exposing private information to unauthorized users are significant.

The good news is there are solutions available. In this post, we explore how Tonic Textual helps secure your private data while still allowing you to harness the benefits of LLMs.

5 common data privacy issues with LLMs

One of the greatest risks of integrating LLMs into your business workflows is that they can inadvertently expose sensitive information such as personally identifiable information (PII) or protected health information (PHI). To protect user privacy and maintain compliance, let's identify how your private data can be put at risk with LLMs––and how to mitigate those risks.

#1: Risks from retrieval and prompting techniques

Techniques like Retrieval-Augmented Generation (RAG) and few-shot prompting are powerful LLM enhancement tools––but they can also inadvertently expose sensitive information, even when instructed not to. (It still isn't clear how good LLMs are at following instructions.)

For instance, imagine you have a healthcare chatbot that collects user information before connecting the user with a nurse for medical advice. That chatbot uses RAG with medical notes as part of the data backing the RAG process.

An image showing a chat transcript between a patient and a chatbot. Certain items of sensitive information are highlighted in yellow.

In the example above, the chatbot revealed that Joe Ferrara has sciatica and had surgery even though it was not needed to answer the user’s query. Without safeguards such as redaction or synthetic data generation, RAG and few-shot prompting can compromise privacy and violate regulatory standards.

#2: Memorization and regurgitation of training data

When an LLM is fine-tuned or trained on sensitive data, it may inadvertently memorize specific details and regurgitate them when the user prompts it with similar language. This risk has been highlighted in recent cases like the New York Times copyright lawsuit against OpenAI.

An image showing sensitive invoice data, including name, address, invoice number, date, due date, and amount due.
Invoice that is part of fine tuning data
An image showing the following text with PII highlighted: Joseph Ferrera, on March 15, you had an invoice sent to 1234 Main St. San Francisco CA 94107 for $12,000.
Part of the summary of March transactions for Joseph Ferrera, a different person than the person from the fine tuning data.

For instance, as in the example above, when summarizing a different customer’s financial transactions, the model can mistakenly include sensitive data from the fine-tuning data set. While these instances may be outliers, the potential for exposing real user data underscores the importance of rigorous privacy safeguards during the fine-tuning process.

#3: Data exposure through APIs and integrations

Leveraging third-party APIs––such as those OpenAI's––risks exposing organizations' sensitive data during transmission. Privacy agreements usually prevent these providers from storing or training customer data, but even sending data to external endpoints can introduce potential vulnerabilities.

When APIs or plugins are not secured properly, it leaves data open to either interception or misuse by malicious actors. It's essential to minimize the possibility of exposure by using secure integration methods and robust de-identification techniques.

#4: Inadequate data governance and compliance

When deploying LLMs, organizations can have a hard time meeting the stringent and fast-changing data governance standards required by laws like GDPR, HIPAA, or CCPA. Failing to anonymize data, improper data handling practices, or insufficient documentation of data usage can all result in compliance violations, fines, or reputational damage. Conversely, incorporating de-identification and anonymization tools along with transparent documentation practices can help to alleviate these concerns.

#5: Adversarial Attacks and Security Exploits

LLMs are especially vulnerable to attacks from malicious actors. This includes prompt injection, a security breach in which inputs are used to deliberately manipulate the model’s outputs. Training data poisoning is another concern, where attackers introduce harmful or misleading data into the training dataset.

Poorly secured systems can also expose LLMs to unauthorized access, increasing the likelihood of sensitive data leaks. However, businesses can help mitigate these and other risks by implementing strict access controls, routine security audits, and anomaly detection systems.

How to improve data security and privacy for LLMs

The key to ensuring data security and privacy in LLMs is to balance safety without compromising functionality. Addressing specific vulnerabilities with the right solutions helps to mitigate risks and maintain alignment with compliance regulations. Let's look at a few specific ways in which these solutions can help enhance your data security and protect privacy while using LLMs.

Conduct regular risk assessments

In the case of LLM security, vigilance is your first and best line of defense. Regular risk assessments are critical for identifying potential vulnerabilities and should examine how data is collected, processed, and stored, as well as potential risks associated with external integrations like APIs or user-facing applications. By proactively addressing risks in this way, businesses can ensure their AI systems remain secure and compliant with evolving data privacy regulations.

Employ secure data storage and transmission

Encrypting data both in transit and at rest helps to prevent potential unauthorized access during LLM workflows. Secure environments––such as private cloud infrastructure or on-premise servers––will add an extra layer of protection, with regular audits and monitoring of data storage solutions to further ensure compliance with data privacy standards.

Adopt role-based access controls

Restricting access based on user roles ensures only authorized personnel can interact with sensitive datasets or LLM operational parameters. This minimizes insider threats and accidental breaches while maintaining accountability through audit trails. Regularly reviewing and updating permissions keeps access control policies current and effective.

Apply differential privacy techniques

Differential privacy techniques introduce statistical noise into the data during the training phase to reduce the risk of memorization and ensure the individual data points in a dataset cannot be reverse-engineered, even if the original training data is exposed. These techniques protect user privacy by preventing sensitive data from being inadvertently revealed in model outputs.

Implement robust access controls

Role-based access controls (RBAC) limit internal access to those who need it and ensure user-facing applications don’t expose restricted data. This minimizes the risk of both threats from inside as well as accidental data leaks. Implementing these and other access controls also help to underwrite customer and stakeholder trust by demonstrating a commitment to secure data handling.

Make sensitive data usable for testing and development.
Unblock data access, turbocharge development, and respect data privacy as a human right.

Tonic Textual to the rescue

For ensuring data privacy and compliance when working with sensitive information in LLMs, Tonic Textual is the solution. Tonic Textual deploys advanced de-identification techniques and synthetic data generation, using state of the art proprietary named-entity recognition (NER) models and allowing businesses to safely leverage their data while eliminating the risks associated with exposing PII and PHI.

Tonic Textual is built to integrate seamlessly into your LLM pipeline to protect private information in your free text data. Tonic Textual’s text de-identification comes in two flavors:

  • Redaction: Direct removal and replacement of PII with placeholders.
  • Synthesis: PII is removed and replaced with fake non-sensitive data.

Depending on your use case, you may want to use redaction or synthesis. We’ll see examples of each below. Textual supports many entity types with new types added regularly. Entity types that Textual covers include first name, last name, street address, city, state, zip code, phone number, date time, and many others.

If you’re sending private data to 3rd party APIs, then you can use Textual to redact the text sent to the 3rd party, and then un-redact the response from the 3rd party API before presenting it to your end user. This prevents the 3rd party from getting the private data because you’ve redacted it before sending it to them.

Similarly, if you’re doing RAG or few shot prompting, you can redact the private data in the prompt sent to the LLM to ensure that the LLM does not leak any of the PII in the retrieved context.‍

An image showing a chat transcript between a patient and a chatbot. The sensitive information has been redacted and replaced with PII markers like [NAME_GIVEN].
In the example from before, the LLM no longer reveals the name Joe Ferrara because the retrieved context is de-identified in the LLM Prompt.

‍When fine tuning an LLM, you can use Tonic Textual to synthesize the free text data used for fine tuning, replacing the identified PII with fake PII. The synthesis makes is so that the training data looks like your normal free text data, but it does not have any real PII in it.‍

An image showing the invoice data from before, but the sensitive information has been replaced with contextually relevant synthetic data.
Using Tonic Textual in Synthesis mode, the real PII in the original invoice is replaced with fake PII, preventing the name collision that occurs for Joseph Ferrera.

‍While it’s still possible for the LLM to memorize and regurgitate the training data, no real PII can be revealed, only fake PII coming from the synthesis.

Ensuring compliance with privacy regulations

Tonic Textual simplifies compliance with regulations like GDPR, HIPAA, and CCPA by offering powerful tools for anonymizing and synthesizing sensitive information. By replacing PII and PHI with realistic yet non-sensitive substitutes, Tonic Textual aligns your data processing workflows with global and industry-specific standards.

Whether you're fine-tuning LLMs, employing Retrieval-Augmented Generation (RAG), or integrating third-party APIs, Tonic Textual ensures that your data practices remain both ethical and compliant. Its ability to adapt to evolving regulatory requirements makes it an essential component of any AI-driven data strategy.

Learn more about Tonic Textual’s compliance solutions.

FAQs

LLMs can unintentionally expose sensitive data through various methods, such as retrieving private information during prompts in Retrieval-Augmented Generation (RAG), leaking training data during inference, or revealing sensitive details embedded in fine-tuning datasets. Ensuring data is properly anonymized or synthesized before use can significantly reduce this risk.

Synthetic data enables organizations to maintain the contextual richness and utility of their datasets without exposing sensitive information. By generating realistic but de-identified data, tools like Tonic Textual ensure compliance with privacy regulations like GDPR and HIPAA while protecting Personally Identifiable Information (PII).

Compliance requires organizations to implement strict data protection measures, including anonymization, data minimization, and robust access controls. These measures not only protect user privacy but also prevent legal and financial repercussions. Tonic Textual helps streamline this process by automating data anonymization and synthesis to meet regulatory requirements effectively.

Safeguarding data privacy while using LLMs
Joe Ferrara, PhD
Senior AI Scientist

Joe is a Senior Data Scientist at Tonic. He has a PhD in mathematics from UC Santa Cruz, giving him a background in complex math research. At Tonic, Joe focuses on implementing the newest developments in generating synthetic data.

Make your sensitive data usable for testing and development.

Accelerate your engineering velocity, unblock AI initiatives, and respect data privacy as a human right.
Accelerate development with high-quality, privacy-respecting synthetic test data from Tonic.ai.Boost development speed and maintain data privacy with Tonic.ai's synthetic data solutions, ensuring secure and efficient test environments.