Category
Data privacy in AI

Safeguarding data privacy while using LLMs

Joe is a Senior Data Scientist at Tonic. He has a PhD in mathematics from UC Santa Cruz, giving him a background in complex math research. At Tonic, Joe focuses on implementing the newest developments in generating synthetic data.
Author
Joe Ferrara, PhD
April 22, 2024

Every company wants to leverage LLMs with their data. But in this era where data breaches and privacy concerns are rampant, protecting your private data has never been more crucial. This is particularly true when it comes to deploying Large Language Models (LLMs) in your business operations. Whether you're incorporating third-party APIs to back your LLM app, doing Retrieval-Augmented Generation (RAG), building chatbots, or fine-tuning an LLM, the risks of exposing private information to unauthorized users are significant. Using private data with LLMs can provide a gateway for malicious actors to access sensitive data. But fear not – there are solutions available. In this post, we explore how Tonic Textual helps secure your private data while still allowing you to harness the benefits of LLMs.

Understanding the privacy issues with LLMs

When you incorporate LLMs into your business operations, you could be inadvertently laying your private data bare for the world to see.  Four very common ways of using LLMs have the possibility of exposing private data: using 3rd party APIs, RAG, few shot prompting, and fine tuning. These four techniques can all lead to private information being exposed to end users.

Using 3rd party APIs, like Open AI’s API, to back your LLM application shares all data sent through the API with the 3rd party. This is a data risk in and of itself, and the 3rd party may use the shared data to train their LLMs. For many companies this is no longer a risk, as it’s common to have privacy agreements with the 3rd party providing the API. These agreements usually state that the 3rd party will not train LLMs on the shared data. The 3rd party commonly is the data storage provider, so they may already have your data in the first place. For instance you may store your data in Azure and use Open AI’s Azure LLM endpoints. Even if you’re using a secure 3rd party API, there is still the risk of leaking private information when doing RAG, few shot prompting, or fine tuning.

LLMs don't follow instructions

In RAG, private information from  your data store may get inserted into the LLM prompt during the retrieval and augmentation process. If private information is inserted into the prompt, the LLM may reveal that private information to the end user even if the LLM is instructed not to. It still isn't clear how good LLMs are at following instructions. For instance, imagine you have a healthcare chatbot that collects user information before connecting the user with a nurse for medical advice. That chatbot uses RAG with medical notes as part of the data backing the RAG process.

An image showing a chat transcript between a patient and a chatbot. Certain items of sensitive information are highlighted in yellow.

In the above example, the private information that Joe Ferrara has sciatica and had surgery for it is revealed even though it is not needed to answer the user’s query.

Another popular and related LLM technique to RAG is few shot prompting. In few shot prompting, you ask an LLM to do a task and put a few examples of how to do the task in the prompt. Usually the examples of how to do the task come from data. If you’re using few shot prompting in production with private data as the examples, private information may be leaked by the LLM revealing the information in it’s prompt.

Memorization and regurgitation of training data

When an LLM is fine tuned or trained on sensitive data, the LLM may memorize the data and regurgitate it when prompted with language similar to the training data. This memorization and regurgitation phenomenon is the basis of the New York Times copyright lawsuit against Open AI.

For example, let’s say you’ve fine tuned an LLM to summarize a customer’s financial transactions from the past month such that the LLM uses a specific language style and format for displaying the summary in your app. If a customer using the app has PII similar to that of someone in the fine tuning data, it’s possible the person from the fine tuning data could have their private information revealed in the summary for the app user.

An image showing sensitive invoice data, including name, address, invoice number, date, due date, and amount due.
Invoice that is part of fine tuning data
An image showing the following text with PII highlighted: Joseph Ferrera, on March 15, you had an invoice sent to 1234 Main St. San Francisco CA 94107 for $12,000.
Part of the summary of March transactions for Joseph Ferrera, a different person than the person from the fine tuning data.

While a fine tuned LLM revealing private information from the fine tuning data may be an edge case, any risk of revealing real user private information is not a risk worth taking.

Tonic Textual to the rescue

Tonic Textual is a text de-identification tool that uses state of the art proprietary named-entity recognition (NER) models to identify personally identifiable information (PII) in text. Tonic Textual can be used as part of your LLM pipeline to protect private information in your free text data. Tonic Textual’s text de-identification comes in two flavors:

  • Redaction: Direct removal and replacement of PII with placeholders.
  • Synthesis: PII is removed and replaced with fake non-sensitive data.

Depending on your use case, you may want to use redaction or synthesis. We’ll see examples of each below. Textual supports many entity types with new types added regularly. Entity types that Textual covers include first name, last name, street address, city, state, zip code, phone number, date time, and many others.

If you’re sending private data to 3rd party APIs, then you can use Textual to redact the text sent to the 3rd party, and then un-redact the response from the 3rd party API before presenting it to your end user. This prevents the 3rd party from getting the private data because you’ve redacted it before sending it to them.

Similarly, if you’re doing RAG or few shot prompting, you can redact the private data in the prompt sent to the LLM to ensure that the LLM does not leak any of the PII in the retrieved context.

An image showing a chat transcript between a patient and a chatbot. The sensitive information has been redacted and replaced with PII markers like [NAME_GIVEN].
In the example from before, the LLM no longer reveals the name Joe Ferrara because the retrieved context is de-identified in the LLM Prompt.

When fine tuning an LLM, you can use Tonic Textual to synthesize the free text data used for fine tuning, replacing the identified PII with fake PII. The synthesis makes is so that the training data looks like your normal free text data, but it does not have any real PII in it.

An image showing the invoice data from before, but the sensitive information has been replaced with contextually relevant synthetic data.
Using Tonic Textual in Synthesis mode, the real PII in the original invoice is replaced with fake PII, preventing the name collision that occurs for Joseph Ferrera.

While it’s still possible tor he LLM to memorize and regurgitate the training data, no real PII can be revealed, only fake PII coming from the synthesis.

Compliance

Using Tonic Textual has the added benefit of making you compliant with any relevant data privacy laws regarding how you use private data with LLMs. De-identifying data using NER models aligns with privacy regulations such as GDPR in the EU and CCPA in California, which mandate the protection of personal information. By implementing NER preprocessing steps, organizations can use LLMs more ethically and responsibly, ensuring that their operations remain within legal boundaries and maintain public trust.

Conclusion

As LLMs continue to shape the future of technology, ensuring the privacy and security of the data they process must be a top priority. By integrating Tonic Textual into your data preprocessing pipeline, you can significantly mitigate the risks associated with data privacy in LLM applications. This not only protects individuals' privacy but also ensures that your organization remains compliant with evolving data protection regulations. As we move forward, the synergy between NER models and LLMs will play a pivotal role in fostering a safer, more privacy-conscious digital landscape.

Curious to try out Textual on your own data? Sign up for a free trial and get started today.

Make sensitive data usable for testing and development.
Unblock data access, turbocharge development, and respect data privacy as a human right.

FAQs

Safeguarding data privacy while using LLMs
Joe Ferrara, PhD
Senior AI Scientist

Joe is a Senior Data Scientist at Tonic. He has a PhD in mathematics from UC Santa Cruz, giving him a background in complex math research. At Tonic, Joe focuses on implementing the newest developments in generating synthetic data.

Make your sensitive data usable for testing and development.

Accelerate your engineering velocity, unblock AI initiatives, and respect data privacy as a human right.
Accelerate development with high-quality, privacy-respecting synthetic test data from Tonic.ai.Boost development speed and maintain data privacy with Tonic.ai's synthetic data solutions, ensuring secure and efficient test environments.