Technical deep dive

Sensitive data in text embeddings is recoverable

Author

Mehul Kalia

Author

July 19, 2024

The pace of generative AI adoption in enterprises shows no signs of slowing down. Companies of all shapes and sizes are actively experimenting with generative AI tools, and many are starting to deploy these systems into production. One particularly popular implementation is Retrieval-Augmented Generation (RAG), which leverages text embeddings stored in vector databases. As more companies use their own data with AI tools, ensuring the privacy and security of the data becomes critically important.

Embeddings of text should be treated with the same privacy precautions as raw text data. In the paper Text Embeddings Reveal (Almost) As Much as Text, Morris et. al developed a model that reconstructs text from their embeddings, preserving 92% of 32-token text inputs exactly.

This finding has notable data security implications for enterprises building RAG systems on their own data: text is highly recoverable from their embeddings, and text embeddings should be treated with the same privacy safeguards as the original text. One action you can take to safeguard your private data against leakage is to redact sensitive information from the text before embedding. Tonic Textual is built to perform these redactions, eliminating the possibility of recovering PII from embeddings.

We ran an experiment using Morris et. al’s Vec2Text model, to demonstrate the privacy risk of text embeddings with sensitive data. As we’ll show, a large percentage of sensitive data can be recovered from just their text embeddings, posing a significant privacy risk and demonstrating the need to use a tool like Tonic Textual to protect your data before using it to build generative AI systems.

What is an embedding?

In order to generate accurate answers to user queries, RAG applications must first retrieve relevant information to inject into an LLM’s context. This retrieval step is an example of text search—finding the most relevant segments of text in a large store of text chunks. While this is a problem as old as the internet, the modern approach uses large transformer models to embed text into vectors in such a way that semantically similar texts result in numerically close vectors. These embeddings are typically stored in vector databases side by side with the text, which allows for efficient search and retrieval by looking for nearby vectors to a given query. Best practices for handling sensitive data demand that the text be encrypted at rest, but embeddings are typically not encrypted as retrieval depends on vector arithmetic with the embeddings. As we will see, this presents a security risk, as these embedding vectors can be used to reconstruct PII in the text.

Experiment

All source code can be seen in this Google Colab notebook.

Generating Strings

First, let’s generate some sample text that has mock sensitive data using GPT-4. We used text lengths of roughly 30, 100 (length of a sentence), 250 (length of a paragraph), 1000, and 5000 (length of an essay) characters.

Here’s an example of the mock text this prompt generates.

Lucas Gray, an Account Manager at Acme Corp, can be reached at 313-555-0198 or [lgray@acmecorp.com](<mailto:lgray@acmecorp.com>). He works out of their office located at 1923 3rd St, Boston, MA.

Extracting Sensitive Data (Ground Truth)

Next we’ll detect the sensitive data using Tonic Textual and store it as a list of PII strings. This is our ground truth of what PII exists in these strings before any embedding operations.

Generating Text Embeddings

A text embedding is generated for each string using text-embedding-ada-002.

Reconstructing Text From Embeddings

Using Vec2Text, an iterative model detailed in this paper, we reconstruct strings from the embeddings.

Extracting Sensitive Data From Reconstructed Strings

We can extract sensitive data from the reconstructed strings in the same manner as we did for the original strings using Tonic Textual.

Comparisons With Ground Truth

We can now compare the ground truth (PII extracted from the original generated strings) with the PII extracted from the reconstructed strings. For each group, we divided the number of exact matches of PII by the total number of PII in our ground truth to get our percentage of sensitive data recovered.

Results

Graph 1: Percent of Sensitive Data Recovered by Text Length

‍

Graph 2: Percent of Sensitive Data Recovered by Entity Type

The results of Graph 1 are somewhat frightening: in text lengths as short as a sentence, 40% of PII can be exactly recovered using the embeddings alone. In text lengths as long as an essay, 10% of PII can be exactly recovered.

Graph 2 shows what percentage of each type of sensitive data is recovered. Qualitative PII, like names of people, companies, and locations, are much more likely to be recovered than numeric data like phone numbers of credit cards.This intuitively makes sense—names have less variability in their structure and more contextual cues than numbers.

The fact that a significant proportion of PII can be exactly recovered from just their embeddings indicates that text embeddings should be treated with similar safeguards as sensitive text. Using Tonic Textual to redact or synthesize this PII before embedding and ingesting into vector db ensures that none of this sensitive information can be recovered from vector reconstruction attacks.

The Takeaway

Text embeddings do not protect privacy and sensitive data is highly recoverable from embeddings alone. In our experiment, we found that 40% of all sensitive data in sentence-length text embeddings can be recovered with an exact match with just a few lines of code. Even for larger texts around the length of an essay, over 10% of sensitive data in text embeddings can be recovered with an exact match.

It is crucial to redact or synthesize sensitive data before embedding them as a first line of defense against data breach and vector reconstruction attacks. Tonic Textual can be used to easily and automatically redact or synthesize private data in text before it is embedded, and was used in this experiment to completely eliminate the recovery of sensitive data from embeddings, achieving a 0% recovery rate for every single text.

If you’re building a RAG system, protecting the sensitive information in your data is a must, and Tonic Textual can streamline this for you. In minutes, you can create automated data pipelines that ensure your unstructured data is protected and optimized for RAG. You can sign up for a free account here, or connect with a Tonic.ai data protection specialist here.

Want to make your data usable?

Unblock product innovation with safe, high-fidelity data de-identification and synthesis.

Book a demo

Mehul Kalia

Data Scientist

Mehul is a Data Science intern at Tonic and is studying Computer Science at the Georgia Institute of Technology. At Tonic, Mehul works on researching the latest generative AI advances' impact on privacy and retrieval.