Blog
Technical deep dive

How to create de-identified embeddings with Tonic Textual & Pinecone

Author
Joe Ferrara, PhD
Author
July 2, 2024
How to create de-identified embeddings with Tonic Textual & Pinecone

Maintaining data privacy while leveraging powerful machine learning tools is both crucial and a significant challenge when it comes to incorporating LLMs into business processes. Text embeddings have become ubiquitous as the semantic search layer in LLM applications, most notably when doing retrieval augmented generation (RAG). When using text embeddings with sensitive private data, it’s imperative to treat the text embeddings with the same level of protections as the raw text. Recent research has shown that text embeddings reveal almost as much as the text itself. Therefore you shouldn’t store or expose text embeddings in any way that you wouldn’t store or expose the raw text itself.

Tonic Textual is built for de-identifying raw text so that private information in the text is protected against data leakage and breach. The de-identified text can then be freely used in downstream applications. To protect private information stored in text embeddings, it’s essential to de-identify the text before embedding and storing it in a vector database. 

In this guide, I'll demonstrate how to de-identify and chunk text using Tonic Textual, and then easily embed these chunks and store the data in a Pinecone vector database to use for semantic search in RAG or other LLM applications.

The need for de-identified data in Pinecone

Pinecone is one of the most popular vector databases, known for its advanced search functionality and performance at enterprise scale. Tonic Textual and Pinecone help you perform semantic search on your unstructured documents by:

  • Extracting text from unstructured documents with Tonic Textual. The platform makes it simple and easy to build automated data pipelines that extract and normalize text from your unstructured documents into a format suitable for LLM applications.
  • Chunking and de-identifying the extracted text with Tonic Textual. The platform performs smart chunking on the extracted text from your documents and uses state of the art named entity recognition (NER) models to de-identify the text in the chunks. With Textual, you can take your data directly from messy, raw documents to ready for embedding and ingestion into a vector database in minutes, not days.
  • Using Pinecone as a vector database to store the text embeddings of the chunks and to handle vector retrieval calculations. With Pinecone it’s easy to set up, store, and retrieve vectors in a vector database.

Together, Textual and Pinecone enable you to build scalable, performant RAG systems with the peace of mind that your private data is protected from leaks and breaches.

Let’s look at this step-by-step with some examples.

Setting up Tonic Textual

To illustrate this process, I'll work with a single PDF: the first 50 pages of the 2022 American Express 10-K report. The entire 10-K is available here. After creating an account at textual.tonic.ai, I’ll create a Pipeline and upload this PDF to the Pipeline. Tonic Textual will immediately begin parsing the PDF. Once the PDF is parsed, it can be chunked via the Tonic Textual SDK which is installed via pip install tonic-textual.

De-identifying text

Before chunking the PDF, let's discuss de-identification. Tonic Textual de-identifies text by identifying named entities and either redacting or synthesizing them. Named entities are identified using Tonic’s state-of-the-art named entity recognition (NER) models, which can detect several dozen entity types. A full list of entity types is available here. Redacting a named entity involves replacing the named entity with a placeholder that describes the type of entity redacted but does not reveal the named entity itself. Synthesizing a named entity involves replacing the named entity with a fake value of the same entity type to retain the semantic meaning of the data. This is best understood with a brief example.

If you want to de-identify the sentence “My name is Joe and I work at Tonic”:

  • Redaction: “My name is [NAME_GIVEN] and I work at [ORGANIZATION].”
  • Synthesis: “My name is Frank and I work at Best Buy.”

To use Tonic Textual to chunk and de-identify the American Express 10-K PDF, the first step is to determine which named entity types to identify and whether to redact or synthesize them. Because I don’t want to reveal any personally identifiable information (PII) that I know may occur in the PDF the following sensitive entity types are specified.

For retrieval augmented generation (RAG), it’s crucial to redact rather than synthesize the entities. Redacting avoids inserting fake PII into the chunks, which could interfere with the retrieval process. This configuration is specified in the generator_config object in the Textual SDK:

Chunking text

With the configuration set, it’s time to retrieve the text from the PDF and create chunks with the specified named entities redacted. First, create an API key from the Tonic Textual UI and set it as the api_key variable. Also, get the Pipeline ID for the created Pipeline and set it as the pipeline_id variable. The following code retrieves the files from the Pipeline and chunks them, redacting the specified sensitive named entity types:

Here's an example of text from one of the de-identified chunks:

  • [GENDER_IDENTIFIER_93y0] [NAME_FAMILY_lCiDtD] (49) has been Group President, Commercial Services and Credit & Fraud Risk since [DATE_TIME_yIw2teH184z0]. Prior thereto, [GENDER_IDENTIFIER_l8P2] had been President, [ORGANIZATION_tjt3Xq3GiqjVdzyWELVea1] since [DATE_TIME_sskpxnXiLvehFswD]. [GENDER_IDENTIFIER_93y0] [NAME_FAMILY_lCiDtD] joined [ORGANIZATION_srnicJ1D3XhWCOI4GT] from [ORGANIZATION_Rwb7EiYx8moZC3IVR9uHF0QLVD], where [GENDER_IDENTIFIER_l8P2] served as Regional CEO, [ORGANIZATION_M1dqe2] and [LOCATION_VItlLf2lsoq1] since [DATE_TIME_7TggFwRoYKIZP1C].

and this is the original text if the chunk had not been de-identified:

  • Ms. Marrs (49) has been Group President, Commercial Services and Credit & Fraud Risk since April 2021. Prior thereto, she had been President, Commercial Services since September 2018. Ms. Marrs joined American Express from Standard Chartered Bank, where she served as Regional CEO, ASEAN and South Asia since November 2016.

The PII from the original chunk has been redacted. In the redacted text, the random character strings following the entity type (e.g., the "93y0" in "[GENDER_IDENTIFIER_93y0]") are unique identifiers for each input string, allowing for consistent redaction across documents in your pipeline. For instance, [GENDER_IDENTIFIER_93y0] appears multiple times, corresponding to the string Ms..

Embedding and storing with Pinecone

The next step is to create a Pinecone database to store the embeddings of our chunks. After creating an account at pinecone.io, set your Pinecone API key as the pinecone_api_key variable. Since OpenAI’s text-embedding-3-small model will be used for embedding, the vector database dimension is 1536.

The chunks are embedded and upserted into the Pinecone database within the same for loop. While there are more efficient ways to chunk, embed, and load large amounts of data, this simple approach is sufficient for our example.

Querying the database

Now we can retrieve vectors for given queries. By also using Tonic Textual to de-identify queries before they are used for vector retrieval, the queries remain private and are more likely to match the de-identified chunks. To test this out, we’ll ask a specific question about the SEC filing and see if the chunk with the correct answer is retrieved. We’ll follow the common practice in tutorials of retrieving the top 5 chunks. The following code snippet retrieves the top 5 chunks for the questions “Was American Express able to retain card members during 2022?”:

Despite the redaction, the correct chunk is retrieved as the 4th chunk. The text from the chunk that reveals the correct answer is:

  • Net card fees increased 17 percent [DATE_TIME_mn5ovMCBUGCkrqnK], as new card acquisitions reached record levels in [DATE_TIME_9UWe7] and [ORGANIZATION_80W79ZoOwE4I0] retention remained high, demonstrating the impact of investments we have made in our premium value propositions. 

This corresponds to the original text:

  • Net card fees increased 17 percent year over-year, as new card acquisitions reached record levels in 2022 and Card Member retention remained high, demonstrating the impact of investments we have made in our premium value propositions.

Thus, the query “Was American Express able to retain card members during 2022?” is answered with a "yes". This example is meant to illustrate how easy it is to do semantic search with de-identified embeddings using Tonic Textual and Pinecone. The document used and question asked are for illustrative purposes to show that semantic search can work with de-identified embeddings. Ultimately, whether you’d use this approach is dependent on your use case and how sensitive the data is that you’re working with.

Conclusion

In this article, I demonstrated how to create de-identified embeddings using Tonic Textual and store them in a Pinecone vector database for efficient retrieval. This process enhances privacy while maintaining the utility of your data. Try it out yourself by creating a free account at textual.tonic.ai and see the benefits of de-identifying your text data.

Joe Ferrara, PhD
Staff AI Scientist
Joe is a Senior Data Scientist at Tonic. He has a PhD in mathematics from UC Santa Cruz, giving him a background in complex math research. At Tonic, Joe focuses on implementing the newest developments in generating synthetic data.

Make your sensitive data usable for testing and development.

Unblock data access, turbocharge development, and respect data privacy as a human right.
Accelerate development with high-quality, privacy-respecting synthetic test data from Tonic.ai.Boost development speed and maintain data privacy with Tonic.ai's synthetic data solutions, ensuring secure and efficient test environments.