Blog
Data de-identification

Tonic Textual: document redaction to de-identify PDFs and Word docs

Author
Lyon Van Voorhis
Author
January 23, 2024
Tonic Textual: document redaction to de-identify PDFs and Word docs

Last month, we announced the launch of our new synthetic data product for free-text data, Tonic Textual, extending Tonic.ai’s platform coverage to unstructured data. With Tonic Textual,  teams have the ability to automatically detect sensitive entities in free-text data using our pre-trained named entity recognition models or by training your own custom model. Tonic Textual then protects the sensitive data points by redacting them or optionally synthesizing contextually relevant synthetic data to maintain the realism and utility of your text data. The de-identified data can be safely used within your existing data pipelines or to enable previously impractical or insecure use cases such as training third-party large-language models using your private data.

Today, we’re excited to announce a new capability for Tonic Textual: .pdf and .doc/.docx support. You can now use Tonic Textual to redact and replace sensitive data in document files, expanding the ways that you can leverage data stored in these formats while minimizing the risk of exposing sensitive information. Curious to learn how? Here’s what the new PDF workflows we’ve built look like in action:

In this post, I’ll discuss the challenges inherent to handling PDF data vs normal free text data, the approach we’ve taken, and the tools and solutions we’ve delivered to help you unlock the value of your PDF data. While this post focuses on the PDF use case, all of the workflows and features associated with PDFs are supported on Word documents as well.

Challenges with protecting PDFs

Compared to a typical text file, PDFs are much more complex, with data possibly present in many different formats including images, tables, and graphs. Text data within a PDF can also appear in many different variations, from handwritten notes on a patient record, to typed memos on an invoice, or simply columnar text in a scan of a newspaper article.  

We need to retrieve the text data, properly identify what information should be redacted based on the surrounding context, and then provide the appropriate redaction based on the sensitive entity type being removed. For example, “Charlotte, NC” should be replaced with a location name, while “Charlotte Jones” should be replaced with a name.  

Our approach to de-identifying sensitive data in PDFs

To solve this problem, we divided the process into discrete chunks. For extracting the text, we’re using Azure’s Document Intelligence service, which employs advanced machine learning to pull out text data from images, including PDFs. In addition to the text content, it also returns some structural information, such as whether the text is in a table, or in its own paragraph.

With the text data in hand, we then use the same Tonic Textual NER models used for non-document text files (including custom models, if you’ve created any) to identify sensitive data within the text, and what type of data is represented.

Once the sensitive data has been identified, we then either apply redactions (removing text data from within the PDF and replacing it with a black box) or synthesize replacement text (removing the text data, covering it with a white box, and adding contextually relevant replacement text on top).

Additional functionality for protecting PDFs

Similar to Textual’s functionality with free-text data files, you can also control what Textual redacts and synthesizes in your PDFs. In addition to automated detection of sensitive data in PDFs, we’ve added the ability to manually adjust Textual’s redactions, either by adding your own new redactions or by removing false positives. This provides an additional layer of safety and customization to ensure that the outputs meet your needs.  

Let’s say you need to redact thousands of documents on an ongoing basis and manually overriding Textual’s redactions each time would be cumbersome at scale. You can teach Textual how to redact similarly formatted PDF files at scale by creating templates to use across multiple documents, in the case of specific forms that share the same format and require the same level of protection.

We are constantly working on improving our named entity recognition models and offering more customization to minimize any extra effort, but we also know how important it is to offer maximum control over the redacted final product.

The takeaway

In summary, Tonic Textual can now be used with PDF files. Files are scanned, text data is extracted, and we apply the same detection models that we use in the rest of Tonic Textual to identify sensitive text information within the PDFs. We then remove the data from the document, and either replace the text with a black redaction bar or synthesize text to replace it. Finally, we offer customization, both in the form of custom trainable detection models and manual overrides of the detected sensitive data. Once Tonic Textual has done its work, you can download a redacted or de-identified version of your document that is safe for downstream use in AI development, data pipelines, and document sharing.

We’d love to hear what you think. Sign up for a free account today.

Lyon Van Voorhis
Engineering
Lyon is a senior software engineer at Tonic.ai. He is currently working full-time on Tonic Textual, with a specific focus on PDFs.

Make your sensitive data usable for testing and development.

Unblock data access, turbocharge development, and respect data privacy as a human right.
Accelerate development with high-quality, privacy-respecting synthetic test data from Tonic.ai.Boost development speed and maintain data privacy with Tonic.ai's synthetic data solutions, ensuring secure and efficient test environments.