De-identifying your unstructured data in Databricks with Tonic Textual

Author

Travis Matthews

September 20, 2024

Databricks is quickly emerging as a popular platform for developing AI systems using your company’s data. While the company is well-known for its advanced capabilities for running large-scale analytics and training jobs on structured data, the recent advancements in transformer-based model architectures have finally provided the right tools for analyzing unstructured data, and Databricks has answered with a host of new features and offerings that help organizations leverage the value in troves of unstructured data they have stored in their data lakes.

Despite all these new tools at our disposal, maintaining data privacy should still be paramount. Generative AI has introduced new vectors through which sensitive data may leak, and it’s up to organizations to protect the privacy of the data and leverage it responsibly. However, unstructured data poses unique challenges for organizations aiming to protect sensitive information, namely the identification of sensitive text data, which can take many forms. Tonic Textual simplifies this process dramatically, allowing your development teams to access the data they need to build the AI systems of their dreams on Databricks-managed data.

By integrating Tonic Textual into your Databricks workflow, you can de-identify unstructured data in Databricks with ease, ensuring compliance and security without compromising on data utility. In this article, we'll walk you through how Tonic Textual can enhance your Databricks environment to deliver both privacy and performance. I’ll demonstrate how the Tonic Textual SDK can be combined with PySpark to detect and then redact or synthesize sensitive free text contained within a Databricks table.

Setting up the notebook

To begin de-identifying data we’ll need a few things:

A Databricks table containing free text we’d like to de-identify:

If you wish to follow along with this guide, you can upload this file[0] into databricks, using their file upload functionality. The commands run in this guide will be executed inside of a databricks notebook.

A Textual account and API Key:

After creating an account at textual.tonic.ai, create an API Key as detailed in our docs.

While Textual has a web UI, for this guide we’ll be using the Textual SDK from within a Databricks notebook. To get started, we begin by installing Textual and PySpark in our notebook, then restarting the kernel, just to ensure the packages have been loaded correctly. We can perform these tasks by running the following actions:

Once installed, we instantiate the textual object with the URL of our Textual site (https://textual.tonic.ai), and the API Key we created for our Textual account:

The Tonic Textual SDK `redact` method

The Textual SDK function that we’ll use to protect our sensitive data is the textual.redact function. This function accepts text that should be protected and returns the results, including both the obfuscated text and the list of sensitive entities that were detected in the text. When obfuscating text, the default method of de-identification in the redact call is to replace the value with the entity type and an identifying hash—we call this behavior “redaction”. Here’s a quick example of the function in action, including the code used to call the function and its result:

The Textual SDK also supports replacing redactions with realistic synthesized values to maintain the semantic structure and meaning of your data. We do this by specifying Synthesis mode for the entity types we wish to replace with synthetic data:

Tonic Textual + PySpark

This powerful function, when combined with PySpark, may be used to de-identify (redact or synthesize) columns within a Databricks table. We accomplish this by first creating a dataframe of our target table, tech_demo_people. Before creating a dataframe we need to create a schema of the redact call’s return object, so that Databricks will know how it may be parsed.

Viewing our de-identified results

Now that we’ve defined the schema and Databricks can process the result of the redact function, we will need to create a UDF to call the redact function using PySpark.

When we wrap our textual.redact call in a PySpark UDF, it will be applied to our dataframe, resulting in a new dataframe with an additional column—the output of our redact call.

We create a final dataframe object by parsing the intermediate for our desired values.

Running this code will yield our results: our original table, supplemented by a column containing the redacted text and a column containing the array of de-identified entities present in the free text.

The original table, supplemented by a column containing the redacted text and a column containing the array of de-identified entities

The takeaway

In this post, we’ve worked through a simple example of how to de-identify your unstructured free-text data in a Databricks table using Tonic Textual. Typically a thorny problem, Tonic Textual simplifies the detection and de-identification of sensitive text data in any form. By using Textual with your data in Databricks, you can unlock the value of your text data for sentiment analysis, data pipeline testing, training AI models, and more, all with the peace of mind that your sensitive data is protected from accidental leakage.

If you have sensitive text data in your Databricks environment, begin safeguarding it today with a free trial of Tonic Textual. Alternatively, sign up for a custom demo where we can show you the power of Tonic Textual + Databricks tailored to your use case.

‍

For the complete python notebook and source files, please refer to our repository on Github.

Make sensitive data in Databricks usable for AI model training.

Unblock your AI initiatives and build features faster by securely leveraging your free-text data.

Book a demo