Last week, we launched Tonic Textual, a sensitive data redaction platform that you can use to protect your unstructured free-text data. Textual identifies sensitive values contained in free-text files and makes it easy to redact those values or synthesize net new values to take their place. The platform enables you to create safe, shareable versions of your free-text files in which the sensitive values are fully redacted ("Michael" becomes "NAME") or are replaced with a similar value ("Michael" becomes "John").
With just a few clicks, you can embed Tonic Textual into your data and ML pipelines to provide you with realistic, de-identified text data that’s safe to use to train models, for LLMOps, and to build data pipelines. Using Textual, you can safely leverage your text data and practice responsible AI development while staying compliant with evolving data handling regulations.
In this post, we'll talk about how Textual uses trained models to identify the sensitive values in your files, and how you can create your own custom models to use in addition to Textual's collection of built-in models.
How Tonic Textual uses models
How does Tonic Textual look for sensitive values in your files? To find the values that you might want to redact or replace, Textual uses a variety of techniques, including regular expressions and trained models.
A model recognizes a fixed set of named entities.
Each named entity represents a specific type of value, such as an identifier, name, or location.
A model entity starts with a sample set of typical values, along with examples of how the values are used in context. For example, this is a model entity that identifies spoken languages:
During the training process, Textual uses these values and templates to learn how to identify that type of value when it scans a file.
Why do you need custom models?
Tonic Textual comes with a set of built-in models that allow it to identify a wide range of value types, including names, ages, locations, and identifiers.
But what do you do when your files contain values that aren't covered by the built-in models? For example, your files might contain terms that are specific to your industry or profession. In the healthcare industry, files might contain names of conditions or diseases. Or you might assign a specific type of identifier to accounts and users.
To handle these other types of values, you can create custom models in Textual.
After you define a custom model, you can tell Textual to use that model in its analysis of the files in any dataset.
How do you create a custom model?
Let's quickly go over how you create a custom model in Tonic Textual. For more details, check out the Tonic Textual documentation.
In a new custom model, you first set up your named entities with example values. You then have Textual generate additional example values to see how well it understands what you're looking for.
The following image shows example values for an entity that contains names of diseases, with additional values generated by Textual.
Next, provide usage examples that show the entity values in context, with the entity values represented by placeholders. After you provide the first few usage examples, once again have Textual generate additional examples, to check that it correctly understands how an entity value might appear in the file text.
The following image shows some usage examples for the disease name entity, with additional examples generated by Textual.
After you save the model, Textual trains the model. The model is then ready to use in your datasets.
Adding custom models to a dataset
By default, Tonic Textual only scans for values from its built-in models. Check out our docs to learn more about how to configure a dataset to use any or all of your custom models.
Recap
To quickly recap, Tonic Textual uses trained models as one tool to find sensitive values. Each model is made up of one or more entities. Each entity represents a specific type of value, such as a name or identifier.
Textual comes with a set of built-in models that represent a range of value types. If your files contain other values that aren't included in the built-in models, then you can create custom models to also identify and redact or synthesize those values.
Each model contains one or more entities. You provide sample entity values and examples of how the entities are used in context. Based on those examples, Tonic Textual returns additional examples to help you assess whether it understands the entity values and usage. After you create and train a model, you can use the model in any of your Tonic Textual datasets.
To learn more about safely de-identifying sensitive data in free-text, connect with our team, or sign up for an account today.