Generative AI

The importance of high quality synthesis when creating safe training datasets

Author

Adam Kamor, PhD

Author

December 2, 2024

The importance of high quality synthesis when creating safe training datasets

Tonic Textual is a fully featured platform+SDK for preparing sensitive data for model training. Its primary feature is its ability to identify sensitive elements in text to ensure that things like healthcare data, personal user information, etc. do not make their way into your model weights. We spend a lot of time and effort ensuring that our proprietary NER models do their best to identify all of your sensitive information.

An often under-appreciated aspect of the data redaction process is what happens after sensitive data is detected. Obviously it cannot simply be removed from the original text as that would destroy the utility of the data. In other words, black-box redaction will not work if your use case is model training or AI implementation. Typically, the step after detecting sensitive data is to either tokenize or synthesize it, depending on the required use of the de-identified data. Before we continue, let me quickly define both terms as they’re performed in Textual and provide some examples.

What is data tokenization in free-text data?

When we tokenize your free-text data we convert the original sensitive values found in that data to an alphanumeric token. The tokens themselves are unique and can be configured such that a given sensitive value will always yield the same token. As an example, in the following sentence we might tokenize it as follows:

Hi, my name is Adam → Hi, my name is [NAME_GIVEN_ssYs5]

The token, ssYs5, represents ‘Adam’ and is prepended with the entity type, in this case NAME_GIVEN. The entire result is always enclosed in square brackets.

In richer text, there can often be multiple people mentioned and in such a case our uniqueness guarantee would ensure that each individual is always mapped to their own unique token value.

What is data synthesis in free-text data?

When we synthesize your free-text data we convert the original sensitive values found in that data to realistic fake values. The goal is for the synthesized text to be indistinguishable from the original text in terms of realism.

As an example, we might synthesize a sentence as follows:

Hi, my name is Adam → Hi, my name is John

With tokenization, you are guaranteed uniqueness meaning each value gets its own unique token. With synthesis, you are given a slightly weaker guarantee called consistency. Consistency guarantees that a given value will always get mapped to the same synthetic value. So in the above example, this would mean that any instance of ‘Adam’ would always get mapped to ‘John’.

Data synthesis vs. data tokenization

Data synthesis and data tokenization are both effective methods for protecting sensitive data. The decision to use one or the other is dependent upon your use case for the protected data and the data utility needs specific to that use case.

One of Textual’s primary use cases is for model development on top of very sensitive data. Our customers typically come from highly-regulated industries like healthcare, finance, government, and education. We’ve found that our customers typically prefer to use Synthesis instead of Tokenization. The primary reasons are:

Better privacy
High utility

When dealing with de-identification of sensitive data, you have to strike the correct balance between the privacy of the data and its utility, and typically these two things are inversely correlated. This means that if you want more privacy you typically will get less utility.

That is most often true, but when dealing with the de-identification of unstructured text it is not. Synthesis yields more realistic text and is more private than tokenized text.

Why? Because NER models are not 100% perfect. They will sometimes miss sensitive data. Imagine you have a piece of text where you correctly identify and tokenize 99 names but miss a single name. The output text will contain tokens in lieu of names except for the 1 name that was missed. It’s clear to any observer that the name that was missed was likely real because it’s the only one not tokenized.

When you synthesize, on the other hand, all of the names look real and an observer does not know if a given name is real or fake. To put it bluntly, you have plausible deniability.

With regards to utility, because synthetic text has real looking entities in it compared to tokenized text, synthetic text looks more like the text the LLM was pre-trained on. This allows for better transfer of knowledge from pre-training to the fine tuning step.

Data synthesis quality

The quality of data synthesis has major impacts on both the privacy of the data (non-synthesized values that are missed by the model stick out) and on the realism of the training data and hence the model itself.

Let’s look at a few examples of how Textual synthesizes data to understand what a high quality data synthesis function should look like.

Date synthesis in free-text data

Dates are an especially difficult entity to synthesize. When Textual synthesizes dates, it is common to perform what we call a ‘time stamp shift’ operation. This is when we take the date and then shift it randomly by some number of days either forwards or backwards. This is a nice technique in that it helps preserve data utility, is configurable in terms of the bounds of the shift, and can also be made consistent in that a given date can always be shifted by the same amount.

To properly shift the date, we have to be able to understand what point in time the date represents. This is easy to accomplish for well-formatted dates that are parse-able by common date format strings. But in many unstructured datasets you’ll find that dates are typically expressed in much more complex utterances.

As an example, imagine a transcript of a customer support phone call where someone is trying to set a date. There are a myriad of ways one might express a given date on the phone and most are not going to be found with a pre-canned list of date format strings.

To solve this problem, we developed our own transformer language model that's been fine-tuned to synthesize natural language date times found in text. The model takes unformatted date times and translates them to a machine parse-able format, which can then be shifted by days or time for synthesis. It then translates the formatted date time back to a format similar to the original.

For instance in the sentence "I have an appointment in September on the 5th, 2024", the date time "September on the 5th, 2024" is recognized as September 5, 2024 and synthesized to "September 10, 2024”.

A flow diagram showing how Textual synthesizes dates

Location synthesis in free-text data

Locations are another tricky entity to synthesize. Textual supports a variety of location types but for today let’s just focus on city, state, and zip code entities.

Synthesizing any of these entities in isolation is straightforward. For example, if Textual finds a string representing a US State it can easily replace it with another US state.

The tricky part is when you find combinations of 2 or more locations all in close proximity to each other in the text. For example, imagine the following sentence, taken from a call transcript:

Agent: Can I please have your address?

Customer: Sure, its 348 Harrison Street Greenville, sorry one second someone is at the door

Agent: Ok

Customer: Ok, I’m back. Greenville Georgia 30222.

To generate a high quality synthetic equivalent here, we need to ensure that the three entities Greenville, Georgia, and 30222 are all recognized as referencing the same location. This is accomplished by grouping all location entities within some fixed distance of each that are likely referencing the same location.

Once your location entities are grouped, we still need logic to create an appropriate combination of city, state, and zip values that are realistic. The logic used to accomplish this is complex but guarantees Textual always generates city, state, zip combinations that are valid.

The logic used is inspired by HIPAA Safe Harbor guidelines but works well for other industries as well. The logic is:

US States are left unchanged. This is because, under privacy frameworks like HIPAA, a US State is considered not sensitive as it is believed that each state contains enough people such that knowing the original state is not a major privacy risk
US Zip code is considered safe when the last 2 digits of the zip code are truncated (except for a small number of low population zip codes designated by the US Census). As a result, we select a new zip code which shares the first 3 digits with the original but the last 2 digits are selected at random from a pool of valid zip codes.
A new city is determined that has geographical overlap with the newly selected zip code.

Note: When we find a low population zip code we select an entirely new zip code at random that is found within the state. Then a city is selected which has overlap with the zip code.

Name synthesis in free-text data

Names are easier to synthesize than dates and locations but are very important for two reasons. First, they are the most common entity we encounter, and second, names are typically highly sensitive. Failure to properly identify and synthesize a name can lead to an easier re-identification than something like a single date time or location.

When we synthesize names we take a few things into consideration:

Gender: The presumed gender of the name, e.g. Jon is almost certainly a man’s name and should be replaced with another name that is likely associated with a man. Generally speaking, gender is an important aspect to preserve in model training data, especially in the healthcare space, to ensure realism in the end results.
Capitalization: Most of the text Textual works with is not well formatted. If names are mostly lower-case we don’t want to synthesize capitalized names as they will stick out as clearly being synthesized.
Consistency: It is common for a given name to appear multiple times in a document, often alongside other names. Additionally, a given name can appear across multiple documents, as well. Consistency ensures that a given name always goes to the same fake output name.

In short, Tonic Textual attempts to preserve the assumed gender of the name, will preserve the capitalization of a name (all lower case, all uppercase, just first letter upper cased, etc), and finally will ensure that a given name always synthesizes to the same output.

Conclusion

High quality NER models are half the battle when it comes to creating safe, high quality training data. Once you’ve identified your sensitive entities, you still need to remove them from the training dataset without breaking the dataset’s utility.

We find that most customers prefer to synthesize their sensitive entities, as opposed to tokenize. This is due to a variety of benefits like better data quality and stronger privacy. The quality of the synthetic data produced is paramount as low quality synthetic data hurts model training and privacy.

To explore Tonic Textual’s capabilities and start reaping its benefits today, connect with our team for a demo or sign up for a free trial.

Want to make your data usable?

Unblock product innovation with safe, high-fidelity data de-identification and synthesis.

Book a demo

Adam Kamor, PhD

Co-Founder & Head of Engineering