As a product designer at Tonic.ai, we’re always looking for ways to improve how our customers de-identify data in our test data management platform Tonic Structural. Recently, we released a new feature called “Custom Sensitivity Rules” that helps customers automate the identification of unique sensitive data—information that is proprietary and sensitive for them, though not a standard example of PII or PHI.
Since Tonic.ai’s earliest days, our platform has offered sensitive data detection to streamline data de-identification. We’ll look at how custom sensitivity rules expands that existing capability to further strengthen the data protection offered by Tonic Structural.
Evaluating the existing de-identification workflow
Over the last year, we’ve fine tuned our existing Sensitivity Scan to remove false positives and improve detection rates. Given the extensive variety of our customers and their data types, it’s near impossible for any sensitive data detection tool to catch everything. Out of the box, Tonic Structural’s Sensitivity Scan automatically detects and flags standard, ubiquitous types of sensitive and private data, including personally identifiable information (PII) and protected health information (PHI) as defined by HIPAA’s Safe Harbor method. In Tonic lingo, we call these Sensitivity Types. They include names, birth dates, addresses, social security numbers, and geolocation data. The Scan also detects financial data like IBAN and SWIFT, and ICD9 and ICD10 medical classifications codes for customers in the health care sector.
Once Structural identifies these Sensitivity Types, it also automatically recommends generators to safely de-identify their data, and these generators can all be applied at the click of a button. The Sensitivity Scan combined with generator recommendations get our customers 80-90% of the way in the de-identification process. But when less common sensitive data types are present, customers have needed to manually review the remaining columns and mask any customer-specific data that the scan did not pick up. So we set out to design a feature that would help automate that last leg of the de-identification process.
What are Custom Sensitivity Rules?
Just like Structural’s OOTB Sensitivity Scan, Custom Sensitivity Rules filter data based on a database column/field name and data type. The key difference is that these rules are defined by Structural’s end-user, to specifically flag a column based on its name and data types. For example, the products a user purchases might be deemed sensitive by your organization. You can define a rule to flag any columns named Product_name as sensitive and define which Generator should be recommended when it is detected.
The rules run as part of a Sensitivity Scan, so they’re triggered whenever a database migration occurs, a new workspace is created, or a scan is initiated. If columns are found that match a rule, Structural tags the results with a Sensitivity Type that matches the rule’s name and shows the rule’s Generator Preset as the recommendation. This means that these unique columns get appropriately tagged as At Risk, and the recommendation can be applied through bulk actions along with the rest of the recommended generators.
Why use Generator Presets?
Tonic Structural comes with 40+ built-in presets for generating data. We also offer Custom Presets that allow customers to re-use custom generator configurations. Tying presets into Custom Rules is a natural fit, as Custom Rules are meant to find unique data that, in most instances, needs to follow a unique format when being generated in order to ensure higher data realism in your masked test database.
Do rules work across different databases?
Sometimes customers have the same data in different databases. For example, you might store transactional data in a PostgreSQL, and then duplicate it into Spark DB for analytics. Let’s say you have a workspace for your development team that de-identifies PostgreSQL and another workspace for your data science team that de-identifies Spark.
We wanted to offer you the ability to make rules that you can set and forget. Once a rule is created it will automatically apply to every team’s workspace. Of course, each end-user can decide whether or not to apply the recommended generator in their workspace. The important functionality here is that all end-users are able to catch sensitive data in the same way and know the recommended generator across the organization.
The main challenge with making Custom Rules a global setting was handling different database connectors. We implemented a generic data type filter, so a rule that filters for the Text data type scans text and varchar columns in PostgreSQL and StringType and VarcharType in Spark. This means you don’t have to know the exact column data types when creating a rule, and it ensures that all workspaces can achieve the same security posture. Once a rule is created, every workspace will identify the same data as At Risk.
Adding a preview panel
One thing we noticed with our initial release was that it was hard to get a sense of how a rule would perform. To dial in a rule, you’d have to go to a workspace and run a Sensitivity Scan, look at the results, and then go back to edit the rule, which felt cumbersome. Our latest release includes a preview panel, providing a visual feedback step, where you can test the rule against a selected workspace and fine tune it in real time. This greatly reduces the risk of creating a rule that is too stringent or too lax.
An example of the new workflow
Let’s walk through the process of identifying columns that were not flagged as At Risk by Structural’s OOTB Sensitivity Scan, creating a Sensitivity Rule, and then applying the recommended generator to mask these columns:
In our demo database, we have a handful of columns named “Number_of_Children”. Because this column is associated with a customer, we want to de-identify it so that the information can’t be used to reverse engineer the user’s identity.
First, we create the rule, giving it a name and an optional description. The name will be used as the Sensitivity Type, so it’s best to use something concise and descriptive. In this case, we’ll use “Number of Children”.
Next we’ll set up our filters, selecting Integer as the data type, and create a matcher that searches for column names containing the string “Number_Of_Children” (case-sensitive). This rule is fairly straightforward, but if I knew there were some columns that had a slightly different name, I could add more matchers. In this case, I’ll add a second string matcher that checks for “number_of_children” in case there are any columns that are lowercase.
We’ll then create a new Generator Preset for the rule using Random Integer as our Generator Type, and set the generator options so that it returns between 0-5 children as the value.
We can then sanity check the rule by testing it against our example workspace before saving it.
Once the rule is saved, we can go into our example workspace and run a Sensitivity Scan. From here, we can review the Generator Recommendations on the Privacy Hub and see that our rule is returning a new Sensitivity Type, “Number of Children” and its recommended generator. Applying this recommendation will apply our Custom Preset.
What’s next for Custom Sensitivity Rules?
Rule creation is admittedly a very top down approach to data masking. You have to know what data needs to be masked in order to create a rule for it. But our customers often spot good candidates for Sensitivity Rules when they are combing through columns in the Database View. We’re currently working on designs for rule creation from this perspective.
Overall we’re excited about the new possibilities this feature affords and hope you all are too. To learn more, connect with our team, sign up for a free trial, or visit our product docs. And once you’ve taken it for a spin, let us know what you think and what you might want us to build next.