New regulations around data privacy and an increasing awareness of the importance of protecting sensitive data is pushing companies to lock down access to their production data. Restricting access to high quality data with which to build and test leads to a variety of issues, including making it more difficult to find bugs. In this article we’ll look at a variety of ways to populate your dev/staging environments with high quality synthetic data that is similar to your production data.
TL/DR: You can generate events using an exponential distribution, but modeling a real world event sequence is usually more complicated than that. Here is an approach using a sequence of exponential distributions that are fit to the data using an adaptive process. And we have an implementation in Python here.
At Tonic we develop tools for data privacy, and an important tool for keeping data private is synthetic data. In this post we’ll talk about what we learned developing an algorithm for creating realistic event data.
A few days ago we made the Tonic Document Masker public, and it’s been fun watching people react to it. Let’s explain a little bit about how it works.
The Tonic Document Masker allows you to find and replace PII in any document. All you need to do is paste a piece of text containing PII and Tonic will parse the text, find PII, and then replace PII with random text, random names, or other types of context specific ‘fake’ text.
A lot goes into building meaningful synthetic datasets, and we’ll be using this blog as a medium to explore the many topics. One piece of the puzzle is the ability to identify various types of data and then generate random data specific to each type. Today, we’re releasing our API for generating random things to help others in their endeavors to create synthetic data.
Currently the API supports generating:
Addresses Names Phone Numbers MAC and IP Addresses Social Security Numbers Take it for a spin at https://randomthings.
While GDPR gets the lion’s share of the coverage, California recently passed an extremely powerful, far-reaching law, the California Consumer Privacy Act (CCPA), that will likely drive even more change than the GDPR. Technically, it’s only a California law, but it’s expected to have a much broader impact because it’s one of the first such laws in the US, and it has a very broad definition of sensitive data.
Tonic is a data company. We are building a platform to make it simple to create synthetic data that can be used in lieu of data that contains PII (or PHI). As part of our efforts, we often find it necessary to subset data. Subsetting data is the process of taking a representative sample of your data in a manner that preserves the integrity of your database, e.g., give me 5% of my users.
Introducing Tonic.ai — Intelligent Synthetic Data Generation How it all began It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness—it was 2am, and several business development engineers were sitting on-site in an otherwise empty building trying to debug some failing code. They had a large, brilliant development team in Palo Alto eager to help them, but they had no way to send the developers the data that was causing all the problems.