Loved our recent blog on data protection strategies? Today, we’ll take a deeper dive into another powerful data management strategy called subsetting. What’s subsetting, you ask? It’s the art and science of creating a coherent slice of data across databases of different types for targeted use cases without breaking referential integrity.
Data subsetting is the technique of extracting a smaller, referentially intact set of data from a production database often for use in testing and non-production environments. With Tonic Structural's subsetting feature, you can create tailored databases for specific use cases while preserving referential integrity. This allows you to efficiently allocate just the right amount and type of data to each stakeholder, ensuring compliance and reducing costs.
In this article, we’ll discuss Tonic Structural’s patented approach to database subsetting—and how you can use it to make fit-for-purpose secure databases for all of your stakeholders.
Heads-up, we will be drawing inspiration from the hit Amazon TV show Rings of Power for example stakeholders. You’ve been warned.
The first step to successful subsetting is identifying the type of data you need to work with. So let’s talk through your different data needs:
This is your one database to rule them all (or many databases if you are a microservices shop). It has all of your most sensitive, most important data and needs to be accurate and performant at all times. Let’s call this database Sauron.
These are the databases that need large swaths of the data (de-identified, of course!). This is your staging database that gets the last run of tests before sending things to production; your QA database that tests things at scale. And perhaps you have 3 of these types of databases. So, let’s call these databases your Elves.
Now, you’re a successful ring maker and you’ve got all of these customers. And to support these customers, you hire 7 customer support engineers (call them dwarves). The customer support engineers are each working on a specific customer and need a de-identified database scoped just to the customer with the bug.
As a growing ring-maker, you’ve got a small but mighty team of 9 human engineers. Each of them needs a de-identified database that can fit on their laptop so they can quickly test out their features and identify bugs before sending their code to staging. (Not to worry, these databases won’t turn your engineers into Nazgul.)
Now, that we have the use cases, let’s take a pause to briefly explain what subsetting is.
Subsetting is a feature in Tonic that enables you to create a small, representative slice of your data while maintaining referential integrity. There are two ways in which you can subset that we’ll discuss below: custom where clauses or target percentages. Returning back to our above use cases, here is a walk through for how we might execute each of them within the Tonic subsetter.
This database is your secret sauce! There is no subsetting involved here. But keep this data protected at all costs and use it only for key customer use cases.
These databases need to have statistical and size representativeness. However, that does not mean you merely want a de-identified production database. You can subset these down a little bit to reduce storage costs. For example, if you have a 150 GB production database, let’s aim to have these databases at 50 GB each, cutting your storage costs substantially. To achieve this, we might aim the target percentage of your primary table (e.g. customers) to 35%. The subsetting tool will then traverse all the appropriate relationships to ensure referential integrity is complete.
For these databases, you have two goals: make it as easy as possible to reproduce the bug while keeping storage costs to a minimum. In this case, we’d recommend using where clauses along the lines of:
For these databases, size is the primary concern. You want your developers to be able to use these representative databases on their laptops and quickly catch bugs before sending their code to the Elven databases. Specifically, maybe you want devs to run their local unit tests in less than one minute. The exact size you need will vary based on your unit tests, but for the sake of our example let’s assume that means you need your database to be less than 10 GB. So, here, let’s choose a target percentage. Given our 150 GB starting size, let’s set the target percentage on your primary table to 5%.
Now, you have all of your databases fit for purpose for their needs! All of your users have all the data they need and nothing more: Sauron is used for your most important production use cases, Elves are used for your last line of testing, Dwarves are used for debugging customer issues and your Human databases are used for developers' local testing.
With fit for purpose subsetting configurations, each stakeholder can get data at the frequency they require—and better yet, you schedule the ones that make sense so data is always ready. Additionally, you can leverage a single set of de-identification policies to protect all of them, or customize if security access levels are different (this is made possible with workspace inheritance!).
A very relevant question, we’re glad you asked. Subsetting improves your software development in the following ways:
These are just a few examples of reasons to implement subsetting!
In conclusion, there’s one ring to rule them all when it comes to smaller and more efficient databases—and it’s called subsetting with Tonic.
Want to learn more about how subsetting can give your team forbidden power over all the lands and peoples of your database? Check out eBay’s comprehensive case study on how Tonic transformed their database—or book a demo today to take our platform for a spin yourself.