How to create database subsets for Ephemeral environments

Author

Travis Matthews

April 18, 2024

We recently discussed how subsetting and Tonic Ephemeral are a powerful combination.

Tonic Structural’s patented approach to subsetting allows you to shrink the size of your database while developing and testing new features, isolating only the entries that you wish to keep while maintaining the referential integrity of the data—aka, maintaining the very relationships that make data useful. This powerful approach to crafting targeted test data offers many advantages. Data subsets are also ideal for use in fast-tracking the creation of ephemeral test databases with Tonic Ephemeral.

Here’s how you can use the combination of Structural’s subsetting and Ephemeral’s database technologies to streamline and accelerate your developer velocity.

Initial setup

When using Tonic Structural, the first step in generating synthetic data is to create and configure a workspace. Log into Structural and navigate to the workspace settings page. Select Create Workspace, then Create a New Workspace. Fill out the Source Settings section with the details of your production database (or you can use the sample database provided in Structural), giving Structural the ability to connect to your source data.

In the Destination Settings section, select the option to output Structural’s data generation to an Ephemeral Database. Selecting the Ephemeral Database option allows Structural and Ephemeral to coordinate the creation and snapshotting of a test database.

A screenshot of the Structural UI, showing where to select "Ephemeral Database" in the Destination Settings.

Under advanced settings, you can further specify the behavior of your data generation. After a data generation is completed, your ephemeral database has a snapshot taken (persisting the database for future use), and the running database is deactivated. You can also opt to keep this database active if you wish to immediately use it. In addition, you can customize the name of your snapshot, as well as increase the storage size of your database’s underlying storage volume (if you foresee adding additional data into the database at a future point). Once satisfied with your workspace settings, click Save Workspace to persist your configuration options.

Subsetting your data

Tonic Structural’s subsetter allows users to generate de-identified data from subsets of production data. It can also be used solely for the purpose of shrinking a database, subsetting the data without performing any de-identification along the way. While we’ve discussed subsetting previously in other blog posts (see here and here), today this capability will be used for the purposes of shrinking: to isolate a desired amount of production data, while still preserving the relationships contained in the data. This will ensure that the containers of synthetic data will be small enough to be performant and easy to manage, but complex enough to capture the data needed to fully reproduce issues, or demonstrate the capabilities of the applications this data will empower.

There are several ways to begin identifying what data should be included in a subset; ultimately this comes down to what purpose this subset should serve, and what data would serve that purpose. For the purposes of this how-to, we’ll use Structural’s sample database so you can follow along the workflow required to shrink our database size while keeping a representative sample of our data for use as a staging environment for software developers. To begin, navigate to the Subsetting tab, and then select the Graph View to view your tables and the relationships between them.

A screenshot of the Structural UI, showing the subsetter's Graph View

Inspecting this, it’s clear that the retail_sales table is both the most populous table, and also one whose relationships are at the heart of the data model. Both these facts make it a good candidate to become the subsetter’s target table. Select this table as a target table, and set the subsetter to shrink the data down to 5% of the rows at random.

A screenshot of the Structural UI, showing how to configure subsetting on the target table.

At other times (when reproducing a bug for example), you may choose a different row selection criteria—instead using a where clause and selecting only those retail_sales entries which are associated with the bug under investigation. More tips for configuring the subsetter can be found in our docs. After this subsetting is configured, you’ll see which tables will be included in your subsetting output.

As an additional step, you may wish to de-identify some or all of the data in your source database. Tonic offers powerful and flexible tools to both detect sensitive data and de-identify it while preserving the relationships contained within your data. For more information, our docs contain several sections on how sensitive data may be detected, and ways it may be de-identified.

Once these steps are finished, you’re ready to create your test database and snapshot—in the upper-right corner, click the Generate Data button. You can follow the progress of your data generation on the Jobs page.

A screenshot of the Structural UI, showing the data generation job details.

‍

Streamline data provisioning to turbocharge developer productivity.

Stop losing time provisioning and maintaining databases yourself. Spin up fully populated test databases in seconds with Tonic Ephemeral.

Book a demo

Spinning up databases in Tonic Ephemeral

Once your data generation has finished, you can view the results in Tonic Ephemeral. Navigating to the Snapshots page in Ephemeral, you can see the new snapshot that your data generation has just produced.

A screenshot of the Ephemeral UI, showing the available Snapshots.

Navigating to our Databases page, we can see the database where Structural's data generation outputted our data—this database was deactivated once the data generation was completed, as we can tell by the grayed out ‘deactivated’ status.

A screenshot of the Ephemeral UI, showing the available Databases.

To connect to this data, reactivate the database by selecting the power button on the right hand side, then click the green database button to view the connection details.

A screenshot of the database connection details in the Ephemeral UI.

To connect to this database, you’ll need to upload an SSH Key to Ephemeral, to coordinate access to our SSH Bastion. This can be done by navigating to the settings section: in the top right-hand corner, click the profile picture icon, then the settings button. Once uploaded, you can enter the connection details above to connect to your database.

After connecting, you can run a query to find the size of your new database:

And you’ll see the result of Structural’s subsetting: from 334MB in production, to 10MB in our ephemeral database.

Conclusion

To wrap up, the combination of Tonic Structural’s subsetter and Tonic Ephemeral provides rapid access to high quality test data. Subsetting efficiently reduces database size while maintaining data integrity, facilitating faster creation of ephemeral test databases with Tonic Ephemeral. This combination can greatly improve testing efficiency and workflow. Sign up for a free trial of Structural or Ephemeral to get started using both today.

Travis Matthews

Engineering

Continue with the next guide in this series

How to create database subsets for Ephemeral environments

Initial setup

Subsetting your data

Spinning up databases in Tonic Ephemeral

Conclusion

Related Guides

Integrating Tonic Structural into your data refresh and CI/CD pipelines

Maintaining data relationships in Structural generation output

Guide to test data management

Make your sensitive data usable for testing and development.