The importance of using reliable datasets for testing, QA, and development

Author

Janice Manwiller

April 11, 2025

Testing and quality assurance (QA) are vital steps in the software development process. They ensure that software products work as intended and bring value to their users.

However, accurate and meaningful test results require high-quality data, which can be hard to source. In this guide, we'll provide an overview of test data and how you can use Tonic.ai’s solutions to create high-quality datasets for testing, QA, and development.

What is test data?

Let's start with a quick definition of test data.

Test data is data that is used to test a software application to verify that it functions correctly, performs well, and is secure.

For example, high-quality, reliable test data enables you to quickly identify bugs and then verify without a doubt when those bugs are fixed.

How test data is used in QA, testing, and development

How do QA, testing, and development use test data?

1. Ensure application stability

One use of test data is to verify that an application continues to work as expected.

Every release needs to go through rounds of testing to verify that the new release does not introduce new bugs to the application.

The release process needs access to a reliable and consistent dataset for testing and QA purposes.

2. Optimize application performance

Test data is also used to check system performance.

Organizations can use test datasets of varying sizes and complexity to make sure that an application and the system that runs it are performant enough to handle the expected workload.

3. Validate features and bug fixes

Probably the most obvious use of test data is for feature development and bug fixes.

As they work on a new feature, developers use test data to verify that new pages and API endpoints work correctly.

Similarly, before they merge the code to fix a bug, developers use test data to verify that the fix actually corrects the issue.

Options for sourcing a dataset for testing and QA

Where does test data come from? There are basically two options:

Manual dataset creation that uses scripts and extract, transform, and load (ETL) tools
Automatic data creation that uses test data management and data synthesis tools

Creating datasets manually

Manual dataset creation involves using scripts and ETL tools to create datasets for testing and QA as they are needed.

Before you create a dataset, you need to have a firm understanding of the data structure and the specific types of records and values that you need for testing.

You also need to be able to recreate a dataset at the start of each round of testing, to ensure consistent results.

Manual dataset creation is only appropriate for small datasets, and does not scale well.

Creating datasets automatically

For larger databases, and to better support iterative development and testing, the preferred option is to use a test data management tool to automatically create realistic and repeatable test data at scale.

Production data can be one source for these tools. However, production data is likely to contain highly sensitive values that must be protected, to maintain privacy and preserve compliance with privacy guidelines and regulations.

To address this, you can use your test data management tool to create synthesized datasets that contain de-identified versions of production data.

For example, with Tonic Structural, you can quickly identify and replace sensitive values in production data, then generate secure, realistic datasets that are safe for developers to use. The platform’s patented database subsetter allows you to generate cohesive datasets in a variety of sizes.

And with Tonic Ephemeral, you can use that output as the basis for any number of temporary databases that developers can spin up as needed.

Streamline test data generation and provisioning.

Accelerate your release cycles and reduce bugs in production with the all-in-one solution for developer data.

Book a demo

Overcoming test data challenges

Let's go over some of the main challenges to obtaining high-quality datasets for testing and QA, and see how Tonic.ai can help to overcome them.

Sensitive data

One challenge is that the best source of realistic data is your actual production data. However, production data — sales transactions, patient records, and so on — contains sensitive values.

To protect individual privacy, and, equally importantly, remain in compliance with privacy guidelines and laws such as HIPAA and GDPR, you cannot simply hand over raw production data to developers.

Tonic Structural automatically identifies a wide range of sensitive values such as names, locations, account identifiers, and birthdates. You can also configure custom rules to identify sensitive values that might be specific to your organization or industry.

You then configure Structural to replace those values before generating de-identified output data that is safe for developers to view and use.

Inconsistent data

Development, testing, and QA are iterative processes. You make a code change, run a test, fix issues in the original change, then test again — over and over until the new or updated feature works exactly as expected.

To make sure that your fix is valid, you want to use the same data that you did when you found it. Without that consistency, it might be difficult to tell whether a test passes because the issue is fixed, or if it passed because the new dataset doesn't have that one value that triggered the bug.

With Tonic Structural, you can generate the same set of data over and over. And if you generate output to a Tonic Ephemeral snapshot, you can use that snapshot to spin up an identical database over and over as needed to cover every round of testing.

Structural's consistency feature ensures that a given source value always gets the same replacement value. You can also configure Structural to use the exact same values across data generations and databases.

Structural also maintains relationships between tables. The primary key value of a patient in the patients table is replicated in the foreign key values that identify the same patient in the records for procedures and invoices. And those relationships are maintained whether you generate the entire database or a subset.

Inaccurate or incomplete data

Accurate testing also requires complete and accurate data. It's not enough to just de-identify datasets for testing and QA if the replacement values don't make sense. Tables full of null values might be secure, but they won't be of much use.

You also don't want to lose the relationships between columns and among tables.

Tonic Structural generators can replace sensitive data with realistic values. For example, replacement dates and timestamps use the same format as the original and can be configured to be within a reasonable range. You can also get realistic replacements for items such as names and cities.

Other Structural features also contribute to accurate and complete data. Column linking ensures that related columns such as cities and states remain in sync. And consistency guarantees the same replacement value every single time.

Structural subsetting produces datasets that are referentially intact, regardless of the target records. Each subset preserves all of the relationships among the tables. For example, a subset might target 5 percent of sales records (plus the relevant customers, products, and vendors) or all outpatient procedures in Minneapolis (plus the relevant doctors, patients, and insurance plans).

Conclusion

To ensure that a software application works correctly and performs well, development, testing, and QA need reliable access to high-quality testing data. To support iterative development and testing, the test data must be consistent, complete, and accurate.

Tonic Structural and Tonic Ephemeral allow you to de-identify your production data and produce realistic and secure datasets for testing and QA. Features such as linking, consistency, and subsetting ensure that data is realistic and preserves the complex relationships among data columns and tables.

From Ephemeral, you can create and recreate an identical database to support each round of validation and testing.

To learn more about Structural de-identification and Ephemeral temporary databases, connect with our team today.

Janice Manwiller

Principal Technical Writer

Janice Manwiller is the Principal Technical Writer at Tonic.ai. She currently maintains the end-user documentation for all of the Tonic.ai products and has scripted and produced several Tonic video tutorials. She's spent most of her technical communication career designing and developing information for security-related products.

Continue with the next guide in this series

Unstructured data management: what it is and how to manage it