All Tonic.ai guides
Category
Developer productivity

How data quality issues can slow down product development

Janice Manwiller is the Principal Technical Writer at Tonic.ai. She currently maintains the end-user documentation for all of the Tonic.ai products and has scripted and produced several Tonic video tutorials. She's spent most of her technical communication career designing and developing information for security-related products.
Author
Janice Manwiller
April 16, 2025

Product development requires high-quality data. Whether it's for:

  • Developers who test new and updated code
  • A continuous release process that runs regular integration tests
  • Data scientists who train data models

Issues with data quality can bring these processes to a near standstill. Without high-quality data, you cannot be confident that your testing and model generation are producing accurate results and high-quality models.

In this guide, we'll go over some of the more common issues with data quality, and provide steps that you can take to address them.

Top data quality issues

What prevents you from having the best-quality data for testing, development, or model training? Here are some of the more common data quality issues.

Data quality issue Summary
Inaccessible Data cannot be used because of privacy issues.
Insufficient Data is too small to produce a realistic model.
Outdated, stale, old Data does not reflect latest schema changes and use cases.
Inaccurate, incomplete Data does not cover all scenarios and edge cases.
Inconsistent Data lacks referential integrity.
Lacks context and meaning Data does not contain realistic values for model training.

Inaccessible data

Your organization likely has a treasure trove of production data—transaction records, patient transcripts, support chats—that would be perfect to use for software testing and AI model development. It includes a range of use cases and accurately reflects the data complexity.

However, in many, if not most, cases, your production data contains highly sensitive information that cannot be exposed to developers. Personally identifiable information (PII) and protected health information (PHI) must be kept private, both for basic privacy concerns and to ensure compliance with guidelines and recommendations such as HIPAA and GDPR.

Not being able to use highly sensitive production data feeds into some of the other issues to follow.

Insufficient data

Training a data model requires a large set of realistic data. A dataset that is too small does not produce accurate or realistic results.

Similarly, testing code against a too-small dataset might not uncover issues, especially performance issues, that would be revealed in tests against larger data.

Outdated, stale, or old data

Generating data for testing is not a one-and-done process. You cannot create a test dataset and then never touch it again.

Schemas change and new use cases arise.

If test data is not refreshed regularly, then it eventually does not produce accurate test results, and cannot uncover issues that are tied to more recent changes to the underlying data structure.

Inaccurate or incomplete data

Test data needs to cover a wide range of scenarios, including edge cases that might not turn up frequently in real data, but that do need to be addressed during development.

It also needs to reflect the actual complexity of real-world data.

For example, data that is generated from a script is likely to be more of a happy-path dataset, with none of the outliers and intricacies that might be found in the original data.

Incomplete data means incomplete testing.

Inconsistent data

To get the most meaningful results, any test data needs to maintain the same relationships that are in the original production data.

For example, a script might generate a dataset with hundreds of sales records, tens of products, and several vendors, but if the data does not tie the sales to the products and the products to the vendors, it is of little value. You can't test whether the system brings up the correct product or vendor if those relationships are not there to begin with.

Data that lacks context or meaning

Similar to inconsistent data, data without context or meaning provides little value, especially for model training.

This can happen with scripted data. For example, a script might produce the same records over and over, without taking into account the real ebb and flow of real-world activity.

It can also happen when you redact data. For example, when you redact patient notes, you replace every single first name with John, null out all of the symptoms, and replace ages with a random integer. The de-identified data is safe, but is devoid of meaning and cannot produce a useful model.

Synthesize your data for software testing and AI model training.

Unblock product innovation with high-fidelity synthetic data that mirrors your data's context and relationships.

How to solve data quality issues

Now that we've gone over some of the data quality pitfalls, let's look at some ways to address those issues and create meaningful, useful data for testing and model training.

Technique Helps to address these issues
Data de-identification Inaccessible data, insufficient data, incomplete data
Consistency Inconsistent data, lack of context and meaning
Synthesize data in context Lack of context and meaning
Automated data refresh Outdated data
Automated data provisioning Outdated data, insufficient data

Data de-identification

While privacy issues might prevent you from using your production data as-is, you can use de-identification to swap out sensitive values with realistic replacements.

This produces a complete, safe-to-use dataset that is based on and reflects real-world data.

Consistency

When you generate data for testing and model training, make sure to preserve the original data relationships, to ensure referential integrity and produce a useful, meaningful set of data.

Consistency also helps to preserve the overall shape of the data—the proportion of records that have specific values.

Synthesize data in context

When you de-identify unstructured data such as text files and PDFs, de-identify the data in context.

Replace each value in a way that does not remove the original meaning of the text.

Automated data refresh

Once you have a de-identified set of data, you should automatically refresh that data to reflect added records and changes to the data schema.

The automated refresh should include de-identification of new records and fields.

Automated data provisioning

As you de-identify and refresh the data, you can automatically create new and updated datasets, then make those datasets available to developers and data scientists.

This ensures that they always have access to the latest de-identified data to use for development and model training.

Using Tonic.ai solutions to produce high-quality data

You can use Tonic Structural, Textual, and Ephemeral to produce high-quality data for development, testing, and AI model training.

Tonic Structural

Tonic Structural produces de-identified versions of databases or text-based files. The Structural generators produce replacement values for sensitive data such as names, locations, identifiers, and much more.

Structural features such as consistency and linking ensure that the de-identified data maintains its original shape and preserves all of the data relationships.

Structural subsetting allows you to produce complete datasets in a variety of sizes. A subset can focus on a specific area of your data. Each subset maintains referential integrity.

You can also schedule Structural data generation to run automatically on a regular schedule, to ensure that you get de-identified versions of new data as it is added and keep your test datasets in sync with production.

Tonic Textual

Tonic Textual redacts free-text data in a variety of file types, including Word files, PDFs, and images. The files can come from a local system, or can be pulled from folders in a cloud storage location.

You can view a summary of the sensitive values that Textual detects, and configure how to replace each type of value. As it replaces each value, Textual maintains consistency across your data transformations.

For cloud storage files, you can regularly run the redaction in order to de-identify files that were added since the most recent run. For example, you can redact new patient note entries as they are added to a folder.

Tonic Ephemeral

Tonic Ephemeral provisions temporary databases for software development and testing that expire based on usage or a specific time frame.

One option for Structural data de-identification is to generate a data snapshot in Ephemeral. You can then use that snapshot to spin up any number of databases to use for development and testing.

Recap

High-quality data is key to high-quality software and AI models. Issues such as inaccessible, incomplete, inconsistent, or obsolete data increase the risk that serious issues escape detection during development and testing, or that a model does not accurately reflect its real-world sources.

To address these issues, you can de-identify data to remove sensitive values, taking care to ensure that the resulting output is realistic and maintains data integrity. Automatic updates and data provisioning ensure that high-quality data is always at your developers' fingertips.

Tonic Structural, Textual, and Ephemeral provide features that make it easy for you to generate and maintain high-quality data for software development, testing, and model training.

To learn more about Structural de-identification, Textual file redaction, and Ephemeral temporary databases, connect with our team today.

FAQs

Complete and accurate testing requires consistent access to high-quality data.

When there are data quality issues such as incomplete, inconsistent, or obsolete data, it increases the risk that an issue escapes detection during testing. This is particularly true for edge cases, which might not be covered by script-produced data.

An effective data model requires large volumes of data that reflect the full spectrum of values. For example, to generate a model for a patient chat, you need to have data from patients of all ages from different locations and with different conditions and symptoms.

When the data provided to the model is insufficient, the model itself becomes less and less useful. For example, if you feed a patient model data that only contains information about young people who have head colds, the resulting tool will not be able to assist elderly patients who have arthritis.

Data and data structures can change over time. New records are added, and data schemas change.

With automation, you can ensure that new records are de-identified as they are added, to make even more data available for development, testing, and training. 

Automated processes can also pick up changes to the data structure, to automatically produce new datasets that are completely up-to-date, so that tests can cover every possible use case.

How data quality issues can slow down product development
Janice Manwiller
Principal Technical Writer

Janice Manwiller is the Principal Technical Writer at Tonic.ai. She currently maintains the end-user documentation for all of the Tonic.ai products and has scripted and produced several Tonic video tutorials. She's spent most of her technical communication career designing and developing information for security-related products.

Make your sensitive data usable for testing and development.

Accelerate your engineering velocity, unblock AI initiatives, and respect data privacy as a human right.
Accelerate development with high-quality, privacy-respecting synthetic test data from Tonic.ai.Boost development speed and maintain data privacy with Tonic.ai's synthetic data solutions, ensuring secure and efficient test environments.