Product development requires high-quality data. Whether it's for:
- Developers who test new and updated code
- A continuous release process that runs regular integration tests
- Data scientists who train data models
Issues with data quality can bring these processes to a near standstill. Without high-quality data, you cannot be confident that your testing and model generation are producing accurate results and high-quality models.
In this guide, we'll go over some of the more common issues with data quality, and provide steps that you can take to address them.
Top data quality issues
What prevents you from having the best-quality data for testing, development, or model training? Here are some of the more common data quality issues.
Inaccessible data
Your organization likely has a treasure trove of production data—transaction records, patient transcripts, support chats—that would be perfect to use for software testing and AI model development. It includes a range of use cases and accurately reflects the data complexity.
However, in many, if not most, cases, your production data contains highly sensitive information that cannot be exposed to developers. Personally identifiable information (PII) and protected health information (PHI) must be kept private, both for basic privacy concerns and to ensure compliance with guidelines and recommendations such as HIPAA and GDPR.
Not being able to use highly sensitive production data feeds into some of the other issues to follow.
Insufficient data
Training a data model requires a large set of realistic data. A dataset that is too small does not produce accurate or realistic results.
Similarly, testing code against a too-small dataset might not uncover issues, especially performance issues, that would be revealed in tests against larger data.
Outdated, stale, or old data
Generating data for testing is not a one-and-done process. You cannot create a test dataset and then never touch it again.
Schemas change and new use cases arise.
If test data is not refreshed regularly, then it eventually does not produce accurate test results, and cannot uncover issues that are tied to more recent changes to the underlying data structure.
Inaccurate or incomplete data
Test data needs to cover a wide range of scenarios, including edge cases that might not turn up frequently in real data, but that do need to be addressed during development.
It also needs to reflect the actual complexity of real-world data.
For example, data that is generated from a script is likely to be more of a happy-path dataset, with none of the outliers and intricacies that might be found in the original data.
Incomplete data means incomplete testing.
Inconsistent data
To get the most meaningful results, any test data needs to maintain the same relationships that are in the original production data.
For example, a script might generate a dataset with hundreds of sales records, tens of products, and several vendors, but if the data does not tie the sales to the products and the products to the vendors, it is of little value. You can't test whether the system brings up the correct product or vendor if those relationships are not there to begin with.
Data that lacks context or meaning
Similar to inconsistent data, data without context or meaning provides little value, especially for model training.
This can happen with scripted data. For example, a script might produce the same records over and over, without taking into account the real ebb and flow of real-world activity.
It can also happen when you redact data. For example, when you redact patient notes, you replace every single first name with John, null out all of the symptoms, and replace ages with a random integer. The de-identified data is safe, but is devoid of meaning and cannot produce a useful model.
Unblock product innovation with high-fidelity synthetic data that mirrors your data's context and relationships.
How to solve data quality issues
Now that we've gone over some of the data quality pitfalls, let's look at some ways to address those issues and create meaningful, useful data for testing and model training.
Data de-identification
While privacy issues might prevent you from using your production data as-is, you can use de-identification to swap out sensitive values with realistic replacements.
This produces a complete, safe-to-use dataset that is based on and reflects real-world data.
Consistency
When you generate data for testing and model training, make sure to preserve the original data relationships, to ensure referential integrity and produce a useful, meaningful set of data.
Consistency also helps to preserve the overall shape of the data—the proportion of records that have specific values.
Synthesize data in context
When you de-identify unstructured data such as text files and PDFs, de-identify the data in context.
Replace each value in a way that does not remove the original meaning of the text.
Automated data refresh
Once you have a de-identified set of data, you should automatically refresh that data to reflect added records and changes to the data schema.
The automated refresh should include de-identification of new records and fields.
Automated data provisioning
As you de-identify and refresh the data, you can automatically create new and updated datasets, then make those datasets available to developers and data scientists.
This ensures that they always have access to the latest de-identified data to use for development and model training.
Using Tonic.ai solutions to produce high-quality data
You can use Tonic Structural, Textual, and Ephemeral to produce high-quality data for development, testing, and AI model training.
Tonic Structural
Tonic Structural produces de-identified versions of databases or text-based files. The Structural generators produce replacement values for sensitive data such as names, locations, identifiers, and much more.
Structural features such as consistency and linking ensure that the de-identified data maintains its original shape and preserves all of the data relationships.
Structural subsetting allows you to produce complete datasets in a variety of sizes. A subset can focus on a specific area of your data. Each subset maintains referential integrity.
You can also schedule Structural data generation to run automatically on a regular schedule, to ensure that you get de-identified versions of new data as it is added and keep your test datasets in sync with production.
Tonic Textual
Tonic Textual redacts free-text data in a variety of file types, including Word files, PDFs, and images. The files can come from a local system, or can be pulled from folders in a cloud storage location.
You can view a summary of the sensitive values that Textual detects, and configure how to replace each type of value. As it replaces each value, Textual maintains consistency across your data transformations.
For cloud storage files, you can regularly run the redaction in order to de-identify files that were added since the most recent run. For example, you can redact new patient note entries as they are added to a folder.
Tonic Ephemeral
Tonic Ephemeral provisions temporary databases for software development and testing that expire based on usage or a specific time frame.
One option for Structural data de-identification is to generate a data snapshot in Ephemeral. You can then use that snapshot to spin up any number of databases to use for development and testing.
Recap
High-quality data is key to high-quality software and AI models. Issues such as inaccessible, incomplete, inconsistent, or obsolete data increase the risk that serious issues escape detection during development and testing, or that a model does not accurately reflect its real-world sources.
To address these issues, you can de-identify data to remove sensitive values, taking care to ensure that the resulting output is realistic and maintains data integrity. Automatic updates and data provisioning ensure that high-quality data is always at your developers' fingertips.
Tonic Structural, Textual, and Ephemeral provide features that make it easy for you to generate and maintain high-quality data for software development, testing, and model training.
To learn more about Structural de-identification, Textual file redaction, and Ephemeral temporary databases, connect with our team today.
FAQs
Complete and accurate testing requires consistent access to high-quality data.
When there are data quality issues such as incomplete, inconsistent, or obsolete data, it increases the risk that an issue escapes detection during testing. This is particularly true for edge cases, which might not be covered by script-produced data.
An effective data model requires large volumes of data that reflect the full spectrum of values. For example, to generate a model for a patient chat, you need to have data from patients of all ages from different locations and with different conditions and symptoms.
When the data provided to the model is insufficient, the model itself becomes less and less useful. For example, if you feed a patient model data that only contains information about young people who have head colds, the resulting tool will not be able to assist elderly patients who have arthritis.
Data and data structures can change over time. New records are added, and data schemas change.
With automation, you can ensure that new records are de-identified as they are added, to make even more data available for development, testing, and training.
Automated processes can also pick up changes to the data structure, to automatically produce new datasets that are completely up-to-date, so that tests can cover every possible use case.