Use Case
Data synthesis

How to generate synthetic data: a comprehensive guide

A bilingual wordsmith dedicated to the art of engineering with words, Chiara has over a decade of experience supporting corporate communications at multi-national companies. She once translated for the Pope; it has more overlap with translating for developers than you might think.
Author
Chiara Colombi
September 4, 2024
Learn how to generate synthetic data with this guide from Tonic.ai. Discover the techniques and best practices to create synthetic data for your needs.
In this article
Share

What is synthetic data generation?

Synthetic data generation is the process of creating artificial data that resembles real-world data. This data can be generated using a variety of methods and techniques, depending on the needs of the use case at hand. These methods aim to ensure that the synthetic data maintains the statistical properties and patterns of real-world data without containing any actual personal or sensitive information.

Among its key use cases, data synthesis proves valuable in software development and testing, especially in environments where using real production data poses privacy risks or regulatory challenges. By generating synthetic data, developers can build, test, and validate their applications more effectively, ensuring high-quality releases while safeguarding consumer privacy.

Techniques for synthetic data generation

At a high-level, traditional methods of synthetic data generation can be grouped into two broad categories: rule-based data synthesis and statistical techniques.

Rule-based data synthesis involves generating synthetic data through predefined rules and logic, providing high control and consistency but sometimes lacking in scalability and realism for complex datasets. In certain instances, it can mean generating data “from scratch”, not involving or relying on any real-world data in its data generation process. In other instances, the rules are applied to existing data to transform it according to specific requirements. This method is ideal for scenarios where strict adherence to business rules is necessary and the data structures to be created are straightforward. However, in particular for methods generating data from scratch, it requires manual setup and maintenance, which can be labor-intensive.

In contrast, statistical approaches rely on real-world data and deep learning in their generation processes. Real-world data acts as a seed used by statistical models and algorithms to build a model capable of generating synthetic data that mirrors the underlying properties and distributions of the real data. These methods, including machine learning techniques like GANs and VAEs, can produce highly realistic synthetic data, making them suitable for machine learning training and privacy-preserving data analysis. 

While these methods offer greater realism, they can be computationally intensive and provide less granular control over specific data attributes compared to rules-based methods. What’s more, current capabilities run into serious limitations when applied to complex databases involving multiple tables. In short, GANs and VAEs work well on smaller, individual tables but are not yet able to fully synthesize multiple tables with cross-table interdependencies.

Traditional methods

The below methods offer a few examples of traditional techniques to create synthetic data.

  • Random sampling: One of the simplest methods, random sampling involves generating data points by randomly selecting values from a predefined distribution. While easy to implement, this method may not capture the complexities and correlations present in real-world data.
  • Conditional data generation: This rule-based approach generates data “from scratch” based on predefined conditions or rules that reflect domain-specific knowledge or business logic. For example, a rule might constrain a data type’s range or format to generate values that fall within specific ranges or adhere to certain formats, like a credit card number. The goal is to create synthetic data that behaves similarly to real-world data in specific contexts, which requires a specific and thorough understanding of what the data should look like and how it should behave, making it useful for individual columns but unsuitable as a blanket approach for synthesizing complete datasets.
  • Data shuffling: This technique involves shuffling real data to create new datasets. For instance, swapping the values of certain columns in a dataset to create a new synthetic dataset. This can help maintain the statistical properties of the data but may still risk re-identification if not done carefully.
  • Data augmentation: Commonly used in image and text data generation, data augmentation involves applying transformations (e.g., rotation, translation, noise addition) to existing data to create new synthetic samples. This is particularly useful in machine learning to increase the diversity and size of training datasets.

Advanced techniques

Advanced techniques leverage machine learning and deep learning models to generate more realistic and complex synthetic data, without relying on rules defined by a user. As mentioned above, these techniques work well for smaller datasets or applied to a single table but are not currently capable of synthesizing multiple tables or full databases.

  • Generative Adversarial Networks (GANs): GANs consist of two neural networks—a generator and a discriminator—that work in tandem. The generator creates synthetic data, while the discriminator evaluates its authenticity. Through this adversarial process, GANs can generate highly realistic synthetic data that closely resembles real-world data.
  • Variational Autoencoders (VAEs): VAEs are another type of neural network used for synthetic data generation. They work by encoding real data into a latent space and then decoding it back into synthetic data. VAEs are particularly effective for generating continuous data and can capture complex data distributions.

In the realm of test data synthesis for software development, most teams find that they require a combination of techniques to effectively craft the realistic data they need. Some data types are better served by rule-based approaches, while others cannot accurately be synthesized without a deep learning algorithm to properly capture the data’s statistical distributions. The solution required is one that brings all options to the table, so that teams can mix and match approaches to generate high-quality, high-utility data. Next, we’ll look at best practices in implementing these approaches across the board.

Best practices for generating synthetic data

Ensure data privacy and security

Prioritizing data privacy in generating synthetic data involves a variety of best practices to ensure both the utility of the data and the protection of sensitive information. Here are some key practices to stay on top of:

  1. Understand the source data: Before generating synthetic data, fully understand the characteristics, distributions, and dependencies in the original data. This understanding will help ensure that the synthetic data maintains the statistical properties necessary for valid testing and development.
  2. Implement advanced data synthesis techniques: Incorporate mechanisms like differential privacy which adds randomness to the data-generating process to minimize the risk of re-identifying individuals from the source dataset. Note that differential privacy naturally softens outliers, so it can remove edge cases from a dataset which may be important to have for fully representative software testing.
  3. Validate synthetic data quality: Regularly test your synthetic data against real scenarios to confirm that it retains the essential characteristics of original data for your use case. Likewise, make use of tools and methodologies to evaluate the risk of re-identification or information leakage from your synthetic data.
  4. Continuously monitor and update: Regularly monitor the use and performance of your synthetic data to identify any unexpected behavior or potential privacy concerns. Adjust your data synthesis methods as new vulnerabilities are discovered or as compliance requirements change.
  5. Comply with legal standards: Ensure that your generation and use of synthetic data comply with all relevant data protection regulations (including GDPR, HIPAA, CCPA, etc.) and ethical guidelines.
  6. Documentation and transparency: Maintain thorough documentation of your data synthesis processes, including the techniques used, the rationale for their selection, and any implications for data privacy. Audit trails are a must.

Quality assurance in synthetic data generation

To create synthetic data from real data effectively, quality assurance is crucial. This involves several best practices:

  1. Use appropriate generation techniques: Depending on the complexity and requirements of the original data, choose the right data generation techniques. Optimize for utility and privacy (a difficult line to walk) by leveraging a variety of approaches best suited to each data type in need of transformation.
  2. Perform validation checks: Regularly validate the synthetic data against the original data to ensure that it statistically resembles the original in terms of means, variances, correlations, and other statistical measures without replicating the actual data points.
  3. Ensure realistic data variation: Synthetic data should include realistic variations to effectively mimic real-world scenarios, like edge cases or rare occurrences found in your original data. Just be mindful that those edge cases may introduce privacy risks if they mirror your source data too closely.
  4. Preserve data relationships: Ensure that relationships between variables (e.g., correlations, hierarchies) are maintained in the synthetic data. For statistical data, deep learning methods can be helpful here. For relational data more broadly speaking, the ability to link columns and ensure consistency throughout your database is essential.

Scaling up synthetic data generation processes

The demand for synthetic data is only set to grow, so your generation processes need to be optimized to scale efficiently:

  • Automation: Implement automated pipelines for generating synthetic data at scale. This includes using tools and platforms that support automated data synthesis and data provisioning to manage your workflows, especially when dealing with large datasets or complex generation tasks.
  • Parallel processing: Leverage parallel processing techniques to improve efficiency and speed by distributing the data generation tasks across multiple processors or machines.
  • Batch processing: For extremely large datasets, consider using batch processing for bulk data generation tasks.
  • Utilize scalable infrastructure: Employ scalable cloud-based solutions or high-performance computing environments to handle the computation-heavy tasks of data synthesis, especially when using advanced techniques like deep learning models. This allows for dynamic allocation of resources based on the workload.
Build better and faster with quality test data today.
Unblock data access, turbocharge development, and respect data privacy as a human right.

Tools for synthetic data generation

Choosing the right tools for synthetic data generation is crucial for achieving high-quality results. Several open-source and commercial products are available to help developers generate synthetic data from real data efficiently and securely, including Tonic.ai’s industry-leading solutions. Below are several pioneering tools in the market:

Commercial tools

  1. Tonic.ai
    • Tonic Structural: Tonic Structural specializes in generating synthetic data from structured and semi-structured data for software development and testing. It offers PII detection, de-identification, synthesis, and subsetting, by way of an intuitive UI and extensive database support. Structural streamlines the process of generating realistic test data on demand, to ensure that developers can work with high-fidelity data that mirrors production data without risking data privacy.
    • Tonic Textual: Tonic Textual focuses on synthesizing unstructured free-text data for AI development. It uses proprietary Named Entity Recognition (NER) models to identify and redact or replace sensitive data entities, ensuring that proprietary data is protected. Tonic Textual transforms various unstructured data formats into AI-friendly formats, streamlining the ingestion and vectorization processes for LLM training and enterprise RAG systems.
  2. Mostly AI: Mostly AI focuses on generating synthetic data for data analytics use cases, retaining the utility and statistical properties of the original data while ensuring privacy protection. It uses AI techniques to create synthetic datasets that are suitable for self-service analytics, ML training, and data sharing.
  3. DataGen: DataGen specializes in generating synthetic data for computer vision applications. It creates high-fidelity 3D simulations to generate diverse and annotated image data, helping AI developers train and validate their models with realistic synthetic data.
  4. MDClone: MDClone offers synthetic data solutions tailored for the healthcare industry, enabling secure data analysis and research. Their platform allows healthcare organizations to create synthetic versions of their data, facilitating innovation while protecting patient privacy.

Open-source tools

  1. SDV (Synthetic Data Vault)
    • SDV is an open-source library that provides a suite of tools for generating synthetic data from real data. It supports various data types and offers multiple generative models, including GANs and VAEs. SDV allows users to fit models to real data and generate new synthetic datasets while preserving the statistical properties of the original data.
  2. Synthea
    • Synthea is an open-source synthetic patient generator that models the medical histories of synthetic patients. It is particularly useful for healthcare applications, providing realistic synthetic health records that can be used for research, testing, and training machine learning models.
  3. PrivBayes
    • PrivBayes is an open-source tool that generates synthetic data using a differentially private Bayesian network. It aims to preserve the privacy of the original data while ensuring that the synthetic data retains its utility for analysis and modeling purposes.

Comparison of tools

When selecting a tool for synthetic data generation, consider the following factors:

  • Data type and source support: Ensure that the tool supports the types and sources of data you are working with, whether structured, semi-structured, or unstructured, and stored on-premises or on the cloud.
  • Usability and integration: Look for tools that offer an intuitive user interface and can easily integrate with your existing data pipelines and workflows.
  • Scalability: Ensure that the tool can handle large datasets and scale up to meet your data generation needs.
  • Privacy and compliance: Check that the tool provides robust privacy protection features that will enable you to comply with relevant data protection regulations.

The takeaway

Realistic data synthesis is not achieved by a one-size-fits-all approach. It requires a combination of traditional and advanced techniques, underpinned by best practices to ensure data utility, privacy, quality, and scalability. Solutions like Tonic Structural and Tonic Textual empower software and AI developers to create synthetic data tailored to their specific needs and use cases. 

By equipping themselves with the right combination of tools, developers can unlock the full potential of synthetic data in their workflows, uplevelling testing and accelerating software and AI development to meet the speed that innovation requires. Connect with our team to learn more about how our solutions fulfill your data synthesis needs.

How to generate synthetic data: a comprehensive guide
Chiara Colombi
Director of Product Marketing

Fake your world a better place

Enable your developers, unblock your data scientists, and respect data privacy as a human right.