Blog
Data privacy

Decoding Generative AI’s Privacy Paradox

Author
Madelyn Goodman
Author
September 7, 2023
Decoding Generative AI’s Privacy Paradox

TLDR; As generative AI tools continue to improve, developers need more data to train models with higher efficacy. This need for more and more data comes directly at odds with prioritizing data privacy while at the same time giving us the power to safeguard sensitive data with self-deployed tools. Here we take a deep dive into the benefits and drawbacks of generative AI for data privacy and security.

Generative AI and Work

This time last year if you had never heard of GPT-3 or DALL-E 2 you wouldn’t have necessarily been at a disadvantage. Now it seems if you don’t have your ChatGPT tab open at all times you’re lagging behind. 

Generative AI increases the efficiency of work across industries by automating rote tasks, accelerating upskilling, and allowing for the quicker development of proofs of concept. The development and improvement of these models, however, comes at a cost. Every time we ask ChatGPT for help we are exposing our data to an external source, risking our privacy.

The Threat Generative AI Poses to Data Privacy 

The key ingredients for the recent breakthroughs in generative AI are model architectures, enormous training datasets, and massive computational resources. The importance of high quality data for training these models means that foundation model providers are incentivized to draw on their customer data for training. Because of the potential of these models to emit sensitive information contained in their training data, everyone using generative AI is concerned about one thing: data privacy.

Below is a comparison of the accuracy of several different LLMs based on their number of parameters and amount of training data represented by the number of tokens. The color gradient represents model performance, with a darker color representing higher performance.

Figure adapted from “Chinchilla’s Wild Implications

This research from Deep Mind shows that smaller models, trained on larger datasets, could out-perform larger models with more parameters, such as GPT-3. In other words, data is the key for scaling model improvement. Thus, smaller open source models with under 100B parameters trained on increasingly large corpuses of text are becoming more and more common. Meta’s Llama 2 models, for example, were trained on datasets of 2 trillion tokens. 

Building the next generation of models will require even more data, whether for pre-training or fine-tuning. This demand for data could become fundamentally at odds with privacy and security requirements.

When it comes to actually using LLMs in your organization it’s important to be cautious. Your team could be unintentionally sending your data to third parties with disparate data security policies from yours and therefore sharing confidential information in breach of regulations. For example, data submitted to OpenAI from 2020 through March 2023 was potentially used to train future models, such as ChatGPT. 

The risk of a model disclosing information from its training set is proportional to the size of the model (the number of parameters) and the number of times that particular data point showed up in the training set. Research by Carlini et al exposed severe training data leakage in diffusion models as shown below.

Figure from “Extracting Training Data from Diffusion Models

Training data leakage is a huge liability for an organization. If sensitive data is used in training, it is at risk of being disclosed, plain and simple. Companies like Stability AI and Microsoft are currently facing the consequences, with lawsuits stemming from their models reproducing copyrighted content without proper credit or compensation to the original creators. 

How to Preserve Data Privacy with Generative AI

All is not lost, however, when it comes to preserving data privacy while also reaping the benefits of the world of generative AI. There are tools that allow you to use your data with third-party models like ChatGPT and Stability AI safely. Beyond these third-party hosted models, there are other options for integrating LLMs in your organization.

Tools for using third-party hosted models safely

Generative AI itself can be used to protect data when sending requests to third party hosted models. LLMs can be used to auto-redact information from a dataset. Recognizing the context of a sentence that can make a word sensitive was a limitation with previous auto-redaction methods. For example, take the following sentence:

“The President tested positive for COVID-19 again Saturday per a letter from presidential physician Dr. Kevin O’Connor.”

Previous techniques would only label “Saturday” and “Dr. Kevin O’Connor” as time and person words. Clearly, however, “president” and “COVID-19” are sensitive aspects of the sentence as well. Since transformer models can detect context, they are able to correctly label these words as sensitive. Further, advancements in prompt engineering have also increased LLMs’ value in executing this type of task with research from Carnegie Mellon showing how careful few-shot prompting, providing the LLM with a few examples of how it should respond to your requests within the prompt, can generate consistent and accurate results.

Deploying models on premises 

Most major cloud providers are expected to be offering ways to deploy large third-party models in your own VPC. Alternatively, training and hosting your own LLM on premises is becoming more accessible as smaller fine-tuned models are performing better on some tasks than large general purpose models. Like the Chinchilla model, these models have fewer parameters but are trained on larger sets of more specialized data.

These models perform comparably to their larger third-party hosted compatriots on common benchmarks. StabilityAI’s open-sourced model Stable Beluga 2 showed almost the same level of accuracy as the closed-source GPT-4, around 60%, on a benchmark called TruthfulQA that measures a model’s proneness to reproducing falsehoods found on the internet. Current research suggests that GPT-4 has been producing less accurate results as the model is fine-tuned broadly, however, locally tuned models will not have this problem of accuracy drift since you can control fine-tuning yourself.

How Generative AI Can Be Used to Protect Data Privacy

We are increasingly seeing a growing trend of leveraging large, general-purpose models to generate synthetic data, which is then used to supplement real datasets to train more specialized models. Synthetic, or fake data, is artificially generated to mimic real information without using actual data points, ensuring privacy and security.

A group at Stanford used the “self-instruct” method of model training, starting with a small seed of data from GPT-3, and Meta’s LLaMa model to generate a large amount of task-based data to fine-tune a ChatGPT-like model for just $700 as opposed to the millions it took to train ChatGPT. This model, trained on synthetic data, was not only more cost-effective to train but also more secure to use. 

Further research has been done to train models on “distilled” training data generated by larger models. Data is distilled from LLMs by following certain prompting patterns, often involving chain of thought tasks, to get more specific data to train smaller and more specialized models. These models show a steeper scaling curve - demonstrating that the more specialized and distilled data LLMs are shown, the better and better they perform. 

Curious to learn more about the data privacy risks generative AI poses? Check out this webinar from Tonic.ai where founder and CEO, Ian Coe, and Head of AI, Ander Steele, discuss how to manage these risks. Also, tune into our upcoming webinar, Data Safety in the Age of AI, in partnership with A.Team where Ander returns along with Anjana Harve, Seasoned Global Chief Digital & Information Officer, and Michael Rispin, General Counsel at Sprout Social, to discuss how to manage these risks.

Also article is also published in Mission by A.Team.

Further Resources

  1. Scaling Laws for Neural Language Models, Kaplan et al, 2020  
  2. An empirical analysis of compute-optimal large language model training, Hoffman et al, 2022 
  3. Chinchilla’s Wild Implications, Nostalgebraist, 2022  
  4. GPT-4 Technical Report, OpenAI, 2023 
  5. Training Compute-Optimal Large Language Models, Hoffmann et al, 2022
  6. Sparks of Artificial General Intelligence: Early experiments with GPT-4, Bubeck et al, 2023
  7. Extracting Training Data from Large Language Models, Carlini et al, 2021
  8. Extracting Training Data from Diffusion Models, Carlini et al, 2023
  9. LLaMA: Open and Efficient Foundation Language Models, Touvron, Lavril, Izacard et al, 2023
  10. PromptNER: Prompting for Named Entity Recognition, Ashok, Lipton, June 2023
  11. Alpaca: A Strong, Replicable Instruction-Following Model, Taori, Gulrajani, Zhang, Dubois, Li, et al, March 2023
  12. Self-Instruct: Aligning Language Models with Self-Generated Instructions, Wang, Kordi, Mishra, Liu, Smith, et al, May 2023
  13. Teaching Small Language Models to Reason, Magister, Mallinson, Adamek, Malmi, Severyn, June 2023
  14. Specializing Smaller Language Models towards Multi-Step Reasoning, Fu, Peng, Ou, Sabharwal, Khot, January 2023
Madelyn Goodman
Data Science
Driven by a passion for promoting game changing technologies, Madelyn creates mission-focused content as a Product Marketing Associate at Tonic. With a background in Data Science, she recognizes the importance of community for developers and creates content to galvanize that community.

Make your sensitive data usable for testing and development.

Unblock data access, turbocharge development, and respect data privacy as a human right.
Accelerate development with high-quality, privacy-respecting synthetic test data from Tonic.ai.Boost development speed and maintain data privacy with Tonic.ai's synthetic data solutions, ensuring secure and efficient test environments.