Data privacy

How does differential privacy work? Plus, how to know if your organization needs it!

Author

Abigail Sims

Author

August 22, 2022

Big data is only getting bigger. And as that happens, users are getting more and more uncomfortable with just how much organizations and advertisers know about them, their habits, and their preferences. (Understandably.)

Legislation like the California Consumer Privacy Act (CCPA) and General Data Protection Regulation (GDPR) has imposed major restrictions on data collection and privacy approaches. That means analysts, big data firms, and other companies have to rethink how they keep private data safe.

Differential privacy is one of the newer and more innovative approaches to address both user squeamishness and data privacy restrictions. Today, we’ll look at how differential privacy works on a practical level—and what implementing it can do for your organization.

How does differential privacy work?

Okay, so. Differential privacy works by introducing random noise into data sets to make it harder to tie specific data records back to specific individuals, thereby preserving privacy. It’s often used when looking at the data in aggregate — like taking averages — and provides plausible deniability to individuals represented in the data. Organizations can still use machine learning and other means to analyze data sets and compute statistics like mean, median, mode, and more. But differentially private algorithms obscure specific responses, correlations, and connections between individual data points/people.

This also means that analysis/queries performed on a dataset run through a differentially private algorithm will yield very similar results even if someone is removed, added to the data set, or had their data modified. This ensures personal privacy and provides a strict, formal (aka mathematical) assurance that personal or private information won’t be leaked or used to re-identify individuals. (Yay!)

All that said, differential privacy does not necessarily mean that all information about individuals is 100% private.

Huh?

Well, differential privacy can mathematically guarantee that:

An individual’s private data is protected against various privacy attacks like membership inference attacks, linkage attacks, and reconstruction attacks.

However, differential privacy does not guarantee:

Full privacy protection of an individual’s data — some data may remain secret, but not all.

The difference lies in what constitutes “individual” and “general” information.

Individual information is information like PII/PHI, that needs to remain private—and therefore should be unidentifiable after that data is inputted into a statistical analysis. General information is the opposite.

Think of it like this:

Imagine holding a survey to analyze the health and habits of individuals who regularly consume alcohol 🍸
You survey 100 individuals and find, through statistical analysis and differentially private mathematical noise, that drinking regularly results in a higher likelihood of liver cancer
If you look at any one data point (an individual response), you won't know their private health status but now know that said individual drinks regularly, since they were included in the survey of drinkers — that’s general information
However, you cannot tell whether any given individual has liver cancer — that’s private individual information

So you can see that differentially private analysis is an important statistical and algorithmic framework, and how it can play a major role in useful, private data collection and analytics.

What are some examples of differential privacy?

Many industries and companies are already leveraging differential privacy approaches and algorithms. For example:

Apple 🍎 currently uses differential privacy to build up large data sets of usage information from its iPhone, iPad, and Mac users. However, it keeps the personal conclusions or information about those users relatively private and secure.
Facebook, meanwhile, uses differential privacy to gather behavioral data. It then uses that data for advertising campaigns and for marketing purposes, all without tripping the privacy policies and regulations cropping up in states like California (i.e., the CCPA).
Amazon takes advantage of differential privacy to determine shopping preferences among large groups of users. But it also protects sensitive information regarding things like past purchases, time spent on a page, specific products viewed, etc.
Even the US Census Bureau now uses differential privacy approaches for its data analytics — more on that later!

Why is this kind of mathematically-guaranteed privacy so important?

Differentially private algorithms are crucial for businesses for a variety of reasons. For example, it:

Helps businesses comply with regulations like the CCPA and GDPR without necessarily undermining their ability to understand their consumers’ behavior patterns.
Helps businesses avoid severe fines 💰 as a result of noncompliance with those regulations.
Lets businesses share data with other organizations for collaboration or marketing purposes without risking consumer privacy.

Differential privacy works to appease folks who want their data to be more secure (i.e. everyone) without removing companies’ ability to perform big data analytics.

When isn’t differential privacy needed?

While differentially private procedures can be very valuable, there are some downsides or circumstances where you really don’t need it.

Specifically, differential privacy approaches aren’t needed when:

You can’t afford the extra “cost” of adding noise to your data: maybe it adversely affects data results to the point where you can’t use them.
Your data doesn’t benefit from or need formal guarantees of privacy, like in circumstances where everyone knows their data is being used and analyzed and volunteers to offer it (“Shut up and take my data!”).
Your data is not only non-sensitive but also can’t be tied back to any sensitive data that could be used to identify individuals.

When it comes to these tradeoffs, one great differential privacy example is one we briefly mentioned above: the US Census Bureau.

Storytime with the U.S. Census Bureau!

Historically, the Census Bureau has had difficulty collecting data from at-risk individuals or populations, including people of color. Certain segments of the population have feared that their data, if volunteered, could be used against them.

To solve this, the US Census Bureau implemented privacy protections for the 2020 census using a differential privacy approach. They introduced data noise at the neighborhood and census block level, though the proverbial data fog lifts at the state level and above.

According to the Senior Advisor for Data Access and Privacy, Michael Hawes, this was a step backward in data accuracy for census blocks. He felt that the differentially private mechanism directly and negatively affected the conclusions made from the 2020 census.

It turns out that in this case, differential privacy data could have wide-reaching consequences for the next decade in terms of redistricting, zoning, and more.

The best data is both ethical and effective.

Organizations and individuals should take note of the use cases and benefits of differential privacy. It’s up to you to use it only when appropriate.

Differential Privacy Makes… Well, a Difference.

All in all, differential privacy is a powerful approach to ensuring the protection of private data. It still allows organizations and firms to continue to learn about large populations or user groups, but securely de-identified data means they don’t know it’s Steve’s shopping cart that they’re looking at. You’re welcome, Steve.

When used responsibly, differential privacy can take the place of data analysis on private data, thanks to improvements in theoretical computer science. It can help organizations save on their privacy budget limits, and perhaps best of all, it offers consumers peace of mind when it comes to how their sensitive data is being handled.

That’s the bottom line: Ethical, useful data is the only data you really need.

While differential privacy is not a one-size-fits-all solution, it’s a major element of data de-identification — one that should absolutely be in your privacy solution toolkit.

Want to learn more about protecting PII/PHI with state-of-the-art privacy solutions? Check out our ebook, The Subtle Art Of Giving A F*** About Data Privacy!

Want to make your data usable?

Unblock product innovation with safe, high-fidelity data de-identification and synthesis.

Book a demo

Abigail Sims

Marketing

As a reformed writer now deep in the marketing machine, Abigail can (and will) create narrative-driven content for any technical vertical. With five years of experience telling brand stories for tech startups and small businesses, she thrives at the intersection of complex data and creative communication.