Big data is only getting bigger. And as that happens, users are getting more and more uncomfortable with just how much organizations and advertisers know about them, their habits, and their preferences. (Understandably.)
Legislation like the California Consumer Privacy Act (CCPA) and General Data Protection Regulation (GDPR) has imposed major restrictions on data collection and privacy approaches. That means analysts, big data firms, and other companies have to rethink how they keep private data safe.
Differential privacy is one of the newer and more innovative approaches to address both user squeamishness and data privacy restrictions. Today, we’ll look at how differential privacy works on a practical level—and what implementing it can do for your organization.
Okay, so. Differential privacy works by introducing random noise into data sets to make it harder to tie specific data records back to specific individuals, thereby preserving privacy. It’s often used when looking at the data in aggregate — like taking averages — and provides plausible deniability to individuals represented in the data. Organizations can still use machine learning and other means to analyze data sets and compute statistics like mean, median, mode, and more. But differentially private algorithms obscure specific responses, correlations, and connections between individual data points/people.
This also means that analysis/queries performed on a dataset run through a differentially private algorithm will yield very similar results even if someone is removed, added to the data set, or had their data modified. This ensures personal privacy and provides a strict, formal (aka mathematical) assurance that personal or private information won’t be leaked or used to re-identify individuals. (Yay!)
All that said, differential privacy does not necessarily mean that all information about individuals is 100% private.
Huh?
Well, differential privacy can mathematically guarantee that:
However, differential privacy does not guarantee:
The difference lies in what constitutes “individual” and “general” information.
Individual information is information like PII/PHI, that needs to remain private—and therefore should be unidentifiable after that data is inputted into a statistical analysis. General information is the opposite.
Think of it like this:
So you can see that differentially private analysis is an important statistical and algorithmic framework, and how it can play a major role in useful, private data collection and analytics.
Many industries and companies are already leveraging differential privacy approaches and algorithms. For example:
Differentially private algorithms are crucial for businesses for a variety of reasons. For example, it:
Differential privacy works to appease folks who want their data to be more secure (i.e. everyone) without removing companies’ ability to perform big data analytics.
While differentially private procedures can be very valuable, there are some downsides or circumstances where you really don’t need it.
Specifically, differential privacy approaches aren’t needed when:
When it comes to these tradeoffs, one great differential privacy example is one we briefly mentioned above: the US Census Bureau.
Historically, the Census Bureau has had difficulty collecting data from at-risk individuals or populations, including people of color. Certain segments of the population have feared that their data, if volunteered, could be used against them.
To solve this, the US Census Bureau implemented privacy protections for the 2020 census using a differential privacy approach. They introduced data noise at the neighborhood and census block level, though the proverbial data fog lifts at the state level and above.
According to the Senior Advisor for Data Access and Privacy, Michael Hawes, this was a step backward in data accuracy for census blocks. He felt that the differentially private mechanism directly and negatively affected the conclusions made from the 2020 census.
It turns out that in this case, differential privacy data could have wide-reaching consequences for the next decade in terms of redistricting, zoning, and more.
The best data is both ethical and effective.
Organizations and individuals should take note of the use cases and benefits of differential privacy. It’s up to you to use it only when appropriate.
All in all, differential privacy is a powerful approach to ensuring the protection of private data. It still allows organizations and firms to continue to learn about large populations or user groups, but securely de-identified data means they don’t know it’s Steve’s shopping cart that they’re looking at. You’re welcome, Steve.
When used responsibly, differential privacy can take the place of data analysis on private data, thanks to improvements in theoretical computer science. It can help organizations save on their privacy budget limits, and perhaps best of all, it offers consumers peace of mind when it comes to how their sensitive data is being handled.
That’s the bottom line: Ethical, useful data is the only data you really need.
While differential privacy is not a one-size-fits-all solution, it’s a major element of data de-identification — one that should absolutely be in your privacy solution toolkit.
Want to learn more about protecting PII/PHI with state-of-the-art privacy solutions? Check out our ebook, The Subtle Art Of Giving A F*** About Data Privacy!