In today’s data-driven world, a vast amount of data is used for analysis, decision making and effective business growth. But the data used by organizations may be sensitive user information. And users are not happy with their data being public. Differential Privacy is a method that can be used to resolve this issue.
Understanding Differential Privacy
Differential privacy is a technique (not an algorithm) that protects people’s privacy while analyzing data. It adds random noise to the data, making it harder to identify individuals. It also ensures that even if someone’s data is included or excluded, it won’t impact the overall results significantly. Differential privacy finds a balance between privacy and data usefulness, making it a valuable tool for privacy protection during data analysis.
Differential privacy learns nothing about an individual rather it learns about the population.
Can’t we just change the user identification and protect their privacy?
Replacing the user Id like name and other information with dummy details is a good option. But the process of changing the original details with dummy info will happen at the organization’s end. So users have to trust the organisation while sharing their info.
How Anonymous is Dummy Data?
The dummy data doesn’t actually provide privacy and the actual users detail can still be fetched. To prove this fact we have a few historical incidents:
- In 2006, Netflix organised a machine learning competition where participants had to create an algorithm that can predict how someone would rate a movie. For this event, Netflix provided over 100 million ratings submitted by over 480,000 users for more than 17,000 movies. Netflix replaced the actual user information with dummy info.
But guess what? The actual user data can be fetched by combining this dataset with the IMDB dataset. These types of attacks are called Linkage Attacks.
- There was research stating that any American citizen can be identified with three details zip code, birth date, and gender. Therefore, even after anonymizing their details, they may be easily identified.
The conclusion is that just changing the user information is not enough to protect their privacy. Hence differential privacy can be used. Differential Privacy neutralizes these types of attacks.
A linkage attack is a method used to link and combine multiple datasets to uncover private or sensitive information about individuals or entities. It poses a privacy risk by re-identifying seemingly anonymous individuals and revealing hidden information.
How Differential Privacy Works?
Let’s take an example, we are collecting data on many patients whether they have diabetes or not. But since it is sensitive data and users are not willing to share this information publicly. We promise them to protect their identity.
Related Book: The Algorithmic Foundations of Differential Privacy
Let’s see how the differential algorithm works to provide privacy to these users. After collecting the data we modify it using the following method:
- Flip the coin.
- If it is head, send the real answer.
- If it is the tail, again flip the coin.
- If it is head, send No i.e. the patient doesn’t has diabetes.
- If it is tail, send Yes i.e. the patient has diabetes.
Using the above method, you can never be sure whether the details about the patient (if the patient has diabetes or not) is correct.
Does changing the data (adding noise) affect the analysis?
It will definitely affect the data analysis if the amount of data is low. But for large data, the overall insights won’t be affected.
Related Link: Google's Differential Privacy Github