Part of a series on differential privacy. You might want to start with the previous articles below!
When people talk about differential privacy, they don't always have the same thing in mind. People agree on the definition, but context also matters a lot. There are two "flavors" of differential privacy: the global model and the local model. In this post, I'll first explain the difference between the two options. Then, I'll give a glimpse of an exciting new idea that combines them.
What do I mean by "context"? In abstract terms, a differentially private algorithm transforms an input into an output.
The main difference will be: who has access to the raw input?
In the global model of differential privacy, a central aggregator has access to the real data. What is this "aggregator"? Typically, it's a service or a research organization collecting data about individuals. In this model, each user sends their data to this aggregator without noise. The aggregator takes this data, and transforms it with a differentially private mechanism.
The differentially private mechanism is only applied once, at the end of the process. The aggregator can then e.g. publish the result or share it with third parties.
This model has one big advantage: accuracy. In the global model, you usually don't need to add a lot of noise to get valuable results with a low εε. Remember when I explained how to release statistics or histograms using differential privacy? These examples were using this global model. It worked pretty well: only a little noise was needed to hide someone in a count.
Where's the catch, then? Well, the central aggregator needs to know the real data. In the scenarios above, we added noise to real counts. This is only possible if we know the true numbers in the first place… To enable this, each user has to trust the aggregator enough to share data with it. That might be difficult: the aggregator can be an untrusted company or government. Also, with the global model, all the data is collected in one place. It increases the risk of catastrophic failure, for example if the aggregator gets hacked and leaks all the data.
The most famous real-world example of the global model is probably the US Census. In 2020, the US Census will use differential privacy to anonymize the data before publication. This is pretty exciting! You can read more about it here.
What's the alternative, then? It's the local model of differential privacy. In this model, there is still an aggregator, but they no longer have access to the real data. Instead, each user applies a differentially private mechanism to their own data. And they only send their data to the aggregator once it's already anonymized.
After collecting this noisy data, the aggregator can compute some statistics, and publish them. This last step doesn't need to be differentially private: the data is anonymous to begin with. In theory, the aggregator could publish the entire dataset they collected.
The big advantage of this model is that it no longer requires trust. Since each user is protecting their own data, they're safe even if the aggregator is malicious. This makes the local model well-suited to situations where trust is difficult to get. And we already saw an example of this! Remember the survey about drug use that used randomized response to gather data. The scheme allowed subjects to answer honestly without admitting to breaking the law. This is a typical application of the local model.
Can you guess the drawback of this model? Since each user must add noise to their own data, the total noise is much larger. You typically need many more users than in the global model to get useful results. To mitigate this problem, practical applications often use high values of εε.
Besides randomized response, the most famous example of this model is probably RAPPOR. This clever scheme was invented to collect differentially private data in Google Chrome. Another example, a bit more recent, is the mechanism that Apple uses to collect data on the iOS keyboard.
Until very recently, there was no middle ground between the two options. The choice was binary: either accept a much larger level of noise, or collect raw data. This is starting to change, thanks to recent work on a novel type of architecture. This new design is called ESA: Encode, Shuffle, Analyze. In the two previous models, there was only two types of participants: users and aggregator. By contrast, the ESA architecture has three elements.
In ESA, the data is encrypted twice, in two layers. The shuffler can only decrypt the first layer. It contains the user IDs, and something called "group ID". This group ID describes what kind of data this is, but not what is the actual value of the data. For example, group ID could be a label like "app latency for iOS", while the actual data would be the latency value.
The shuffler then groups all group IDs together and counts how many users are in each group. If there are enough users in a group, it passes them all to the analyzer. The analyzer can then decrypt the second layer of the data, and compute the output.
Having two layers allows to separate the data and its metadata. The shuffler can see IDs, but not the actual data. Meanwhile, the analyzer only sees the data in batches, and cannot know who sent what. The magic comes with tweaking what the shuffler does. You add some randomness in the process, in well-chosen places… And it ensures that the final output is differentially private.
This nice property comes from the separation between shuffler and analyzer. Of course, if the same entity runs both, that benefit disappears. This design can still be useful! In particular, catastrophic failures are much less likely. But from a trust perspective, it becomes like the global model. So there are two options:
Both options are probably a few years away, but I find them pretty exciting. They could, in theory, bring the best of both worlds: the low noise levels of the global model, and the trustless aspect of the local model. If you want to learn more about it, you can check out the paper that introduced this design.
What comes next
What will we be learning next? I'm not sure yet. I recently wrote a research paper about differential privacy variants, so I'll cover that here at some point. Apart from that, here are a few ideas…
If you have opinions about what you want to learn next, let me know — my contact info is below.