Udacity_Differencial Privacy

4.8

4.8-1

In this concept, we’re going to answer the question, how do we actually use epsilon( ϵ \epsilon ϵ) and delta( δ \delta δ)? We’re going to learn how to take a query and at a certain degree of noise to make what we call a randomized mechanism. We want this randomized mechanism to satisfy a certain degree of differential privacy.

We went to augment a query like a sum, threshold, mean or average and add a certain amount of noise to this query so that we get a certain amount of differential privacy. In particular, we’re going to leave behind(抛弃) the local differential privacy previously discussed, and instead opt(选择) for global differential privacy. As I mentioned earlier, the difference between local and global is that global differential privacy adds noise to the output of a query while local differential privacy adds noise to each data input to the query. So given that we are going for global dP, we’re adding noise to the output and how much noise should we add?

We’re going to add the minimum amount required to satisfy a certain level of epsilon and delta, which we will term our privacy budget for a given query. Now, in order to do this, there are two types of noise we could add as I mentioned earlier; Gaussian noise or Laplacian noise. Generally speaking, Laplacian noise works better but technically both are still valid and can give us varying levels of epsilon-delta privacy. In this case, we’re going to exclusively focus on Laplacian. Now, to the hard question.

How much noise should we add?

The amount of noise necessary to add to the output of the query is a function of four things.

  • First, the amount of noise is dependent on the type of noise that we’re adding. We’re just going to focus on Laplacian here, so that one’s easy.
  • Second, we must take into account the sensitivity of the query that we are using to query the database. As mentioned, some queries are way more sensitive to removing a person from the database and other queries. Some sensitivity is very consistent sensitivity, as in every database always has the same level of sensitivity for that query type. Whereas some queries, have varying levels of sensitivity that is dependent on the database.
  • Then of course, the two other things we must take into account is the desired epsilon and delta.

Thus, for each type of noise that we’re adding, we have a different way of calculating how much noise to add as a function of the sensitivity of the query to meet a certain epsilon-delta constraint. So to restate this, each noise type has a specific function, which tells us how much noise to add given a certain sensitivity, epsilon and delta. For Laplacian noise, this function is the following.

4.8-2

Laplacian noise takes an input parameter beta, which determines how significant the noise is. We set the beta by taking the sensitivity of the query and dividing it by the epsilon that we want to achieve. As it happens, delta is always zero for Laplacian noise, so we can just ignore it.

In other words, if we set beta to be this value when creating our Laplacian noise, then we know we will have a privacy leakage which is less than or equal to a certain amount of epsilon. Furthermore, the nice thing about Laplacian noise is that we don’t have to worry about delta because it’s always set to zero.

β = s e n s i t i v i t y ( q u e r y ) ϵ \beta = \frac{sensitivity(query)}{\epsilon} β=ϵsensitivity(query)

Gaussian noise has a non-zero delta, which is why it’s somewhat less desirable. Thus, we’re using Laplacian for this exercise. There’s this really awesome proof for why this is the case but that proof is not necessary to know how to use Laplacian noise. Furthermore, when reading literature about differential privacy, you’ve heard the term Laplacian mechanism, which refers to a function being augmented with Laplacian noise in this way, forming the mechanism " N " in the original and differential privacy function discussed earlier.

The thing we need to know here however is that we can take any query for which we have a measure of sensitivity, choose any arbitrary epsilon budget that we want to preserve and we can add the appropriate amount of Laplacian noise to the alphabet of the query, pretty neat.

In the next project, I want you to do this yourself. First, modify a query for sum with the appropriate amount of Laplacian noise so that you can satisfy a certain epsilon delta constraint. So this new sum query should automatically add the appropriate noise given an arbitrary epsilon level.

For Laplace, you can use the Laplace function np.random.laplace. After you had this mechanism working for the sum function, I then want you to do the same thing for the mean function. Scaling the Laplacian noise correctly given the fact that, mean has a different level of sensitivity than sum.

4.8-3

So Laplacian noise, increase and decrease according to a scale parameter beta(\beta). Before I get to you, there are wide variety of different kinds of randomized mechanisms. In this course, we’re only going to go through a small handful of them and I highly encourage you, when you do finish this course, to Google around and learn some more about the different kinds of differential private randomized mechanisms that can be appropriate for different use cases.

Okay. So back to Laplacian noise. So Laplacian noise is the amount of noise you’re adding for a Laplacian distribution is increased or decreased according to a scale parameter beta. We choose beta based on the following formula. B or beta equals the sensitivity of our query. That’s the query that we are adding this noise to, divided by epsilon, right?

This epsilon again, we’re spending(花费) this epsilon for every query, right? So if we’re querying a database, right? Every time we do it, we’re going to spend this amount of epsilon, right?

So the notion here is that we have a certain epsilon budget that we wanted to stay underneath and that by using this simple formula, we can know how much noise we have to add to the output of these queries in order to make sure that we are preserving privacy.

So in other words, if we set b this value, that we know that we’ll have a privacy leakage of less than or equal to epsilon. The nice thing about Laplacian noise is that it actually guarantees that we do this with a delta that is equal to zero, right? So we have these four things right here.

So type of noise, sensitivity, epsilon-delta. Laplacian noise always has a delta that’s zero. So if you remember, delta was a probability that we would accidentally leak more than this amount of epsilon, right? So Laplacian is guaranteed to not leak more than this amount of epsilon. Now, one other question you might have, what happens if we want to query repeatedly?

Well, as it happens, if we do query repeatedly, then we can simply add the epsilons across the different queries. So if we have epsilon of say five, we could do five queries that leak epsilon of value one for example. This is how the Laplacian mechanism works.

In the next section, what I would like for you to do is actually perform a sum and a mean query. So you can take the sum and the mean query over the database and use the ones we used previously in the course. I want you to add a certain amount of Laplacian noise to the output, so that you’re underneath a certain level epsilon. In the next lesson, I’ll show you how I would do this. See you then.

5 Differential Privacy for Deep Learning

5.1

In the last few lessons, you might have been wondering, what does all this have to do with deep learning? Well, it turns out the same techniques that we were just studying formed the core principles for how differential privacy provides guarantees in the context of deep learning.

Previously, we defined perfect privacy as something like, a query to a database returns the same value even if we remove any person from that database. If we’re able to do that, then no person is really contributing information to the final query and their privacy is protected. We use this intuition in the description of epsilon delta.

In the context of deep learning, we have a similar standard, which is based on these ideas, which instead of querying a database, we’re training a model. Our definition of perfect privacy would then be something like, training a model on a dataset should return the same model even if we remove any person from the training dataset. So we’ve replaced, “querying a database with training a model on a dataset”.

In essence(从本质上讲), the training process is actually a query, but one should notice that this adds two points of complexity, which the databases didn’t have.

  • First, do we always know where people are referenced in a training dataset?

In a database, every row corresponded to a person, so it was very easy to calculate the sensitivity because we can just remove individuals. We knew where all of them were. However, in a training dataset, let’s say I’m training a sentiment classifier on movie reviews, I have no idea where all the people are reference inside of that training dataset because, it’s just a bunch of natural language. So in some cases, this can actually be quite a bit more challenging.

  • Secondly, neural models rarely ever trained to the same state, the same location even when they’re trained on the same dataset twice.

So if I train the same deep neural network twice,even if I train over the exact same data, the model is not going to train to the same state.There’s already an element of randomness in the training process.

So, how do we actually prove or create training setups where differential privacy is present? (那么,我们如何实际证明或创建存在差异隐私的训练设置?)

The answer to the first question by default seems to be, to treat each training example as a single separate person. Strictly speaking, this is often a bit overzealous(过于热情) as many examples have no relevance to people at all, but others may have multiple partial individuals contained within that training example. Consider an image, which has multiple people contained within it, localizing exactly where people are referenced, thus how much the model would change if those people will remove, could be quite challenging. But obviously, there’s a technique we’re about to talk about that tries to overcome this.

The answer to the second question regarding how models rarely ever trained at the same location, how do we know what sensitivity truly is, has several interesting proposed solutions as well which we’ll be discussing shortly.

But first, let’s suppose a new scenario within which we want to train a deep neural network. As mentioned previously, privacy preserving technology is ultimately(最终是要) about protecting data owners from individuals or parties they don’t trust. We only want to add as much noise as is necessary to protect these individuals as adding excess noise(过量的噪声) needlessly hurts(不必要地损害) the model accuracy, or failing to add enough noise might expose someone to privacy risk.

Thus, when discussing tools with differential privacy, it’s very important to discuss it in the context of different parties who either do or do not trust each other, so that we can make sure that we’re using an appropriate technique.

5.2 Demo Intro

To ground our discussion of differentially private deep learning, let’s consider a scenario.

Let’s say you work for a hospital, and you have a large collection of images about your patients. However, you don’t know what’s in them. You would like to use images develop a neural network which can automatically classify them.

However, since your images aren’t labeled, they aren’t sufficient to train a classifier. Whoever, being a cunning strategist(狡猾的策略家), you realize that you can reach out to 10 partner hospitals, which do have annotated data.

It is your hope to train your new classifier on their datasets so you can automatically label your own. While these hospitals are interested in helping, they have privacy concerns regarding information about their own patients. Thus, you will use the following technique to train a classifier which protects the privacy of the patients in the other hospitals.

  1. So first, you’ll ask each of the 10 hospitals to train a model on their own datasets, so generating 10 different models.

  2. Second, you’ll then use each of these 10 partner models to predict on your local dataset generating 10 labels for each of your datapoints for each of your images.

  3. Then, for each local datapoint, now with 10 labels, you will perform a differentially private query to generate a final true label for each example. This query will be a max function, where max is the most frequent label across the 10 labels assigned for each individual image.

  4. We will then need to add Laplacian noise to make this differentially private to a certain epsilon delta constraint.

  5. Finally, we will then retrain a new model on our local dataset, which now has these labels that we have automatically generated.

This will be our final differentially private model. So let’s walk through these steps. I will assume you are already familiar with how to train and predicted deep neural network, so we’ll skip steps one and two and work with that example data.

We’ll focus instead on step three, namely how to perform the differentially private query for each example using toy data.

你可能感兴趣的:(Udacity_Differencial Privacy)