forked from 626_privacy/tensorflow_privacy
add walkthrough MD
PiperOrigin-RevId: 241016765
This commit is contained in:
parent
a8e54d928c
commit
8507094f2b
2 changed files with 434 additions and 1 deletions
11
README.md
11
README.md
|
@ -65,7 +65,16 @@ Github pull requests. To speed the code review process, we ask that:
|
|||
|
||||
## Tutorials directory
|
||||
|
||||
To help you get started with the functionalities provided by this library, the
|
||||
To help you get started with the functionalities provided by this library, we
|
||||
provide a detailed walkthrough [here](tutorials/walkthrough/walkthrough.md) that
|
||||
will teach you how to wrap existing optimizers
|
||||
(e.g., SGD, Adam, ...) into their differentially private counterparts using
|
||||
TensorFlow (TF) Privacy. You will also learn how to tune the parameters
|
||||
introduced by differentially private optimization and how to
|
||||
measure the privacy guarantees provided using analysis tools included in TF
|
||||
Privacy.
|
||||
|
||||
In addition, the
|
||||
`tutorials/` folder comes with scripts demonstrating how to use the library
|
||||
features. The list of tutorials is described in the README included in the
|
||||
tutorials directory.
|
||||
|
|
424
tutorials/walkthrough/walkthrough.md
Normal file
424
tutorials/walkthrough/walkthrough.md
Normal file
|
@ -0,0 +1,424 @@
|
|||
# Machine Learning with Differential Privacy in TensorFlow
|
||||
|
||||
*Cross-posted from [cleverhans.io](http://www.cleverhans.io/privacy/2019/03/26/machine-learning-with-differential-privacy-in-tensorflow.html)*
|
||||
|
||||
Differential privacy is a framework for measuring the privacy guarantees
|
||||
provided by an algorithm. Through the lens of differential privacy, we can
|
||||
design machine learning algorithms that responsibly train models on private
|
||||
data. Learning with differential privacy provides provable guarantees of
|
||||
privacy, mitigating the risk of exposing sensitive training data in machine
|
||||
learning. Intuitively, a model trained with differential privacy should not be
|
||||
affected by any single training example, or small set of training examples, in its data set.
|
||||
|
||||
You may recall our [previous blog post on PATE](http://www.cleverhans.io/privacy/2018/04/29/privacy-and-machine-learning.html),
|
||||
an approach that achieves private learning by carefully
|
||||
coordinating the activity of several different ML
|
||||
models [[Papernot et al.]](https://arxiv.org/abs/1610.05755).
|
||||
In this post, you will learn how to train a differentially private model with
|
||||
another approach that relies on Differentially
|
||||
Private Stochastic Gradient Descent (DP-SGD) [[Abadi et al.]](https://arxiv.org/abs/1607.00133).
|
||||
DP-SGD and PATE are two different ways to achieve the same goal of privacy-preserving
|
||||
machine learning. DP-SGD makes less assumptions about the ML task than PATE,
|
||||
but this comes at the expense of making modifications to the training algorithm.
|
||||
|
||||
Indeed, DP-SGD is
|
||||
a modification of the stochastic gradient descent algorithm,
|
||||
which is the basis for many optimizers that are popular in machine learning.
|
||||
Models trained with DP-SGD have provable privacy guarantees expressed in terms
|
||||
of differential privacy (we will explain what this means at the end of this
|
||||
post). We will be using the [TensorFlow Privacy](https://github.com/tensorflow/privacy) library,
|
||||
which provides an implementation of DP-SGD, to illustrate our presentation of DP-SGD
|
||||
and provide a hands-on tutorial.
|
||||
|
||||
The only prerequisite for following this tutorial is to be able to train a
|
||||
simple neural network with TensorFlow. If you are not familiar with
|
||||
convolutional neural networks or how to train them, we recommend reading
|
||||
[this tutorial first](https://www.tensorflow.org/tutorials/keras/basic_classification)
|
||||
to get started with TensorFlow and machine learning.
|
||||
|
||||
Upon completing the tutorial presented in this post,
|
||||
you will be able to wrap existing optimizers
|
||||
(e.g., SGD, Adam, ...) into their differentially private counterparts using
|
||||
TensorFlow (TF) Privacy. You will also learn how to tune the parameters
|
||||
introduced by differentially private optimization. Finally, we will learn how to
|
||||
measure the privacy guarantees provided using analysis tools included in TF
|
||||
Privacy.
|
||||
|
||||
## Getting started
|
||||
|
||||
Before we get started with DP-SGD and TF Privacy, we need to put together a
|
||||
script that trains a simple neural network with TensorFlow.
|
||||
|
||||
In the interest of keeping this tutorial focused on the privacy aspects of
|
||||
training, we've included
|
||||
such a script as companion code for this blog post in the `walkthrough` [subdirectory](https://github.com/tensorflow/privacy/tree/master/tutorials/walkthrough) of the
|
||||
`tutorials` found in the [TensorFlow Privacy](https://github.com/tensorflow/privacy) repository. The code found in the file `mnist_scratch.py`
|
||||
trains a small
|
||||
convolutional neural network on the MNIST dataset for handwriting recognition.
|
||||
This script will be used as the basis for our exercise below.
|
||||
|
||||
Next, we highlight some important code snippets from the `mnist_scratch.py`
|
||||
script.
|
||||
|
||||
The first snippet includes the definition of a convolutional neural network
|
||||
using `tf.keras.layers`. The model contains two convolutional layers coupled
|
||||
with max pooling layers, a fully-connected layer, and a softmax. The model's
|
||||
output is a vector where each component indicates how likely the input is to be
|
||||
in one of the 10 classes of the handwriting recognition problem we considered.
|
||||
If any of this sounds unfamiliar, we recommend reading
|
||||
[this tutorial first](https://www.tensorflow.org/tutorials/keras/basic_classification)
|
||||
to get started with TensorFlow and machine learning.
|
||||
|
||||
```python
|
||||
input_layer = tf.reshape(features['x'], [-1, 28, 28, 1])
|
||||
y = tf.keras.layers.Conv2D(16, 8,
|
||||
strides=2,
|
||||
padding='same',
|
||||
activation='relu').apply(input_layer)
|
||||
y = tf.keras.layers.MaxPool2D(2, 1).apply(y)
|
||||
y = tf.keras.layers.Conv2D(32, 4,
|
||||
strides=2,
|
||||
padding='valid',
|
||||
activation='relu').apply(y)
|
||||
y = tf.keras.layers.MaxPool2D(2, 1).apply(y)
|
||||
y = tf.keras.layers.Flatten().apply(y)
|
||||
y = tf.keras.layers.Dense(32, activation='relu').apply(y)
|
||||
logits = tf.keras.layers.Dense(10).apply(y)
|
||||
predicted_labels = tf.argmax(input=logits, axis=1)
|
||||
```
|
||||
|
||||
The second snippet shows how the model is trained using the `tf.Estimator` API,
|
||||
which takes care of all the boilerplate code required to form minibatches used
|
||||
to train and evaluate the model. To prepare ourselves for the modifications we
|
||||
will be making to provide differential privacy, we still expose the loop over
|
||||
different epochs of learning: an epoch is defined as one pass over all of the
|
||||
training points included in the training set.
|
||||
|
||||
```python
|
||||
steps_per_epoch = 60000 // FLAGS.batch_size
|
||||
for epoch in range(1, FLAGS.epochs + 1):
|
||||
# Train the model for one epoch.
|
||||
mnist_classifier.train(input_fn=train_input_fn, steps=steps_per_epoch)
|
||||
|
||||
# Evaluate the model and print results
|
||||
eval_results = mnist_classifier.evaluate(input_fn=eval_input_fn)
|
||||
test_accuracy = eval_results['accuracy']
|
||||
print('Test accuracy after %d epochs is: %.3f' % (epoch, test_accuracy))
|
||||
```
|
||||
|
||||
We are now ready to train our MNIST model without privacy. The model should
|
||||
achieve above 99% test accuracy after 15 epochs at a learning rate of 0.15 on
|
||||
minibatches of 256 training points.
|
||||
|
||||
```shell
|
||||
python mnist_scratch.py
|
||||
```
|
||||
|
||||
### Stochastic Gradient Descent
|
||||
|
||||
Before we dive into how DP-SGD and TF Privacy can be used to provide differential privacy
|
||||
during machine learning, we first provide a brief overview of the stochastic
|
||||
gradient descent algorithm, which is one of the most popular optimizers for
|
||||
neural networks.
|
||||
|
||||
Stochastic gradient descent is an iterative procedure. At each iteration, a
|
||||
batch of data is randomly sampled from the training set (this is where the
|
||||
*stochasticity* comes from). The error between the model's prediction and the
|
||||
training labels is then computed. This error, also called the loss, is then
|
||||
differentiated with respect to the model's parameters. These derivatives (or
|
||||
*gradients*) tell us how we should update each parameter to bring the model
|
||||
closer to predicting the correct label. Iteratively recomputing gradients and
|
||||
applying them to update the model's parameters is what is referred to as the
|
||||
*descent*. To summarize, the following steps are repeated until the model's
|
||||
performance is satisfactory:
|
||||
|
||||
1. Sample a minibatch of training points `(x, y)` where `x` is an input and `y`
|
||||
a label.
|
||||
|
||||
2. Compute loss (i.e., error) `L(theta, x, y)` between the model's prediction
|
||||
`f_theta(x)` and label `y` where `theta` represents the model parameters.
|
||||
|
||||
3. Compute gradient of the loss `L(theta, x, y)` with respect to the model
|
||||
parameters `theta`.
|
||||
|
||||
4. Multiply these gradients by the learning rate and apply the product to
|
||||
update model parameters `theta`.
|
||||
|
||||
### Modifications needed to make stochastic gradient descent a differentially private algorithm
|
||||
|
||||
Two modifications are needed to ensure that stochastic gradient descent is a
|
||||
differentially private algorithm.
|
||||
|
||||
First, the sensitivity of each gradient needs to be bounded. In other words, we
|
||||
need to limit how much each individual training point sampled in a minibatch can
|
||||
influence the resulting gradient computation. This can be done by clipping each
|
||||
gradient computed on each training point between steps 3 and 4 above.
|
||||
Intuitively, this allows us to bound how much each training point can possibly
|
||||
impact model parameters.
|
||||
|
||||
Second, we need to randomize the algorithm's behavior to make it statistically
|
||||
impossible to know whether or not a particular point was included in the
|
||||
training set by comparing the updates stochastic gradient descent applies when
|
||||
it operates with or without this particular point in the training set. This is
|
||||
achieved by sampling random noise and adding it to the clipped gradients.
|
||||
|
||||
Thus, here is the stochastic gradient descent algorithm adapted from above to be
|
||||
differentially private:
|
||||
|
||||
1. Sample a minibatch of training points `(x, y)` where `x` is an input and `y`
|
||||
a label.
|
||||
|
||||
2. Compute loss (i.e., error) `L(theta, x, y)` between the model's prediction
|
||||
`f_theta(x)` and label `y` where `theta` represents the model parameters.
|
||||
|
||||
3. Compute gradient of the loss `L(theta, x, y)` with respect to the model
|
||||
parameters `theta`.
|
||||
|
||||
4. Clip gradients, per training example included in the minibatch, to ensure
|
||||
each gradient has a known maximum Euclidean norm.
|
||||
|
||||
5. Add random noise to the clipped gradients.
|
||||
|
||||
6. Multiply these clipped and noised gradients by the learning rate and apply
|
||||
the product to update model parameters `theta`.
|
||||
|
||||
### Implementing DP-SGD with TF Privacy
|
||||
|
||||
It's now time to make changes to the code we started with to take into account
|
||||
the two modifications outlined in the previous paragraph: gradient clipping and
|
||||
noising. This is where TF Privacy kicks in: it provides code that wraps an
|
||||
existing TF optimizer to create a variant that performs both of these steps
|
||||
needed to obtain differential privacy.
|
||||
|
||||
As mentioned above, step 1 of the algorithm, that is forming minibatches of
|
||||
training data and labels, is implemented by the `tf.Estimator` API in our
|
||||
tutorial. We can thus go straight to step 2 of the algorithm outlined above and
|
||||
compute the loss (i.e., model error) between the model's predictions and labels.
|
||||
|
||||
```python
|
||||
vector_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
|
||||
labels=labels, logits=logits)
|
||||
```
|
||||
|
||||
TensorFlow provides implementations of common losses, here we use the
|
||||
cross-entropy, which is well-suited for our classification problem. Note how we
|
||||
computed the loss as a vector, where each component of the vector corresponds to
|
||||
an individual training point and label. This is required to support per example
|
||||
gradient manipulation later at step 4.
|
||||
|
||||
We are now ready to create an optimizer. In TensorFlow, an optimizer object can
|
||||
be instantiated by passing it a learning rate value, which is used in step 6
|
||||
outlined above.
|
||||
This is what the code would look like *without* differential privacy:
|
||||
|
||||
```python
|
||||
optimizer = tf.train.GradientDescentOptimizer(FLAGS.learning_rate)
|
||||
train_op = optimizer.minimize(loss=scalar_loss)
|
||||
```
|
||||
|
||||
Note that our code snippet assumes that a TensorFlow flag was
|
||||
defined for the learning rate value.
|
||||
|
||||
Now, we use the `optimizers.dp_optimizer` module of TF Privacy to implement the
|
||||
optimizer with differential privacy. Under the hood, this code implements steps
|
||||
3-6 of the algorithm above:
|
||||
|
||||
```python
|
||||
optimizer = optimizers.dp_optimizer.DPGradientDescentGaussianOptimizer(
|
||||
l2_norm_clip=FLAGS.l2_norm_clip,
|
||||
noise_multiplier=FLAGS.noise_multiplier,
|
||||
num_microbatches=FLAGS.microbatches,
|
||||
learning_rate=FLAGS.learning_rate,
|
||||
population_size=60000)
|
||||
train_op = optimizer.minimize(loss=vector_loss)
|
||||
```
|
||||
|
||||
In these two code snippets, we used the stochastic gradient descent
|
||||
optimizer but it could be replaced by another optimizer implemented in
|
||||
TensorFlow. For instance, the `AdamOptimizer` can be replaced by `DPAdamGaussianOptimizer`. In addition to the standard optimizers already
|
||||
included in TF Privacy, most optimizers which are objects from a child class
|
||||
of `tf.train.Optimizer`
|
||||
can be made differentially private by calling `optimizers.dp_optimizer.make_gaussian_optimizer_class()`.
|
||||
|
||||
As you can see, only one line needs to change but there are a few things going
|
||||
on that are best to unwrap before we continue. In addition to the learning rate, we
|
||||
passed the size of the training set as the `population_size` parameter. This is
|
||||
used to measure the strength of privacy achieved; we will come back to this
|
||||
accounting aspect later.
|
||||
|
||||
More importantly, TF Privacy introduces three new hyperparameters to the
|
||||
optimizer object: `l2_norm_clip`, `noise_multiplier`, and `num_microbatches`.
|
||||
You may have deduced what `l2_norm_clip` and `noise_multiplier` are from the two
|
||||
changes outlined above.
|
||||
|
||||
Parameter `l2_norm_clip` is the maximum Euclidean norm of each individual
|
||||
gradient that is computed on an individual training example from a minibatch. This
|
||||
parameter is used to bound the optimizer's sensitivity to individual training
|
||||
points. Note how in order for the optimizer to be able to compute these per
|
||||
example gradients, we must pass it a *vector* loss as defined previously, rather
|
||||
than the loss averaged over the entire minibatch.
|
||||
|
||||
Next, the `noise_multiplier` parameter is used to control how much noise is
|
||||
sampled and added to gradients before they are applied by the optimizer.
|
||||
Generally, more noise results in better privacy (often, but not necessarily, at
|
||||
the expense of lower utility).
|
||||
|
||||
The third parameter relates to an aspect of DP-SGD that was not discussed
|
||||
previously. In practice, clipping gradients on a per example basis can be
|
||||
detrimental to the performance of our approach because computations can no
|
||||
longer be batched and parallelized at the granularity of minibatches. Hence, we
|
||||
introduce a new granularity by splitting each minibatch into multiple
|
||||
microbatches [[McMahan et al.]](https://arxiv.org/abs/1812.06210). Rather than
|
||||
clipping gradients on a per example basis, we clip them on a microbatch basis.
|
||||
For instance, if we have a minibatch of 256 training examples, rather than
|
||||
clipping each of the 256 gradients individually, we would clip 32 gradients
|
||||
averaged over microbatches of 8 training examples when `num_microbatches=32`.
|
||||
This allows for some degree of parallelism. Hence, one can think of
|
||||
`num_microbatches` as a parameter that allows us to trade off performance (when
|
||||
the parameter is set to a small value) with utility (when the parameter is set
|
||||
to a value close to the minibatch size).
|
||||
|
||||
Once you've implemented all these changes, try training your model again with
|
||||
the differentially private stochastic gradient optimizer. You can use the
|
||||
following hyperparameter values to obtain a reasonable model (95% test
|
||||
accuracy):
|
||||
|
||||
```python
|
||||
learning_rate=0.25
|
||||
noise_multiplier=1.3
|
||||
l2_norm_clip=1.5
|
||||
batch_size=256
|
||||
epochs=15
|
||||
num_microbatches=256
|
||||
```
|
||||
|
||||
### Measuring the privacy guarantee achieved
|
||||
|
||||
At this point, we made all the changes needed to train our model with
|
||||
differential privacy. Congratulations! Yet, we are still missing one crucial
|
||||
piece of the puzzle: we have not computed the privacy guarantee achieved. Recall
|
||||
the two modifications we made to the original stochastic gradient descent
|
||||
algorithm: clip and randomize gradients.
|
||||
|
||||
It is intuitive to machine learning practitioners how clipping gradients limits
|
||||
the ability of the model to overfit to any of its training points. In fact,
|
||||
gradient clipping is commonly employed in machine learning even when privacy is
|
||||
not a concern. The intuition for introducing randomness to a learning algorithm
|
||||
that is already randomized is a little more subtle but this additional
|
||||
randomization is required to make it hard to tell which behavioral aspects of
|
||||
the model defined by the learned parameters came from randomness and which came
|
||||
from the training data. Without randomness, we would be able to ask questions
|
||||
like: “What parameters does the learning algorithm choose when we train it on
|
||||
this specific dataset?” With randomness in the learning algorithm, we instead
|
||||
ask questions like: “What is the probability that the learning algorithm will
|
||||
choose parameters in this set of possible parameters, when we train it on this
|
||||
specific dataset?”
|
||||
|
||||
We use a version of differential privacy which requires that the probability of
|
||||
learning any particular set of parameters stays roughly the same if we change a
|
||||
single training example in the training set. This could mean to add a training
|
||||
example, remove a training example, or change the values within one training
|
||||
example. The intuition is that if a single training point does not affect the
|
||||
outcome of learning, the information contained in that training point cannot be
|
||||
memorized and the privacy of the individual who contributed this data point to our
|
||||
dataset is respected. We often refer to this probability as the privacy budget:
|
||||
smaller privacy budgets correspond to stronger privacy guarantees.
|
||||
|
||||
Accounting required to compute the privacy budget spent to train our machine
|
||||
learning model is another feature provided by TF Privacy. Knowing what level of
|
||||
differential privacy was achieved allows us to put into perspective the drop in
|
||||
utility that is often observed when switching to differentially private
|
||||
optimization. It also allows us to compare two models objectively to determine
|
||||
which of the two is more privacy-preserving than the other.
|
||||
|
||||
Before we derive a bound on the privacy guarantee achieved by our optimizer, we
|
||||
first need to identify all the parameters that are relevant to measuring the
|
||||
potential privacy loss induced by training. These are the `noise_multiplier`,
|
||||
the sampling ratio `q` (the probability of an individual training point being
|
||||
included in a minibatch), and the number of `steps` the optimizer takes over the
|
||||
training data. We simply report the `noise_multiplier` value provided to the
|
||||
optimizer and compute the sampling ratio and number of steps as follows:
|
||||
|
||||
```python
|
||||
noise_multiplier = FLAGS.noise_multiplier
|
||||
sampling_probability = FLAGS.batch_size / 60000
|
||||
steps = FLAGS.epochs * 60000 // FLAGS.batch_size
|
||||
```
|
||||
|
||||
At a high level, the privacy analysis measures how including or excluding any
|
||||
particular point in the training data is likely to change the probability that
|
||||
we learn any particular set of parameters. In other words, the analysis measures
|
||||
the difference between the distributions of model parameters on neighboring training
|
||||
sets (pairs of any training sets with a Hamming distance of 1). In TF Privacy,
|
||||
we use the Rényi divergence to measure this distance between distributions.
|
||||
Indeed, our analysis is performed in the framework of Rényi Differential Privacy
|
||||
(RDP), which is a generalization of pure differential privacy
|
||||
[[Mironov]](https://arxiv.org/abs/1702.07476). RDP is a useful tool here because
|
||||
it is particularly well suited to analyze the differential privacy guarantees
|
||||
provided by sampling followed by Gaussian noise addition, which is how gradients
|
||||
are randomized in the TF Privacy implementation of the DP-SGD optimizer.
|
||||
|
||||
We will express our differential privacy guarantee using two parameters:
|
||||
`epsilon` and `delta`.
|
||||
|
||||
* Delta bounds the probability of our privacy guarantee not holding. A rule of
|
||||
thumb is to set it to be less than the inverse of the training data size
|
||||
(i.e., the population size). Here, we set it to `10^-5` because MNIST has
|
||||
60000 training points.
|
||||
|
||||
* Epsilon measures the strength of our privacy guarantee. In the case of
|
||||
differentially private machine learning, it gives a bound on how much the
|
||||
probability of a particular model output can vary by including (or removing)
|
||||
a single training example. We usually want it to be a small constant.
|
||||
However, this is only an upper bound, and a large value of epsilon could
|
||||
still mean good practical privacy.
|
||||
|
||||
The TF Privacy library provides two methods relevant to derive privacy
|
||||
guarantees achieved from the three parameters outlined in the last code snippet: `compute_rdp`
|
||||
and `get_privacy_spent`.
|
||||
These methods are found in its `analysis.rdp_accountant` module. Here is how to use them.
|
||||
|
||||
First, we need to define a list of orders, at which the Rényi divergence will be
|
||||
computed. The first method `compute_rdp` returns the Rényi differential privacy
|
||||
achieved by the Gaussian mechanism applied to gradients in DP-SGD, for each of
|
||||
these orders.
|
||||
|
||||
```python
|
||||
orders = [1 + x / 10. for x in range(1, 100)] + list(range(12, 64))
|
||||
rdp = compute_rdp(q=sampling_probability,
|
||||
noise_multiplier=FLAGS.noise_multiplier,
|
||||
steps=steps,
|
||||
orders=orders)
|
||||
```
|
||||
|
||||
Then, the method `get_privacy_spent` computes the best `epsilon` for a given
|
||||
`target_delta` value of delta by taking the minimum over all orders.
|
||||
|
||||
```python
|
||||
epsilon = get_privacy_spent(orders, rdp, target_delta=1e-5)[0]
|
||||
```
|
||||
|
||||
Running the code snippets above with the hyperparameter values used during
|
||||
training will estimate the `epsilon` value that was achieved by the
|
||||
differentially private optimizer, and thus the strength of the privacy guarantee
|
||||
which comes with the model we trained. Once we computed the value of `epsilon`,
|
||||
interpreting this value is at times
|
||||
difficult. One possibility is to purposely
|
||||
insert secrets in the model's training set and measure how likely
|
||||
they are to be leaked by a differentially private model
|
||||
(compared to a non-private model) at inference time
|
||||
[[Carlini et al.]](https://arxiv.org/abs/1802.08232).
|
||||
|
||||
### Putting all the pieces together
|
||||
|
||||
We covered a lot in this blog post! If you made all the changes discussed
|
||||
directly into the `mnist_scratch.py` file, you should have been able to train a
|
||||
differentially private neural network on MNIST and measure the privacy guarantee
|
||||
achieved.
|
||||
|
||||
However, in case you ran into an issue or you'd like to see what a complete
|
||||
implementation looks like, the "solution" to the tutorial presented in this blog
|
||||
post can be [found](https://github.com/tensorflow/privacy/blob/master/tutorials/mnist_dpsgd_tutorial.py) in the
|
||||
tutorials directory of TF Privacy. It is the script called `mnist_dpsgd_tutorial.py`.
|
||||
|
||||
|
Loading…
Reference in a new issue