As of mid-2017 there are already many proposals to fool DNNs by adversarial example. Papernot et al (2016) proposed so far the most powerful such attack against image classifiers by exploiting transferability. According to the experiments conducted, the method can achieve nearly 90% misclassification rates for DNNs services hosted on Amazon and Google. This practical shows that black-box attack is feasible.
Adversarial example is a set of data that was created deliberately to fool DNNs. The topic started when Szegedy et al (2013) found that for DNNs trained with ImageNet or AlexNet, often a small amount of change in the input can result in vast differences in the output.
For example, suppose the model recognizes a picture of a truck correctly, it may only require changing relatively few pixels in the picture for the model to classify it differently, and the changes are tiny relative to the picture that human eyes are unlikely to recognize the differences.
This will be a two-part blog post on recent developments of adversarial examples on DNNs. We will start by mentioning the important properties of these examples, pointing readers to the two algorithms of finding them, and some published strategies of defense. In the second part we will collect some important developments in this regard since 2017. The readers may also wish to consult the excellent introduction Attacking Machine Learning with Adversarial Examples by Open AI.
Important properties of adversarial examples
There are some basic findings in Szegedy et al (2013):
- DNNs trained by different sections of the same dataset, with different network architecture, supposedly are different networks. However by experimentation it has been found that often they are mislead by the same adversarial example. In other words, an adversarial example found for one DNN can often be transferred to another DNN on the same problem.
- Finding adversarial example is basically an optimisation problem. One needs to look around an input data point by algorithms. It’s not goining to be easy to find one by randomly perturb the data.
So far what we can explain about the existence of adversarial example is that they are the results of the linear calculations in DNNs. This is explained in Goodfellow et al (2014) roughly as follows. The linear calculations in DNNs are equations such as , where and are weight and bias respectively, is the input, and is the output stimulation. Here and are both vectors, and is the inner product of the two vectors. Therefore when is parallel to , a small amount to changes represented by can results in vast differences in the output , because can be large. When the dimension of and is high (when there are lots of model features), is big enough to push the output over a decision boundary.
This view of adversarial examples explains a few things:
- Common regularization techniques such as dropout, pertaining, model average is unlikely to be effective against adversarial examples.
- The existence of adversarial examples is a result of the geometric properties of the decision boundary, so this also explains their transferability.
- For nonlinear models such as RBF network, adversarial examples are not as effective as to linear models. Of course, it is much harder to train nonlinear models. Therefore model programmers might be facing the choice of linear models which are easier to train and potentially unstable, or nonlinear models which are difficult to train but more robust.
Algorithms for finding adversarial examples
So far there are two main algorithms for finding adversarial examples:
- Fast Gradient Sign Method (FGSM) in Goodfellow et al (2014).
Jacobian-based Saliency Map Approach (JSMA) in Papernot et al (2016).
The thread model of exploiting DNNs with adversarial examples generally has the following categories (Papernot et al (2016)).
Adversarial goals (from easy to difficult)
- Confidence reduction.
- Misclassification: change to output.
- Targeted misclassification: craft an input with a specific output.
Source/target misclassify: modify a specific input to a specific output.
Adversarial capabilities (from more to little)
- Architecture & training data
- Training data sample
Under this thread model, the research we have mentioned so far all have the adversarial goal (4). As for their adversarial capabilities, scenarios before 2015 all require (2) which means the exploiters need to know the architecture of the target network, and since 2016 it has been (4) which means the exploiters only need to have access to some prediction results.
Note that through the scenarios in Tramer et al (2016) a malicious entity can gain information about the target network architecture, and then apply the information to generate adversarial examples.
Two broad categories exist for defense: reactive and proactive.
- Reactive: add defense against adversarial example after the model is trained. Many of the results concerns how to filter out adversarial examples from model input.
- Proactive: train the model to be more resistant to adversarial examples.
The two main approaches to proactive defense are:
- Adversarial training in Shaham et al (2015) uses generated adversarial examples to retrain the model, thereby reduce the changes caused by perturbation. Such retrained models are more stable locally and generating adversarial examples for them would be harder.
- Defensive distillation in Papernot et al (2016) feeds the probability vectors produced by the model back to itself to distill the weights, causing the decision boundaries to be smoother and harder to find adversarial examples.
The more effective approach so far seems to be defense distillation, which can be shown to be effective against FGSM and JSMA. This is before Papernot et al (2016) and Carlini et al (2016) announced new scenarios of attack. Papernot et al (2017) attempts to harden defensive distillation but the results are limited.
- Carlini, N., & Wagner, D. (2016). Towards Evaluating the Robustness of Neural Networks. Retrieved from http://arxiv.org/abs/1608.04644
- Goodfellow, I. J., Shlens, J., & Szegedy, C. (2014). Explaining and Harnessing Adversarial Examples, 1–11. Retrieved from http://arxiv.org/abs/1412.6572
- Papernot, N., & McDaniel, P. (2017). Extending Defensive Distillation. arXiv. Retrieved from http://arxiv.org/abs/1705.05264
- Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., & Swami, A. (2016). Practical Black-Box Attacks against Machine Learning. https://doi.org/10.1145/3052973.3053009
- Papernot, N., McDaniel, P., Wu, X., Jha, S., & Swami, A. (2016). Distillation as a Defense to Adversarial Perturbations Against Deep Neural Networks. Proceedings — 2016 IEEE Symposium on Security and Privacy, SP 2016, 582–597. https://doi.org/10.1109/SP.2016.41
- Papernot, N., Mcdaniel, P., Jha, S., Fredrikson, M., Celik, Z. B., & Swami, A. (2016). The limitations of deep learning in adversarial settings. Proceedings — 2016 IEEE European Symposium on Security and Privacy, EURO S and P 2016, 372–387. https://doi.org/10.1109/EuroSP.2016.36
- Shaham, U., Yamada, Y., & Negahban, S. (2015). Understanding Adversarial Training: Increasing Local Stability of Neural Nets through Robust Optimization, 1–12. Retrieved from http://arxiv.org/abs/1511.05432
- Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2013). Intriguing properties of neural networks. https://doi.org/10.1021/ct2009208
- Tramèr, F., Zhang, F., & Juels, A. (2016). Stealing Machine Learning Models via Prediction APIs. In Proceedings of the 25th USENIX Security Symposium (pp. 601–618). https://doi.org/10.1103/PhysRevC.94.034301