Over the course of this book, you have come to understand GANs as an intuitive concept. However, in 2014, GANs seemed like a massive leap of faith, especially for those unfamiliar with the emerging field of adversarial examples, including Ian Goodfellow’s and others’ work in this field.[1] This chapter dives into adversarial examples—specially constructed examples that make other classification algorithms fail catastrophically.
See “Intriguing Properties of Neural Networks,” by Christian Szegedy et al., 2014, https://arxiv.org/pdf/1312.6199.pdf.
We also talk about their connections to GANs and how and why adversarial learning is still largely an unsolved problem in ML—an important but rarely discussed flaw of the current approaches. That is true even though adversarial examples have an important role to play in ML robustness, fairness, and (cyber)security.
There is no denying we have made substantial progress in machine learning’s capacity to match and surpass human-level performance over the last five years—for example, in computer vision (CV) classification tasks or the ability to play games.[2] However, looking only at metrics and ROC curves[3] is insufficient for us to understand (a) why neural networks make the decisions they do (how they work) and (b) what errors they are prone to making. This chapter touches on the first and dives into the second. Before we begin, it should be said that although this chapter deals almost exclusively with CV problems, adversarial examples have been identified in diverse areas such as text or even in humans.[4]
What constitutes human-level performance in vision-classification tasks is a complicated topic. However, at least in, for example, Dota 2 and Go, AI has beat human experts by a substantial margin.
A receiver operating characteristic (ROC) curve explains the trade-offs between false positives and negatives. We also encountered them in chapter 2. For more details, Wikipedia has an excellent explanation.
See “Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey,” by Wei Emma Zhang et al., 2019, http://arxiv.org/abs/1901.06796. See also “Adversarial Examples That Fool Both Computer Vision and Time-Limited Humans,” by Gamaleldin F. Elsayed et al., 2018, http://arxiv.org/abs/1802.08195.
First of all, when we speak about neural networks’ performance, we frequently read that their error rate is lower than that of humans on the large ImageNet dataset. This often-cited statistic—which started more as an academic joke than anything else—belies the performance differences hidden underneath this average. While humans’ error rate tends to be driven mostly by their inability to distinguish between different breeds of dogs that appear prominently in this dataset, the machine learning failures are much more ominous. Upon further investigation, adversarial examples were born.
Unlike humans, CV algorithms struggle with problems that are very different in nature and can be close to the training data. Because the algorithm has to make predictions for every picture possible, it has to extrapolate between the isolated and far-apart individual instances it has seen in the training data, even if we have lots of them.
When we have trained networks such as Inception V3 and VGG-19, we have found an amazing way of making image classification work on a thin manifold around the training data. But when people tried to poke holes in the classification ability of these algorithms, they discovered a cosmic crater—current machine learning algorithms get easily fooled by even minor distortions. Virtually all major successful machine learning algorithms to date suffer from this flaw to some extent, and, indeed, some speculate that is why machine learning works at all.
In supervised settings, think of our training set. We have a training manifold—just a fancy word describing a high-dimensional distribution in which our examples live. For example, our 300 × 300 pixel images live in a 270,000 dimensional space (300 × 300 × 3 colors). That makes training very complicated.
To start, we want to quickly touch on why we included this chapter toward the end of the book:
In terms of applications, adversarial examples are interesting for several reasons:
As current research stands, learning about adversarial examples is the only way to start to understand adversarial defenses, as most papers begin with a description of the types of attacks they defend against and only then try to solve them. At the time of writing this book, no universal defenses work against all types of attack. But whether this is a good reason to study them depends on your view on adversarial examples. We decided not to cover defenses in detail—above the high-level ideas toward the end of this chapter—because anything beyond that is beyond the scope of this book.
To truly understand adversarial examples, we must come back to the domain of CV classification tasks—partially to understand how difficult a task it is. Recall that to go from raw pixels to ultimately being able to classify sets of images is challenging.
This is in part because, in order to have a truly generalizable algorithm, we have to make sensible predictions on data nowhere near anything that we have seen in the training set. Moreover, the pixel-level differences between the image at hand and the closest image in the training set of the same class are large, even when we slightly change the angle at which the picture was taken.
When we have our training set of 100,000 examples of 300 × 300 images in RGB space, we have to somehow deal with 270,000 dimensions. When we consider all possible images (not the ones that we actually observe, but the ones that could happen), the pixel value of each dimension is independent of the other dimensions, because we can always generate a valid picture by rolling a hypothetical 256-sided dice 270,000 times. Therefore, we theoretically have 256270,000 examples (a number that is 650,225 digits long) at 8-bit color space.
We would need a lot of examples to cover even 1% of this space. Of course, most of these images would not make any sense. Frequently, our training set is a lot sparser than that, so we need our algorithms to train using this relatively limited data to extrapolate even into regions they have not seen at all yet. This is because the algorithm most likely has seen nothing near what we have in the training set.
Having 100,000 examples is frequently cited as a minimum at which deep learning algorithms should really start to shine.
We understand that algorithms have to meaningfully generalize; they have to be able to meaningfully fill in the huge part of space where they have not seen any example. Computer vision algorithms work mostly because they can come up with good guesses for the vast swaths of missing probability, but their strength is also their greatest weakness.
In this section, we introduce two ways of thinking about adversarial examples—one from first principles and the other by analogy. The first way to think about adversarial examples is to start from the way machine learning classification is trained. Remember that these are networks with tens of millions of parameters. Throughout training, we update some of them so that the class matches the label as provided in the training set. We need to find just the right parameter updates, which is what the stochastic gradient descent (SGD) allows us to do.
Now think back to the simple classifier days, before you knew a lot about GANs. Here we have some sort of learnable classification function fθ(x) (for example, a deep neural network, or DNN), which is parametrized by θ (parameters of the DNN) and takes x (for example, an image) as input and produces a classification . At training time, we then take and compare it with the true y, which is how we get our loss (L). We then update the parameters of fθ(x) such that the loss is minimized. Equations 10.1, 10.2, and 10.3 summarize.[5]
Please remember, this is just a quick summary, and we have to skip over some details, so if you can point them out—great. If not, we suggest picking up a book such as Deep Learning with Python by François Chollet (Manning, 2017) to brush up on the specifics.
In essence, we have defined prediction as the output of the neural net after being fed an example (equation 10.1). Loss is some form of the difference between the true and predicted label (equation 10.2). The overall problem is then phrased as trying to minimize the difference between the true and predicted labels over the parameters of the DNN, which then constitute the prediction given an example (equation 10.3).
This is all working great, but how do we actually minimize our classification loss? How do we solve the optimization problem as phrased in equation 10.3? We usually use an SGD-based method to take batches of x; then we take the derivative of the loss function with respect to the current parameters (θt) multiplied by our learning rate (α), which constitutes our new parameters (θt + 1). See equation 10.4.
This was the quickest introduction to deep learning you will ever find. But now that you have this context, think about whether this powerful tool (SGD) could be used for other purposes as well. For instance, what happens when we take a step up the loss space rather than down? Turns out, maximizing the error rather than minimizing it is much easier, but also important. And like many great discoveries, it started as a seeming bug that turned into a hack: what if we start updating the pixels rather than the weights? If we update them maliciously, adversarial examples happen.
Some of you may be confused, about this quick recap of SGD, so let’s remind ourselves what a typical loss space could look like in figure 10.1.
(Source: “Visualizing the Loss Landscape of Neural Nets,” by Tom Goldstein et al., 2018, https://github.com/tomgoldstein/loss-landscape.)
The second useful (though imperfect) mental model to think about adversarial examples is by analogy. You may think of adversarial examples as Conditional GANs like those we encountered in the preceding two chapters. With adversarial examples, we are conditioning on an entire image and trying to produce a domain transferred or similar image, except in a domain that fools the classifier. The “generator” can be a simple stochastic gradient ascent that simply adjusts the image to fool some other classifier.
Whichever of the two ways makes sense to you, let’s now dive straight into adversarial examples and what they look like. They were discovered with an observation of how easy it is to misclassify these altered images. One of the first methods to achieve this is the fast sign gradient method (FSGM), which is as simple as our previous description.
You start with the gradient update (equation 10.4), look at the sign, and then make a small step in the opposite direction. In fact, frequently the images come out looking (almost) identical! A picture is worth a thousand words to show you how little noise is needed; see figure 10.2.
Now we run a ResNet-50 pretrained classifier on this unmodified vacation image and check the top three predictions, shown in table 10.1; drumroll, please.
Order |
Class |
Confidence |
---|---|---|
First | mountain_tent | 0.6873 |
Second | promontory | 0.0736 |
Third | valley | 0.0717 |
The top three are all sensible, with mountain_tent taking the top spot, as it should. Table 10.2 shows the adversarial image predictions. The top three miss mountain_tent completely, with some suggestions that at least match the outdoors, but even the modified image is clearly not a suspension bridge.
Order |
Class |
Confidence |
---|---|---|
First | volcano | 0.5914 |
Second | suspension_bridge | 0.1685 |
Third | valley | 0.0869 |
This is how much we can distort the prediction, with a budget of only approximately 200 pixel values—the equivalent of taking a single almost-black pixel and turning it into an almost-white pixel—spread across the whole image.
A somewhat scary thing is how little code it takes to create this whole example. In this chapter, we’ll use an amazing library called foolbox, which provides many great convenience methods to create adversarial examples. Without further ado, let’s dive into it. We start with our well-known imports, plus foolbox, which is a library designed specifically to make adversarial attacks easier.
import numpy as np from keras.applications.resnet50 import ResNet50 from foolbox.criteria import Misclassification, ConfidentMisclassification from keras.preprocessing import image as img from keras.applications.resnet50 import preprocess_input, decode_predictions import matplotlib.pyplot as plt import foolbox import pprint as pp Import keras %matplotlib inline
Next, we define a convenience function to load in more images.
def load_image(img_path: str): image = img.load_img(img_path, target_size=(224, 224)) plt.imshow(image) x = img.img_to_array(image) return x image = load_image('DSC_0897.jpg')
Next, we have to set Keras to register our model and download ResNet-50 from the Keras convenience function.
keras.backend.set_learning_phase(0) 1 kmodel = ResNet50(weights='imagenet') preprocessing = (np.array([104, 116, 123]), 1) fmodel = foolbox.models.KerasModel(kmodel, bounds=(0, 255), 2 preprocessing=preprocessing) 2 to_classify = np.expand_dims(image, axis=0) 3 preds = kmodel.predict(to_classify) 4 print('Predicted:', pp.pprint(decode_predictions(preds, top=20)[0])) label = np.argmax(preds) 5 image = image[:, :, ::-1] 6 attack = foolbox.attacks.FGSM(fmodel, threshold=.9, 7 criterion=ConfidentMisclassification(.9)) 7 adversarial = attack(image, label) 8 new_preds = kmodel.predict(np.expand_dims(adversarial, axis=0)) 9 print('Predicted:', pp.pprint(decode_predictions(new_preds, top=20)[0]))
That’s how easy it is to use these examples! Now you may be thinking, maybe that’s just ResNet-50 that suffers from these examples. Well, we have some bad news for you. ResNet not only proved to be the hardest classifier to break as we were testing various code setups for this chapter, but also is an uncontested winner on DAWNBench in every ImageNet category (which is the most challenging task in the CV category on DAWNBench), as shown in figure 10.3.[6]
See “Image Classification on ImageNet,” at DAWNBench, https://dawn.cs.stanford.edu/benchmark/#imagenet.
But the biggest problem of adversarial examples is their pervasiveness. Adversarial examples generalize beyond deep learning and transfer to different ML techniques. If we generate an adversarial example against one technique, there is a reasonable chance it will work even on another model we are trying to attack, as illustrated in figure 10.4.
(Source: “Transferability in Machine Learning: from Phenomena to Black-Box Attacks Using Adversarial Samples,” by Nicolas Papernot et al., 2016, https://arxiv.org/pdf/1605.07277.pdf.)
Worse yet, many of the adversarial examples are so easy to construct that we can just as easily fool the classifier by Gaussian noise that we can sample from np.random.normal. On the other hand—and to support our earlier point of ResNet-50 being a fairly robust architecture—we will show you that other architectures suffer from this issue much more.
Figure 10.5 shows the result of running ResNet-50 on pure Gaussian noise. However, we can use an adversarial attack on the noise itself to see how misclassified our image can get—rather quickly.
In listing 10.4, we’ll use a projected gradient descent (PGD) attack, illustrated in figure 10.6. Although this is still a simple attack, it warrants a high-level explanation. Unlike with the previous attacks, we are now taking a step regardless of where it may lead us—even “invalid” pixel values—and then projecting back onto the feasible space. Now let’s apply the PGD attack onto our Gaussian noise in figure 10.7 and run ResNet-50 to see how we do.
To demonstrate that most architectures are even worse, we’ll look into Inception V3—an architecture that has earned fame in the CV community. Indeed, this network has been deemed so reliable that we touched on it in chapter 5. In figure 10.8, you can see that even something that gave birth to the inception score still fails on trivial examples. To dispel any doubts, Inception V3 is still one of the better pretrained networks out there and does have superhuman accuracy.
This was just regular Gaussian noise. You can see in the code for yourself that no adversarial step was applied. Sure, you could argue that the noise could have been preprocessed better. But even that is a massive adversarial weakness.
If you are anything like us, you are thinking, no way, I want to see for myself. Well, now we give you the code to reproduce those figures. Because the code for each is similar, we go through it only once and for next time promise DRYer code.
For an explanation of don’t repeat yourself (DRY) code, see Wikipedia at https://en.wikipedia.org/wiki/Don%27t_repeat_yourself.
fig = plt.figure(figsize=(20,20)) sigma_list = list(max_vals.sigma) 1 mu_list = list(max_vals.mu) conf_list = [] def make_subplot(x, y, z, new_row=False): 2 rand_noise = np.random.normal(loc=mu, scale=sigma, size=(224,224, 3)) 3 rand_noise = np.clip(rand_noise, 0, 255.) 4 noise_preds = kmodel.predict(np.expand_dims(rand_noise, axis=0)) 5 prediction, num = decode_predictions(noise_preds, top=20)[0][0][1:3] 6 num = round(num * 100, 2) conf_list.append(num) ax = fig.add_subplot(x,y,z) 7 ax.annotate(prediction, xy=(0.1, 0.6), xycoords=ax.transAxes, fontsize=16, color='yellow') ax.annotate(f'{num}%' , xy=(0.1, 0.4), xycoords=ax.transAxes, fontsize=20, color='orange') if new_row: ax.annotate(f'$mu$:{mu}, $sigma$:{sigma}' , xy=(-.2, 0.8), xycoords=ax.transAxes, rotation=90, fontsize=16, color='black') ax.imshow(rand_noise / 255) 8 ax.axis('off') for i in range(1,101): 9 if (i-1) % 10==0: mu = mu_list.pop(0) sigma = sigma_list.pop(0) make_subplot(10,10, i, new_row=True) else: make_subplot(10,10, i) plt.show()
Some people now start to worry about the security implications of adversarial examples. However, it is important to keep this in a meaningful perspective of a hypothetical attacker. If the attacker can change every pixel slightly, why not change the whole image?[7] Why not just feed in another one that is completely different? Why does the passed-in example have to be imperceptibly—rather than visibly—different?
See “Motivating the Rules of the Game for Adversarial Example Research,” by Justin Gilmer et al., 2018, http://arxiv.org/abs/1807.06732.
Some people give the example of self-driving cars and adversarially perturbing stop signs. But if we can do that, why wouldn’t the attackers completely spray-paint over the stop signs or simply physically obscure the stop sign with a high speed-limit sign for a little while? Because these “traditional attacks,” unlike adversarial examples, will work 100% of the time, whereas an adversarial attack works only when it transfers well and manages to not get distorted by the preprocessing.
This does not mean that when you have a mission-critical ML application, you can just ignore this problem. However, it most cases, adversarial attacks require far more effort than more commonplace vectors of attack, so bearing that in mind is worthwhile.
Yet, as with most security implications, adversarial attacks also have adversarial defenses that attempt to defend against the many types of attacks. The attacks covered in this chapter have been some of the easier ones, but even simpler ones exist—such as drawing a single line through MNIST. Even that is sufficient to fool most classifiers.
Adversarial defenses are an ever-evolving game, in which many good defenses are available against some types of attacks, but not all. The turnaround can be so quick that just three days after the submission deadline for ICLR 2018, seven of the eight proposed and examined defenses were broken.[8]
ICLR is the International Conference on Learning Representations, one of the smaller but excellent machine learning conferences. See Anish Athalye on Twitter in 2018, http://mng.bz/ad77. It should be noted that there were three more defenses unexamined by the author.
To make the connection with GANs even clearer, imagine a system generating adversarial examples, and another one saying how good that example is—depending on whether the example managed to fool the system or not. Doesn’t that remind you of a Generator (adversary) and a Discriminator (classification algorithm)? These two algorithms are again competing: the adversary is trying to fool the classifier with slight perturbations of the image, and the classifier is trying to not get fooled. Indeed, a way to think of GANs is almost as ML-in-the-loop adversarial examples that eventually come up with images.
On the other hand, you can think of iterated adversarial attacks as if you took a GAN and, rather than specifying that the objective is to generate the most realistic examples, you specify that the objective is to generate examples that will fool the classifier. Of course, you have to always remember that important differences exist, and typically you have a fixed classifier in deployed systems. But that does not preclude us from using this idea in adversarial training in which some implementations even include a repeated retraining of the classifier based on the adversarial examples that fooled it. These techniques are then moving closer to a typical GANs setup.
To give you an example, let’s take a look at one technique that has held its ground for a while as a viable defense. In the Robust Manifold Defense, we take the following steps to defend against the adversarial examples:[9]
See “The Robust Manifold Defense: Adversarial Training Using Generative Models,” by Ajil Jalal et al., 2019, https://arxiv.org/pdf/1712.09196.pdf.
However, the authors of this defense find out that there are still some ambiguous cases in which the classifier does get fooled by minor perturbations. Still, we encourage you to check out their paper, as these cases tend to be unclear to humans as well, which is a sign of a robust model. To fix this, we apply adversarial training on the manifold: we get some of these adversarial cases into the training set so the classifier learns to distinguish those from the real training data.
This paper demonstrates that using GANs can give us classifiers that do not completely break down after minor perturbations, even against some of the most sophisticated methods. Performance of the downstream classifier does drop as with most of these defenses, because our classifier now has to be trained to implicitly deal with these adversarial cases. But even despite this setback, it is not a universal defense.
Adversarial training, of course, has some interesting applications. For example, for a while, the best results—state of the art—in semi-supervised learning were achieved by using adversarial training.[10] This was subsequently challenged by GANs (remember chapter 7?) and other approaches, but that does not mean that by the time you are reading these lines, adversarial training will not be the state of the art again.
See “Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning,” by Takeru Miyato et al., 2018, https://arxiv.org/pdf/1704.03976.pdf.
Hopefully, this gave you another reason to study GANs and adversarial examples—partially because in mission-critical classification tasks, GANs may be the best defense going forward or because of other applications beyond the scope of this book.[11] That is best left for a hypothetical Adversarial Examples in Action.
This was a hotly debated topic at ICLR 2019. Though most of these conversations were informal, using (pseudo) invertible generative models as a way to classify “out-of-sample”ness of an image seems like a fruitful avenue.
To sum up, we have laid out the notion of adversarial examples and made the connection to GANs even more specific. This is an underappreciated connection, but one that can solidify your understanding of this challenging subject. Furthermore, one of the defenses against adversarial examples are GANs themselves![12] So GANs also have the potential to solve this gap that likely led to their existence in the first place.
See Jalal et al., 2019, https://arxiv.org/pdf/1712.09196.pdf.
Adversarial examples are an important field, because even commercial computer vision products suffered from this shortcoming and can still be easily fooled by academics.[13] Beyond security and machine learning explainability applications, many practical uses remain in fairness and robustness.
See “Black-Box Adversarial Attacks with Limited Queries and Information,” by Andrew Ilyas et al., 2018, https://arxiv.org/abs/1804.08598.
Furthermore, adversarial examples are an excellent way of solidifying your own understanding of deep learning and GANs. Adversarial examples take advantage of the difficulty in training classifiers in general and the relative ease of fooling the classifier in one particular case. The classifier has to make predictions for many images, and crafting a special offset to fool the classifier exactly right is easy because of the many degrees of freedom. As a result, we can easily get adversarial noise that completely changes the label of a picture without changing the image perceptibly.
Adversarial examples can be found in many domains and many areas of AI, not just deep learning or computer vision. But as you saw in the code, creating the ones in computer vision is not challenging. Defenses against these examples exist, and you saw one using GANs, but adversarial examples are far from being solved completely.