and Harnessing Adversarial Examples: https://arxiv.org/abs/1412.6572 • DeepFool: a simple and accurate method to fool deep neural networks: https://arxiv.org/abs/1511.04599 • Universal adversarial perturbations: https://arxiv.org/abs/1610.08401 • Delving into Transferable Adversarial Examples and Black-box Attacks: https://arxiv.org/abs/1611.02770 • On Detecting Adversarial Perturbations: https://arxiv.org/abs/1702.04267 • Analysis of universal adversarial perturbations: https://arxiv.org/abs/1705.09554 • NO Need to Worry about Adversarial Examples in Object Detection in Autonomous Vehicles: https://arxiv.org/abs/1707.03501 [Web pages] • Robust Adversarial Examples: https://blog.openai.com/robust-adversarial-inputs/ 3/32
• গͳ͍σʔλͰڧྗͳϊΠζ͕࡞Մ • σʔλ͔Βࣝผڥքͷ๏ઢํͷϕΫ τϧΛ্͛͠Δ͜ͱͰϊΠζΛߏஙՄ • ࣝผڥքͷ๏ઢϕΫτϧଟ͘ͷσʔλ Ͱڞ௨ͷํΛ͍͓ͯΓڧ͍૬ؔ ʢҟͳΔࣝผڥքྖҬ͕ݩͰهड़Մʣ Seyed-Mohsen Moosavi-Dezfooli⇤† seyed.moosavi@epfl.ch Alhussein Fawzi⇤† alhussein.fawzi@epfl.ch Omar Fawzi‡ omar.fawzi@ens-lyon.fr Pascal Frossard† pascal.frossard@epfl.ch Abstract Given a state-of-the-art deep neural network classiﬁer, we show the existence of a universal (image-agnostic) and very small perturbation vector that causes natural images to be misclassiﬁed with high probability. We propose a sys- tematic algorithm for computing universal perturbations, and show that state-of-the-art deep neural networks are highly vulnerable to such perturbations, albeit being quasi- imperceptible to the human eye. We further empirically an- alyze these universal perturbations and show, in particular, that they generalize very well across neural networks. The surprising existence of universal perturbations reveals im- portant geometric correlations among the high-dimensional decision boundary of classiﬁers. It further outlines poten- tial security breaches with the existence of single directions in the input space that adversaries can possibly exploit to break a classiﬁer on most natural images. 1 1. Introduction Can we ﬁnd a single small image perturbation that fools a state-of-the-art deep neural network classiﬁer on all nat- ural images? We show in this paper the existence of such quasi-imperceptible universal perturbation vectors that lead to misclassify natural images with high probability. Specif- ically, by adding such a quasi-imperceptible perturbation to natural images, the label estimated by the deep neu- ral network is changed with high probability (see Fig. 1). Such perturbations are dubbed universal , as they are image- agnostic. The existence of these perturbations is problem- atic when the classiﬁer is deployed in real-world (and pos- sibly hostile) environments, as they can be exploited by ad- Joystick Whiptail lizard Balloon Lycaenid Tibetan mastiff Thresher Grille Flagpole Face powder Labrador Chihuahua Chihuahua Jay Labrador Labrador Tibetan mastiff Brabancon griffon Border terrier Figure 1: When added to a natural image, a universal per- turbation image causes the image to be misclassiﬁed by the deep neural network with high probability. Left images: Original natural images. The labels are shown on top of arXiv:1610.08401v3 [cs.CV] 9 Mar 2017 Ref: https://arxiv.org/abs/1610.08401 5/32
ure 5: Adversarial examples generated for AlexNet [9].(Left) is a correctly predicted sample, (center) d ence between correct image, and image predicted incorrectly magniﬁed by 10x (values shifted by 128 a mped), (right) adversarial example. All images in the right column are predicted to be an “ostrich, Strut ݩը૾ ϊΠζ มը૾ ݩը૾ ϊΠζ มը૾ มը૾શͯ ostrich, Struthio camelus ͱࣝ͞Εͯ͠·͏ Ref: https://arxiv.org/abs/1312.6199 7/32
DL Ϟσϧ The minimizer r might not be unique, but we denote one such x + r for minimizer by D(x, l) . Informally, x + r is the closest image to x classiﬁed D(x, f(x)) = f(x) , so this task is non-trivial only if f(x) 6 = l . In general, of D(x, l) is a hard problem, so we approximate it by using a box-constrained we ﬁnd an approximation of D(x, l) by performing line-search to ﬁnd the min the minimizer r of the following problem satisﬁes f(x + r) = l . • Minimize c | r | + loss f (x + r, l) subject to x + r 2 [0, 1]m This penalty function method would yield the exact solution for D(X, l) i losses, however neural networks are non-convex in general, so we end up wit this case. 4.2 Experimental results Our “minimimum distortion” function D has the following intriguing properti port by informal evidence and quantitative experiments in this section: 1. For all the networks we studied (MNIST, QuocNet [10], AlexNe ple, we have always managed to generate very close, visually ha versarial examples that are misclassiﬁed by the original netwo http://goo.gl/huaGPb for examples). 2. Cross model generalization: a relatively large fraction of examples w networks trained from scratch with different hyper-parameters (num ization or initial weights). 3. Cross training-set generalization a relatively large fraction of examp Ref: https://arxiv.org/abs/1312.6199 (a) (b) ure 5: Adversarial examples generated for AlexNet [9].(Left) is a correctly predicted sample, (center) d ence between correct image, and image predicted incorrectly magniﬁed by 10x (values shifted by 128 a ݩը૾ ϊΠζ มը૾ ݩը૾ ϊΠζ มը૾ มը૾શͯ ostrich, Struthio camelus ͱࣝ 10/32
sign method) Ͱग़: ε ඍখͳఆɺθ Ϟσϧύϥϝλɺx ೖྗը૾ɺy ϥϕϧ often use only 8 bits per pixel so they discard all information below 1 / 255 of the dynamic Because the precision of the features is limited, it is not rational for the classiﬁer to respond ntly to an input x than to an adversarial input ˜ x = x + ⌘ if every element of the perturbation aller than the precision of the features. Formally, for problems with well-separated classes, ect the classiﬁer to assign the same class to x and ˜ x so long as || ⌘ ||1 < ✏, where ✏ is small to be discarded by the sensor or data storage apparatus associated with our problem. er the dot product between a weight vector w and an adversarial example ˜ x : w > ˜ x = w > x + w > ⌘ . versarial perturbation causes the activation to grow by w > ⌘ .We can maximize this increase to the max norm constraint on ⌘ by assigning ⌘ = sign (w) . If w has n dimensions and the magnitude of an element of the weight vector is m, then the activation will grow by ✏mn. |⌘||1 does not grow with the dimensionality of the problem but the change in activation by perturbation by ⌘ can grow linearly with n, then for high dimensional problems, we can many inﬁnitesimal changes to the input that add up to one large change to the output. We nk of this as a sort of “accidental steganography,” where a linear model is forced to attend vely to the signal that aligns most closely with its weights, even if multiple signals are present er signals have much greater amplitude. planation shows that a simple linear model can have adversarial examples if its input has suf- dimensionality. Previous explanations for adversarial examples invoked hypothesized prop- f neural networks, such as their supposed highly non-linear nature. Our hypothesis based arity is simpler, and can also explain why softmax regression is vulnerable to adversarial es. NEAR PERTURBATION OF NON-LINEAR MODELS ear view of adversarial examples suggests a fast way of generating them. We hypothesize ural networks are too linear to resist linear adversarial perturbation. LSTMs (Hochreiter & huber, 1997), ReLUs (Jarrett et al., 2009; Glorot et al., 2011), and maxout networks (Good- et al., 2013c) are all intentionally designed to behave in very linear ways, so that they are o optimize. More nonlinear models such as sigmoid networks are carefully tuned to spend GoogLeNet’s classiﬁcation of the image. Here our ✏ of .007 corresponds to th smallest bit of an 8 bit image encoding after GoogLeNet’s conversion to real nu Let ✓ be the parameters of a model, x the input to the model, y the targets ass machine learning tasks that have targets) and J (✓ , x , y ) be the cost used to train We can linearize the cost function around the current value of ✓ , obtaining an constrained pertubation of ⌘ = ✏sign ( r x J (✓ , x , y )) . We refer to this as the “fast gradient sign method” of generating adversarial exam required gradient can be computed efﬁciently using backpropagation. We ﬁnd that this method reliably causes a wide variety of models to misclass Fig. 1 for a demonstration on ImageNet. We ﬁnd that using ✏ = . 25 , we cause classiﬁer to have an error rate of 99.9% with an average conﬁdence of 79.3% on set1. In the same setting, a maxout network misclassiﬁes 89.4% of our advers an average conﬁdence of 97.6%. Similarly, using ✏ = . 1 , we obtain an error an average probability of 96.6% assigned to the incorrect labels when using a co network on a preprocessed version of the CIFAR-10 (Krizhevsky & Hinton, 20 simple methods of generating adversarial examples are possible. For example, rotating x by a small angle in the direction of the gradient reliably produces adv The fact that these simple, cheap algorithms are able to generate misclassiﬁed evidence in favor of our interpretation of adversarial examples as a result of linea are also useful as a way of speeding up adversarial training or even just analysis 5 ADVERSARIAL TRAINING OF LINEAR MODELS VERSUS WEI Published as a conference paper at ICLR 2015 + . 007 ⇥ = x sign ( r x J (✓ , x , y )) x + ✏sign ( r x J (✓ , x , y )) “panda” “nematode” “gibbon” 57.7% conﬁdence 8.2% conﬁdence 99.3 % conﬁdence Figure 1: A demonstration of fast adversarial example generation applied to GoogLeNet (Szegedy et al., 2014a) on ImageNet. By adding an imperceptibly small vector whose elements are equal to the sign of the elements of the gradient of the cost function with respect to the input, we can change Ref: https://arxiv.org/abs/1412.6572 11/32
of the data [18]. In fact, very small and often imperceptible perturbations of the data samples are sufﬁcient to fool state-of-the-art classiﬁers and result in incorrect classiﬁcation. (e.g., Figure 1). For- mally, for a given classiﬁer, we deﬁne an adversarial per- turbation as the minimal perturbation r that is sufﬁcient to change the estimated label ˆ k (x) : (x; ˆ k ) := min r k r k2 subject to ˆ k (x + r) 6 = ˆ k (x) , (1) where x is an image and ˆ k (x) is the estimated label. We call (x; ˆ k ) the robustness of ˆ k at point x . The robustness of classiﬁer ˆ k is then deﬁned as 1To encourage reproducible research, the code of DeepFool is made available at http://github.com/lts4/deepfool Figure 1: An example of adversarial First row: the original image x that i ˆ k (x) =“whale”. Second row: the image x as ˆ k (x + r) =“turtle” and the corresponding computed by DeepFool. Third row: the im as “turtle” and the corresponding perturba by the fast gradient sign method [4]. Deep smaller perturbation. arX F f( x ) < 0 f( x ) > 0 r⇤ ( x ) (x 0 ;f) x0 Figure 2: Adversarial examples for a linear binary classiﬁer. be seen that the robustness of f at point x0 , (x0; f ) 2, is equal to the distance from x0 to the separating afﬁne hyper- plane F = { x : wT x + b = 0 } (Figure 2). The minimal perturbation to change the classiﬁer’s decision corresponds to the orthogonal projection of x0 onto F. It is given by the closed-form formula: r⇤(x0) := arg min k r k2 (3) subject to sign ( f (x0 + r)) 6 = sign ( f (x0)) = f (x0) k w k2 2 w . Algorithm 1 DeepFool for binary classiﬁers 1: input: Image x , classiﬁer f. 2: output: Perturbation ˆ r . 3: Initialize x0 x , i 0 . 4: while sign ( f (xi)) = sign ( f (x0)) do 5: ri f ( xi) kr f ( xi)k2 2 rf (xi) , 6: xi +1 xi + ri , 7: i i + 1 . 8: end while 9: return ˆ r = P i ri . Figure 3: Illustration of Algorithm 1 for n = 2 . n ate method to fool deep neural networks ezfooli, Alhussein Fawzi, Pascal Frossard hnique F´ ed´ erale de Lausanne n.fawzi,pascal.frossard } at epfl.ch d im- How- e un- ages. ective he ro- ertur- l this com- reli- nsive forms ertur- s that nce in peech p net- iﬁca- ularly ]. In of the siﬁers For- l per- F f( x ) < 0 f( x ) > 0 r⇤ ( x ) (x 0 ;f) x0 Figure 2: Adversarial examples for a linear binary classiﬁer. be seen that the robustness of f at point x0 , (x0; f ) 2, is ݩը૾: whale ఏҊख๏: turtle աڈख๏: turtle fast gradient sign method Ref: https://arxiv.org/abs/1511.04599 12/32
DL Ϟσϧ https://arxiv.org/abs/1412.6572 : తؔͷඍํʹඍখྔΛੵΈ্͛Δ ϊΠζ෦Λղ: ϊΠζ η Λࣜ (fast gradient sign method) Ͱग़: ε ඍখͳఆɺθ Ϟσϧύϥϝλɺx ೖྗը૾ɺy ϥϕϧ https://arxiv.org/abs/1511.04599 : ࣝผڥքΛ·͕ͨΔํͷมҐΛগ্ͣͭ͛͠͠Δ ؤڧੑΛఆٛ: ೋྨͷΞϧΰϦζϜʢଟྨ֦ுՄʣ: The minimizer r might not be unique, but we denote one such x + r for minimizer by D(x, l) . Informally, x + r is the closest image to x classiﬁed D(x, f(x)) = f(x) , so this task is non-trivial only if f(x) 6 = l . In general, of D(x, l) is a hard problem, so we approximate it by using a box-constrained we ﬁnd an approximation of D(x, l) by performing line-search to ﬁnd the min the minimizer r of the following problem satisﬁes f(x + r) = l . • Minimize c | r | + loss f (x + r, l) subject to x + r 2 [0, 1]m This penalty function method would yield the exact solution for D(X, l) i losses, however neural networks are non-convex in general, so we end up wit this case. 4.2 Experimental results Our “minimimum distortion” function D has the following intriguing properti port by informal evidence and quantitative experiments in this section: 1. For all the networks we studied (MNIST, QuocNet [10], AlexNe ple, we have always managed to generate very close, visually ha versarial examples that are misclassiﬁed by the original netwo http://goo.gl/huaGPb for examples). 2. Cross model generalization: a relatively large fraction of examples w networks trained from scratch with different hyper-parameters (num ization or initial weights). 3. Cross training-set generalization a relatively large fraction of examp HE LINEAR EXPLANATION OF ADVERSARIAL EXAMPLES t with explaining the existence of adversarial examples for linear models. y problems, the precision of an individual input feature is limited. For example, digital often use only 8 bits per pixel so they discard all information below 1 / 255 of the dynamic Because the precision of the features is limited, it is not rational for the classiﬁer to respond ntly to an input x than to an adversarial input ˜ x = x + ⌘ if every element of the perturbation aller than the precision of the features. Formally, for problems with well-separated classes, ect the classiﬁer to assign the same class to x and ˜ x so long as || ⌘ ||1 < ✏, where ✏ is small to be discarded by the sensor or data storage apparatus associated with our problem. er the dot product between a weight vector w and an adversarial example ˜ x : w > ˜ x = w > x + w > ⌘ . versarial perturbation causes the activation to grow by w > ⌘ .We can maximize this increase to the max norm constraint on ⌘ by assigning ⌘ = sign (w) . If w has n dimensions and the magnitude of an element of the weight vector is m, then the activation will grow by ✏mn. |⌘||1 does not grow with the dimensionality of the problem but the change in activation by perturbation by ⌘ can grow linearly with n, then for high dimensional problems, we can many inﬁnitesimal changes to the input that add up to one large change to the output. We nk of this as a sort of “accidental steganography,” where a linear model is forced to attend vely to the signal that aligns most closely with its weights, even if multiple signals are present er signals have much greater amplitude. planation shows that a simple linear model can have adversarial examples if its input has suf- dimensionality. Previous explanations for adversarial examples invoked hypothesized prop- f neural networks, such as their supposed highly non-linear nature. Our hypothesis based arity is simpler, and can also explain why softmax regression is vulnerable to adversarial es. NEAR PERTURBATION OF NON-LINEAR MODELS 57.7% conﬁdence 8.2% conﬁdence 99.3 % Figure 1: A demonstration of fast adversarial example generation applied to G et al., 2014a) on ImageNet. By adding an imperceptibly small vector whose el the sign of the elements of the gradient of the cost function with respect to the in GoogLeNet’s classiﬁcation of the image. Here our ✏ of .007 corresponds to th smallest bit of an 8 bit image encoding after GoogLeNet’s conversion to real nu Let ✓ be the parameters of a model, x the input to the model, y the targets ass machine learning tasks that have targets) and J (✓ , x , y ) be the cost used to train We can linearize the cost function around the current value of ✓ , obtaining an constrained pertubation of ⌘ = ✏sign ( r x J (✓ , x , y )) . We refer to this as the “fast gradient sign method” of generating adversarial exam required gradient can be computed efﬁciently using backpropagation. We ﬁnd that this method reliably causes a wide variety of models to misclass Fig. 1 for a demonstration on ImageNet. We ﬁnd that using ✏ = . 25 , we cause classiﬁer to have an error rate of 99.9% with an average conﬁdence of 79.3% on set1. In the same setting, a maxout network misclassiﬁes 89.4% of our advers an average conﬁdence of 97.6%. Similarly, using ✏ = . 1 , we obtain an error an average probability of 96.6% assigned to the incorrect labels when using a co network on a preprocessed version of the CIFAR-10 (Krizhevsky & Hinton, 20 simple methods of generating adversarial examples are possible. For example, rotating x by a small angle in the direction of the gradient reliably produces adv ably quantify the robustness of these classiﬁers. Extensive experimental results show that our approach outperforms recent methods in the task of computing adversarial pertur- bations and making classiﬁers more robust.1 1. Introduction Deep neural networks are powerful learning models that achieve state-of-the-art pattern recognition performance in many research areas such as bioinformatics [1, 16], speech [12, 6], and computer vision [10, 8]. Though deep net- works have exhibited very good performance in classiﬁca- tion tasks, they have recently been shown to be particularly unstable to adversarial perturbations of the data [18]. In fact, very small and often imperceptible perturbations of the data samples are sufﬁcient to fool state-of-the-art classiﬁers and result in incorrect classiﬁcation. (e.g., Figure 1). For- mally, for a given classiﬁer, we deﬁne an adversarial per- turbation as the minimal perturbation r that is sufﬁcient to change the estimated label ˆ k (x) : (x; ˆ k ) := min r k r k2 subject to ˆ k (x + r) 6 = ˆ k (x) , (1) where x is an image and ˆ k (x) is the estimated label. We call (x; ˆ k ) the robustness of ˆ k at point x . The robustness of classiﬁer ˆ k is then deﬁned as 1To encourage reproducible research, the code of DeepFool is made available at http://github.com/lts4/deepfool Figure 1: An example of adversarial First row: the original image x that i ˆ k (x) =“whale”. Second row: the image x as ˆ k (x + r) =“turtle” and the corresponding computed by DeepFool. Third row: the im as “turtle” and the corresponding perturba by the fast gradient sign method [4]. Deep smaller perturbation. arXiv:1511.04599v3 [c F f( x ) < 0 f( x ) > 0 r⇤ ( x ) (x 0 ;f) x0 Figure 2: Adversarial examples for a linear binary classiﬁer. Algorithm 1 DeepFool for binary classiﬁers 1: input: Image x , classiﬁer f. 2: output: Perturbation ˆ r . 3: Initialize x0 x , i 0 . 4: while sign ( f (xi)) = sign ( f (x0)) do 5: ri f ( xi) kr f ( xi)k2 2 rf (xi) , 6: xi +1 xi + ri , 7: i i + 1 . 8: end while 9: return ˆ r = P i ri . 13/32
v ΛޡࣝΛҾ͖ى͜͢༗ޮͳઁಈͱ࣮ͯ͠ݱ͢ΔͨΊʹɺҎԼͷ੍Λ՝͢ ̍ɽϊΠζ͕େ͖ͳͷʹͳΒͳ͍͜ͱΛอূ͢ΔͨΊͷ੍ݶ ̎ɽϊΠζ͕ҰఆҎ্ͷޡࣝΛҾ͖ى͜͢͜ͱΛอূ͢ΔͨΊͷ੍ݶ → (ξ, δ) ΛՄͳݶΓখ͘͢͞ΔΞϧΰϦζϜ͕Γ͍ͨ ֤σʔλʹରͯࣝ͠ผڥքΛ·͕ͨΔΑ͏ͳϕΫτϧΛݟ͚͍ͭͯͯ͘͠ d successive data- he classiﬁer. ns have a remark- erturbations com- ng points fool new e not only univer- e well across deep are therefore dou- e data and the net- nerability of deep ations by examin- een different parts structured and un- racted a lot of at- pite the impressive hitectures on chal- 6, 9, 21, 10], these erable to perturba- to be unstable to ve adversarial per- rbations are either turbations. Let µ denote a distribution of images in R , and ˆ k deﬁne a classiﬁcation function that outputs for each im- age x 2 Rd an estimated label ˆ k ( x ). The main focus of this paper is to seek perturbation vectors v 2 Rd that fool the classiﬁer ˆ k on almost all datapoints sampled from µ . That is, we seek a vector v such that ˆ k ( x + v ) 6= ˆ k ( x ) for “most” x ⇠ µ. We coin such a perturbation universal , as it represents a ﬁxed image-agnostic perturbation that causes label change for most images sampled from the data distribution µ . We focus here on the case where the distribution µ represents the set of natural images, hence containing a huge amount of variability. In that context, we examine the existence of small universal perturbations (in terms of the `p norm with p 2 [1 , 1)) that misclassify most images. The goal is there- fore to ﬁnd v that satisﬁes the following two constraints: 1. k v k p ⇠, 2. P x ⇠ µ ⇣ ˆ k ( x + v ) 6= ˆ k ( x ) ⌘ 1 . The parameter ⇠ controls the magnitude of the perturbation vector v , and quantiﬁes the desired fooling rate for all images sampled from the distribution µ . Algorithm. Let X = { x1, . . . , xm } be a set of images sampled from the distribution µ . Our proposed algorithm ts belonging to the data distribution. perturbations in this section the notion of universal per- ropose a method for estimating such per- denote a distribution of images in Rd, and ﬁcation function that outputs for each im- timated label ˆ k ( x ). The main focus of this perturbation vectors v 2 Rd that fool the most all datapoints sampled from µ . That or v such that + v ) 6= ˆ k ( x ) for “most” x ⇠ µ. perturbation universal , as it represents a ostic perturbation that causes label change sampled from the data distribution µ . We e case where the distribution µ represents images, hence containing a huge amount that context, we examine the existence of erturbations (in terms of the `p norm with misclassify most images. The goal is there- satisﬁes the following two constraints: ⌘ small set of training points fool new probability. h perturbations are not only univer- but also generalize well across deep uch perturbations are therefore dou- with respect to the data and the net- . alyze the high vulnerability of deep universal perturbations by examin- correlation between different parts undary. mage classiﬁers to structured and un- s have recently attracted a lot of at- 12, 13, 14]. Despite the impressive eural network architectures on chal- tion benchmarks [6, 9, 21, 10], these to be highly vulnerable to perturba- tworks are shown to be unstable to perceptible additive adversarial per- ully crafted perturbations are either n optimization problem [19, 11, 1] gradient ascent [5], and result in a a speciﬁc data point. A fundamental rsarial perturbations is their intrin- points: the perturbations are specif- data point independently. As a re- of an adversarial perturbation for a solving a data-dependent optimiza- tch, which uses the full knowledge is, we seek a vector v such that ˆ k ( x + v ) 6= ˆ k ( x ) for “most” x ⇠ µ. We coin such a perturbation universal , as it represents a ﬁxed image-agnostic perturbation that causes label change for most images sampled from the data distribution µ . We focus here on the case where the distribution µ represents the set of natural images, hence containing a huge amount of variability. In that context, we examine the existence of small universal perturbations (in terms of the `p norm with p 2 [1 , 1)) that misclassify most images. The goal is there- fore to ﬁnd v that satisﬁes the following two constraints: 1. k v k p ⇠, 2. P x ⇠ µ ⇣ ˆ k ( x + v ) 6= ˆ k ( x ) ⌘ 1 . The parameter ⇠ controls the magnitude of the perturbation vector v , and quantiﬁes the desired fooling rate for all images sampled from the distribution µ . Algorithm. Let X = { x1, . . . , xm } be a set of images sampled from the distribution µ . Our proposed algorithm seeks a universal perturbation v , such that k v k p ⇠ , while fooling most data points in X . The algorithm proceeds it- eratively over the data points in X and gradually builds the universal perturbation, as illustrated in Fig. 2. At each iter- ation, the minimal perturbation vi that sends the current perturbed point, xi + v , to the decision boundary of the clas- siﬁer is computed, and aggregated to the current instance of the universal perturbation. In more details, provided the current universal perturbation v does not fool data point x , 15/32
Data points X , classiﬁer ˆ k , desired `p norm of the perturbation ⇠ , desired accuracy on perturbed sam- ples . 2: output: Universal perturbation vector v . 3: Initialize v 0. 4: while Err( Xv ) 1 do 5: for each datapoint xi 2 X do 6: if ˆ k ( xi + v ) = ˆ k ( xi ) then 7: Compute the minimal perturbation that sends xi + v to the decision boundary: vi arg min r k r k2 s.t. ˆ k ( xi + v + r ) 6= ˆ k ( xi ) . 8: Update the perturbation: v P p,⇠ ( v + vi ) . 9: end if 10: end for 11: end while Figure 2: Schematic representation of the proposed alg rithm used to compute universal perturbations. In this lustration, data points x1, x2 and x3 are super-imposed, an the classiﬁcation regions R i (i.e., regions of constant es mated label) are shown in different colors. Our algorith proceeds by aggregating sequentially the minimal perturb tions sending the current perturbed points xi + v outside the corresponding classiﬁcation region R i . mization problem: vi arg min r k r k2 s.t. ˆ k ( xi + v + r ) 6= ˆ k ( xi ) . ( To ensure that the constraint k v k p ⇠ is satisﬁed, the u dated universal perturbation is further projected on the ball of radius ⇠ and centered at 0. That is, let P p,⇠ be t projection operator deﬁned as follows: P p,⇠ ( v ) = arg min v 0 k v v 0k2 subject to k v 0k p ⇠. Then, our update rule is given by v P p,⇠ ( v + vi ). Se eral passes on the data set X are performed to improve t quality of the universal perturbation. The algorithm is te minated when the empirical “fooling rate” on the perturb r To ensure that the constraint k v k p ⇠ is satisﬁed, the up- dated universal perturbation is further projected on the `p ball of radius ⇠ and centered at 0. That is, let P p,⇠ be the projection operator deﬁned as follows: P p,⇠ ( v ) = arg min v 0 k v v 0k2 subject to k v 0k p ⇠. Then, our update rule is given by v P p,⇠ ( v + vi ). Sev- eral passes on the data set X are performed to improve the quality of the universal perturbation. The algorithm is ter- minated when the empirical “fooling rate” on the perturbed data set Xv := { x1 + v, . . . , xm + v } exceeds the target threshold 1 . That is, we stop the algorithm whenever Err( Xv ) := 1 m m X i =1 1ˆ k ( xi+ v )6=ˆ k ( xi) 1 . The detailed algorithm is provided in Algorithm 1. Interest- ingly, in practice, the number of data points m in X need not be large to compute a universal perturbation that is valid for the whole distribution µ . In particular, we can set m to be much smaller than the number of training points (see Section 3). The proposed algorithm involves solving at most m in- stances of the optimization problem in Eq. (1) for each pass. While this optimization problem is not convex when ˆ k is a ֤σʔλʹରͯࣝ͠ผڥք ·Ͱͷ࠷ͷϕΫτϧΛग़ ࣝผڥք·ͰͷϕΫτϧΛՄͳݶΓ อͪͭͭɺLp-norm Λ੍ݶҎԼʹ͢Δ 16/32
ξ) = (2, 2000) ͱ (∞, 10) ͱ͍͏Έ߹ΘͤͰ࣮ݧ L2-norm ͷํ͕શମతʹྑ͍݁Ռ͕ͩɺL∞-norm ͕Α͘Ϛον͢ΔϞσϧଘࡏ ීวతͳઁಈϊΠζʹΑͬͯ༠ൃ͞Εͨ GoogleNet ͷޡࣝ CaffeNet [8] VGG-F [2] VGG-16 [17] VGG-19 [17] GoogLeNet [18] ResNet-152 [6] `2 X 85.4% 85.9% 90.7% 86.9% 82.9% 89.7% Val. 85.6 87.0% 90.3% 84.5% 82.0% 88.5% `1 X 93.1% 93.8% 78.5% 77.8% 80.8% 85.4% Val. 93.3% 93.7% 78.3% 77.8% 78.9% 84.0% Table 1: Fooling ratios on the set X , and the validation set. natural images2. Results are listed in Table 1. Each result is reported on the set X , which is used to compute the per- turbation, as well as on the validation set (that is not used in the process of the computation of the universal pertur- bation). Observe that for all networks, the universal per- turbation achieves very high fooling rates on the validation set. Speciﬁcally, the universal perturbations computed for CaffeNet and VGG-F fool more than 90% of the validation set (for p = 1). In other words, for any natural image in the validation set, the mere addition of our universal per- turbation fools the classiﬁer more than 9 times out of 10. This result is moreover not speciﬁc to such architectures, as we can also ﬁnd universal perturbations that cause VGG, GoogLeNet and ResNet classiﬁers to be fooled on natural images with probability edging 80%. These results have an While the above universal perturbations are computed for a set X of 10,000 images from the training set (i.e., in average 10 images per class), we now examine the inﬂuence of the size of X on the quality of the universal perturbation. We show in Fig. 6 the fooling rates obtained on the val- idation set for different sizes of X for GoogLeNet. Note for example that with a set X containing only 500 images, we can fool more than 30% of the images on the validation set. This result is signiﬁcant when compared to the num- ber of classes in ImageNet (1000), as it shows that we can fool a large set of unseen images, even when using a set X containing less than one image per class! The universal perturbations computed using Algorithm 1 have therefore a remarkable generalization power over unseen data points, and can be computed on a very small set of training images. Ref: https://arxiv.org/abs/1610.08401 wool Indian elephant Indian elephant African grey tabby African grey common newt carousel grey fox macaw three-toed sloth macaw Figure 3: Examples of perturbed images and their corresponding labels. The ﬁrst 8 images belong to the ILSVRC 2012 18/32
10) Ͱͷ݁Ռ ϞσϧຖʹҟͳΔ݁Ռ͕ಘΒΕΔ Ref: https://arxiv.org/abs/1610.08401 wool Indian elephant Indian elephant African grey tabby African grey common newt carousel grey fox macaw three-toed sloth macaw Figure 3: Examples of perturbed images and their corresponding labels. The ﬁrst 8 images belong to the ILSVRC 2012 validation set, and the last 4 are images taken by a mobile phone camera. See supp. material for the original images. (a) CaffeNet (b) VGG-F (c) VGG-16 (d) VGG-19 (e) GoogLeNet (f) ResNet-152 Figure 4: Universal perturbations computed for different deep neural network architectures. Images generated with p = 1, 19/32
͜ΕීวతͳઁಈϊΠζ͕ unique Ͱͳ͍͜ͱΛ͍ࣔͯ͠Δ Ref: https://arxiv.org/abs/1610.08401 Figure 5: Diversity of universal perturbations for the GoogLeNet architecture. The ﬁve perturbations are generated using different random shufﬂings of the set X . Note that the normalized inner products for any pair of universal perturbations does not exceed 0 . 1, which highlights the diversity of such perturbations. VGG-F CaffeNet GoogLeNet VGG-16 VGG-19 ResNet-152 VGG-F 93.7% 71.8% 48.4% 42.1% 42.1% 47.4 % CaffeNet 74.0% 93.3% 47.7% 39.9% 39.9% 48.0% 20/32
GoogleNet Λ༻ গͳ͍σʔλͰߴ͍ޡࣝΛୡ ଟ͘ͷσʔλʹ͓͍ͯۙͷࣝผڥքࣅͨΑ͏ͳزԿͰ͋Δ͜ͱΛࣔࠦ Ref: https://arxiv.org/abs/1610.08401 VGG-F CaffeNet GoogLeNet VGG-16 VGG VGG-F 93.7% 71.8% 48.4% 42.1% 42.1 CaffeNet 74.0% 93.3% 47.7% 39.9% 39.9 GoogLeNet 46.2% 43.8% 78.9% 39.2% 39.8 VGG-16 63.4% 55.8% 56.5% 78.3% 73.1 VGG-19 64.0% 57.2% 53.6% 73.5% 77.8 ResNet-152 46.3% 46.3% 50.5% 47.0% 45.5 Table 2: Generalizability of the universal perturbations across different networks. The The rows indicate the architecture for which the universal perturbations is computed, and for which the fooling rate is reported. Number of images in X 500 1000 2000 4000 Fooling ratio (%) 0 10 20 30 40 50 60 70 80 90 images. We use the V network based on a m perturbations are adde ples: for each traini added with probabilit served with probabili of universal perturbati ferent universal pertu training samples rand ﬁne-tuned by perform modiﬁed training set. the robustness of the n perturbation for the ﬁ with p = 1 and ⇠ = 21/32
ߏ͕ࣅ͍ͯΕͦͷޮՌߴ͍Α͏ʹݟड͚ΒΕΔʢVGG-16 ↔ VGG-19ʣ ग़͞ΕͨීวతͳઁಈϊΠζೋॏͷҙຯͰ “universal” Ͱ͋Δ͜ͱΛ͍ࣔͯ͠Δ ̍ɽҰͭͷϊΠζͰ༷ʑͳσʔλʹద༻Մ ̎ɽҰͭͷϊΠζͰ༷ʑͳϞσϧʹద༻Մ Ref: https://arxiv.org/abs/1610.08401 e 5: Diversity of universal perturbations for the GoogLeNet architecture. The ﬁve perturbations are generated ent random shufﬂings of the set X . Note that the normalized inner products for any pair of universal perturbations xceed 0 . 1, which highlights the diversity of such perturbations. VGG-F CaffeNet GoogLeNet VGG-16 VGG-19 ResNet-152 VGG-F 93.7% 71.8% 48.4% 42.1% 42.1% 47.4 % CaffeNet 74.0% 93.3% 47.7% 39.9% 39.9% 48.0% GoogLeNet 46.2% 43.8% 78.9% 39.2% 39.8% 45.5% VGG-16 63.4% 55.8% 56.5% 78.3% 73.1% 63.4% VGG-19 64.0% 57.2% 53.6% 73.5% 77.8% 58.0% ResNet-152 46.3% 46.3% 50.5% 47.0% 45.5% 84.0% 2: Generalizability of the universal perturbations across different networks. The percentages indicate the fooling ows indicate the architecture for which the universal perturbations is computed, and the columns indicate the archite 22/32
ΛٻΊɺ୯Ґٿ͔ΒϥϯμϜʹαϯϓϧͨ͠߹ͱಛҟΛൺֱ Ref: https://arxiv.org/abs/1610.08401 Figure 8: Comparison between fooling rates of different perturbations. Experiments performed on the CaffeNet ar- chitecture. between different regions of the decision boundary of the classiﬁer, we deﬁne the matrix N = r ( x1) k r ( x1)k2 . . . r ( xn ) k r ( xn )k2 of normal vectors to the decision boundary in the vicinity of n data points in the validation set. For binary linear classiﬁers, the decision boundary is a hyperplane, and N is of rank 1, as all normal vectors are collinear. To capture more generally the correlations in the decision boundary of complex classiﬁers, we compute the singular values of the matrix N . The singular values of the matrix N , computed for the CaffeNet architecture are shown in Fig. 9. We fur- ther show in the same ﬁgure the singular values obtained when the columns of N are sampled uniformly at random from the unit sphere. Observe that, while the latter singu- lar values have a slow decay, the singular values of N de- Figure 9: Singu vectors to the de Figure 10: Illu S containing no regions surroun this illustration, and the adversar spective datapoi shown. Note tha 27/32
ௐΔ͜ͱͰɺͦͷଘࡏ͕ূ໌͞Εͨ ( https://arxiv.org/abs/1705.09554 ) Ref: https://arxiv.org/abs/1610.08401 g rates of different on the CaffeNet ar- on boundary of the 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Index 10 4 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Singular values Random Normal vectors Figure 9: Singular values of matrix N containing normal vectors to the decision decision boundary. 28/32
• গͳ͍σʔλͰڧྗͳϊΠζ͕࡞Մ • σʔλ͔Βࣝผڥքͷ๏ઢํͷϕΫ τϧΛ্͛͠Δ͜ͱͰϊΠζΛߏஙՄ • ࣝผڥքͷ๏ઢϕΫτϧଟ͘ͷσʔλ Ͱڞ௨ͷํΛ͍͓ͯΓڧ͍૬ؔ ʢҟͳΔࣝผڥքྖҬ͕ݩͰهड़Մʣ Seyed-Mohsen Moosavi-Dezfooli⇤† seyed.moosavi@epfl.ch Alhussein Fawzi⇤† alhussein.fawzi@epfl.ch Omar Fawzi‡ omar.fawzi@ens-lyon.fr Pascal Frossard† pascal.frossard@epfl.ch Abstract Given a state-of-the-art deep neural network classiﬁer, we show the existence of a universal (image-agnostic) and very small perturbation vector that causes natural images to be misclassiﬁed with high probability. We propose a sys- tematic algorithm for computing universal perturbations, and show that state-of-the-art deep neural networks are highly vulnerable to such perturbations, albeit being quasi- imperceptible to the human eye. We further empirically an- alyze these universal perturbations and show, in particular, that they generalize very well across neural networks. The surprising existence of universal perturbations reveals im- portant geometric correlations among the high-dimensional decision boundary of classiﬁers. It further outlines poten- tial security breaches with the existence of single directions in the input space that adversaries can possibly exploit to break a classiﬁer on most natural images. 1 1. Introduction Can we ﬁnd a single small image perturbation that fools a state-of-the-art deep neural network classiﬁer on all nat- ural images? We show in this paper the existence of such quasi-imperceptible universal perturbation vectors that lead to misclassify natural images with high probability. Specif- ically, by adding such a quasi-imperceptible perturbation to natural images, the label estimated by the deep neu- ral network is changed with high probability (see Fig. 1). Such perturbations are dubbed universal , as they are image- agnostic. The existence of these perturbations is problem- atic when the classiﬁer is deployed in real-world (and pos- sibly hostile) environments, as they can be exploited by ad- Joystick Whiptail lizard Balloon Lycaenid Tibetan mastiff Thresher Grille Flagpole Face powder Labrador Chihuahua Chihuahua Jay Labrador Labrador Tibetan mastiff Brabancon griffon Border terrier Figure 1: When added to a natural image, a universal per- turbation image causes the image to be misclassiﬁed by the deep neural network with high probability. Left images: Original natural images. The labels are shown on top of arXiv:1610.08401v3 [cs.CV] 9 Mar 2017 32/32