Universal adversarial perturbations

Slide 1

Slide 1 text

Universal adversarial perturbations Yohei KIKUTA @yohei_kikuta 20170806  ୈ41ճίϯϐϡʔλϏδϣϯษڧձ@ؔ౦

Slide 5

Slide 5 text

·ͱΊ • Deep Learning ϞσϧΛޡೝࣝͤ͞ΔΑ͏ͳ ීวతͳઁಈϊΠζΛൃݟ • ҰͭͷϊΠζͰଟ͘ͷը૾͕ޡೝࣝ • ҟͳΔϞσϧʹಉ͡ϊΠζ͕ద༻Մ • গͳ͍σʔλͰڧྗͳϊΠζ͕࡞੒Մ • σʔλ఺͔Βࣝผڥք΁ͷ๏ઢํ޲ͷϕΫ τϧΛ଍্͛͠Δ͜ͱͰϊΠζΛߏஙՄ • ࣝผڥք΁ͷ๏ઢϕΫτϧ͸ଟ͘ͷσʔλ ఺Ͱڞ௨ͷํ޲Λ޲͍͓ͯΓڧ͍૬ؔ  ʢҟͳΔࣝผڥքྖҬ͕௿࣍ݩͰهड़Մʣ Seyed-Mohsen Moosavi-Dezfooli⇤† [email protected] Alhussein Fawzi⇤† [email protected] Omar Fawzi‡ [email protected] Pascal Frossard† [email protected] Abstract Given a state-of-the-art deep neural network classifier, we show the existence of a universal (image-agnostic) and very small perturbation vector that causes natural images to be misclassified with high probability. We propose a sys- tematic algorithm for computing universal perturbations, and show that state-of-the-art deep neural networks are highly vulnerable to such perturbations, albeit being quasi- imperceptible to the human eye. We further empirically an- alyze these universal perturbations and show, in particular, that they generalize very well across neural networks. The surprising existence of universal perturbations reveals im- portant geometric correlations among the high-dimensional decision boundary of classifiers. It further outlines poten- tial security breaches with the existence of single directions in the input space that adversaries can possibly exploit to break a classifier on most natural images. 1 1. Introduction Can we find a single small image perturbation that fools a state-of-the-art deep neural network classifier on all natural images? We show in this paper the existence of such quasi-imperceptible universal perturbation vectors that lead to misclassify natural images with high probability. Specif- ically, by adding such a quasi-imperceptible perturbation to natural images, the label estimated by the deep neural network is changed with high probability (see Fig. 1). Such perturbations are dubbed universal , as they are image- agnostic. The existence of these perturbations is problem- atic when the classifier is deployed in real-world (and possibly hostile) environments, as they can be exploited by ad- Joystick Whiptail lizard Balloon Lycaenid Tibetan mastiff Thresher Grille Flagpole Face powder Labrador Chihuahua Chihuahua Jay Labrador Labrador Tibetan mastiff Brabancon griffon Border terrier Figure 1: When added to a natural image, a universal perturbation image causes the image to be misclassified by the deep neural network with high probability. Left images: Original natural images. The labels are shown on top of arXiv:1610.08401v3 [cs.CV] 9 Mar 2017 Ref: https://arxiv.org/abs/1610.08401 5/32

Slide 11

Slide 11 text

ैདྷख๏ͷϊΠζͷ࡞Γํ https://arxiv.org/abs/1412.6572 : ໨తؔ਺ͷඍ෼ํ޲ʹඍখྔΛੵΈ্͛Δ  ϊΠζ෦෼Λ෼ղ: ϊΠζ η Λ࣍ࣜ (fast gradient sign method) Ͱࢉग़: ε ͸ඍখͳఆ਺ɺθ ͸Ϟσϧύϥϝλɺx ͸ೖྗը૾ɺy ͸ϥϕϧ often use only 8 bits per pixel so they discard all information below 1 / 255 of the dynamic Because the precision of the features is limited, it is not rational for the classifier to respond ntly to an input x than to an adversarial input ˜ x = x + ⌘ if every element of the perturbation aller than the precision of the features. Formally, for problems with well-separated classes, ect the classifier to assign the same class to x and ˜ x so long as || ⌘ ||1 < ✏, where ✏ is small to be discarded by the sensor or data storage apparatus associated with our problem. er the dot product between a weight vector w and an adversarial example ˜ x : w > ˜ x = w > x + w > ⌘ . versarial perturbation causes the activation to grow by w > ⌘ .We can maximize this increase to the max norm constraint on ⌘ by assigning ⌘ = sign (w) . If w has n dimensions and the magnitude of an element of the weight vector is m, then the activation will grow by ✏mn. |⌘||1 does not grow with the dimensionality of the problem but the change in activation by perturbation by ⌘ can grow linearly with n, then for high dimensional problems, we can many infinitesimal changes to the input that add up to one large change to the output. We nk of this as a sort of “accidental steganography,” where a linear model is forced to attend vely to the signal that aligns most closely with its weights, even if multiple signals are present er signals have much greater amplitude. planation shows that a simple linear model can have adversarial examples if its input has suf- dimensionality. Previous explanations for adversarial examples invoked hypothesized prop- f neural networks, such as their supposed highly non-linear nature. Our hypothesis based arity is simpler, and can also explain why softmax regression is vulnerable to adversarial es. NEAR PERTURBATION OF NON-LINEAR MODELS ear view of adversarial examples suggests a fast way of generating them. We hypothesize ural networks are too linear to resist linear adversarial perturbation. LSTMs (Hochreiter & huber, 1997), ReLUs (Jarrett et al., 2009; Glorot et al., 2011), and maxout networks (Good- et al., 2013c) are all intentionally designed to behave in very linear ways, so that they are o optimize. More nonlinear models such as sigmoid networks are carefully tuned to spend GoogLeNet’s classification of the image. Here our ✏ of .007 corresponds to th smallest bit of an 8 bit image encoding after GoogLeNet’s conversion to real nu Let ✓ be the parameters of a model, x the input to the model, y the targets ass machine learning tasks that have targets) and J (✓ , x , y ) be the cost used to train We can linearize the cost function around the current value of ✓ , obtaining an constrained pertubation of ⌘ = ✏sign ( r x J (✓ , x , y )) . We refer to this as the “fast gradient sign method” of generating adversarial exam required gradient can be computed efficiently using backpropagation. We find that this method reliably causes a wide variety of models to misclass Fig. 1 for a demonstration on ImageNet. We find that using ✏ = . 25 , we cause classifier to have an error rate of 99.9% with an average confidence of 79.3% on set1. In the same setting, a maxout network misclassifies 89.4% of our advers an average confidence of 97.6%. Similarly, using ✏ = . 1 , we obtain an error an average probability of 96.6% assigned to the incorrect labels when using a co network on a preprocessed version of the CIFAR-10 (Krizhevsky & Hinton, 20 simple methods of generating adversarial examples are possible. For example, rotating x by a small angle in the direction of the gradient reliably produces adv The fact that these simple, cheap algorithms are able to generate misclassified evidence in favor of our interpretation of adversarial examples as a result of linea are also useful as a way of speeding up adversarial training or even just analysis 5 ADVERSARIAL TRAINING OF LINEAR MODELS VERSUS WEI Published as a conference paper at ICLR 2015 + . 007 ⇥ = x sign ( r x J (✓ , x , y )) x + ✏sign ( r x J (✓ , x , y )) “panda” “nematode” “gibbon” 57.7% confidence 8.2% confidence 99.3 % confidence Figure 1: A demonstration of fast adversarial example generation applied to GoogLeNet (Szegedy et al., 2014a) on ImageNet. By adding an imperceptibly small vector whose elements are equal to the sign of the elements of the gradient of the cost function with respect to the input, we can change Ref: https://arxiv.org/abs/1412.6572 11/32

Slide 12

Slide 12 text

ैདྷख๏ͷϊΠζͷ࡞Γํ https://arxiv.org/abs/1511.04599 : ࣝผڥքΛ·͕ͨΔํ޲΁ͷมҐΛগͣͭ͠଍্͛͠Δ ؤڧੑΛఆٛ: ೋ஋෼ྨͷΞϧΰϦζϜʢଟ஋෼ྨ΁΋֦ுՄೳʣ: unstable to adversarial perturbations of the data [18]. In fact, very small and often imperceptible perturbations of the data samples are sufficient to fool state-of-the-art classifiers and result in incorrect classification. (e.g., Figure 1). For- mally, for a given classifier, we define an adversarial perturbation as the minimal perturbation r that is sufficient to change the estimated label ˆ k (x) : (x; ˆ k ) := min r k r k2 subject to ˆ k (x + r) 6 = ˆ k (x) , (1) where x is an image and ˆ k (x) is the estimated label. We call (x; ˆ k ) the robustness of ˆ k at point x . The robustness of classifier ˆ k is then defined as 1To encourage reproducible research, the code of DeepFool is made available at http://github.com/lts4/deepfool Figure 1: An example of adversarial First row: the original image x that i ˆ k (x) =“whale”. Second row: the image x as ˆ k (x + r) =“turtle” and the corresponding computed by DeepFool. Third row: the im as “turtle” and the corresponding perturba by the fast gradient sign method [4]. Deep smaller perturbation. arX F f( x ) < 0 f( x ) > 0 r⇤ ( x ) (x 0 ;f) x0 Figure 2: Adversarial examples for a linear binary classifier. be seen that the robustness of f at point x0 , (x0; f ) 2, is equal to the distance from x0 to the separating affine hyper- plane F = { x : wT x + b = 0 } (Figure 2). The minimal perturbation to change the classifier’s decision corresponds to the orthogonal projection of x0 onto F. It is given by the closed-form formula: r⇤(x0) := arg min k r k2 (3) subject to sign ( f (x0 + r)) 6 = sign ( f (x0)) = f (x0) k w k2 2 w . Algorithm 1 DeepFool for binary classifiers 1: input: Image x , classifier f. 2: output: Perturbation ˆ r . 3: Initialize x0 x , i 0 . 4: while sign ( f (xi)) = sign ( f (x0)) do 5: ri f ( xi) kr f ( xi)k2 2 rf (xi) , 6: xi +1 xi + ri , 7: i i + 1 . 8: end while 9: return ˆ r = P i ri . Figure 3: Illustration of Algorithm 1 for n = 2 . n ate method to fool deep neural networks ezfooli, Alhussein Fawzi, Pascal Frossard hnique F´ ed´ erale de Lausanne n.fawzi,pascal.frossard } at epfl.ch d im- How- e un- ages. ective he ro- ertur- l this com- reli- nsive forms ertur- s that nce in peech p net- ifica- ularly ]. In of the sifiers For- l per- F f( x ) < 0 f( x ) > 0 r⇤ ( x ) (x 0 ;f) x0 Figure 2: Adversarial examples for a linear binary classifier. be seen that the robustness of f at point x0 , (x0; f ) 2, is ݩը૾: whale ఏҊख๏: turtle աڈख๏: turtle  fast gradient sign method Ref: https://arxiv.org/abs/1511.04599 12/32

Slide 13

Slide 13 text

ैདྷख๏ͷϊΠζͷ࡞Γํ https://arxiv.org/abs/1312.6199 : ͋ΔΫϥε΁ͱಋ͘খ͞ͳϊΠζΛࢉग़ ϊΠζ r ͸࣍ࣜͰࢉग़: x ͸ೖྗը૾ɺl ͸ϥϕϧɺf ͸ DL Ϟσϧ https://arxiv.org/abs/1412.6572 : ໨తؔ਺ͷඍ෼ํ޲ʹඍখྔΛੵΈ্͛Δ  ϊΠζ෦෼Λ෼ղ: ϊΠζ η Λ࣍ࣜ (fast gradient sign method) Ͱࢉग़: ε ͸ඍখͳఆ਺ɺθ ͸Ϟσϧύϥϝλɺx ͸ೖྗը૾ɺy ͸ϥϕϧ https://arxiv.org/abs/1511.04599 : ࣝผڥքΛ·͕ͨΔํ޲΁ͷมҐΛগͣͭ͠଍্͛͠Δ ؤڧੑΛఆٛ: ೋ஋෼ྨͷΞϧΰϦζϜʢଟ஋෼ྨ΁΋֦ுՄೳʣ: The minimizer r might not be unique, but we denote one such x + r for minimizer by D(x, l) . Informally, x + r is the closest image to x classified D(x, f(x)) = f(x) , so this task is non-trivial only if f(x) 6 = l . In general, of D(x, l) is a hard problem, so we approximate it by using a box-constrained we find an approximation of D(x, l) by performing line-search to find the min the minimizer r of the following problem satisfies f(x + r) = l . • Minimize c | r | + loss f (x + r, l) subject to x + r 2 [0, 1]m This penalty function method would yield the exact solution for D(X, l) i losses, however neural networks are non-convex in general, so we end up wit this case. 4.2 Experimental results Our “minimimum distortion” function D has the following intriguing properti port by informal evidence and quantitative experiments in this section: 1. For all the networks we studied (MNIST, QuocNet [10], AlexNe ple, we have always managed to generate very close, visually ha versarial examples that are misclassified by the original netwo http://goo.gl/huaGPb for examples). 2. Cross model generalization: a relatively large fraction of examples w networks trained from scratch with different hyper-parameters (num ization or initial weights). 3. Cross training-set generalization a relatively large fraction of examp HE LINEAR EXPLANATION OF ADVERSARIAL EXAMPLES t with explaining the existence of adversarial examples for linear models. y problems, the precision of an individual input feature is limited. For example, digital often use only 8 bits per pixel so they discard all information below 1 / 255 of the dynamic Because the precision of the features is limited, it is not rational for the classifier to respond ntly to an input x than to an adversarial input ˜ x = x + ⌘ if every element of the perturbation aller than the precision of the features. Formally, for problems with well-separated classes, ect the classifier to assign the same class to x and ˜ x so long as || ⌘ ||1 < ✏, where ✏ is small to be discarded by the sensor or data storage apparatus associated with our problem. er the dot product between a weight vector w and an adversarial example ˜ x : w > ˜ x = w > x + w > ⌘ . versarial perturbation causes the activation to grow by w > ⌘ .We can maximize this increase to the max norm constraint on ⌘ by assigning ⌘ = sign (w) . If w has n dimensions and the magnitude of an element of the weight vector is m, then the activation will grow by ✏mn. |⌘||1 does not grow with the dimensionality of the problem but the change in activation by perturbation by ⌘ can grow linearly with n, then for high dimensional problems, we can many infinitesimal changes to the input that add up to one large change to the output. We nk of this as a sort of “accidental steganography,” where a linear model is forced to attend vely to the signal that aligns most closely with its weights, even if multiple signals are present er signals have much greater amplitude. planation shows that a simple linear model can have adversarial examples if its input has suf- dimensionality. Previous explanations for adversarial examples invoked hypothesized prop- f neural networks, such as their supposed highly non-linear nature. Our hypothesis based arity is simpler, and can also explain why softmax regression is vulnerable to adversarial es. NEAR PERTURBATION OF NON-LINEAR MODELS 57.7% confidence 8.2% confidence 99.3 % Figure 1: A demonstration of fast adversarial example generation applied to G et al., 2014a) on ImageNet. By adding an imperceptibly small vector whose el the sign of the elements of the gradient of the cost function with respect to the in GoogLeNet’s classification of the image. Here our ✏ of .007 corresponds to th smallest bit of an 8 bit image encoding after GoogLeNet’s conversion to real nu Let ✓ be the parameters of a model, x the input to the model, y the targets ass machine learning tasks that have targets) and J (✓ , x , y ) be the cost used to train We can linearize the cost function around the current value of ✓ , obtaining an constrained pertubation of ⌘ = ✏sign ( r x J (✓ , x , y )) . We refer to this as the “fast gradient sign method” of generating adversarial exam required gradient can be computed efficiently using backpropagation. We find that this method reliably causes a wide variety of models to misclass Fig. 1 for a demonstration on ImageNet. We find that using ✏ = . 25 , we cause classifier to have an error rate of 99.9% with an average confidence of 79.3% on set1. In the same setting, a maxout network misclassifies 89.4% of our advers an average confidence of 97.6%. Similarly, using ✏ = . 1 , we obtain an error an average probability of 96.6% assigned to the incorrect labels when using a co network on a preprocessed version of the CIFAR-10 (Krizhevsky & Hinton, 20 simple methods of generating adversarial examples are possible. For example, rotating x by a small angle in the direction of the gradient reliably produces adv ably quantify the robustness of these classifiers. Extensive experimental results show that our approach outperforms recent methods in the task of computing adversarial perturbations and making classifiers more robust.1 1. Introduction Deep neural networks are powerful learning models that achieve state-of-the-art pattern recognition performance in many research areas such as bioinformatics [1, 16], speech [12, 6], and computer vision [10, 8]. Though deep networks have exhibited very good performance in classification tasks, they have recently been shown to be particularly unstable to adversarial perturbations of the data [18]. In fact, very small and often imperceptible perturbations of the data samples are sufficient to fool state-of-the-art classifiers and result in incorrect classification. (e.g., Figure 1). For- mally, for a given classifier, we define an adversarial perturbation as the minimal perturbation r that is sufficient to change the estimated label ˆ k (x) : (x; ˆ k ) := min r k r k2 subject to ˆ k (x + r) 6 = ˆ k (x) , (1) where x is an image and ˆ k (x) is the estimated label. We call (x; ˆ k ) the robustness of ˆ k at point x . The robustness of classifier ˆ k is then defined as 1To encourage reproducible research, the code of DeepFool is made available at http://github.com/lts4/deepfool Figure 1: An example of adversarial First row: the original image x that i ˆ k (x) =“whale”. Second row: the image x as ˆ k (x + r) =“turtle” and the corresponding computed by DeepFool. Third row: the im as “turtle” and the corresponding perturba by the fast gradient sign method [4]. Deep smaller perturbation. arXiv:1511.04599v3 [c F f( x ) < 0 f( x ) > 0 r⇤ ( x ) (x 0 ;f) x0 Figure 2: Adversarial examples for a linear binary classifier. Algorithm 1 DeepFool for binary classifiers 1: input: Image x , classifier f. 2: output: Perturbation ˆ r . 3: Initialize x0 x , i 0 . 4: while sign ( f (xi)) = sign ( f (x0)) do 5: ri f ( xi) kr f ( xi)k2 2 rf (xi) , 6: xi +1 xi + ri , 7: i i + 1 . 8: end while 9: return ˆ r = P i ri . 13/32

Slide 15

Slide 15 text

ීวతͳઁಈϊΠζͷఆࣜԽ ࣍ͷΑ͏ʹීวతͳϊΠζΛఆࣜԽ: ͜͜Ͱɺɹɹ͸ estimated label Ͱ x ͸ը૾ɺµ ͸σʔλ෼෍ ϊΠζ v ΛޡೝࣝΛҾ͖ى͜͢༗ޮͳઁಈͱ࣮ͯ͠ݱ͢ΔͨΊʹɺҎԼͷ੍໿Λ՝͢ ̍ɽ͸ϊΠζ͕େ͖ͳ΋ͷʹͳΒͳ͍͜ͱΛอূ͢ΔͨΊͷ੍ݶ ̎ɽ͸ϊΠζ͕ҰఆҎ্ͷޡೝࣝΛҾ͖ى͜͢͜ͱΛอূ͢ΔͨΊͷ੍ݶ → (ξ, δ) ΛՄೳͳݶΓখ͘͢͞ΔΞϧΰϦζϜ͕஌Γ͍ͨ ֤σʔλ఺ʹରͯࣝ͠ผڥքΛ·͕ͨΔΑ͏ͳϕΫτϧΛݟ͚ͭͯ଍͍ͯ͘͠ d successive data- he classifier. ns have a remark- erturbations com- ng points fool new e not only univer- e well across deep are therefore dou- e data and the net- nerability of deep ations by examin- een different parts structured and un- racted a lot of at- pite the impressive hitectures on chal- 6, 9, 21, 10], these erable to perturba- to be unstable to ve adversarial per- rbations are either turbations. Let µ denote a distribution of images in R , and ˆ k define a classification function that outputs for each image x 2 Rd an estimated label ˆ k ( x ). The main focus of this paper is to seek perturbation vectors v 2 Rd that fool the classifier ˆ k on almost all datapoints sampled from µ . That is, we seek a vector v such that ˆ k ( x + v ) 6= ˆ k ( x ) for “most” x ⇠ µ. We coin such a perturbation universal , as it represents a fixed image-agnostic perturbation that causes label change for most images sampled from the data distribution µ . We focus here on the case where the distribution µ represents the set of natural images, hence containing a huge amount of variability. In that context, we examine the existence of small universal perturbations (in terms of the `p norm with p 2 [1 , 1)) that misclassify most images. The goal is therefore to find v that satisfies the following two constraints: 1. k v k p  ⇠, 2. P x ⇠ µ ⇣ ˆ k ( x + v ) 6= ˆ k ( x ) ⌘ 1 . The parameter ⇠ controls the magnitude of the perturbation vector v , and quantifies the desired fooling rate for all images sampled from the distribution µ . Algorithm. Let X = { x1, . . . , xm } be a set of images sampled from the distribution µ . Our proposed algorithm ts belonging to the data distribution. perturbations in this section the notion of universal per- ropose a method for estimating such per- denote a distribution of images in Rd, and fication function that outputs for each im- timated label ˆ k ( x ). The main focus of this perturbation vectors v 2 Rd that fool the most all datapoints sampled from µ . That or v such that + v ) 6= ˆ k ( x ) for “most” x ⇠ µ. perturbation universal , as it represents a ostic perturbation that causes label change sampled from the data distribution µ . We e case where the distribution µ represents images, hence containing a huge amount that context, we examine the existence of erturbations (in terms of the `p norm with misclassify most images. The goal is there- satisfies the following two constraints: ⌘ small set of training points fool new probability. h perturbations are not only univer- but also generalize well across deep uch perturbations are therefore dou- with respect to the data and the net- . alyze the high vulnerability of deep universal perturbations by examin- correlation between different parts undary. mage classifiers to structured and un- s have recently attracted a lot of at- 12, 13, 14]. Despite the impressive eural network architectures on chal- tion benchmarks [6, 9, 21, 10], these to be highly vulnerable to perturba- tworks are shown to be unstable to perceptible additive adversarial per- ully crafted perturbations are either n optimization problem [19, 11, 1] gradient ascent [5], and result in a a specific data point. A fundamental rsarial perturbations is their intrin- points: the perturbations are specif- data point independently. As a re- of an adversarial perturbation for a solving a data-dependent optimiza- tch, which uses the full knowledge is, we seek a vector v such that ˆ k ( x + v ) 6= ˆ k ( x ) for “most” x ⇠ µ. We coin such a perturbation universal , as it represents a fixed image-agnostic perturbation that causes label change for most images sampled from the data distribution µ . We focus here on the case where the distribution µ represents the set of natural images, hence containing a huge amount of variability. In that context, we examine the existence of small universal perturbations (in terms of the `p norm with p 2 [1 , 1)) that misclassify most images. The goal is therefore to find v that satisfies the following two constraints: 1. k v k p  ⇠, 2. P x ⇠ µ ⇣ ˆ k ( x + v ) 6= ˆ k ( x ) ⌘ 1 . The parameter ⇠ controls the magnitude of the perturbation vector v , and quantifies the desired fooling rate for all images sampled from the distribution µ . Algorithm. Let X = { x1, . . . , xm } be a set of images sampled from the distribution µ . Our proposed algorithm seeks a universal perturbation v , such that k v k p  ⇠ , while fooling most data points in X . The algorithm proceeds it- eratively over the data points in X and gradually builds the universal perturbation, as illustrated in Fig. 2. At each iter- ation, the minimal perturbation vi that sends the current perturbed point, xi + v , to the decision boundary of the classifier is computed, and aggregated to the current instance of the universal perturbation. In more details, provided the current universal perturbation v does not fool data point x , 15/32

Slide 16

Slide 16 text

ΞϧΰϦζϜ ޡೝࣝ཰Λܭࢉ Algorithm 1 Computation of universal perturbations. 1: input: Data points X , classifier ˆ k , desired `p norm of the perturbation ⇠ , desired accuracy on perturbed samples . 2: output: Universal perturbation vector v . 3: Initialize v 0. 4: while Err( Xv )  1 do 5: for each datapoint xi 2 X do 6: if ˆ k ( xi + v ) = ˆ k ( xi ) then 7: Compute the minimal perturbation that sends xi + v to the decision boundary: vi arg min r k r k2 s.t. ˆ k ( xi + v + r ) 6= ˆ k ( xi ) . 8: Update the perturbation: v P p,⇠ ( v + vi ) . 9: end if 10: end for 11: end while Figure 2: Schematic representation of the proposed alg rithm used to compute universal perturbations. In this lustration, data points x1, x2 and x3 are super-imposed, an the classification regions R i (i.e., regions of constant es mated label) are shown in different colors. Our algorith proceeds by aggregating sequentially the minimal perturb tions sending the current perturbed points xi + v outside the corresponding classification region R i . mization problem: vi arg min r k r k2 s.t. ˆ k ( xi + v + r ) 6= ˆ k ( xi ) . ( To ensure that the constraint k v k p  ⇠ is satisfied, the u dated universal perturbation is further projected on the ball of radius ⇠ and centered at 0. That is, let P p,⇠ be t projection operator defined as follows: P p,⇠ ( v ) = arg min v 0 k v v 0k2 subject to k v 0k p  ⇠. Then, our update rule is given by v P p,⇠ ( v + vi ). Se eral passes on the data set X are performed to improve t quality of the universal perturbation. The algorithm is te minated when the empirical “fooling rate” on the perturb r To ensure that the constraint k v k p  ⇠ is satisfied, the up- dated universal perturbation is further projected on the `p ball of radius ⇠ and centered at 0. That is, let P p,⇠ be the projection operator defined as follows: P p,⇠ ( v ) = arg min v 0 k v v 0k2 subject to k v 0k p  ⇠. Then, our update rule is given by v P p,⇠ ( v + vi ). Sev- eral passes on the data set X are performed to improve the quality of the universal perturbation. The algorithm is ter- minated when the empirical “fooling rate” on the perturbed data set Xv := { x1 + v, . . . , xm + v } exceeds the target threshold 1 . That is, we stop the algorithm whenever Err( Xv ) := 1 m m X i =1 1ˆ k ( xi+ v )6=ˆ k ( xi) 1 . The detailed algorithm is provided in Algorithm 1. Interest- ingly, in practice, the number of data points m in X need not be large to compute a universal perturbation that is valid for the whole distribution µ . In particular, we can set m to be much smaller than the number of training points (see Section 3). The proposed algorithm involves solving at most m in- stances of the optimization problem in Eq. (1) for each pass. While this optimization problem is not convex when ˆ k is a ֤σʔλ఺ʹରͯࣝ͠ผڥք ·Ͱͷ࠷୹ͷϕΫτϧΛࢉग़ ࣝผڥք·ͰͷϕΫτϧΛՄೳͳݶΓ อͪͭͭɺLp-norm Λ੍ݶҎԼʹ͢Δ 16/32

Slide 18

Slide 18 text

֤Ϟσϧͷޡೝࣝ཰ ༷ʑͳϞσϧͰޡೝࣝ཰Λࢉग़ ɾILSVRC ͷσʔλΛ࢖༻ʢ X: 10,000, Val.: 50,000 ʣ ɾ(p, ξ) = (2, 2000) ͱ (∞, 10) ͱ͍͏૊Έ߹ΘͤͰ࣮ݧ L2-norm ͷํ͕શମతʹྑ͍݁Ռ͕ͩɺL∞-norm ͕Α͘Ϛον͢ΔϞσϧ΋ଘࡏ ීวతͳઁಈϊΠζʹΑͬͯ༠ൃ͞Εͨ GoogleNet ͷޡೝࣝ CaffeNet [8] VGG-F [2] VGG-16 [17] VGG-19 [17] GoogLeNet [18] ResNet-152 [6] `2 X 85.4% 85.9% 90.7% 86.9% 82.9% 89.7% Val. 85.6 87.0% 90.3% 84.5% 82.0% 88.5% `1 X 93.1% 93.8% 78.5% 77.8% 80.8% 85.4% Val. 93.3% 93.7% 78.3% 77.8% 78.9% 84.0% Table 1: Fooling ratios on the set X , and the validation set. natural images2. Results are listed in Table 1. Each result is reported on the set X , which is used to compute the perturbation, as well as on the validation set (that is not used in the process of the computation of the universal perturbation). Observe that for all networks, the universal perturbation achieves very high fooling rates on the validation set. Specifically, the universal perturbations computed for CaffeNet and VGG-F fool more than 90% of the validation set (for p = 1). In other words, for any natural image in the validation set, the mere addition of our universal perturbation fools the classifier more than 9 times out of 10. This result is moreover not specific to such architectures, as we can also find universal perturbations that cause VGG, GoogLeNet and ResNet classifiers to be fooled on natural images with probability edging 80%. These results have an While the above universal perturbations are computed for a set X of 10,000 images from the training set (i.e., in average 10 images per class), we now examine the influence of the size of X on the quality of the universal perturbation. We show in Fig. 6 the fooling rates obtained on the validation set for different sizes of X for GoogLeNet. Note for example that with a set X containing only 500 images, we can fool more than 30% of the images on the validation set. This result is significant when compared to the number of classes in ImageNet (1000), as it shows that we can fool a large set of unseen images, even when using a set X containing less than one image per class! The universal perturbations computed using Algorithm 1 have therefore a remarkable generalization power over unseen data points, and can be computed on a very small set of training images. Ref: https://arxiv.org/abs/1610.08401 wool Indian elephant Indian elephant African grey tabby African grey common newt carousel grey fox macaw three-toed sloth macaw Figure 3: Examples of perturbed images and their corresponding labels. The first 8 images belong to the ILSVRC 2012 18/32

Slide 32

Slide 32 text

·ͱΊ • Deep Learning ϞσϧΛޡೝࣝͤ͞ΔΑ͏ͳ ීวతͳઁಈϊΠζΛൃݟ • ҰͭͷϊΠζͰଟ͘ͷը૾͕ޡೝࣝ • ҟͳΔϞσϧʹಉ͡ϊΠζ͕ద༻Մ • গͳ͍σʔλͰڧྗͳϊΠζ͕࡞੒Մ • σʔλ఺͔Βࣝผڥք΁ͷ๏ઢํ޲ͷϕΫ τϧΛ଍্͛͠Δ͜ͱͰϊΠζΛߏஙՄ • ࣝผڥք΁ͷ๏ઢϕΫτϧ͸ଟ͘ͷσʔλ ఺Ͱڞ௨ͷํ޲Λ޲͍͓ͯΓڧ͍૬ؔ  ʢҟͳΔࣝผڥքྖҬ͕௿࣍ݩͰهड़Մʣ Seyed-Mohsen Moosavi-Dezfooli⇤† [email protected] Alhussein Fawzi⇤† [email protected] Omar Fawzi‡ [email protected] Pascal Frossard† [email protected] Abstract Given a state-of-the-art deep neural network classifier, we show the existence of a universal (image-agnostic) and very small perturbation vector that causes natural images to be misclassified with high probability. We propose a sys- tematic algorithm for computing universal perturbations, and show that state-of-the-art deep neural networks are highly vulnerable to such perturbations, albeit being quasi- imperceptible to the human eye. We further empirically an- alyze these universal perturbations and show, in particular, that they generalize very well across neural networks. The surprising existence of universal perturbations reveals im- portant geometric correlations among the high-dimensional decision boundary of classifiers. It further outlines poten- tial security breaches with the existence of single directions in the input space that adversaries can possibly exploit to break a classifier on most natural images. 1 1. Introduction Can we find a single small image perturbation that fools a state-of-the-art deep neural network classifier on all natural images? We show in this paper the existence of such quasi-imperceptible universal perturbation vectors that lead to misclassify natural images with high probability. Specif- ically, by adding such a quasi-imperceptible perturbation to natural images, the label estimated by the deep neural network is changed with high probability (see Fig. 1). Such perturbations are dubbed universal , as they are image- agnostic. The existence of these perturbations is problem- atic when the classifier is deployed in real-world (and possibly hostile) environments, as they can be exploited by ad- Joystick Whiptail lizard Balloon Lycaenid Tibetan mastiff Thresher Grille Flagpole Face powder Labrador Chihuahua Chihuahua Jay Labrador Labrador Tibetan mastiff Brabancon griffon Border terrier Figure 1: When added to a natural image, a universal perturbation image causes the image to be misclassified by the deep neural network with high probability. Left images: Original natural images. The labels are shown on top of arXiv:1610.08401v3 [cs.CV] 9 Mar 2017 32/32

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text