Universal adversarial perturbations

Universal adversarial perturbations Yohei KIKUTA @yohei_kikuta 20170806  ୈ41ճίϯϐϡʔλϏδϣϯษڧձ@ؔ౦

References 2/32

[Papers] • Intriguing properties of neural networks:  https://arxiv.org/abs/1312.6199 • Explaining
and Harnessing Adversarial Examples:  https://arxiv.org/abs/1412.6572 • DeepFool: a simple and accurate method to fool deep neural networks:  https://arxiv.org/abs/1511.04599 • Universal adversarial perturbations:  https://arxiv.org/abs/1610.08401 • Delving into Transferable Adversarial Examples and Black-box Attacks:  https://arxiv.org/abs/1611.02770 • On Detecting Adversarial Perturbations:  https://arxiv.org/abs/1702.04267 • Analysis of universal adversarial perturbations:  https://arxiv.org/abs/1705.09554 • NO Need to Worry about Adversarial Examples in Object Detection in Autonomous Vehicles:  https://arxiv.org/abs/1707.03501 [Web pages] • Robust Adversarial Examples:   https://blog.openai.com/robust-adversarial-inputs/ 3/32

Summary 4/32

·ͱΊ • Deep Learning ϞσϧΛޡೝࣝͤ͞ΔΑ͏ͳ ීวతͳઁಈϊΠζΛൃݟ • ҰͭͷϊΠζͰଟ͘ͷը૾͕ޡೝࣝ • ҟͳΔϞσϧʹಉ͡ϊΠζ͕ద༻Մ
• গͳ͍σʔλͰڧྗͳϊΠζ͕࡞੒Մ • σʔλ఺͔Βࣝผڥք΁ͷ๏ઢํ޲ͷϕΫ τϧΛ଍্͛͠Δ͜ͱͰϊΠζΛߏஙՄ • ࣝผڥք΁ͷ๏ઢϕΫτϧ͸ଟ͘ͷσʔλ ఺Ͱڞ௨ͷํ޲Λ޲͍͓ͯΓڧ͍૬ؔ  ʢҟͳΔࣝผڥքྖҬ͕௿࣍ݩͰهड़Մʣ Seyed-Mohsen Moosavi-Dezfooli⇤† [email protected] Alhussein Fawzi⇤† [email protected] Omar Fawzi‡ [email protected] Pascal Frossard† [email protected] Abstract Given a state-of-the-art deep neural network classifier, we show the existence of a universal (image-agnostic) and very small perturbation vector that causes natural images to be misclassified with high probability. We propose a sys- tematic algorithm for computing universal perturbations, and show that state-of-the-art deep neural networks are highly vulnerable to such perturbations, albeit being quasi- imperceptible to the human eye. We further empirically an- alyze these universal perturbations and show, in particular, that they generalize very well across neural networks. The surprising existence of universal perturbations reveals im- portant geometric correlations among the high-dimensional decision boundary of classifiers. It further outlines poten- tial security breaches with the existence of single directions in the input space that adversaries can possibly exploit to break a classifier on most natural images. 1 1. Introduction Can we find a single small image perturbation that fools a state-of-the-art deep neural network classifier on all natural images? We show in this paper the existence of such quasi-imperceptible universal perturbation vectors that lead to misclassify natural images with high probability. Specif- ically, by adding such a quasi-imperceptible perturbation to natural images, the label estimated by the deep neural network is changed with high probability (see Fig. 1). Such perturbations are dubbed universal , as they are image- agnostic. The existence of these perturbations is problem- atic when the classifier is deployed in real-world (and possibly hostile) environments, as they can be exploited by ad- Joystick Whiptail lizard Balloon Lycaenid Tibetan mastiff Thresher Grille Flagpole Face powder Labrador Chihuahua Chihuahua Jay Labrador Labrador Tibetan mastiff Brabancon griffon Border terrier Figure 1: When added to a natural image, a universal perturbation image causes the image to be misclassified by the deep neural network with high probability. Left images: Original natural images. The labels are shown on top of arXiv:1610.08401v3 [cs.CV] 9 Mar 2017 Ref: https://arxiv.org/abs/1610.08401 5/32

Introduction 6/32

Deep Learning ͸ὃ͞Ε΍͍͢ ਓؒͷ໨ʹ͸ҧ͍͕෼͔Βͳ͍ઁಈϊΠζΛՃ͑Δ͜ͱͰɺ Deep Learning (DL) ϞσϧΛޡೝࣝͤ͞Δ͜ͱ͕Ͱ͖ͯ͠·͏ʂ (a) (b)
ure 5: Adversarial examples generated for AlexNet [9].(Left) is a correctly predicted sample, (center) d ence between correct image, and image predicted incorrectly magniﬁed by 10x (values shifted by 128 a mped), (right) adversarial example. All images in the right column are predicted to be an “ostrich, Strut ݩը૾ ϊΠζ ม׵ը૾ ݩը૾ ϊΠζ ม׵ը૾ ม׵ը૾͸શͯ ostrich, Struthio camelus ͱೝࣝ͞Εͯ͠·͏ Ref: https://arxiv.org/abs/1312.6199 7/32

ޡೝࣝϝΧχζϜཧղͷॏཁੑ DL ͕ޡೝࣝ͢ΔϝΧχζϜͷཧղ͸Ԡ༻໘Ͱ΋ཧ࿦໘Ͱ΋ॏཁ  • DL Λ༻͍ͨೝূγεςϜͷؤڧੑ޲্  ը૾ೝূ΍ࣗಈӡసͳͲͷ࣮ݱʹ͸ޡೝࣝΛݮΒ͢͜ͱ͸ෆՄܽ • DL ͷ൚Խੑೳ޲্ 
DL ͷʮบʯΛ஌Δ͜ͱ͸ΑΓྑ͍Ϟσϧͷߏஙʹ༗༻ • ਓؒͱ DL ͷೝࣝͷҧ͍Λ໌Β͔ʹ͢Δ  զʑਓؒͱ DL ͷೝࣝͷ૬ҧ఺΍ڞ௨఺Λཧղ͢Δͷʹ༗༻ 8/32

͜ͷ࿦จͷҐஔ͚ͮ https://arxiv.org/abs/1312.6199 : ର৅ը૾ʹϊΠζΛՃ͑ͯ DL Λޡೝࣝͤ͞Δํ๏Λൃݟ https://arxiv.org/abs/1412.6572 : ϊΠζ෦෼͕ߴ࣍ݩ಺ੵͰ૿෯͞ΕΔ࢓૊ΈΛղ໌  ಉ͡ϊΠζͰผϞσϧ΋ಉ༷ʹޡೝࣝ͞ΕΔ͜ͱΛൃݟ
https://arxiv.org/abs/1511.04599 : ࣝผڥքΛߟྀ͢Δ͜ͱͰΑΓখ͍͞ϊΠζΛߏங  https://arxiv.org/abs/1610.08401 : ଟ਺ը૾ʹద༻Ͱ͖ΔҰͭͷීวతͳϊΠζΛߏங ɾ͜ͷ࿦จʹΑΓޡೝࣝΛҾ͖ى͜͢ϊΠζͷෆมੑ͕ڧ͘ҙࣝ͞Εͨ  ɾDL ͷࣝผڥքͷҰൠతੑ࣭Λ໌Β͔ʹ͢ΔػӡΛߴΊɺޙͷཧ࿦తղੳ΁ͱܨ͕Δ  ɾޡೝࣝΛى͜͞ͳ͍ϞσϧߏஙͷͨΊͷҰ࿈ͷݚڀͷओཁͳ΋ͷͷҰͭ  https://arxiv.org/abs/1702.04267 : ޡೝࣝΛҾ͖ى͜͢ը૾Λݕग़͢Δ࢓૊ΈΛఏҊ https://arxiv.org/abs/1705.09554 : ීวతͳϊΠζ͕ߏஙͰ͖Δ͜ͱΛཧ࿦తʹূ໌  https://arxiv.org/abs/1707.03501 : ޡೝࣝ͸زԿֶతʹม׵͞Εͨը૾Ͱ͸ੜ͡ͳ͍ͱओு https://blog.openai.com/robust-adversarial-inputs/ : زԿֶม׵ޙ΋ޡೝࣝ͞ΕΔ͜ͱΛ࣮ূ  9/32

ैདྷख๏ͷϊΠζͷ࡞Γํ https://arxiv.org/abs/1312.6199 : ͋ΔΫϥε΁ͱಋ͘খ͞ͳϊΠζΛࢉग़ ϊΠζ r ͸࣍ࣜͰࢉग़: x ͸ೖྗը૾ɺl ͸ϥϕϧɺf
͸ DL Ϟσϧ The minimizer r might not be unique, but we denote one such x + r for minimizer by D(x, l) . Informally, x + r is the closest image to x classified D(x, f(x)) = f(x) , so this task is non-trivial only if f(x) 6 = l . In general, of D(x, l) is a hard problem, so we approximate it by using a box-constrained we find an approximation of D(x, l) by performing line-search to find the min the minimizer r of the following problem satisfies f(x + r) = l . • Minimize c | r | + loss f (x + r, l) subject to x + r 2 [0, 1]m This penalty function method would yield the exact solution for D(X, l) i losses, however neural networks are non-convex in general, so we end up wit this case. 4.2 Experimental results Our “minimimum distortion” function D has the following intriguing properti port by informal evidence and quantitative experiments in this section: 1. For all the networks we studied (MNIST, QuocNet [10], AlexNe ple, we have always managed to generate very close, visually ha versarial examples that are misclassified by the original netwo http://goo.gl/huaGPb for examples). 2. Cross model generalization: a relatively large fraction of examples w networks trained from scratch with different hyper-parameters (num ization or initial weights). 3. Cross training-set generalization a relatively large fraction of examp Ref: https://arxiv.org/abs/1312.6199 (a) (b) ure 5: Adversarial examples generated for AlexNet [9].(Left) is a correctly predicted sample, (center) d ence between correct image, and image predicted incorrectly magnified by 10x (values shifted by 128 a ݩը૾ ϊΠζ ม׵ը૾ ݩը૾ ϊΠζ ม׵ը૾ ม׵ը૾͸શͯ ostrich, Struthio camelus ͱೝࣝ 10/32

ैདྷख๏ͷϊΠζͷ࡞Γํ https://arxiv.org/abs/1412.6572 : ໨తؔ਺ͷඍ෼ํ޲ʹඍখྔΛੵΈ্͛Δ  ϊΠζ෦෼Λ෼ղ: ϊΠζ η Λ࣍ࣜ (fast gradient
sign method) Ͱࢉग़: ε ͸ඍখͳఆ਺ɺθ ͸Ϟσϧύϥϝλɺx ͸ೖྗը૾ɺy ͸ϥϕϧ often use only 8 bits per pixel so they discard all information below 1 / 255 of the dynamic Because the precision of the features is limited, it is not rational for the classifier to respond ntly to an input x than to an adversarial input ˜ x = x + ⌘ if every element of the perturbation aller than the precision of the features. Formally, for problems with well-separated classes, ect the classifier to assign the same class to x and ˜ x so long as || ⌘ ||1 < ✏, where ✏ is small to be discarded by the sensor or data storage apparatus associated with our problem. er the dot product between a weight vector w and an adversarial example ˜ x : w > ˜ x = w > x + w > ⌘ . versarial perturbation causes the activation to grow by w > ⌘ .We can maximize this increase to the max norm constraint on ⌘ by assigning ⌘ = sign (w) . If w has n dimensions and the magnitude of an element of the weight vector is m, then the activation will grow by ✏mn. |⌘||1 does not grow with the dimensionality of the problem but the change in activation by perturbation by ⌘ can grow linearly with n, then for high dimensional problems, we can many infinitesimal changes to the input that add up to one large change to the output. We nk of this as a sort of “accidental steganography,” where a linear model is forced to attend vely to the signal that aligns most closely with its weights, even if multiple signals are present er signals have much greater amplitude. planation shows that a simple linear model can have adversarial examples if its input has suf- dimensionality. Previous explanations for adversarial examples invoked hypothesized prop- f neural networks, such as their supposed highly non-linear nature. Our hypothesis based arity is simpler, and can also explain why softmax regression is vulnerable to adversarial es. NEAR PERTURBATION OF NON-LINEAR MODELS ear view of adversarial examples suggests a fast way of generating them. We hypothesize ural networks are too linear to resist linear adversarial perturbation. LSTMs (Hochreiter & huber, 1997), ReLUs (Jarrett et al., 2009; Glorot et al., 2011), and maxout networks (Good- et al., 2013c) are all intentionally designed to behave in very linear ways, so that they are o optimize. More nonlinear models such as sigmoid networks are carefully tuned to spend GoogLeNet’s classification of the image. Here our ✏ of .007 corresponds to th smallest bit of an 8 bit image encoding after GoogLeNet’s conversion to real nu Let ✓ be the parameters of a model, x the input to the model, y the targets ass machine learning tasks that have targets) and J (✓ , x , y ) be the cost used to train We can linearize the cost function around the current value of ✓ , obtaining an constrained pertubation of ⌘ = ✏sign ( r x J (✓ , x , y )) . We refer to this as the “fast gradient sign method” of generating adversarial exam required gradient can be computed efficiently using backpropagation. We find that this method reliably causes a wide variety of models to misclass Fig. 1 for a demonstration on ImageNet. We find that using ✏ = . 25 , we cause classifier to have an error rate of 99.9% with an average confidence of 79.3% on set1. In the same setting, a maxout network misclassifies 89.4% of our advers an average confidence of 97.6%. Similarly, using ✏ = . 1 , we obtain an error an average probability of 96.6% assigned to the incorrect labels when using a co network on a preprocessed version of the CIFAR-10 (Krizhevsky & Hinton, 20 simple methods of generating adversarial examples are possible. For example, rotating x by a small angle in the direction of the gradient reliably produces adv The fact that these simple, cheap algorithms are able to generate misclassified evidence in favor of our interpretation of adversarial examples as a result of linea are also useful as a way of speeding up adversarial training or even just analysis 5 ADVERSARIAL TRAINING OF LINEAR MODELS VERSUS WEI Published as a conference paper at ICLR 2015 + . 007 ⇥ = x sign ( r x J (✓ , x , y )) x + ✏sign ( r x J (✓ , x , y )) “panda” “nematode” “gibbon” 57.7% confidence 8.2% confidence 99.3 % confidence Figure 1: A demonstration of fast adversarial example generation applied to GoogLeNet (Szegedy et al., 2014a) on ImageNet. By adding an imperceptibly small vector whose elements are equal to the sign of the elements of the gradient of the cost function with respect to the input, we can change Ref: https://arxiv.org/abs/1412.6572 11/32

ैདྷख๏ͷϊΠζͷ࡞Γํ https://arxiv.org/abs/1511.04599 : ࣝผڥքΛ·͕ͨΔํ޲΁ͷมҐΛগͣͭ͠଍্͛͠Δ ؤڧੑΛఆٛ: ೋ஋෼ྨͷΞϧΰϦζϜʢଟ஋෼ྨ΁΋֦ுՄೳʣ: unstable to adversarial perturbations
of the data [18]. In fact, very small and often imperceptible perturbations of the data samples are sufficient to fool state-of-the-art classifiers and result in incorrect classification. (e.g., Figure 1). For- mally, for a given classifier, we define an adversarial perturbation as the minimal perturbation r that is sufficient to change the estimated label ˆ k (x) : (x; ˆ k ) := min r k r k2 subject to ˆ k (x + r) 6 = ˆ k (x) , (1) where x is an image and ˆ k (x) is the estimated label. We call (x; ˆ k ) the robustness of ˆ k at point x . The robustness of classifier ˆ k is then defined as 1To encourage reproducible research, the code of DeepFool is made available at http://github.com/lts4/deepfool Figure 1: An example of adversarial First row: the original image x that i ˆ k (x) =“whale”. Second row: the image x as ˆ k (x + r) =“turtle” and the corresponding computed by DeepFool. Third row: the im as “turtle” and the corresponding perturba by the fast gradient sign method [4]. Deep smaller perturbation. arX F f( x ) < 0 f( x ) > 0 r⇤ ( x ) (x 0 ;f) x0 Figure 2: Adversarial examples for a linear binary classifier. be seen that the robustness of f at point x0 , (x0; f ) 2, is equal to the distance from x0 to the separating affine hyperplane F = { x : wT x + b = 0 } (Figure 2). The minimal perturbation to change the classifier’s decision corresponds to the orthogonal projection of x0 onto F. It is given by the closed-form formula: r⇤(x0) := arg min k r k2 (3) subject to sign ( f (x0 + r)) 6 = sign ( f (x0)) = f (x0) k w k2 2 w . Algorithm 1 DeepFool for binary classifiers 1: input: Image x , classifier f. 2: output: Perturbation ˆ r . 3: Initialize x0 x , i 0 . 4: while sign ( f (xi)) = sign ( f (x0)) do 5: ri f ( xi) kr f ( xi)k2 2 rf (xi) , 6: xi +1 xi + ri , 7: i i + 1 . 8: end while 9: return ˆ r = P i ri . Figure 3: Illustration of Algorithm 1 for n = 2 . n ate method to fool deep neural networks ezfooli, Alhussein Fawzi, Pascal Frossard hnique F´ ed´ erale de Lausanne n.fawzi,pascal.frossard } at epfl.ch d im- How- e un- ages. ective he ro- ertur- l this com- reli- nsive forms ertur- s that nce in peech p net- ifica- ularly ]. In of the sifiers For- l per- F f( x ) < 0 f( x ) > 0 r⇤ ( x ) (x 0 ;f) x0 Figure 2: Adversarial examples for a linear binary classifier. be seen that the robustness of f at point x0 , (x0; f ) 2, is ݩը૾: whale ఏҊख๏: turtle աڈख๏: turtle  fast gradient sign method Ref: https://arxiv.org/abs/1511.04599 12/32

ैདྷख๏ͷϊΠζͷ࡞Γํ https://arxiv.org/abs/1312.6199 : ͋ΔΫϥε΁ͱಋ͘খ͞ͳϊΠζΛࢉग़ ϊΠζ r ͸࣍ࣜͰࢉग़: x ͸ೖྗը૾ɺl ͸ϥϕϧɺf
͸ DL Ϟσϧ https://arxiv.org/abs/1412.6572 : ໨తؔ਺ͷඍ෼ํ޲ʹඍখྔΛੵΈ্͛Δ  ϊΠζ෦෼Λ෼ղ: ϊΠζ η Λ࣍ࣜ (fast gradient sign method) Ͱࢉग़: ε ͸ඍখͳఆ਺ɺθ ͸Ϟσϧύϥϝλɺx ͸ೖྗը૾ɺy ͸ϥϕϧ https://arxiv.org/abs/1511.04599 : ࣝผڥքΛ·͕ͨΔํ޲΁ͷมҐΛগͣͭ͠଍্͛͠Δ ؤڧੑΛఆٛ: ೋ஋෼ྨͷΞϧΰϦζϜʢଟ஋෼ྨ΁΋֦ுՄೳʣ: The minimizer r might not be unique, but we denote one such x + r for minimizer by D(x, l) . Informally, x + r is the closest image to x classified D(x, f(x)) = f(x) , so this task is non-trivial only if f(x) 6 = l . In general, of D(x, l) is a hard problem, so we approximate it by using a box-constrained we find an approximation of D(x, l) by performing line-search to find the min the minimizer r of the following problem satisfies f(x + r) = l . • Minimize c | r | + loss f (x + r, l) subject to x + r 2 [0, 1]m This penalty function method would yield the exact solution for D(X, l) i losses, however neural networks are non-convex in general, so we end up wit this case. 4.2 Experimental results Our “minimimum distortion” function D has the following intriguing properti port by informal evidence and quantitative experiments in this section: 1. For all the networks we studied (MNIST, QuocNet [10], AlexNe ple, we have always managed to generate very close, visually ha versarial examples that are misclassified by the original netwo http://goo.gl/huaGPb for examples). 2. Cross model generalization: a relatively large fraction of examples w networks trained from scratch with different hyper-parameters (num ization or initial weights). 3. Cross training-set generalization a relatively large fraction of examp HE LINEAR EXPLANATION OF ADVERSARIAL EXAMPLES t with explaining the existence of adversarial examples for linear models. y problems, the precision of an individual input feature is limited. For example, digital often use only 8 bits per pixel so they discard all information below 1 / 255 of the dynamic Because the precision of the features is limited, it is not rational for the classifier to respond ntly to an input x than to an adversarial input ˜ x = x + ⌘ if every element of the perturbation aller than the precision of the features. Formally, for problems with well-separated classes, ect the classifier to assign the same class to x and ˜ x so long as || ⌘ ||1 < ✏, where ✏ is small to be discarded by the sensor or data storage apparatus associated with our problem. er the dot product between a weight vector w and an adversarial example ˜ x : w > ˜ x = w > x + w > ⌘ . versarial perturbation causes the activation to grow by w > ⌘ .We can maximize this increase to the max norm constraint on ⌘ by assigning ⌘ = sign (w) . If w has n dimensions and the magnitude of an element of the weight vector is m, then the activation will grow by ✏mn. |⌘||1 does not grow with the dimensionality of the problem but the change in activation by perturbation by ⌘ can grow linearly with n, then for high dimensional problems, we can many infinitesimal changes to the input that add up to one large change to the output. We nk of this as a sort of “accidental steganography,” where a linear model is forced to attend vely to the signal that aligns most closely with its weights, even if multiple signals are present er signals have much greater amplitude. planation shows that a simple linear model can have adversarial examples if its input has suf- dimensionality. Previous explanations for adversarial examples invoked hypothesized prop- f neural networks, such as their supposed highly non-linear nature. Our hypothesis based arity is simpler, and can also explain why softmax regression is vulnerable to adversarial es. NEAR PERTURBATION OF NON-LINEAR MODELS 57.7% confidence 8.2% confidence 99.3 % Figure 1: A demonstration of fast adversarial example generation applied to G et al., 2014a) on ImageNet. By adding an imperceptibly small vector whose el the sign of the elements of the gradient of the cost function with respect to the in GoogLeNet’s classification of the image. Here our ✏ of .007 corresponds to th smallest bit of an 8 bit image encoding after GoogLeNet’s conversion to real nu Let ✓ be the parameters of a model, x the input to the model, y the targets ass machine learning tasks that have targets) and J (✓ , x , y ) be the cost used to train We can linearize the cost function around the current value of ✓ , obtaining an constrained pertubation of ⌘ = ✏sign ( r x J (✓ , x , y )) . We refer to this as the “fast gradient sign method” of generating adversarial exam required gradient can be computed efficiently using backpropagation. We find that this method reliably causes a wide variety of models to misclass Fig. 1 for a demonstration on ImageNet. We find that using ✏ = . 25 , we cause classifier to have an error rate of 99.9% with an average confidence of 79.3% on set1. In the same setting, a maxout network misclassifies 89.4% of our advers an average confidence of 97.6%. Similarly, using ✏ = . 1 , we obtain an error an average probability of 96.6% assigned to the incorrect labels when using a co network on a preprocessed version of the CIFAR-10 (Krizhevsky & Hinton, 20 simple methods of generating adversarial examples are possible. For example, rotating x by a small angle in the direction of the gradient reliably produces adv ably quantify the robustness of these classifiers. Extensive experimental results show that our approach outperforms recent methods in the task of computing adversarial perturbations and making classifiers more robust.1 1. Introduction Deep neural networks are powerful learning models that achieve state-of-the-art pattern recognition performance in many research areas such as bioinformatics [1, 16], speech [12, 6], and computer vision [10, 8]. Though deep networks have exhibited very good performance in classification tasks, they have recently been shown to be particularly unstable to adversarial perturbations of the data [18]. In fact, very small and often imperceptible perturbations of the data samples are sufficient to fool state-of-the-art classifiers and result in incorrect classification. (e.g., Figure 1). For- mally, for a given classifier, we define an adversarial perturbation as the minimal perturbation r that is sufficient to change the estimated label ˆ k (x) : (x; ˆ k ) := min r k r k2 subject to ˆ k (x + r) 6 = ˆ k (x) , (1) where x is an image and ˆ k (x) is the estimated label. We call (x; ˆ k ) the robustness of ˆ k at point x . The robustness of classifier ˆ k is then defined as 1To encourage reproducible research, the code of DeepFool is made available at http://github.com/lts4/deepfool Figure 1: An example of adversarial First row: the original image x that i ˆ k (x) =“whale”. Second row: the image x as ˆ k (x + r) =“turtle” and the corresponding computed by DeepFool. Third row: the im as “turtle” and the corresponding perturba by the fast gradient sign method [4]. Deep smaller perturbation. arXiv:1511.04599v3 [c F f( x ) < 0 f( x ) > 0 r⇤ ( x ) (x 0 ;f) x0 Figure 2: Adversarial examples for a linear binary classifier. Algorithm 1 DeepFool for binary classifiers 1: input: Image x , classifier f. 2: output: Perturbation ˆ r . 3: Initialize x0 x , i 0 . 4: while sign ( f (xi)) = sign ( f (x0)) do 5: ri f ( xi) kr f ( xi)k2 2 rf (xi) , 6: xi +1 xi + ri , 7: i i + 1 . 8: end while 9: return ˆ r = P i ri . 13/32

Universal adversarial perturbations 14/32

ීวతͳઁಈϊΠζͷఆࣜԽ ࣍ͷΑ͏ʹීวతͳϊΠζΛఆࣜԽ: ͜͜Ͱɺɹɹ͸ estimated label Ͱ x ͸ը૾ɺµ ͸σʔλ෼෍ ϊΠζ
v ΛޡೝࣝΛҾ͖ى͜͢༗ޮͳઁಈͱ࣮ͯ͠ݱ͢ΔͨΊʹɺҎԼͷ੍໿Λ՝͢ ̍ɽ͸ϊΠζ͕େ͖ͳ΋ͷʹͳΒͳ͍͜ͱΛอূ͢ΔͨΊͷ੍ݶ ̎ɽ͸ϊΠζ͕ҰఆҎ্ͷޡೝࣝΛҾ͖ى͜͢͜ͱΛอূ͢ΔͨΊͷ੍ݶ → (ξ, δ) ΛՄೳͳݶΓখ͘͢͞ΔΞϧΰϦζϜ͕஌Γ͍ͨ ֤σʔλ఺ʹରͯࣝ͠ผڥքΛ·͕ͨΔΑ͏ͳϕΫτϧΛݟ͚ͭͯ଍͍ͯ͘͠ d successive data- he classifier. ns have a remark- erturbations com- ng points fool new e not only univer- e well across deep are therefore dou- e data and the net- nerability of deep ations by examin- een different parts structured and un- racted a lot of at- pite the impressive hitectures on chal- 6, 9, 21, 10], these erable to perturba- to be unstable to ve adversarial per- rbations are either turbations. Let µ denote a distribution of images in R , and ˆ k define a classification function that outputs for each image x 2 Rd an estimated label ˆ k ( x ). The main focus of this paper is to seek perturbation vectors v 2 Rd that fool the classifier ˆ k on almost all datapoints sampled from µ . That is, we seek a vector v such that ˆ k ( x + v ) 6= ˆ k ( x ) for “most” x ⇠ µ. We coin such a perturbation universal , as it represents a fixed image-agnostic perturbation that causes label change for most images sampled from the data distribution µ . We focus here on the case where the distribution µ represents the set of natural images, hence containing a huge amount of variability. In that context, we examine the existence of small universal perturbations (in terms of the `p norm with p 2 [1 , 1)) that misclassify most images. The goal is therefore to find v that satisfies the following two constraints: 1. k v k p  ⇠, 2. P x ⇠ µ ⇣ ˆ k ( x + v ) 6= ˆ k ( x ) ⌘ 1 . The parameter ⇠ controls the magnitude of the perturbation vector v , and quantifies the desired fooling rate for all images sampled from the distribution µ . Algorithm. Let X = { x1, . . . , xm } be a set of images sampled from the distribution µ . Our proposed algorithm ts belonging to the data distribution. perturbations in this section the notion of universal per- ropose a method for estimating such per- denote a distribution of images in Rd, and fication function that outputs for each im- timated label ˆ k ( x ). The main focus of this perturbation vectors v 2 Rd that fool the most all datapoints sampled from µ . That or v such that + v ) 6= ˆ k ( x ) for “most” x ⇠ µ. perturbation universal , as it represents a ostic perturbation that causes label change sampled from the data distribution µ . We e case where the distribution µ represents images, hence containing a huge amount that context, we examine the existence of erturbations (in terms of the `p norm with misclassify most images. The goal is there- satisfies the following two constraints: ⌘ small set of training points fool new probability. h perturbations are not only univer- but also generalize well across deep uch perturbations are therefore dou- with respect to the data and the net- . alyze the high vulnerability of deep universal perturbations by examin- correlation between different parts undary. mage classifiers to structured and un- s have recently attracted a lot of at- 12, 13, 14]. Despite the impressive eural network architectures on chal- tion benchmarks [6, 9, 21, 10], these to be highly vulnerable to perturba- tworks are shown to be unstable to perceptible additive adversarial per- ully crafted perturbations are either n optimization problem [19, 11, 1] gradient ascent [5], and result in a a specific data point. A fundamental rsarial perturbations is their intrin- points: the perturbations are specif- data point independently. As a re- of an adversarial perturbation for a solving a data-dependent optimiza- tch, which uses the full knowledge is, we seek a vector v such that ˆ k ( x + v ) 6= ˆ k ( x ) for “most” x ⇠ µ. We coin such a perturbation universal , as it represents a fixed image-agnostic perturbation that causes label change for most images sampled from the data distribution µ . We focus here on the case where the distribution µ represents the set of natural images, hence containing a huge amount of variability. In that context, we examine the existence of small universal perturbations (in terms of the `p norm with p 2 [1 , 1)) that misclassify most images. The goal is therefore to find v that satisfies the following two constraints: 1. k v k p  ⇠, 2. P x ⇠ µ ⇣ ˆ k ( x + v ) 6= ˆ k ( x ) ⌘ 1 . The parameter ⇠ controls the magnitude of the perturbation vector v , and quantifies the desired fooling rate for all images sampled from the distribution µ . Algorithm. Let X = { x1, . . . , xm } be a set of images sampled from the distribution µ . Our proposed algorithm seeks a universal perturbation v , such that k v k p  ⇠ , while fooling most data points in X . The algorithm proceeds it- eratively over the data points in X and gradually builds the universal perturbation, as illustrated in Fig. 2. At each iter- ation, the minimal perturbation vi that sends the current perturbed point, xi + v , to the decision boundary of the classifier is computed, and aggregated to the current instance of the universal perturbation. In more details, provided the current universal perturbation v does not fool data point x , 15/32

ΞϧΰϦζϜ ޡೝࣝ཰Λܭࢉ Algorithm 1 Computation of universal perturbations. 1: input:
Data points X , classifier ˆ k , desired `p norm of the perturbation ⇠ , desired accuracy on perturbed samples . 2: output: Universal perturbation vector v . 3: Initialize v 0. 4: while Err( Xv )  1 do 5: for each datapoint xi 2 X do 6: if ˆ k ( xi + v ) = ˆ k ( xi ) then 7: Compute the minimal perturbation that sends xi + v to the decision boundary: vi arg min r k r k2 s.t. ˆ k ( xi + v + r ) 6= ˆ k ( xi ) . 8: Update the perturbation: v P p,⇠ ( v + vi ) . 9: end if 10: end for 11: end while Figure 2: Schematic representation of the proposed alg rithm used to compute universal perturbations. In this lustration, data points x1, x2 and x3 are super-imposed, an the classification regions R i (i.e., regions of constant es mated label) are shown in different colors. Our algorith proceeds by aggregating sequentially the minimal perturb tions sending the current perturbed points xi + v outside the corresponding classification region R i . mization problem: vi arg min r k r k2 s.t. ˆ k ( xi + v + r ) 6= ˆ k ( xi ) . ( To ensure that the constraint k v k p  ⇠ is satisfied, the u dated universal perturbation is further projected on the ball of radius ⇠ and centered at 0. That is, let P p,⇠ be t projection operator defined as follows: P p,⇠ ( v ) = arg min v 0 k v v 0k2 subject to k v 0k p  ⇠. Then, our update rule is given by v P p,⇠ ( v + vi ). Se eral passes on the data set X are performed to improve t quality of the universal perturbation. The algorithm is te minated when the empirical “fooling rate” on the perturb r To ensure that the constraint k v k p  ⇠ is satisfied, the up- dated universal perturbation is further projected on the `p ball of radius ⇠ and centered at 0. That is, let P p,⇠ be the projection operator defined as follows: P p,⇠ ( v ) = arg min v 0 k v v 0k2 subject to k v 0k p  ⇠. Then, our update rule is given by v P p,⇠ ( v + vi ). Sev- eral passes on the data set X are performed to improve the quality of the universal perturbation. The algorithm is ter- minated when the empirical “fooling rate” on the perturbed data set Xv := { x1 + v, . . . , xm + v } exceeds the target threshold 1 . That is, we stop the algorithm whenever Err( Xv ) := 1 m m X i =1 1ˆ k ( xi+ v )6=ˆ k ( xi) 1 . The detailed algorithm is provided in Algorithm 1. Interest- ingly, in practice, the number of data points m in X need not be large to compute a universal perturbation that is valid for the whole distribution µ . In particular, we can set m to be much smaller than the number of training points (see Section 3). The proposed algorithm involves solving at most m in- stances of the optimization problem in Eq. (1) for each pass. While this optimization problem is not convex when ˆ k is a ֤σʔλ఺ʹରͯࣝ͠ผڥք ·Ͱͷ࠷୹ͷϕΫτϧΛࢉग़ ࣝผڥք·ͰͷϕΫτϧΛՄೳͳݶΓ อͪͭͭɺLp-norm Λ੍ݶҎԼʹ͢Δ 16/32

Several experiments 17/32

֤Ϟσϧͷޡೝࣝ཰ ༷ʑͳϞσϧͰޡೝࣝ཰Λࢉग़ ɾILSVRC ͷσʔλΛ࢖༻ʢ X: 10,000, Val.: 50,000 ʣ ɾ(p,
ξ) = (2, 2000) ͱ (∞, 10) ͱ͍͏૊Έ߹ΘͤͰ࣮ݧ L2-norm ͷํ͕શମతʹྑ͍݁Ռ͕ͩɺL∞-norm ͕Α͘Ϛον͢ΔϞσϧ΋ଘࡏ ීวతͳઁಈϊΠζʹΑͬͯ༠ൃ͞Εͨ GoogleNet ͷޡೝࣝ CaffeNet [8] VGG-F [2] VGG-16 [17] VGG-19 [17] GoogLeNet [18] ResNet-152 [6] `2 X 85.4% 85.9% 90.7% 86.9% 82.9% 89.7% Val. 85.6 87.0% 90.3% 84.5% 82.0% 88.5% `1 X 93.1% 93.8% 78.5% 77.8% 80.8% 85.4% Val. 93.3% 93.7% 78.3% 77.8% 78.9% 84.0% Table 1: Fooling ratios on the set X , and the validation set. natural images2. Results are listed in Table 1. Each result is reported on the set X , which is used to compute the perturbation, as well as on the validation set (that is not used in the process of the computation of the universal perturbation). Observe that for all networks, the universal perturbation achieves very high fooling rates on the validation set. Specifically, the universal perturbations computed for CaffeNet and VGG-F fool more than 90% of the validation set (for p = 1). In other words, for any natural image in the validation set, the mere addition of our universal perturbation fools the classifier more than 9 times out of 10. This result is moreover not specific to such architectures, as we can also find universal perturbations that cause VGG, GoogLeNet and ResNet classifiers to be fooled on natural images with probability edging 80%. These results have an While the above universal perturbations are computed for a set X of 10,000 images from the training set (i.e., in average 10 images per class), we now examine the influence of the size of X on the quality of the universal perturbation. We show in Fig. 6 the fooling rates obtained on the validation set for different sizes of X for GoogLeNet. Note for example that with a set X containing only 500 images, we can fool more than 30% of the images on the validation set. This result is significant when compared to the number of classes in ImageNet (1000), as it shows that we can fool a large set of unseen images, even when using a set X containing less than one image per class! The universal perturbations computed using Algorithm 1 have therefore a remarkable generalization power over unseen data points, and can be computed on a very small set of training images. Ref: https://arxiv.org/abs/1610.08401 wool Indian elephant Indian elephant African grey tabby African grey common newt carousel grey fox macaw three-toed sloth macaw Figure 3: Examples of perturbed images and their corresponding labels. The first 8 images belong to the ILSVRC 2012 18/32

֤ϞσϧͷීวతͳઁಈϊΠζ ༷ʑͳϞσϧͰීวతͳઁಈϕΫτϧΛࢉग़ʢ͜ΕΒ͸ unique Ͱ͸ͳ͍͜ͱʹ஫ҙʣ ɾILSVRC ͷσʔλΛ࢖༻ ɾ(p, ξ) = (∞,
10) Ͱͷ݁Ռ ϞσϧຖʹҟͳΔ݁Ռ͕ಘΒΕΔ Ref: https://arxiv.org/abs/1610.08401 wool Indian elephant Indian elephant African grey tabby African grey common newt carousel grey fox macaw three-toed sloth macaw Figure 3: Examples of perturbed images and their corresponding labels. The ﬁrst 8 images belong to the ILSVRC 2012 validation set, and the last 4 are images taken by a mobile phone camera. See supp. material for the original images. (a) CaffeNet (b) VGG-F (c) VGG-16 (d) VGG-19 (e) GoogLeNet (f) ResNet-152 Figure 4: Universal perturbations computed for different deep neural network architectures. Images generated with p = 1, 19/32

ීวతͳઁಈϊΠζͷඇҰҙੑ ҰͭͷϞσϧͰֶशσʔλΛγϟοϑϧ͠ͳ͕Β̑ͭͷීวతͳઁಈϊΠζΛࢉग़ ɾILSVRC ͷσʔλΛ࢖༻ ɾϞσϧ͸ GoogleNet Λ࢖༻ ݟͨ໨͸ࣅ͍ͯΔ͕ɺਖ਼نԽͨ͠಺ੵ͸೚ҙͷϖΞͰ 0.1 ҎԼ
͜Ε͸ීวతͳઁಈϊΠζ͕ unique Ͱ͸ͳ͍͜ͱΛ͍ࣔͯ͠Δ Ref: https://arxiv.org/abs/1610.08401 Figure 5: Diversity of universal perturbations for the GoogLeNet architecture. The ﬁve perturbations are generated using different random shufﬂings of the set X . Note that the normalized inner products for any pair of universal perturbations does not exceed 0 . 1, which highlights the diversity of such perturbations. VGG-F CaffeNet GoogLeNet VGG-16 VGG-19 ResNet-152 VGG-F 93.7% 71.8% 48.4% 42.1% 42.1% 47.4 % CaffeNet 74.0% 93.3% 47.7% 39.9% 39.9% 48.0% 20/32

ֶशσʔλ਺ͷґଘੑ ҰͭͷϞσϧͰֶशσʔλͷαΠζΛม͑ͳ͕Βޡೝࣝ཰Λܭࢉ ɾILSVRC ͷσʔλΛ࢖༻ʢ X: ม͑ͳ͕Β࣮ݧ, Val.: 50,000 ʣ ɾϞσϧ͸
GoogleNet Λ࢖༻ গͳ͍σʔλ਺Ͱ΋ߴ͍ޡೝࣝ཰Λୡ੒ ଟ͘ͷσʔλ఺ʹ͓͍ͯۙ๣ͷࣝผڥք͸ࣅͨΑ͏ͳزԿͰ͋Δ͜ͱΛࣔࠦ Ref: https://arxiv.org/abs/1610.08401 VGG-F CaffeNet GoogLeNet VGG-16 VGG VGG-F 93.7% 71.8% 48.4% 42.1% 42.1 CaffeNet 74.0% 93.3% 47.7% 39.9% 39.9 GoogLeNet 46.2% 43.8% 78.9% 39.2% 39.8 VGG-16 63.4% 55.8% 56.5% 78.3% 73.1 VGG-19 64.0% 57.2% 53.6% 73.5% 77.8 ResNet-152 46.3% 46.3% 50.5% 47.0% 45.5 Table 2: Generalizability of the universal perturbations across different networks. The The rows indicate the architecture for which the universal perturbations is computed, and for which the fooling rate is reported. Number of images in X 500 1000 2000 4000 Fooling ratio (%) 0 10 20 30 40 50 60 70 80 90 images. We use the V network based on a m perturbations are adde ples: for each traini added with probabilit served with probabili of universal perturbati ferent universal pertu training samples rand fine-tuned by perform modified training set. the robustness of the n perturbation for the fi with p = 1 and ⇠ = 21/32

ීวతͳઁಈϊΠζͷ൚Խੑ ҰͭͷϞσϧͰֶशͨ͠ීวతͳઁಈϊΠζΛଞͷϞσϧʹ΋ద༻ ɾILSVRC ͷσʔλΛ࢖༻ʢ X: 10,000, Val.: 50,000 ʣ ҟͳΔϞσϧͰ΋ߴ͍ޡೝࣝ཰ΛҾ͖ى͜͢
ߏ଄͕ࣅ͍ͯΕ͹ͦͷޮՌ͸ߴ͍Α͏ʹݟड͚ΒΕΔʢVGG-16 ↔ VGG-19ʣ ࢉग़͞ΕͨීวతͳઁಈϊΠζ͸ೋॏͷҙຯͰ “universal” Ͱ͋Δ͜ͱΛ͍ࣔͯ͠Δ ̍ɽҰͭͷϊΠζͰ༷ʑͳσʔλʹద༻Մೳ ̎ɽҰͭͷϊΠζͰ༷ʑͳϞσϧʹద༻Մೳ Ref: https://arxiv.org/abs/1610.08401 e 5: Diversity of universal perturbations for the GoogLeNet architecture. The ﬁve perturbations are generated ent random shufﬂings of the set X . Note that the normalized inner products for any pair of universal perturbations xceed 0 . 1, which highlights the diversity of such perturbations. VGG-F CaffeNet GoogLeNet VGG-16 VGG-19 ResNet-152 VGG-F 93.7% 71.8% 48.4% 42.1% 42.1% 47.4 % CaffeNet 74.0% 93.3% 47.7% 39.9% 39.9% 48.0% GoogLeNet 46.2% 43.8% 78.9% 39.2% 39.8% 45.5% VGG-16 63.4% 55.8% 56.5% 78.3% 73.1% 63.4% VGG-19 64.0% 57.2% 53.6% 73.5% 77.8% 58.0% ResNet-152 46.3% 46.3% 50.5% 47.0% 45.5% 84.0% 2: Generalizability of the universal perturbations across different networks. The percentages indicate the fooling ows indicate the architecture for which the universal perturbations is computed, and the columns indicate the archite 22/32

ࣗ෼Ͱ΋΍ͬͯΈͨ ϨϙδτϦ͸ެ։͞Ε͍ͯΔͷͰɺdocker ؀ڥΛߏங͠෼ੳΛ࣮ߦ: https://github.com/yoheikikuta/universal ( nvidia-docker ͕࢖͑Δ؀ڥͰ͋Δ͜ͱ͕લఏ৚݅ ) $ git
clone https://github.com/yoheikikuta/universal.git  $ cd universal $ bash run.sh  (in the container) $ python3 demo_inception.py -i data/test_img.png ֶशࡁΈͷීวతͳઁಈϊΠζ͕͋ΔͷͰɺ͙͢ʹࣗ෼Ͱ४උͨ͠ը૾ʹద༻Մೳ ීวతͳઁಈϊΠζΛܭࢉ͢Δ universal_pert.py ΋४උ͞Ε͍ͯΔ 23/32

ࣗ෼Ͱ΋΍ͬͯΈͨ ࿦จͰ࣮ࡍʹ࢖ΘΕ͍ͯΔը૾ 24/32

ޡೝࣝʹ͸ύλʔϯ͕ଘࡏ ޡೝࣝͷύλʔϯΛ೺Ѳ͢ΔͨΊʹάϥϑߏ଄Λௐ΂Δ ɾϞσϧ͸ GoogleNet Λ࢖༻ ɾILSVRC ͷσʔλΛ࢖༻ ɾΫϥε i ͷσʔλ͕
j ʹޡೝࣝ͞ΕΔͱ͖ʹΤοδΛషΓ༗޲άϥϑΛߏங ͋Δগ਺ͷಛఆͷϊʔυʹूத͢Δ܏޲͕؍ଌ͞Εͨ ͜ΕΒͷΫϥε͸σʔλۭؒʹ઎ΊΔׂ߹͕ଟ͍ͱߟ͑ΒΕΔʢ௚ײͱ͸߹Θͳ͍͕ʣ great grey owl platypus nematode dowitcher Arctic fox leopard digital clock fountain slide rule space shuttle cash machine pillow computer keyboard dining table envelope medicine chest microwave mosquito net pencil box plate rack quilt refrigerator television tray wardrobe window shade Ref: https://arxiv.org/abs/1610.08401 25/32

Discussions about the decision boundaries 26/32

ࣝผڥքͷಛ௃ ୯ҰͷීวతͳઁಈϊΠζͰଟ͘ͷը૾͕ޡೝࣝ͞ΕΔ͜ͱ͕ࣔ͞Εͨ → σʔλ఺ۙ๣ͷࣝผڥք͸গ਺ͷϕΫτϧͰ span ͞ΕΔ௿࣍ݩۭؒͰهड़͞Εͦ͏ ͜ΕΛௐ΂ΔͨΊʹ֤σʔλ఺ʹ͓͚Δࣝผڥք΁ͷ๏ઢϕΫτϧͷੑ࣭Λௐ΂Δ CaffeNet Ͱߦྻ N
ΛٻΊɺ୯Ґٿ͔ΒϥϯμϜʹαϯϓϧͨ͠৔߹ͱಛҟ஋෼෍Λൺֱ Ref: https://arxiv.org/abs/1610.08401 Figure 8: Comparison between fooling rates of different perturbations. Experiments performed on the CaffeNet architecture. between different regions of the decision boundary of the classifier, we define the matrix N =  r ( x1) k r ( x1)k2 . . . r ( xn ) k r ( xn )k2 of normal vectors to the decision boundary in the vicinity of n data points in the validation set. For binary linear classifiers, the decision boundary is a hyperplane, and N is of rank 1, as all normal vectors are collinear. To capture more generally the correlations in the decision boundary of complex classifiers, we compute the singular values of the matrix N . The singular values of the matrix N , computed for the CaffeNet architecture are shown in Fig. 9. We further show in the same figure the singular values obtained when the columns of N are sampled uniformly at random from the unit sphere. Observe that, while the latter singular values have a slow decay, the singular values of N de- Figure 9: Singu vectors to the de Figure 10: Illu S containing no regions surroun this illustration, and the adversar spective datapoi shown. Note tha 27/32

ࣝผڥքͷಛ௃ CaffeNet ͔Βߏஙͨ͠ N Ͱ͸গ਺ͷಛҟ஋͕ࢧ഑తͰ͋Δ͜ͱ͕؍ଌ͞ΕΔ → σʔλ఺ۙ๣ͷࣝผڥք͸͔֬ʹσʔλͷ࣍ݩΑΓང͔ʹ௿͍࣍ݩͰهड़͞Εͦ͏ → σʔλ఺ۙ๣ͷࣝผڥքͷہॴతͳزԿ͸ݸผͷ఺ʹґΒͳ͍ڞ௨ͷੑ࣭͕͋Δʁ ීวతͳઁಈϊΠζ͕࡞ΕΔ͔Ͳ͏͔͸ඇࣗ໌͕ͩɺσʔλ఺ۙ๣ͷࣝผڥքΛৄ͘͠
ௐ΂Δ͜ͱͰɺͦͷଘࡏ͕ূ໌͞Εͨ ( https://arxiv.org/abs/1705.09554 ) Ref: https://arxiv.org/abs/1610.08401 g rates of different on the CaffeNet ar- on boundary of the 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Index 10 4 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Singular values Random Normal vectors Figure 9: Singular values of matrix N containing normal vectors to the decision decision boundary. 28/32

Universal adversarial perturbations

Universal adversarial perturbations

yoppe

More Decks by yoppe

Other Decks in Science

Featured

Transcript