Slide 13
Slide 13 text
ैདྷख๏ͷϊΠζͷ࡞Γํ
https://arxiv.org/abs/1312.6199 : ͋ΔΫϥεͱಋ͘খ͞ͳϊΠζΛࢉग़
ϊΠζ r ࣍ࣜͰࢉग़:
x ೖྗը૾ɺl ϥϕϧɺf DL Ϟσϧ
https://arxiv.org/abs/1412.6572 : తؔͷඍํʹඍখྔΛੵΈ্͛Δ
ϊΠζ෦Λղ:
ϊΠζ η Λ࣍ࣜ (fast gradient sign method) Ͱࢉग़:
ε ඍখͳఆɺθ Ϟσϧύϥϝλɺx ೖྗը૾ɺy ϥϕϧ
https://arxiv.org/abs/1511.04599 : ࣝผڥքΛ·͕ͨΔํͷมҐΛগ্ͣͭ͛͠͠Δ
ؤڧੑΛఆٛ:
ೋྨͷΞϧΰϦζϜʢଟྨ֦ுՄೳʣ:
The minimizer
r
might not be unique, but we denote one such
x + r
for
minimizer by
D(x, l)
. Informally,
x + r
is the closest image to
x
classified
D(x, f(x)) = f(x)
, so this task is non-trivial only if
f(x)
6
= l
. In general,
of
D(x, l)
is a hard problem, so we approximate it by using a box-constrained
we find an approximation of
D(x, l)
by performing line-search to find the min
the minimizer
r
of the following problem satisfies
f(x + r) = l
.
• Minimize
c
|
r
|
+
loss
f (x + r, l)
subject to
x + r
2
[0, 1]m
This penalty function method would yield the exact solution for
D(X, l)
i
losses, however neural networks are non-convex in general, so we end up wit
this case.
4.2 Experimental results
Our “minimimum distortion” function
D
has the following intriguing properti
port by informal evidence and quantitative experiments in this section:
1. For all the networks we studied (MNIST, QuocNet [10], AlexNe
ple, we have always managed to generate very close, visually ha
versarial examples that are misclassified by the original netwo
http://goo.gl/huaGPb
for examples).
2.
Cross model generalization:
a relatively large fraction of examples w
networks trained from scratch with different hyper-parameters (num
ization or initial weights).
3.
Cross training-set generalization
a relatively large fraction of examp
HE LINEAR EXPLANATION OF ADVERSARIAL EXAMPLES
t with explaining the existence of adversarial examples for linear models.
y problems, the precision of an individual input feature is limited. For example, digital
often use only 8 bits per pixel so they discard all information below
1
/
255
of the dynamic
Because the precision of the features is limited, it is not rational for the classifier to respond
ntly to an input
x
than to an adversarial input
˜
x = x + ⌘
if every element of the perturbation
aller than the precision of the features. Formally, for problems with well-separated classes,
ect the classifier to assign the same class to
x
and
˜
x
so long as ||
⌘
||1 < ✏, where ✏ is small
to be discarded by the sensor or data storage apparatus associated with our problem.
er the dot product between a weight vector
w
and an adversarial example
˜
x
:
w
>
˜
x = w
>
x + w
>
⌘
.
versarial perturbation causes the activation to grow by
w
>
⌘
.We can maximize this increase
to the max norm constraint on
⌘
by assigning ⌘
=
sign
(w)
. If
w
has n dimensions and the
magnitude of an element of the weight vector is m, then the activation will grow by ✏mn.
|⌘||1 does not grow with the dimensionality of the problem but the change in activation
by perturbation by ⌘ can grow linearly with n, then for high dimensional problems, we can
many infinitesimal changes to the input that add up to one large change to the output. We
nk of this as a sort of “accidental steganography,” where a linear model is forced to attend
vely to the signal that aligns most closely with its weights, even if multiple signals are present
er signals have much greater amplitude.
planation shows that a simple linear model can have adversarial examples if its input has suf-
dimensionality. Previous explanations for adversarial examples invoked hypothesized prop-
f neural networks, such as their supposed highly non-linear nature. Our hypothesis based
arity is simpler, and can also explain why softmax regression is vulnerable to adversarial
es.
NEAR PERTURBATION OF NON-LINEAR MODELS
57.7% confidence 8.2% confidence 99.3 %
Figure 1: A demonstration of fast adversarial example generation applied to G
et al., 2014a) on ImageNet. By adding an imperceptibly small vector whose el
the sign of the elements of the gradient of the cost function with respect to the in
GoogLeNet’s classification of the image. Here our ✏ of .007 corresponds to th
smallest bit of an 8 bit image encoding after GoogLeNet’s conversion to real nu
Let
✓
be the parameters of a model,
x
the input to the model, y the targets ass
machine learning tasks that have targets) and J
(✓
,
x
, y
)
be the cost used to train
We can linearize the cost function around the current value of
✓
, obtaining an
constrained pertubation of
⌘ =
✏sign
(
r
x
J
(✓
,
x
, y
))
.
We refer to this as the “fast gradient sign method” of generating adversarial exam
required gradient can be computed efficiently using backpropagation.
We find that this method reliably causes a wide variety of models to misclass
Fig. 1 for a demonstration on ImageNet. We find that using ✏
=
.
25
, we cause
classifier to have an error rate of 99.9% with an average confidence of 79.3% on
set1. In the same setting, a maxout network misclassifies 89.4% of our advers
an average confidence of 97.6%. Similarly, using ✏
=
.
1
, we obtain an error
an average probability of 96.6% assigned to the incorrect labels when using a co
network on a preprocessed version of the CIFAR-10 (Krizhevsky & Hinton, 20
simple methods of generating adversarial examples are possible. For example,
rotating
x
by a small angle in the direction of the gradient reliably produces adv
ably quantify the robustness of these classifiers. Extensive
experimental results show that our approach outperforms
recent methods in the task of computing adversarial pertur-
bations and making classifiers more robust.1
1. Introduction
Deep neural networks are powerful learning models that
achieve state-of-the-art pattern recognition performance in
many research areas such as bioinformatics [1, 16], speech
[12, 6], and computer vision [10, 8]. Though deep net-
works have exhibited very good performance in classifica-
tion tasks, they have recently been shown to be particularly
unstable to adversarial perturbations of the data [18]. In
fact, very small and often imperceptible perturbations of the
data samples are sufficient to fool state-of-the-art classifiers
and result in incorrect classification. (e.g., Figure 1). For-
mally, for a given classifier, we define an adversarial per-
turbation as the minimal perturbation
r
that is sufficient to
change the estimated label ˆ
k
(x)
:
(x; ˆ
k
) := min
r
k
r
k2
subject to ˆ
k
(x + r)
6
= ˆ
k
(x)
, (1)
where
x
is an image and ˆ
k
(x)
is the estimated label. We
call
(x; ˆ
k
)
the robustness of ˆ
k at point
x
. The robustness
of classifier ˆ
k is then defined as
1To encourage reproducible research, the code of DeepFool is made
available at
http://github.com/lts4/deepfool
Figure 1: An example of adversarial
First row: the original image
x
that i
ˆ
k
(x)
=“whale”. Second row: the image
x
as ˆ
k
(x + r)
=“turtle” and the corresponding
computed by DeepFool. Third row: the im
as “turtle” and the corresponding perturba
by the fast gradient sign method [4]. Deep
smaller perturbation.
arXiv:1511.04599v3 [c
F
f(
x
) < 0
f(
x
) > 0
r⇤
(
x
)
(x
0
;f)
x0
Figure 2: Adversarial examples for a linear binary classifier.
Algorithm 1 DeepFool for binary classifiers
1:
input: Image
x
, classifier f.
2:
output: Perturbation
ˆ
r
.
3: Initialize
x0 x
, i
0
.
4:
while sign
(
f
(xi)) =
sign
(
f
(x0))
do
5:
ri
f
(
xi)
kr
f
(
xi)k2
2
rf
(xi)
,
6:
xi
+1 xi + ri
,
7: i i
+ 1
.
8:
end while
9:
return
ˆ
r =
P
i ri
.
13/32