an adversarial example iff: # !" = % Class is % (targeted) ∆ !, !" ≤ ) Difference below threshold ∆ !, !" is defined in some (simple!) metric space: *+ norm (# different), *,, *-norm (“Euclidean distance”), *. Assumption (to map to earlier definition): small perturbation does not change class in Oracle space
Clever adversaries can still find adversarial examples 2. Build a robust classifier − Adversarial retraining, increasing model capacity, etc. − If we could build a perfect model, we would! 31
Clever adversaries can still find adversarial examples 2. Build a robust classifier − Adversarial retraining, increasing model capacity, etc. − If we could build a perfect model, we would! 32 Our strategy: “Feature Squeezing” reduce search space available to the adversary Weilin Xu Yanjun Qi
… , $%&', ) Yes Input Adversarial No Legitimate Model’ Squeezer k … Predictionk Detection Framework Feature Squeezer coalesces similar inputs into one pointL: • Little change for legitimate inputs. • Destruct adversarial perturbations.
its neighbors. Effective in eliminating ”salt-and-pepper” noise (!" attacks) 37 Image from https://sultanofswing90.wordpress.com/tag/image-processing/ 3×3 Median Filter
Effects Through Randomization, ICLR 2018. J Buckman, et al. Thermometer Encoding: One Hot Way To Resist Adversarial Examples, ICLR 2018. D Meng and H Chen, MagNet: a Two-Pronged Defense against Adversarial Examples, in CCS 2017; A Prakash, et al., Deflecting Adversarial Attacks with Pixel Deflection, CVPR 2018;... Thermometer Encoding (learnable bit depth reduction) Image denoising using autoencoder, wavelet, JPEG, etc. Image resizing ...
exists some feature squeezer that accurately detects its adversarial examples. 41 Intuition: if the perturbation is small (in some simple metric space), there is some squeezer that coalesces original and adversarial example into same sample.
1.6 2.0 Number of Examples Legitimate Adversarial Maximum !"distance between original and squeezed input threshold = 0.0029 detection: 98.2%, FP < 4% Training a detector (MNIST) set the detection threshold to keep false positive rate below target
Wei, Xinyun Chen, Nicholas Carlini, Dawn Song, Adversarial Example Defense: Ensembles of Weak Defenses are not Strong, USENIX WOOT’17. !"#"!"$% & '( − * + , ∗ Δ ', '( + 0 ∗ 12 3456%('′) 47 Misclassification term Distance term Detection term
“Black-box attacker” Interacts with model through API Limited number of interactions Output is <class, confidence> vector decision-based: output is just class “bird”, 0.09 “horse”, 0.84 ...
2 3 4 5 6 7 8 Number of Queries 10 5 Effort (number of model interactions) to find adversarial example varies by seed most only require a few thousand queries ZOO-attack on MNIST a few require > 10x more effort
Selected 0 2 4 6 8 10 12 Average Number of Queries 10 4 Greedy Search Random Search Retroactive Optimal MNIST CIFAR-10 0 50 100 150 Number of Images Selected 0 0.5 1 1.5 2 2.5 3 Average Number of Queries 10 4 Greedy Search Random Search Retroactive Optimal
domain knowledge will not be robust against motivated adversaries Immature, but fun and active research area Need to make progress toward meaningful threat models, robustness measures, verifiable defenses Workshop to be held at DSN (Luxembourg, 25 June) Workshop to be held at IEEE S&P (San Francisco, 24 May)