Labelled Training Data ML Algorithm Feature Extraction Vectors Deployment Malicious / Benign Operational Data Trained Classifier Training (supervised learning)
ML Algorithm Feature Extraction Vectors Deployment Malicious / Benign Operational Data Training (supervised learning) Assumption: Training Data is Representative Trained Classifier Labelled Training Data
Adversarial Examples 8 0.007 × [&'()*] + = “panda” “gibbon” Example from: Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy. Explaining and Harnessing Adversarial Examples. 2014 (in ICLR 2015)
12 History of the destruction of Troy, 1498 Trojans, lumber into the city, heart fear any clever/packed ambush of the Argives. Homer, The Illiad (~1200 BCE)
Goal of Machine Learning Classifier 17 Classifier Space (DNN Model) “Oracle” Space (human perception) Model and visualization based on work by Beilun Wang, Ji Gao and Yanjun Qi (ICLR 2017 Workshop)
Well-Trained Classifier 18 Model and visualization based on work by Beilun Wang, Ji Gao and Yanjun Qi (ICLR 2017 Workshop) Classifier Space (DNN Model) “Oracle” Space (human perception)
Adversarial Examples 19 Model and visualization based on work by Beilun Wang, Ji Gao and Yanjun Qi (ICLR 2017 Workshop) Classifier Space (DNN Model) “Oracle” Space (human perception)
Misleading Visualization 20 Cartoon Reality 2 dimensions thousands of dimensions few samples near boundaries all samples near boundaries every sample near 1-3 classes every sample near all classes Classifier Space (DNN Model)
Adversarial Examples 21 Adversary’s goal: find a small perturbation that changes class for classifier, but imperceptible to oracle. Classifier Space (DNN Model) “Oracle” Space (human perception)
Adversarial Examples Definition 22 Given seed sample, !, !" is an adversarial example iff: # !" = % Class is % (targeted) ∆ !, !" ≤ ) Difference below threshold ∆ !, !" is defined in some (simple!) metric space: *+ norm (# different), *,, *-norm (“Euclidean distance”), *. Assumption (to map to earlier definition): small perturbation does not change class in Oracle space
Defense Strategies 1. Hide the gradients − Transferability results − Clever adversaries can still find adversarial examples 2. Build a robust classifier − Adversarial retraining, increasing model capacity, etc. 30
Defense Strategies 1. Hide the gradients − Transferability results − Clever adversaries can still find adversarial examples 2. Build a robust classifier − Adversarial retraining, increasing model capacity, etc. − If we could build a perfect model, we would! 31
Defense Strategies 1. Hide the gradients − Transferability results − Clever adversaries can still find adversarial examples 2. Build a robust classifier − Adversarial retraining, increasing model capacity, etc. − If we could build a perfect model, we would! 32 Our strategy: “Feature Squeezing” reduce search space available to the adversary Weilin Xu Yanjun Qi
Model Model Squeezer 1 Prediction0 Prediction1 "($%&'( , $%&'* , … , $%&', ) Yes Input Adversarial No Legitimate Model’ Squeezer k … Predictionk Detection Framework Feature Squeezer coalesces similar inputs into one pointL: • Little change for legitimate inputs. • Destruct adversarial perturbations.
Spatial Smoothing: Median Filter Replace a pixel with median of its neighbors. Effective in eliminating ”salt-and-pepper” noise (!" attacks) 37 Image from https://sultanofswing90.wordpress.com/tag/image-processing/ 3×3 Median Filter
Spatial Smoothing: Non-local Means Replace a patch with weighted mean of similar patches (in region). 38 ! "# "$ !% = ' ((!, "+ )×"+ Preserves edges, while removing noise.
Other Potential Squeezers 40 C Xie, et al. Mitigating Adversarial Effects Through Randomization, ICLR 2018. J Buckman, et al. Thermometer Encoding: One Hot Way To Resist Adversarial Examples, ICLR 2018. D Meng and H Chen, MagNet: a Two-Pronged Defense against Adversarial Examples, in CCS 2017; A Prakash, et al., Deflecting Adversarial Attacks with Pixel Deflection, CVPR 2018;... Thermometer Encoding (learnable bit depth reduction) Image denoising using autoencoder, wavelet, JPEG, etc. Image resizing ...
“Feature Squeezing” (Vacuous) Conjecture For any distance-limited adversarial method, there exists some feature squeezer that accurately detects its adversarial examples. 41 Intuition: if the perturbation is small (in some simple metric space), there is some squeezer that coalesces original and adversarial example into same sample.
Feature Squeezing Detection Model (7-layer CNN) Model Model Bit Depth- 1 Median 2×2 Prediction0 Prediction1 Prediction2 Yes Input Adversarial No Legitimate max '( )* , )( , '( )* , )2 > -
44 0 200 400 600 800 0.0 0.4 0.8 1.2 1.6 2.0 Number of Examples Legitimate Adversarial Maximum !"distance between original and squeezed input threshold = 0.0029 detection: 98.2%, FP < 4% Training a detector (MNIST) set the detection threshold to keep false positive rate below target
Threat Models Oblivious attack: The adversary has full knowledge of the target model, but is not aware of the detector. Adaptive attack: The adversary has full knowledge of the target model and the detector. 46
Adaptive Adversarial Examples 48 No successful adversarial examples were found for images originally labeled as 3 or 8. Mean L 2 2.80 4.14 4.67 Attack Untargeted Targeted (next) Targeted (least likely)
Attacker Visibility “White-box attacker” Knows model architecture and all parameters “Black-box attacker” Interacts with model through API Limited number of interactions Output is vector decision-based: output is just class “bird”, 0.09 “horse”, 0.84 ...
Black-Box Batch Attacker 0 50 100 150 Images 0 1 2 3 4 5 6 7 8 Number of Queries 10 5 Effort (number of model interactions) to find adversarial example varies by seed most only require a few thousand queries ZOO-attack on MNIST a few require > 10x more effort
Greedy Search Works 0 50 100 150 Number of Images Selected 0 2 4 6 8 10 12 Average Number of Queries 10 4 Greedy Search Random Search Retroactive Optimal MNIST CIFAR-10 0 50 100 150 Number of Images Selected 0 0.5 1 1.5 2 2.5 3 Average Number of Queries 10 4 Greedy Search Random Search Retroactive Optimal
Conclusions Domain Expertise still matters Machine learning models designed without domain knowledge will not be robust against motivated adversaries Immature, but fun and active research area Need to make progress toward meaningful threat models, robustness measures, verifiable defenses Workshop to be held at DSN (Luxembourg, 25 June) Workshop to be held at IEEE S&P (San Francisco, 24 May)