Lessons from the Last 3000 Years of Adversarial Examples

Lessons from the Last 3000 Years of Adversarial Examples David
Evans University of Virginia evadeML.org Huawei STW Shenzhen, China 15 May 2018

Machine Learning Does Amazing Things 1

… and can solve all Security Problems! Fake Spam IDS
Malware Fake Accounts … “Fake News”

Labelled Training Data ML Algorithm Feature Extraction Vectors Deployment Malicious
/ Benign Operational Data Trained Classifier Training (supervised learning)

ML Algorithm Feature Extraction Vectors Deployment Malicious / Benign Operational
Data Training (supervised learning) Assumption: Training Data is Representative Trained Classifier Labelled Training Data

Adversaries Don’t Cooperate Assumption: Training Data is Representative Deployment Training

Deployment Adversaries Don’t Cooperate Assumption: Training Data is Representative Training
Poisoning

Adversaries Don’t Cooperate Assumption: Training Data is Representative Evading Deployment
Training

Adversarial Examples 8 0.007 × [&'()*] + = “panda” “gibbon”
Example from: Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy. Explaining and Harnessing Adversarial Examples. 2014 (in ICLR 2015)

9 Papers on “Adversarial Examples” (Google Scholar) 0 200 400
600 800 1000 1200 2018 (5/13) 2017 2016 2015 2014 2013 654

Adversarial Examples before Deep Learning 10

Evasive Malware Péter Ször (1970-2013)

12 History of the destruction of Troy, 1498 Trojans, lumber
into the city, heart fear any clever/packed ambush of the Argives. Homer, The Illiad (~1200 BCE)

Adversarial Examples across Domains 13 Domain Classifier Space Oracle Space
Trojan Wars Judgment of Trojans !(#) = “gi>” Physical Reality !∗(#) = invading army Malware Malware Detector !(#) = “benign” Victim’s Execution !∗(#) = malicious behavior Image Classification DNN Classifier !(#) = ) Human Perception !∗(#) = *

Trojan Wars Judgment of Trojans !(#) = “gi@” Physical Reality !∗(#) = invading army Malware Malware Detector !(#) = “benign” Victim’s Execution !∗(#) = malicious behavior Image Classification DNN Classifier !(#) = ) Human Perception !∗(#) = * Today

Trojan Wars Judgment of Trojans !(#) = “gi@” Physical Reality !∗(#) = invading army Malware Malware Detector !(#) = “benign” Victim’s Execution !∗(#) = malicious behavior Image Classification DNN Classifier !(#) = ) Human Perception !∗(#) = * Today Fixing (Breaking?) the Definition

Adversarial Examples across Domains 16 Trojan Wars !(#) = “gi=”
!∗(#) = invading army Malware Malware Detector !(#) = “benign” Victim’s Execution !∗(#) = malicious behavior Image Classification DNN Classifier !(#) = ) Human Perception !∗(#) = * Fixing (Breaking?) the Definition

Goal of Machine Learning Classifier 17 Classifier Space (DNN Model)
“Oracle” Space (human perception) Model and visualization based on work by Beilun Wang, Ji Gao and Yanjun Qi (ICLR 2017 Workshop)

Well-Trained Classifier 18 Model and visualization based on work by
Beilun Wang, Ji Gao and Yanjun Qi (ICLR 2017 Workshop) Classifier Space (DNN Model) “Oracle” Space (human perception)

Adversarial Examples 19 Model and visualization based on work by
Beilun Wang, Ji Gao and Yanjun Qi (ICLR 2017 Workshop) Classifier Space (DNN Model) “Oracle” Space (human perception)

Misleading Visualization 20 Cartoon Reality 2 dimensions thousands of dimensions
few samples near boundaries all samples near boundaries every sample near 1-3 classes every sample near all classes Classifier Space (DNN Model)

Adversarial Examples 21 Adversary’s goal: find a small perturbation that
changes class for classifier, but imperceptible to oracle. Classifier Space (DNN Model) “Oracle” Space (human perception)

Adversarial Examples Definition 22 Given seed sample, !, !" is
an adversarial example iff: # !" = % Class is % (targeted) ∆ !, !" ≤ ) Difference below threshold ∆ !, !" is defined in some (simple!) metric space: *+ norm (# different), *,, *-norm (“Euclidean distance”), *. Assumption (to map to earlier definition): small perturbation does not change class in Oracle space

!" Adversary: Fast Gradient Sign 23 original 0.1 0.2 0.3
0.4 0.5 Adversary Power: # !" adversary power: max(() −() +) < # () + = () − # ⋅ sign(∇loss7 (())

Many Other Adversarial Methods 24 + “1” 100% confidence “4”
100% = + “2” 99.9% = + “2” 83.8% = BIM ("# ) JSMA ("% ) CW 2 ("& ) Original Perturbation Adversarial Examples

Impact of Adversarial Perturbations 25 95th percentile 5th percentile CIFAR-10
DenseNet FGSM attack ε = 0.1

Impact of Adversarial Perturbations 26 FGSM Attack Perturbation Random Perturbation

Defense Strategies 1. Hide the gradients 27

Defense Strategies 1. Hide the gradients − Transferability results 28

Defense Strategies 1. Hide the gradients − Transferability results −
Clever adversaries can still find adversarial examples 29

Clever adversaries can still find adversarial examples 2. Build a robust classifier − Adversarial retraining, increasing model capacity, etc. 30

Clever adversaries can still find adversarial examples 2. Build a robust classifier − Adversarial retraining, increasing model capacity, etc. − If we could build a perfect model, we would! 31

Clever adversaries can still find adversarial examples 2. Build a robust classifier − Adversarial retraining, increasing model capacity, etc. − If we could build a perfect model, we would! 32 Our strategy: “Feature Squeezing” reduce search space available to the adversary Weilin Xu Yanjun Qi

Model Model Squeezer 1 Prediction0 Prediction1 "($%&'( , $%&'* ,
… , $%&', ) Yes Input Adversarial No Legitimate Model’ Squeezer k … Predictionk Detection Framework

Model Model Squeezer 1 Prediction0 Prediction1 "($%&'( , $%&'* ,
… , $%&', ) Yes Input Adversarial No Legitimate Model’ Squeezer k … Predictionk Detection Framework Feature Squeezer coalesces similar inputs into one pointL: • Little change for legitimate inputs. • Destruct adversarial perturbations.

Bit Depth Reduction 0 0.1 0.2 0.3 0.4 0.5 0.6
0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 3-bit 1-bit 8-bit Reduce to 1-bit !" = round(!" ×2)/2 Reduce to 1-bit !" = round(!" ×2)/2 [0.312 0.271 …… 0.159 0.651] X* [0.012 0.571 …… 0.159 0.951] X Input Output 35 [0. 1. …… 0. 1. ] [0. 0. …… 0. 1. ] Signal Quantization Adversarial Example Normal Example

Bit Depth Reduction 36 Seed 1 1 4 2 2
1 1 1 1 1 CW 2 CW ∞ BIM FGSM

Spatial Smoothing: Median Filter Replace a pixel with median of
its neighbors. Effective in eliminating ”salt-and-pepper” noise (!" attacks) 37 Image from https://sultanofswing90.wordpress.com/tag/image-processing/ 3×3 Median Filter

Spatial Smoothing: Non-local Means Replace a patch with weighted mean
of similar patches (in region). 38 ! "# "$ !% = ' ((!, "+ )×"+ Preserves edges, while removing noise.

39 Airplane 94.4% Truck 99.9% Automobile 56.5% Airplane 98.4% Airplane
99.9% Ship 46.0% Airplane 98.3% Airplane 80.8% Airplane 70.0% Median Filter (2×2) Non-local Means (13-3-4) Original BIM (L ∞ ) JSMA (L 0 )

Other Potential Squeezers 40 C Xie, et al. Mitigating Adversarial
Effects Through Randomization, ICLR 2018. J Buckman, et al. Thermometer Encoding: One Hot Way To Resist Adversarial Examples, ICLR 2018. D Meng and H Chen, MagNet: a Two-Pronged Defense against Adversarial Examples, in CCS 2017; A Prakash, et al., Deflecting Adversarial Attacks with Pixel Deflection, CVPR 2018;... Thermometer Encoding (learnable bit depth reduction) Image denoising using autoencoder, wavelet, JPEG, etc. Image resizing ...

“Feature Squeezing” (Vacuous) Conjecture For any distance-limited adversarial method, there
exists some feature squeezer that accurately detects its adversarial examples. 41 Intuition: if the perturbation is small (in some simple metric space), there is some squeezer that coalesces original and adversarial example into same sample.

Feature Squeezing Detection Model (7-layer CNN) Model Model Bit Depth-
1 Median 2×2 Prediction0 Prediction1 Prediction2 Yes Input Adversarial No Legitimate max '( )* , )( , '( )* , )2 > -

Detecting Adversarial Examples 43 Distance between original input and its
squeezed version Adversarial inputs (CW attack) Legitimate inputs

44 0 200 400 600 800 0.0 0.4 0.8 1.2
1.6 2.0 Number of Examples Legitimate Adversarial Maximum !"distance between original and squeezed input threshold = 0.0029 detection: 98.2%, FP < 4% Training a detector (MNIST) set the detection threshold to keep false positive rate below target

Aggregated Detection Results Dataset Squeezers Threshold False Positive Rate Detection
Rate (SAEs) ROC-AUC Exclude FAEs MNIST Bit Depth (1-bit), Median (2x2) 0.0029 3.98% 98.2% 99.44% CIFAR-10 Bit Depth (5-bit), Median (2x2), Non-local Mean (13-3-2) 1.1402 4.93% 84.5% 95.74% ImageNet Bit Depth (5-bit), Median (2x2), Non-local Mean (11-3-4) 1.2128 8.33% 85.9% 94.24% 45

Threat Models Oblivious attack: The adversary has full knowledge of
the target model, but is not aware of the detector. Adaptive attack: The adversary has full knowledge of the target model and the detector. 46

Adaptive Adversary Adaptive CW2 attack, unbounded adversary: Warren He, James
Wei, Xinyun Chen, Nicholas Carlini, Dawn Song, Adversarial Example Defense: Ensembles of Weak Defenses are not Strong, USENIX WOOT’17. !"#"!"$% & '( − * + , ∗ Δ ', '( + 0 ∗ 12 3456%('′) 47 Misclassification term Distance term Detection term

Adaptive Adversarial Examples 48 No successful adversarial examples were found
for images originally labeled as 3 or 8. Mean L 2 2.80 4.14 4.67 Attack Untargeted Targeted (next) Targeted (least likely)

Adaptive Adversary Success Rates 49 0.68 0.06 0.01 0.44 0.01
0.24 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Adversary’s Success Rate Clipped ε Targeted (Next) Targeted (LL) Untargeted Unbounded Typical !

Security State-of-the-Art Attack success probability Threat models Proofs Cryptography !"#!$
information theoretic, resource bounded required System Security !"%! capabilities, motivations, rationality common Adversarial Machine Learning !&; !"##* white-box, black-box rare! 50

Revisiting Attacker’s Goal Find one adversarial example Find many adversarial
examples Suya Yuan Tian

Attacker Visibility “White-box attacker” Knows model architecture and all parameters
“Black-box attacker” Interacts with model through API Limited number of interactions Output is <class, confidence> vector decision-based: output is just class “bird”, 0.09 “horse”, 0.84 ...

Black-Box Batch Attacker 0 50 100 150 Images 0 1
2 3 4 5 6 7 8 Number of Queries 10 5 Effort (number of model interactions) to find adversarial example varies by seed most only require a few thousand queries ZOO-attack on MNIST a few require > 10x more effort

Easy and Hard Examples “Easy” “Hard” 1024 1280 1536 2560
2816 “Easy” images: 5 with fewest number of queries needed to find adversarial example “Hard” images: 5 with highest number of queries needed (or failed) 138,240 101,376 97,792 75,008 71,424 768,000 query attempts without success 4608 6912 12,800 13,568 14,336 2 → 7 “bird” → “horse”

Greedy Search Works 0 50 100 150 Number of Images
Selected 0 2 4 6 8 10 12 Average Number of Queries 10 4 Greedy Search Random Search Retroactive Optimal MNIST CIFAR-10 0 50 100 150 Number of Images Selected 0 0.5 1 1.5 2 2.5 3 Average Number of Queries 10 4 Greedy Search Random Search Retroactive Optimal

Conclusions Domain Expertise still matters Machine learning models designed without
domain knowledge will not be robust against motivated adversaries Immature, but fun and active research area Need to make progress toward meaningful threat models, robustness measures, verifiable defenses Workshop to be held at DSN (Luxembourg, 25 June) Workshop to be held at IEEE S&P (San Francisco, 24 May)

David Evans University of Virginia [email protected] EvadeML.org Weilin Xu Yanjun
Qi Suya Yuan Tian Mainuddin Jonas Funding: NSF, Intel

Lessons from the Last 3000 Years of Adversarial...

Lessons from the Last 3000 Years of Adversarial Examples

More Decks by David Evans

Other Decks in Research

Featured

Transcript