Adversarial Machine Learning: Are We Playing the Wrong Game?

Adversarial Machine Learning: Are We Playing the Wrong Game? David
Evans University of Virginia work mostly with Weilin Xu and Yanjun Qi evadeML.org Center for IT-Security, Privacy and Accountability, Universität des Saarlandes 10 July 2017

Machine Learning Does Amazing Things 1

… and can solve all Security Problems! Fake Spam IDS
Malware Fake Accounts … “Fake News”

Labelled Training Data ML Algorithm Feature Extraction Vectors Deployment Malicious
/ Benign Operational Data Trained Classifier Training (supervised learning)

/ Benign Operational Data Trained Classifier Training (supervised learning) Assumption: Training Data is Representative

Adversaries Don’t Cooperate Assumption: Training Data is Representative Deployment Training

Deployment Adversaries Don’t Cooperate Assumption: Training Data is Representative Training
Poisoning

Adversaries Don’t Cooperate Assumption: Training Data is Representative Evading Deployment
Training

Focus: Evasion Attacks Goals: Understand classifier robustness Build better classifiers
(or give up)

Adversarial Examples 10 0.007 × [] + = “panda” “gibbon”
Example from: Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy. Explaining and Harnessing Adversarial Examples. ICLR 2015.

Goal of Machine Learning Classifier 11 Metric Space 1: Target
Classifier Metric Space 2: “Oracle” Model and visualization based on work by Beilun Wang, Ji Gao and Yanjun Qi (ICLR 2017 Workshop)

Well-Trained Classifier 12 Metric Space 1: Target Classifier Metric Space
2: “Oracle” Model and visualization based on work by Beilun Wang, Ji Gao and Yanjun Qi (ICLR 2017 Workshop)

Adversarial Examples 13 Metric Space 1: Target Classifier Metric Space
2: “Oracle” Model and visualization based on work by Beilun Wang, Ji Gao and Yanjun Qi (ICLR 2017 Workshop)

Adversarial Examples 14 Metric Space 1: Target Classifier Metric Space
2: “Oracle” Adversary’s goal: find a small perturbation that changes class for classifier, but imperceptible to oracle.

Misleading Visualization 15 Metric Space 1: Target Classifier Cartoon Reality
2 dimensions thousands of dimensions few samples near boundaries all samples near boundaries every sample near 1-3 classes every sample near all classes

Formalizing Adversarial Examples Game 16 Given seed sample, , find
0 where: 0 ≠ () Class is different ∆ , 0 ≤ Difference below threshold

Formalizing Adversarial Examples Game 17 Given seed sample, , find
0 where: 0 ≠ () Class is different ∆ , 0 ≤ Difference below threshold ∆ is defined in some metric space: 9 “norm” (# different): ⋕ < ≠ < 0) >norm: ∑ |< − < 0| Cnorm (“Euclidean”): ∑(< −< 0)C Dnorm: max(< −< 0)

Targeted Attacks 18 Given seed sample, , find 0 where:
0 ≠ () Class is different ∆ , 0 ≤ Difference below threshold Untargeted Attack Given seed sample, , and target class, , find 0 where: 0 = Class is ∆ , 0 ≤ Difference below threshold Targeted Attack

Datasets MNIST 19 2 8 7 6 8 6 5
9 70 000 images 28×28 pixels, 8-bit grayscale scanned hand-written digits labeled by humans LeCun, Cortes, Burges [1998]

Datasets MNIST CIFAR-10 20 2 8 7 6 8 6
5 9 70 000 images 28×28 pixels, 8-bit grayscale scanned hand-written digits labeled by humans truck ship horse frog dog deer cat bird automobile airplane 60 000 images 32×32 pixels, 24-bit color human-labeled subset of images in 10 classes from Tiny Images Dataset Alex Krizhevsky [2009] LeCun, Cortes, Burges [1998]

ImageNet 21 14 Million high-resolution, full color images Manually annotated
in WordNet ~20,000 synonym sets (~1000 images in each) Models: MobileNet (Top-1 accuracy: .684 / Top-5: .882) Inception v3 (Top-1: .763 / Top-5: .930)

D Adversary (Fast Gradient Sign) 23 original 0.1 0.2 0.3
0.4 0.5 Adversary Power: Dnorm adversary: max(< −< 0) < < 0 = < − ⋅ sign(lossS ())

D Adversary: Binary Filter 24 original 0.1 0.2 0.3 0.4
0.5 Adversary Power: 1-bit filter

AdversarialDNN Playground 25 Andrew Norton and Yanjun Qi Live demo:
https://evadeML.org/playground Will be integrated with EvadeML-Zoo models and attacks soon! L

26 Given seed sample, , find 0 where: 0 ≠
() Class is different or 0 = Class is target class ∆ , 0 ≤ Difference below threshold Is this the right game?

Is this the right game? 27

Arms Race 28 ICLR 2014 NDSS 2013 ICLR 2015 S&P
2016 S&P 2017 NDSS 2016 NDSS 2016 This Talk Feb 2017

New Idea: Detect Adversarial Examples 29 Given seed sample, ,
find 0 where: 0 ≠ () Class is different ∆ , 0 ≤ Difference below threshold Deployed classifier only sees 0 - can we search for “”?

30 Model Model Model Filter 1 Filter 2 Prediction Prediction′
Prediction′′ Compare Predictions Difference exceeds threshold Reject Prediction Ok Input Need filters that do not affect predictions on normal inputs, but that reverse malicious perturbations.

“Feature Squeezing” 31 0 0 ≠ () [0.054, 0.4894, 0.9258,
0.0116, 0.2898, 0.5222, 0.5074, …] [0.0491, 0.4903, 0.9292, 0.009, 0.2942, 0.5243, 0.5078, …]

“Feature Squeezing” 32 [0.054, 0.4894, 0.9258, 0.0116, 0.2898, 0.5222, 0.5074,
…] [0.0491, 0.4903, 0.9292, 0.009, 0.2942, 0.5243, 0.5078, …] [0.0, 0.5, 1.0, 0.0, 0.25, 0.5, 0.5, …] 0 Squeeze: < = round(< ×4)/4 Squeeze: < = round(< ×4)/4 [0.0, 0.5, 1.0, 0.0, 0.25, 0.5, 0.5, …] squeeze 0 ≈ squeeze ⟹ (squeeze 0 ) ≈ (squeeze )

Squeezing Images 33 Reduce Color Depth 8-bit greyscale 1-bit monochrome

Squeezing Images 34 Reduce Color Depth Median Smoothing 8-bit greyscale
1-bit monochrome 3x3 smoothing: Replace with median of pixels and its neighbors

MNIST Results: Accuracy 35 Original (8) 7 6 5 4
3 2 1 .9930 .9930 .9930 .9930 .9930 .9928 .9926 .9924 Reducing bit depth (all the way to 1) barely reduces model accuracy! Correct on original image, wrong on 1-bit filtered image (19) Wrong on original image, correct on 1-bit filtered image (13) (out of 10 000 MNIST test images) Both wrong, but differently

Robustness Results (MNIST) 36 bit depth accuracy .00 .25 .50
.75 1.00 8 7 6 5 4 3 2 1 non-adversarial (ε=0.0) ε=0.3 ε=0.2 ε=0.1 adversary strength (ε) .987 .944 .640 .107 0.0 0.1 0.2 0.3 0.4 0.5 0.6 8-bit (unfiltered) 1-bit filtered Even for strong adversaries, 1-bit filter effectively removes adversarial perturbations

9 Adversary (Jacobian-based Saliency Map) 37 original JSMA 9 “norm”
(# different): ⋕ < ≠ < 0) Adversary strength = 0.1 (can modify up to 10% of pixels)

9 Adversary (Jacobian-based Saliency Map) 38 original JSMA smoothed (3x3)

Smoothing Results (MNIST) 39 .993 .988 .991 .980 .943 .845
.650 .479 .014 .700 .976 .953 .906 .791 .616 .454 .00 .25 .50 .75 1.00 1 2 3 4 5 6 7 8 Adversarial (JSMA) Original accuracy smoothing window (×) No smoothing: adversary succeeds 98.6% of time

Smoothing Results 40 .993 .988 .991 .980 .943 .845 .650
.479 .014 .700 .976 .953 .906 .791 .616 .454 .00 .25 .50 .75 1.00 1 2 3 4 5 6 7 8 Adversarial (JSMA) Original accuracy smoothing window (×) .9257 .8592 .7812 .0100 .8400 .7500 1 2 3 4 MNIST CIFAR-10 2 × 2 smoothing defeats adversary, but reduces accuracy

Carlini/Wagner Untargeted Attacks 41 Data Set Attack Accuracy on Adversarial
Examples MNIST C 0.0 D 0.0 9 0.0 CIFAR-10 C 0.0 D 0.0 9 0.0 Nicholas Carlini, David Wagner. Oakland 2017 (Best Student Paper) Adversary suceeds 100% of the time with very small perturbations “Our D attacks on ImageNet are so successful that we can change the classification of an image to any desired label by only flipping the lowest bit of each pixel, a change that would be impossible to detect visually.”

Squeezing Results (2x2 Median Smoothing) 42 Weilin Xu, David Evans,
Yanjun Qi. https://arxiv.org/1705.10686 Data Set Attack Accuracy on Adversarial Examples Original Squeezed MNIST C 0.0 0.904 D 0.0 0.942 9 0.0 0.817 CIFAR-10 C 0.0 0.682 D 0.0 0.661 9 0.0 0.706

Results on Carlini/Wagner Untargeted Attacks 43 Weilin Xu, David Evans,
Yanjun Qi. https://arxiv.org/1705.10686 Data Set Attack Accuracy on Adversarial Examples Original Squeezed MNIST C 0.0 0.904 D 0.0 0.942 9 0.0 0.817 CIFAR-10 C 0.0 0.682 D 0.0 0.661 9 0.0 0.706 Accuracy on legitimate examples: 0.783

Results on Carlini/Wagner Targeted Attacks 44 Weilin Xu, David Evans,
Yanjun Qi. https://arxiv.org/1705.10686 Data Set Attack Adversary Success Rate Original Squeezed MNIST C 0.999 0.022 D 1.0 0.011 9 1.0 0.057 CIFAR-10 C 1.0 0.033 D 1.0 0.037 9 1.0 0.037

Detecting Adversarial Examples 45 Model Model Model Squeeze Bit Depth
Median Smoothing Prediction Prediction′ Prediction′′ Compare Predictions (> distance) Difference exceeds threshold Adversarial Normal Ok Input

46 0 50 100 150 200 250 300 350 0.0
0.5 1.0 1.5 2.0 FGSM Non-adversarial > between normal and squeezed predictions MNIST binning-histogram over 1000 samples Detection Results JSMA

47 0 50 100 150 200 250 300 350 0.0
0.5 1.0 1.5 2.0 FGSM Non-adversarial JSMA > between normal and squeezed predictions MNIST binning-histogram over 1000 samples Detection Results With max > threshold 0.362, detect 99.5% of adversarial examples with false positive rate of 1.5%.

Detection Results on C/W Attacks Dataset ROC-AUC L1 Distance Threshold
Accuracy True Positive Rate False Positive Rate MNIST 0.9950 0.1147 0.9880 0.9933 0.0173 CIFAR-10 0.8711 0.7423 0.8750 0.9527 0.2027 Weilin Xu, David Evans, Yanjun Qi. https://arxiv.org/1705.10686 (Validated results – use half samples to determine threshold, test with other half)

0 5 10 15 20 25 0.0 0.2 0.4 0.6
0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 Non-adversarial FGSM > between normal and squeezed predictions ImageNet with MobileNet (histogram for 68 seeds)

0 5 10 15 20 25 0.0 0.2 0.4 0.6
0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 DeepFool Carlini/Wagner (_2) FGSM Non-adversarial > between normal and squeezed predictions ImageNet with MobileNet (histogram for 68 seeds) Adversarial Success Rate: 100% Adversarial Success Rate: 47%

Arms Race 51 ICLR 2014 ICLR 2015 S&P 2016 S&P
2017 NDSS 2013 NDSS 2016 NDSS 2016 Feature Squeezing 15 June 2017 (arXiv) Quick Hack (not yet published) Weilin Xu, and others A new tweak Authors TBD Delta, my Epsilon! Authors TBD

Raising the Bar or Changing the Game? 52 Metric Space
1: Target Classifier Metric Space 2: “Oracle” Before: find a small perturbation that changes class for classifier, but imperceptible to oracle.

Raising the Bar or Changing the Game? 53 Metric Space
1: Target Classifier Metric Space 2: “Oracle” Before: find a small perturbation that changes class for classifier, but imperceptible to oracle. Now: change class for both original and squeezed classifier, but imperceptible to oracle.

“Feature Squeezing” Conjecture For any distance-limited adversarial method, there exists
some feature squeezer that accurately detects its adversarial examples. 54 Intuition: if the perturbation is small (in some simple metric space), there is some squeezer that coalesces original and adversarial example into same sample.

Entropy Advantage Model Model Model Randomized Squeezer #1 Prediction Prediction′
Prediction′′ Compare Predictions (> distance) Difference exceeds threshold Adversarial Normal Ok Input Randomized Squeezer #2 Squeezers can be selected randomly, and behave randomly different for each feature

Changing the Game Option 1: Find distance-limited adversarial methods for
which it is intractable to find effective feature squeezer. Option 2: Redefine adversarial examples so distance is not limited (in simple metric space). 56 focus of rest of the talk

Evolutionary Search for Faraway Adversarial Examples 57

Faraway Adversarial Examples 58 Metric Space 1: Target Classifier Metric
Space 2: “Oracle” Need a domain where we know Metric Space 2: “Oracle”

Domain: PDF Malware Classifiers

0 50 100 150 200 250 2006 2007 2008 2009
2010 2011 2012 2013 2014 2015 2016 2017 Vulnerabilities reported in Adobe Acrobat Reader Source: http://www.cvedetails.com/vulnerability-list.php?vendor_id=53&product_id=921

PDF Malware Classifiers Random Forest Random Forest Support Vector Machine
Features Object counts, lengths, positions, … Object structural paths Very robust against “strongest conceivable mimicry attack”. Automated Features Manual Features PDFrate [ACSA 2012] Hidost16 [JIS 2016] Hidost13 [NDSS 2013]

Variants Automated Classifier Evasion Using Genetic Programming Clone Benign PDFs
Malicious PDF Mutation Variants Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive? Benign Oracle

Variants Generating Variants Clone Benign PDFs Malicious PDF Mutation Variants
Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive?

Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive? Found Evasive ? 0 /JavaScript eval(‘…’); /Root /Catalog /Pages

Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive? Found Evasive ? 0 /JavaScript eval(‘…’); /Root /Catalog /Pages Select random node

Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive? Found Evasive ? 0 /JavaScript eval(‘…’); /Root /Catalog /Pages Select random node Randomly transform: delete, insert, replace

Variants Select Variants Found Evasive? Found Evasive ? Select random node Randomly transform: delete, insert, replace Nodes from Benign PDFs 0 /JavaScript eval(‘…’); /Root /Catalog /Pages 128 546 7 63 128

Variants Selecting Promising Variants Clone Benign PDFs Malicious PDF Mutation
Variants Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive?

Variants Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive? Fitness Function Candidate Variant (efghij , higkk ) Score Malicious 0 /JavaScript eval(‘…’); /Root /Catalog /Pages 128 Oracle Target Classifier

Oracle Execute candidate in vulnerable Adobe Reader in virtual environment
Behavioral signature: malicious if signature matches https://github.com/cuckoosandbox Simulated network: INetSim Cuckoo HTTP_URL + HOST extracted from API traces Advantage: we know the target malware behavior

Fitness Function Assumes lost malicious behavior will not be recovered
= m .5 − classifier_score if oracle = "malicious" −∞ otherwise classifier_score ≥ 0.5: labeled malicious

Experimental Results

Classifier Performance PDFrate Hidost Accuracy 0.9976 0.9996 False Negative Rate
0.0000 0.0056 Results on non-adversarial samples

Classifier Performance PDFrate Hidost Accuracy 0.9976 0.9996 False Negative Rate
0.0000 0.0056 False Negative Rate against Adversary 1.0000 1.0000

0 100 200 300 400 500 0 100 200 300
Seeds Evaded (out of 500) PDFRate Number of Mutations Hidost

0 100 200 300 400 500 0 100 200 300
Seeds Evaded (out of 500) PDFRate Number of Mutations Hidost Simple transformations often worked

0 100 200 300 400 500 0 100 200 300
Seeds Evaded (out of 500) PDFRate Number of Mutations Hidost (insert, /Root/Pages/Kids, 3:/Root/Pages/Kids/4/Kids/5/) Works on 162/500 seeds

0 100 200 300 400 500 0 100 200 300
Seeds Evaded (out of 500) PDFRate Number of Mutations Hidost Works on 162/500 seeds Some seeds required complex transformations

Insert: Threads, ViewerPreferences/Direction, Metadata, Metadata/Length, Metadata/Subtype, Metadata/Type, OpenAction/Contents, OpenAction/Contents/Filter, OpenAction/Contents/Length,
Pages/MediaBox Delete: AcroForm, Names/JavaSCript/Names/S, AcroForm/DR/Encoding/PDFDocEncoding, AcroForm/DR/Encoding/PDFDocEncoding/Differences, AcroForm/DR/Encoding/PDFDocEncoding/Type, Pages/Rotate, AcroForm/Fields, AcroForm/DA, Outlines/Type, Outlines, Outlines/Count, Pages/Resources/ProcSet, Pages/Resources 85-step mutation trace evading Hidost Effective for 198/500 seeds

0 20 40 60 80 100 120 Hidost PDFrate Oracle
Execution Cost Hours to find all 500 variants on one desktop PC Oracle Mutation Classifier

Possible Defenses

Possible Defense: Adjust Threshold Charles Smutz, Angelos Stavrou. When a
Tree Falls: Using Diversity in Ensemble Classifiers to Identify Evasion in Malware Detectors. NDSS 2016.

Original Malicious Seeds Evading PDFrate Malicious Label Threshold

Discovered Evasive Variants Adjust threshold?

Adjust threshold? Variants found with threshold = 0.25 Variants found
with threshold = 0.50

Possible Defense: Retrain Classifier

/ Benign Operational Data Trained Classifier Training (supervised learning) Retrain Classifier

Labelled Training Data ML Algorithm Feature Extraction Vectors Training (supervised
learning) Clone EvadeML

Labelled Training Data ML Algorithm Feature Extraction Vectors Training (supervised
learning) Clone EvadeML Deployment

0 100 200 300 400 500 0 200 400 600
800 Seeds Evaded (out of 500) Generations Hidost16 Original classifier: Takes 614 generations to evade all seeds

0 100 200 300 400 500 0 200 400 600
800 HidostR1 Seeds Evaded (out of 500) Generations Hidost16

0 100 200 300 400 500 0 200 400 600
800 HidostR1 HidostR2 Seeds Evaded (out of 500) Generations Hidost16

0 100 200 300 400 500 0 200 400 600
800 Hidost16 Genome Contagio Benign Hidost16 0.00 0.00 HidostR1 0.78 0.30 HidostR2 0.85 0.53 False Positive Rates HidostR1 Seeds Evaded (out of 500) Generations HidostR2

Possible Defense: Hide Classifier

Variants Hiding the Classifier Clone Benign PDFs Malicious PDF Mutation

Evading Classifiers in the Dark 99 arXiv, May 2017

Cross-Evasion Effects PDF Malware Seeds Hidost 13 Evasive PDF Malware
(against PDFrate) Automated Evasion PDFrate 2/500 Evasive (0.4% Success) Potentially Good News?

Evasive PDF Malware (against PDFrate) Cross-Evasion Effects PDF Malware Seeds
Hidost 13 Automated Evasion PDFrate 2/500 Evasive (0.4% Success) Evasive PDF Malware (against Hidost) 387/500 Evasive (77.4% Success)

Cross-Evasion Effects PDF Malware Seeds Automated Evasion 6/500 Evasive (1.2%
Success) Hidost 13 Evasive PDF Malware (against Hidost)

Evading Gmail’s Classifier Evasion rate on Gmail: 179/380 (47.1%) for
javascript in pdf.all_js: javascript.append_code("var ucb=1;“) if pdf.get_size() < 7050000: pdf.add_padding(7050000 – pdf.get_size())

Conclusion

Hopeful Conclusions Domain Knowledge is not Dead • Classifiers trained
without understanding vulnerable • Adversaries can exploit unnecessary features Trust Requires Understanding • Good results against test data do not apply to adaptive adversaries but there is hope for building robust ML models!

Credits Funding: National Science Foundation, Air Force Office of Scientific
Research, Google, Microsoft, Amazon Weilin Xu Security Research Group Yanjun Qi

David Evans University of Virginia [email protected] EvadeML.org source code, papers

Adversarial Machine Learning: Are We Playing t...

Adversarial Machine Learning: Are We Playing the Wrong Game?

More Decks by David Evans

Other Decks in Science

Featured

Transcript