Adversarial Machine Learning: Are We Playing the Wrong Game?

Adversarial Machine Learning: Are We Playing the Wrong Game? David
Evans University of Virginia work with Weilin Xu and Yanjun Qi evadeML.org ICSI, Berkeley CA 8 June 2017

Machine Learning Does Amazing Things 1

… and can solve all Security Problems! Fake Spam IDS
Malware Fake Accounts … “Fake News”

Labelled Training Data ML Algorithm Feature Extraction Vectors Deployment Malicious
/ Benign Operational Data Trained Classifier Training (supervised learning)

/ Benign Operational Data Trained Classifier Training (supervised learning) Assumption: Training Data is Representative

Adversaries Don’t Cooperate Assumption: Training Data is Representative Evading Deployment
Training

Focus: Evasion Attacks Goals: Understand classifier robustness Build better classifiers
(or give up)

Adversarial Examples 7 0.007 × [] + = “panda” “gibbon”
Example from: Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy. Explaining and Harnessing Adversarial Examples. ICLR 2015.

Goal of Machine Learning Classifier 8 Metric Space 1: Target
Classifier Metric Space 2: “Oracle” Model and visualization based on work by Beilun Wang, Ji Gao and Yanjun Qi (ICLR 2017 Workshop)

Well-Trained Classifier 9 Metric Space 1: Target Classifier Metric Space
2: “Oracle” Model and visualization based on work by Beilun Wang, Ji Gao and Yanjun Qi (ICLR 2017 Workshop)

Adversarial Examples 10 Metric Space 1: Target Classifier Metric Space
2: “Oracle” Model and visualization based on work by Beilun Wang, Ji Gao and Yanjun Qi (ICLR 2017 Workshop)

Adversarial Examples 11 Metric Space 1: Target Classifier Metric Space
2: “Oracle” Adversary’s goal: find a small perturbation that changes class for classifier, but imperceptible to oracle.

Formalizing Adversarial Examples Game 12 Given seed sample, , find
0 where: 0 ≠ () Class is different ∆ , 0 ≤ Difference below threshold

Formalizing Adversarial Examples Game 13 Given seed sample, , find
0 where: 0 ≠ () Class is different ∆ , 0 ≤ Difference below threshold ∆ is defined in some metric space: 9 “norm” (# different): ⋕ < ≠ < 0) >norm: ∑ |< − < 0| Cnorm (“Euclidean”): ∑(< −< 0)C Dnorm: max(< −< 0)

Targeted Attacks 14 Given seed sample, , find 0 where:
0 ≠ () Class is different ∆ , 0 ≤ Difference below threshold Untargeted Attack Given seed sample, , and target class, , find 0 where: 0 = Class is ∆ , 0 ≤ Difference below threshold Targeted Attack

Datasets MNIST 15 2 8 7 6 8 6 5
9 70 000 images 28×28 pixels, 8-bit grayscale scanned hand-written digits labeled by humans LeCun, Cortes, Burges [1998]

Datasets MNIST CIFAR-10 16 2 8 7 6 8 6
5 9 70 000 images 28×28 pixels, 8-bit grayscale scanned hand-written digits labeled by humans truck ship horse frog dog deer cat bird automobile airplane 60 000 images 32×32 pixels, 24-bit color human-labeled subset of images in 10 classes from Tiny Images Dataset Alex Krizhevsky [2009] LeCun, Cortes, Burges [1998]

D Adversary (Fast Gradient Sign) 17 original 0.1 0.2 0.3
0.4 0.5 Adversary Power: Dnorm adversary: max(< −< 0) < < 0 = < − ⋅ sign(lossS ())

D Adversary: Binary Filter 18 original 0.1 0.2 0.3 0.4
0.5 Adversary Power: 1-bit filter

Demo: Adversarial Playground 19 Andrew Norton and Yanjun Qi https://github.com/QData/AdversarialDNN-Playground

20 Is this the right game? Given seed sample, ,
find 0 where: 0 ≠ () Class is different ∆ , 0 ≤ Difference below threshold

Is this the right game? 21

Arms Race 22 ICLR 2014 ICLR 2015 S&P 2016 S&P
2017 NDSS 2013 NDSS 2016 NDSS 2016 This Talk

New Idea: Detect Adversarial Examples 23 Given seed sample, ,
find 0 where: 0 ≠ () Class is different ∆ , 0 ≤ Difference below threshold Deployed classifier only sees ′ - can we search for “”?

24 Model Model Model Filter 1 Filter 2 Prediction Prediction′
Prediction′′ Compare Predictions Difference exceeds threshold Reject Prediction Ok Input Need filters that do not affect predictions on normal inputs, but that reverse malicious perturbations.

“Feature Squeezing” 25 0 0 ≠ () [0.054, 0.4894, 0.9258,
0.0116, 0.2898, 0.5222, 0.5074, …] [0.0491, 0.4903, 0.9292, 0.009, 0.2942, 0.5243, 0.5078, …]

“Feature Squeezing” 26 [0.054, 0.4894, 0.9258, 0.0116, 0.2898, 0.5222, 0.5074,
…] [0.0491, 0.4903, 0.9292, 0.009, 0.2942, 0.5243, 0.5078, …] [0.0, 0.5, 1.0, 0.0, 0.25, 0.5, 0.5, …] 0 Squeeze: < = round(< ×4)/4 Squeeze: < = round(< ×4)/4 [0.0, 0.5, 1.0, 0.0, 0.25, 0.5, 0.5, …] squeeze 0 ≈ squeeze ⟹ (squeeze 0 ) ≈ (squeeze )

Squeezing Images 27 Reduce Color Depth 8-bit greyscale 1-bit monochrome

Squeezing Images 28 Reduce Color Depth Median Smoothing 8-bit greyscale
1-bit monochrome 3x3 smoothing: Replace with median of pixels and its neighbors

MNIST Results: Accuracy 29 Original (8) 7 6 5 4
3 2 1 .9930 .9930 .9930 .9930 .9930 .9928 .9926 .9924 Reducing bit depth (all the way to 1) barely reduces model accuracy! Correct on original image, wrong on 1-bit filtered image (19) Wrong on original image, correct on 1-bit filtered image (13) (out of 10 000 MNIST test images) Both wrong, but differently

Robustness Results (MNIST) 30 bit depth accuracy .00 .25 .50
.75 1.00 8 7 6 5 4 3 2 1 non-adversarial (ε=0.0) ε=0.3 ε=0.2 ε=0.1 adversary strength (ε) .987 .944 .640 .107 0.0 0.1 0.2 0.3 0.4 0.5 0.6 8-bit (unfiltered) 1-bit filtered Even for strong adversaries, 1-bit filter effectively removes adversarial perturbations

9 Adversary (Jacobian-based Saliency Map) 31 original JSMA 9 “norm”
(# different): ⋕ < ≠ < 0) Adversary strength = 0.1 (can modify up to 10% of pixels)

9 Adversary (Jacobian-based Saliency Map) 32 original JSMA smoothed (3x3)

Smoothing Results (MNIST) 33 .993 .988 .991 .980 .943 .845
.650 .479 .014 .700 .976 .953 .906 .791 .616 .454 .00 .25 .50 .75 1.00 1 2 3 4 5 6 7 8 Adversarial (JSMA) Original accuracy smoothing window (×) No smoothing: adversary succeeds 98.6% of time

Smoothing Results 34 .993 .988 .991 .980 .943 .845 .650
.479 .014 .700 .976 .953 .906 .791 .616 .454 .00 .25 .50 .75 1.00 1 2 3 4 5 6 7 8 Adversarial (JSMA) Original accuracy smoothing window (×) .9257 .8592 .7812 .0100 .8400 .7500 1 2 3 4 MNIST CIFAR-10 2 × 2 smoothing defeats adversary, but reduces accuracy

Carlini/Wagner Untargeted Attacks 35 Data Set Attack Accuracy on Adversarial
Examples MNIST 2 0.0 ∞ 0.0 0 0.0 CIFAR-10 2 0.0 ∞ 0.0 0 0.0 Nicholas Carlini, David Wagner. Oakland 2017 (Best Student Paper) Adversary suceeds 100% of the time with very small perturbations “Our D attacks on ImageNet are so successful that we can change the classification of an image to any desired label by only flipping the lowest bit of each pixel, a change that would be impossible to detect visually.”

Squeezing Results (2x2 Median Smoothing) 36 Weilin Xu, David Evans,
Yanjun Qi. https://arxiv.org/1705.10686 Data Set Attack Accuracy on Adversarial Examples Original Squeezed MNIST 2 0.0 0.904 ∞ 0.0 0.942 0 0.0 0.817 CIFAR-10 2 0.0 0.682 ∞ 0.0 0.661 0 0.0 0.706

Results on Carlini/Wagner Untargeted Attacks 37 Weilin Xu, David Evans,
Yanjun Qi. https://arxiv.org/1705.10686 Data Set Attack Accuracy on Adversarial Examples Original Squeezed MNIST 2 0.0 0.904 ∞ 0.0 0.942 0 0.0 0.817 CIFAR-10 2 0.0 0.682 ∞ 0.0 0.661 0 0.0 0.706 Accuracy on legitimate examples: 0.783

Results on Carlini/Wagner Targeted Attacks 38 Weilin Xu, David Evans,
Yanjun Qi. https://arxiv.org/1705.10686 Data Set Attack Adversary Success Rate Original Squeezed MNIST 2 0.999 0.022 ∞ 1.0 0.011 0 1.0 0.057 CIFAR-10 2 1.0 0.033 ∞ 1.0 0.037 0 1.0 0.037

Detecting Adversarial Examples 39 Model Model Model Squeeze Bit Depth
Median Smoothing Prediction Prediction′ Prediction′′ Compare Predictions (> distance) Difference exceeds threshold Adversarial Normal Ok Input

40 0 50 100 150 200 250 300 350 0.0
0.5 1.0 1.5 2.0 FGSM Non-adversarial > between normal and squeezed predictions MNIST binning-histogram over 1000 samples Detection Results JSMA

41 0 50 100 150 200 250 300 350 0.0
0.5 1.0 1.5 2.0 FGSM Non-adversarial JSMA > between normal and squeezed predictions MNIST binning-histogram over 1000 samples Detection Results With max > threshold 0.362, detect 99.5% of adversarial examples with false positive rate of 1.5%.

Detection Results on C/W Attacks Dataset ROC-AUC L1 Distance Threshold
Accuracy True Positive Rate False Positive Rate MNIST 0.9950 0.1147 0.9880 0.9933 0.0173 CIFAR-10 0.8711 0.7423 0.8750 0.9527 0.2027 Weilin Xu, David Evans, Yanjun Qi. https://arxiv.org/1705.10686 (Validated results – use half samples to determine threshold, test with other half)

Arms Race 43 ICLR 2014 ICLR 2015 S&P 2016 S&P
2017 NDSS 2013 NDSS 2016 NDSS 2016 Feature Squeezing Warren He, James Wei, Xinyun Chen, Nicholas Carlini, Dawn Song (upcoming paper) Authors TBD (not yet started paper)

Raising the Bar or Changing the Game? 44 Metric Space
1: Target Classifier Metric Space 2: “Oracle” Before: find a small perturbation that changes class for classifier, but imperceptible to oracle.

Raising the Bar or Changing the Game? 45 Metric Space
1: Target Classifier Metric Space 2: “Oracle” Before: find a small perturbation that changes class for classifier, but imperceptible to oracle. Now: change class for both original and squeezed classifier, but imperceptible to oracle.

“Feature Squeezing” Conjecture For any distance-limited adversarial method, there exists
some feature squeezer that accurately detects its adversarial examples. 46 Intuition: if the perturbation is small (in some simple metric space), there is some squeezer that coalesces original and adversarial example into same sample.

Changing the Game Option 1: Find distance-limited adversarial methods for
which it is intractable to find effective feature squeezer. Option 2: Redefine adversarial examples so distance is not limited (in simple metric space). 47 focus of rest of the talk

Evolutionary Search for Faraway Adversarial Examples 48

Faraway Adversarial Examples 49 Metric Space 1: Target Classifier Metric
Space 2: “Oracle” Need a domain where we know Metric Space 2: “Oracle”

Domain: PDF Malware Classifiers

0 50 100 150 200 250 2006 2007 2008 2009
2010 2011 2012 2013 2014 2015 2016 2017 Vulnerabilities reported in Adobe Acrobat Reader Source: http://www.cvedetails.com/vulnerability-list.php?vendor_id=53&product_id=921

PDF Malware Classifiers Random Forest Random Forest Support Vector Machine
Features Object counts, lengths, positions, … Object structural paths Very robust against “strongest conceivable mimicry attack”. Automated Features Manual Features PDFrate [ACSA 2012] Hidost16 [JIS 2016] Hidost13 [NDSS 2013]

Variants Automated Classifier Evasion Using Genetic Programming Clone Benign PDFs
Malicious PDF Mutation Variants Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive? Benign Oracle

Variants Generating Variants Clone Benign PDFs Malicious PDF Mutation Variants
Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive?

Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive? Found Evasive ? 0 /JavaScript eval(‘…’); /Root /Catalog /Pages Select random node

Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive? Found Evasive ? 0 /JavaScript eval(‘…’); /Root /Catalog /Pages Select random node Randomly transform: delete, insert, replace

Variants Select Variants Found Evasive? Found Evasive ? Select random node Randomly transform: delete, insert, replace Nodes from Benign PDFs 0 /JavaScript eval(‘…’); /Root /Catalog /Pages 128 546 7 63 128

Variants Selecting Promising Variants Clone Benign PDFs Malicious PDF Mutation
Variants Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive?

Variants Selecting Promising Variants Clone Benign PDFs Malicious PDF Mutation
Variants Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive? Fitness Function Candidate Variant (efghij , higkk ) Score Malicious 0 /JavaScript eval(‘…’); /Root /Catalog /Pages 128 Oracle Target Classifier

Oracle Execute candidate in vulnerable Adobe Reader in virtual environment
Behavioral signature: malicious if signature matches https://github.com/cuckoosandbox Simulated network: INetSim Cuckoo HTTP_URL + HOST extracted from API traces Advantage: we know the target malware behavior

Fitness Function Assumes lost malicious behavior will not be recovered
= m .5 − classifier_score if oracle = "malicious" −∞ otherwise classifier_score ≥ 0.5: labeled malicious

Experimental Results

Classifier Performance PDFrate Hidost Accuracy 0.9976 0.9996 False Negative Rate
0.0000 0.0056 Results on non-adversarial samples

Classifier Performance PDFrate Hidost Accuracy 0.9976 0.9996 False Negative Rate
0.0000 0.0056 False Negative Rate against Adversary 1.0000 1.0000

0 100 200 300 400 500 0 100 200 300
Seeds Evaded (out of 500) PDFRate Number of Mutations Hidost

0 100 200 300 400 500 0 100 200 300
Seeds Evaded (out of 500) PDFRate Number of Mutations Hidost Simple transformations often worked

0 100 200 300 400 500 0 100 200 300
Seeds Evaded (out of 500) PDFRate Number of Mutations Hidost (insert, /Root/Pages/Kids, 3:/Root/Pages/Kids/4/Kids/5/) Works on 162/500 seeds

0 100 200 300 400 500 0 100 200 300
Seeds Evaded (out of 500) PDFRate Number of Mutations Hidost Works on 162/500 seeds Some seeds required complex transformations

Insert: Threads, ViewerPreferences/Direction, Metadata, Metadata/Length, Metadata/Subtype, Metadata/Type, OpenAction/Contents, OpenAction/Contents/Filter, OpenAction/Contents/Length,
Pages/MediaBox Delete: AcroForm, Names/JavaSCript/Names/S, AcroForm/DR/Encoding/PDFDocEncoding, AcroForm/DR/Encoding/PDFDocEncoding/Differences, AcroForm/DR/Encoding/PDFDocEncoding/Type, Pages/Rotate, AcroForm/Fields, AcroForm/DA, Outlines/Type, Outlines, Outlines/Count, Pages/Resources/ProcSet, Pages/Resources 85-step mutation trace evading Hidost Effective for 198/500 seeds

0 20 40 60 80 100 120 Hidost PDFrate Oracle
Execution Cost Hours to find all 500 variants on one desktop PC Oracle Mutation Classifier

Possible Defenses

Possible Defense: Adjust Threshold Charles Smutz, Angelos Stavrou. When a
Tree Falls: Using Diversity in Ensemble Classifiers to Identify Evasion in Malware Detectors. NDSS 2016.

Original Malicious Seeds Evading PDFrate Malicious Label Threshold

Discovered Evasive Variants Adjust threshold?

Adjust threshold? Variants found with threshold = 0.25 Variants found
with threshold = 0.50

Possible Defense: Retrain Classifier

/ Benign Operational Data Trained Classifier Training (supervised learning) Retrain Classifier

Labelled Training Data ML Algorithm Feature Extraction Vectors Training (supervised
learning) Clone EvadeML Deployment

0 100 200 300 400 500 0 200 400 600
800 Seeds Evaded (out of 500) Generations Hidost16 Original classifier: Takes 614 generations to evade all seeds

0 100 200 300 400 500 0 200 400 600
800 HidostR1 Seeds Evaded (out of 500) Generations Hidost16

0 100 200 300 400 500 0 200 400 600
800 HidostR1 HidostR2 Seeds Evaded (out of 500) Generations Hidost16

0 100 200 300 400 500 0 200 400 600
800 Hidost16 Genome Contagio Benign Hidost16 0.00 0.00 HidostR1 0.78 0.30 HidostR2 0.85 0.53 False Positive Rates HidostR1 Seeds Evaded (out of 500) Generations HidostR2

Possible Defense: Hide Classifier

Variants Hiding the Classifier Clone Benign PDFs Malicious PDF Mutation
Variants Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive? Fitness Function Candidate Variant (efghij , higkk ) Score Malicious 0 /JavaScript eval(‘…’); /Root /Catalog /Pages 128 Oracle Target Classifier

Cross-Evasion Effects PDF Malware Seeds Hidost 13 Evasive PDF Malware
(against PDFrate) Automated Evasion PDFrate 2/500 Evasive (0.4% Success) Potentially Good News?

Evasive PDF Malware (against PDFrate) Cross-Evasion Effects PDF Malware Seeds
Hidost 13 Automated Evasion PDFrate 2/500 Evasive (0.4% Success) Evasive PDF Malware (against Hidost) 387/500 Evasive (77.4% Success)

Cross-Evasion Effects PDF Malware Seeds Automated Evasion 6/500 Evasive (0.6%
Success) Hidost 13 Evasive PDF Malware (against Hidost)

Evading Gmail’s Classifier Evasion rate on Gmail: 179/380 (47.1%) for
javascript in pdf.all_js: javascript.append_code("var ucb=1;“) if pdf.get_size() < 7050000: pdf.add_padding(7050000 – pdf.get_size())

Conclusion

Conclusions Domain Knowledge is not Dead • Classifiers trained without
understanding vulnerable • Adversaries can exploit unnecessary features Trust Requires Understanding • Good results against test data do not apply to adaptive adversaries but there is hope for building robust ML models!

Credits Funding: National Science Foundation, Air Force Office of Scientific
Research, Google, Microsoft, Amazon Weilin Xu Security Research Group Yanjun Qi

David Evans University of Virginia (visiting Inria Paris for summer)
[email protected] EvadeML.org source code, papers

Adversarial Machine Learning: Are We Playing th...

Adversarial Machine Learning: Are We Playing the Wrong Game?

More Decks by David Evans

Other Decks in Science

Featured

Transcript