Can Machine Learning Ever Be Trustworthy?

Can Machine Learning Ever Be Trustworthy? David Evans University of
Virginia evadeML.org 7 December 2018 University of Maryland

2 Its too late!

3 “Unfortunately, our translation systems made an error last week
that misinterpreted what this individual posted. Even though our translations are getting better each day, mistakes like these might happen from time to time and we’ve taken steps to address this particular issue. We apologize to him and his family for the mistake and the disruption this caused.”

Amazon Employment 5

Amazon Employment 6

Risks from Artificial Intelligence 7 Benign developers and operators AI
out of control AI inadvertently causes harm Malicious operators Build AI to do harm Malicious abuse of benign AI On Robots Joe Berger and Pascal Wyse (The Guardian, 21 July 2018)

Risks from Artificial Intelligence Benign developers and operators AI out
of control AI inadvertently causes harm Malicious operators Build AI to do harm Malicious abuse of benign AI systems 8

Crash Course in Artificial Intelligence and Machine Learning 9

Labelled Training Data ML Algorithm Feature Extraction Vectors Deployment Malicious
/ Benign Operational Data Trained Classifier Training (supervised learning) Statistical Machine Learning

/ Benign Operational Data Trained Classifier Training (supervised learning) Assumption: Training Data is Representative

Deployment Adversaries Don’t Cooperate Assumption: Training Data is Representative Training
Poisoning

Adversaries Don’t Cooperate Assumption: Training Data is Representative Evading Deployment
Training

More Ambition 15 “The human race will have a new
kind of instrument which will increase the power of the mind much more than optical lenses strengthen the eyes and which will be as far superior to microscopes or telescopes as reason is superior to sight.”

More Ambition 16 “The human race will have a new
kind of instrument which will increase the power of the mind much more than optical lenses strengthen the eyes and which will be as far superior to microscopes or telescopes as reason is superior to sight.” Gottfried Wilhelm Leibniz (1679)

17 Gottfried Wilhelm Leibniz (Universitat Altdorf, 1666) who advised: Jacob
Bernoulli (Universitdt Basel, 1684) who advised: Johann Bernoulli (Universitdt Basel, 1694) who advised: Leonhard Euler (Universitat Basel, 1726) who advised: Joseph Louis Lagrange who advised: Simeon Denis Poisson who advised: Michel Chasles (Ecole Polytechnique, 1814) who advised: H. A. (Hubert Anson) Newton (Yale, 1850) who advised: E. H. Moore (Yale, 1885) who advised: Oswald Veblen (U. of Chicago, 1903) who advised: Philip Franklin (Princeton 1921) who advised: Alan Perlis (MIT Math PhD 1950) who advised: Jerry Feldman (CMU Math 1966) who advised: Jim Horning (Stanford CS PhD 1969) who advised: John Guttag (U. of Toronto CS PhD 1975) who advised: David Evans (MIT CS PhD 2000) my academic great- great-great-great- great-great-great- great-great-great- great-great-great- great-great- grandparent!

More Precision 18 “The human race will have a new
kind of instrument which will increase the power of the mind much more than optical lenses strengthen the eyes and which will be as far superior to microscopes or telescopes as reason is superior to sight.” Gottfried Wilhelm Leibniz (1679) Normal computing amplifies (quadrillions of times faster) and aggregates (enables millions of humans to work together) human cognitive abilities; AI goes beyond what humans can do.

Operational Definition “Artificial Intelligence” means making computers do things their
programmers don’t understand well enough to program explicitly. 19 If it is explainable, its not ML!

Inherent Paradox of “Trustworthy” ML 20 “Artificial Intelligence” means making
computers do things their programmers don’t understand well enough to program explicitly. If we could specify precisely what the model should do, we wouldn’t need ML to do it!

Inherent Paradox of “Trustworthy” ML 21 If we could specify
precisely what the model should do, we wouldn’t need ML to do it! Best we hope for is verifying certain properties M1 M2 ∀": $% " = $' (") DeepXplore: Automated Whitebox Testing of Deep Learning Systems. Kexin Pei, Yinzhi Cao, Junfeng Yang, Suman Jana. SOSP 2017 Model Similarity

Inherent Paradox of “Trustworthy” ML 22 Best we hope for
is verifying certain properties M 1 M 2 ∀" ∈ $: &' " ≈ &) (") DeepXplore: Automated Whitebox Testing of Deep Learning Systems. Kexin Pei, Yinzhi Cao, Junfeng Yang, Suman Jana. SOSP 2017 Model Similarity M ∀" ∈ $, ∀∆ ∈ .: & " ≈ &(" + ∆) " " + ∆ Model Robustness 0 M 0∗

Third Strategy: Specify Containing System 23 Somesh Jha’s talk (Oct
26)

Adversarial Robustness 24 M ∀" ∈ $, ∀∆ ∈ ':
) " ≈ )(" + ∆) " " + ∆ . M .∗ Adversary’s Goal: find a “small” perturbation that changes model output targeted attack: in some desired way Defender’s Goal: Robust Model: find model where this is hard Detection: detect inputs that are adversarial

Not a new problem... 25 Or do you think any
Greek gift’s free of treachery? Is that Ulysses’s reputation? Either there are Greeks in hiding, concealed by the wood, or it’s been built as a machine to use against our walls, or spy on our homes, or fall on the city from above, or it hides some other trick: Trojans, don’t trust this horse. Whatever it is, I’m afraid of Greeks even those bearing gifts.’ Virgil, The Aenid (Book II)

Adversarial Examples for DNNs 26 0.007 × [&'()*] + =
“panda” “gibbon” Example from: Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy. Explaining and Harnessing Adversarial Examples. 2014 (in ICLR 2015)

Impact of Adversarial Perturbations 27 Distance between layer output and
its output for original seed FGSM ! = 0.0245 CIFAR-10 DenseNet 95th percentile 5th percentile

its output for original seed Random noise (same amount) FGSM ! = 0.0245 CIFAR-10 DenseNet

its output for original seed Random noise (same amount) Carlini- Wagner L2 CIFAR-10 DenseNet

0 200 400 600 800 1000 1200 1400 1600 1800
2018 2017 2016 2015 2014 2013 30 Papers on “Adversarial Examples” (Google Scholar) 1826.68 papers expected in 2018!

0 200 400 600 800 1000 1200 1400 1600 1800
2018 2017 2016 2015 2014 2013 31 Papers on “Adversarial Examples” (Google Scholar) 1826.68 papers expected in 2018!

0 200 400 600 800 1000 1200 1400 1600 1800
2018 2017 2016 2015 2014 2013 32 Emergence of “Theory” ICML Workshop 2015 15% of 2018 “adversarial examples” papers contain “theorem” and “proof”

Adversarial Example 33 Prediction Change Definition: An input, !′ ∈
$, is an adversarial example for ! ∈ $, iff ∃!& ∈ Ball* (!) such that - ! ≠ - !& .

Adversarial Example 34 Ball$ (&) is some space around &,
typically defined in some (simple!) metric space: () norm (# different), (* norm (“Euclidean distance”), (+ Without constraints on Ball$ , every input has adversarial examples. Prediction Change Definition: An input, &′ ∈ /, is an adversarial example for & ∈ /, iff ∃&1 ∈ Ball$ (&) such that 2 & ≠ 2 &1 .

Adversarial Example 35 Any non-trivial model has adversarial examples: ∃"#
, "% ∈ '. ) "# ≠ )("% ) Prediction Change Definition: An input, -′ ∈ ', is an adversarial example for - ∈ ', iff ∃-/ ∈ Ball3 (-) such that ) - ≠ ) -/ .

Prediction Error Robustness 36 Error Robustness: An input, !′ ∈
$, is an adversarial example for (correct) ! ∈ $, iff ∃!& ∈ Ball* (!) such that - !′ ≠ true label for !′. Perfect classifier has no (error robustness) adversarial examples.

Prediction Error Robustness 37 Error Robustness: An input, !′ ∈
$, is an adversarial example for (correct) ! ∈ $, iff ∃!& ∈ Ball* (!) such that - !′ ≠ true label for !′. Perfect classifier has no (error robustness) adversarial examples. If we have a way to know this, don’t need an ML classifier.

Global Robustness Properties 38 Adversarial Risk: probability an input has
an adversarial example Pr # ← % [∃ () ∈ +,--. ( . 0 () ≠ class (′ ] Dimitrios I. Diochnos, Saeed Mahloujifar, Mohammad Mahmoody, NeurIPS 2018

Global Robustness Properties 39 Dimitrios I. Diochnos, Saeed Mahloujifar, Mohammad
Mahmoody, NeurIPS 2018 Adversarial Risk: probability an input has an adversarial example Pr # ← % [∃ () ∈ +,--. ( . 0 () ≠ class (′ ] Error Region Robustness: expected distance to closest AE: 8 # ← % [inf { =: ∃ () ∈ +,--. ( . 0 () ≠ class () }]

Assumption Key Result Adversarial Spheres [Gilmer et al., 2018] Uniform
distribution on two concentric !-spheres Expected safe distance ("# -norm) is relatively small. Adversarial vulnerability for any classifier [Fawzi × 3, 2018] Smooth generative model: 1. Gaussian in latent space. 2. Generator is L-Lipschitz. Adversarial risk ⟶ 1 for relatively small attack strength ("# -norm). Curse of Concentration in Robust Learning [Mahloujifar et al., 2018] Normal Lévy families • Unit sphere, uniform, "# norm • Boolean hypercube, uniform, Hamming distance ... If attack strength exceeds a relatively small threshold, adversarial risk > 1/2. b > p log(k1/") p k2 · n ! Riskb(h, c) 1/2 Recent Global Robustness Results P(r(x)  ⌘) 1 r ⇡ 2 e ⌘2/2L2 Properties of any model for input space: distance to AE is small relative to expected distance between two sampled points

Prediction Change Robustness 41 Prediction Change: An input, !′ ∈
$, is an adversarial example for ! ∈ $, iff ∃!& ∈ Ball* (!) such that - !′ ≠ - ! . Any non-trivial model has adversarial examples: ∃!0 , !2 ∈ $. - !0 ≠ -(!2 ) Solutions: - only consider particular inputs (“good” seeds) - output isn’t just class (e.g., confidence) - targeted adversarial examples cost-sensitive adversarial robustness

Local (Instance) Robustness 42 Robust Region: For an input !,
the robust region is the maximum region with no adversarial example: sup % > 0 ∀)* ∈ Ball/ ) , 1 )* = 1 ) }

Local (Instance) Robustness 43 Robust Region: For an input !,
the robust region is the maximum region with no adversarial example: sup % > 0 ∀)* ∈ Ball/ ) , 1 )* = 1 ) } Robust Error: For a test set, 4, and bound, %5 : | ) ∈ 4, RobustRegion ) < %5 } | 4|

44 Scalability Formal Verification MILP solver (MIPVerify) SMT solver (Reluplex)
Interval analysis (Reluval) robust error Heuristic Defenses distillation (Papernot et al., 2016) gradient obfuscation adversarial retraining (Madry et al., 2017) attack success rate (set of attacks) Certified Robustness CNN-Cert (Boopathy et al., 2018) Dual-LP (Kolter & Wong 2018) Dual-SDP (Raghunathan et al., 2018) bound Evaluation Metric precise feature squeezing

45 Theory “Practice” Reality Distributional assumptions Toy, arbitrary datasets Malware,
Fake News, ... Classification Problems Adversarial Strength !" norm bound !# bound application specific Fake

Example: PDF Malware

Finding Evasive Malware 47 Given seed sample, !, with desired
malicious behavior find an adversarial example !" that satisfies: # !" = “&'()*(” Model misclassifies ℬ !′) = ℬ(! Malicious behavior preserved Generic attack: heuristically explore input space for !′ that satisfies definition. No requirement that ! ~ !′ except through ℬ.

PDF Malware Classifiers Random Forest Random Forest Support Vector Machine
Features Object counts, lengths, positions, … Object structural paths Very robust against “strongest conceivable mimicry attack”. Automated Features Manual Features PDFrate [ACSA 2012] Hidost16 [JIS 2016] Hidost13 [NDSS 2013]

Variants Evolutionary Search Clone Benign PDFs Malicious PDF Mutation 01011001101
Variants Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive? Benign Oracle Weilin Xu Yanjun Qi Fitness Selection Mutant Generation

Variants Generating Variants Clone Benign PDFs Malicious PDF Mutation 01011001101
Variants Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive? Fitness Selection Mutant Generation

PDF Structure

Variants Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive? Fitness Selection Mutant Generation

Variants Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive? Found Evasive ? 0 /JavaScript eval(‘…’); /Root /Catalog /Pages Select random node Randomly transform: delete, insert, replace

Variants Variants Select Variants Found Evasive? Found Evasive ? Select random node Randomly transform: delete, insert, replace Nodes from Benign PDFs 0 /JavaScript eval(‘…’); /Root /Catalog /Pages 128 546 7 63 128

Variants Selecting Promising Variants Clone Benign PDFs Malicious PDF Mutation
01011001101 Variants Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive? Fitness Selection Mutant Generation

Variants Selecting Promising Variants Clone Benign PDFs Malicious PDF Mutation
01011001101 Variants Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive? Fitness Function Candidate Variant !(#$%&'() , #'(&++ ) Score Malicious 0 /JavaScript eval(‘…’); /Root /Catalog /Pages 128 Oracle Target Classifier

Oracle: ℬ "′) = ℬ(" ? Execute candidate in vulnerable
Adobe Reader in virtual environment Behavioral signature: malicious if signature matches https://github.com/cuckoosandbox Simulated network: INetSim Cuckoo HTTP_URL + HOST extracted from API traces

Fitness Function Assumes lost malicious behavior will not be recovered
!itness '′ = * 1 − classi!ier_score '3 if ℬ '′) = ℬ(' −∞ otherwise

0 100 200 300 400 500 0 100 200 300
Seeds Evaded (out of 500) PDFRate Number of Mutations Hidost

0 100 200 300 400 500 0 100 200 300
Seeds Evaded (out of 500) PDFRate Number of Mutations Hidost Simple transformations often worked

0 100 200 300 400 500 0 100 200 300
Seeds Evaded (out of 500) PDFRate Number of Mutations Hidost (insert, /Root/Pages/Kids, 3:/Root/Pages/Kids/4/Kids/5/) Works on 162/500 seeds

0 100 200 300 400 500 0 100 200 300
Seeds Evaded (out of 500) PDFRate Number of Mutations Hidost Some seeds required complex transformations

Malicious Label Threshold Original Malicious Seeds Evading PDFrate Classification Score
Malware Seed (sorted by original score) Discovered Evasive Variants

Discovered Evasive Variants Malicious Label Threshold Original Malicious Seeds Adjust
threshold? Charles Smutz, Angelos Stavrou. When a Tree Falls: Using Diversity in Ensemble Classifiers to Identify Evasion in Malware Detectors. NDSS 2016. Classification Score Malware Seed (sorted by original score)

Variants found with threshold = 0.25 Variants found with threshold
= 0.50 Adjust threshold? Classification Score Malware Seed (sorted by original score)

Variants Hide the Classifier Score? Clone Benign PDFs Malicious PDF
Mutation 01011001101 Variants Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive? Fitness Function Candidate Variant !(#$%&'() , #'(&++ ) Score Malicious 0 /JavaScript eval(‘…’); /Root /Catalog /Pages 128 Oracle Target Classifier

Variants Binary Classifier Output is Enough Clone Benign PDFs Malicious
PDF Mutation 01011001101 Variants Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive? Fitness Function Candidate Variant !(#$%&'() , #'(&++ ) Score Malicious 0 /JavaScript eval(‘…’); /Root /Catalog /Pages 128 Oracle Target Classifier ACM CCS 2017

/ Benign Operational Data Trained Classifier Training (supervised learning) Retrain Classifier

Labelled Training Data ML Algorithm Feature Extraction Vectors Training (supervised
learning) Clone 01011001 101 EvadeML Deployment

0 100 200 300 400 500 0 200 400 600
800 Seeds Evaded (out of 500) Generations Hidost16 Original classifier: Takes 614 generations to evade all seeds

0 100 200 300 400 500 0 200 400 600
800 HidostR1 Seeds Evaded (out of 500) Generations Hidost16

0 100 200 300 400 500 0 200 400 600
800 HidostR1 HidostR2 Seeds Evaded (out of 500) Generations Hidost16

0 100 200 300 400 500 0 200 400 600
800 Hidost16 Genome Contagio Benign Hidost16 0.00 0.00 HidostR1 0.78 0.30 HidostR2 0.85 0.53 False Positive Rates HidostR1 Seeds Evaded (out of 500) Generations HidostR2

76 Only 8/6987 robust features (Hidost) Robust classifier High false
positives /Names /Names /JavaScript /Names /JavaScript /Names /Names /JavaScript /JS /OpenAction /OpenAction /JS /OpenAction /S /Pages

Malware Classification Moral To build robust, effective malware classifiers need
robust features that are strong signals for malware. 77 If you have features like this – don’t need ML!

78 Theory “Practice” “Reality” Distributional assumptions Toy, arbitrary datasets Malware,
Fake News, ... Classification Problems Adversarial Strength !" norm bound !# bound application specific Fake

Adversarial Examples across Domains 79 Domain Classifier Space “Reality” Space
Trojan Wars Judgment of Trojans !(#) = “gift” Physical Reality !∗(#) = invading army Malware Malware Detector !(#) = “benign” Victim’s Execution !∗(#) = malicious behavior Image Classification DNN Classifier !(#) = ) Human Perception !∗(#) = * Next Done Not DL

Adversarial Example 80 Prediction Change Definition: An input, !′ ∈
$, is an adversarial example for ! ∈ $, iff ∃!& ∈ Ball ' (!) such that * ! ≠ * !& . Suggested Defense: given an input !∗, see how the model behaves on .(!∗) where .(/) reverses transformations in ∆-space.

Model Model Squeezer 1 Prediction0 Prediction1 "($%&'( , $%&'* ,
… , $%&', ) Input Adversarial Legitimate Model’ Squeezer k … Predictionk Feature Squeezing Detection Framework Weilin Xu Yanjun Qi

Model Model Squeezer 1 Prediction0 Prediction1 "($%&'( , $%&'* ,
… , $%&', ) Input Adversarial Legitimate Model’ Squeezer k … Predictionk Feature Squeezing Detection Framework Feature Squeezer coalesces similar inputs into one point: • Barely change legitimate inputs. • Destruct adversarial perturbations.

Coalescing by Feature Squeezing 84 Metric Space 1: Target Classifier
Metric Space 2: “Oracle” Before: find a small perturbation that changes class for classifier, but imperceptible to oracle. Now: change class for both original and squeezed classifier, but imperceptible to oracle.

Example Squeezer: Bit Depth Reduction 0 0.1 0.2 0.3 0.4
0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 3-bit 1-bit 8-bit Input Output 85 Signal Quantization

Example Squeezer: Bit Depth Reduction 0 0.1 0.2 0.3 0.4
0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 3-bit 1-bit 8-bit Input Output 86 Signal Quantization Seed 1 1 4 2 2 1 1 1 1 1 CW 2 CW ∞ BIM FGSM

Other Potential Squeezers 87 C Xie, et al. Mitigating Adversarial
Effects Through Randomization, ICLR 2018. J Buckman, et al. Thermometer Encoding: One Hot Way To Resist Adversarial Examples, ICLR 2018. D Meng and H Chen, MagNet: a Two-Pronged Defense against Adversarial Examples, CCS 2017; A Prakash, et al., Deflecting Adversarial Attacks with Pixel Deflection, CVPR 2018;... Thermometer Encoding (learnable bit depth reduction) Image denoising using autoencoder, wavelet, JPEG, etc. Image resizing ... Spatial Smoothers: median filter, non-local means

Other Potential Squeezers 88 C Xie, et al. Mitigating Adversarial
Effects Through Randomization, ICLR 2018. J Buckman, et al. Thermometer Encoding: One Hot Way To Resist Adversarial Examples, ICLR 2018. D Meng and H Chen, MagNet: a Two-Pronged Defense against Adversarial Examples, CCS 2017; A Prakash, et al., Deflecting Adversarial Attacks with Pixel Deflection, CVPR 2018;... Thermometer Encoding (learnable bit depth reduction) Image denoising using autoencoder, wavelet, JPEG, etc. Image resizing ... Spatial Smoothers: median filter, non-local means Anish Athalye, Nicholas Carlini, David Wagner. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. ICML 2018.

“Feature Squeezing” (Vacuous) Conjecture For any distance-limited adversarial method, there
exists some feature squeezer that accurately detects its adversarial examples. 89 Intuition: if the perturbation is small (in some simple metric space), there is some squeezer that coalesces original and adversarial example into same sample.

Feature Squeezing Detection Model (7-layer CNN) Model Model Bit Depth-
1 Median 2×2 Prediction0 Prediction1 Prediction2 Yes Input Adversarial No Legitimate max '( )* , )( , '( )* , )2 > -

Detecting Adversarial Examples Distance between original input and its squeezed
version Adversarial inputs (CW attack) Legitimate inputs

92 0 200 400 600 800 0.0 0.4 0.8 1.2
1.6 2.0 Number of Examples Legitimate Adversarial Maximum !" distance between original and squeezed input threshold = 0.0029 detection: 98.2%, FP < 4% Training a detector (MNIST) set the detection threshold to keep false positive rate below target

ImageNet Configuration Model (MobileNet) Model Model Bit Depth- 5 Median
2×2 Prediction0 Prediction1 Prediction2 Yes Input Adversarial No Legitimate max(() (*+ , {*) , *. , */ }) > 3 Model Non-local Mean Prediction3

94 0 20 40 60 80 100 120 140 0.0
0.4 0.8 1.2 1.6 2.0 Legitimate Adversarial Maximum !" distance between original and squeezed input threshold = 1.24 detection: 85%, FP < 5% Training a detector (ImageNet)

What about better adversaries? 95

Instance Defense-Robustness 96 For an input !, the robust-defended region
is the maximum region with no undetected adversarial example: sup % > 0 ∀)* ∈ Ball/ ) , 1 )* = 1 ) ⋁ 45657654(!*)} Defense Failure: For a test set, ;, and bound, %< : | ) ∈ ;, RobustDefendedRegion ) < %< } | ;| Can we verify a defense?

Formal Verification of Defense Instance exhaustively test all inputs in
∀"# ∈ Ball( " for correctness or detection Need to transform model into a function amenable to verification

Linear Programming !"" #" + !"% #% + ⋯ ≤
(" !%" #" + !%% #% + ⋯ ≤ (% #) ≤ 0 ... Find values of + that minimize linear function under constraints: ," #" + ,% #% + ,- #- + …

Encoding a Neural Network Linear Components (! = #$ +
&) Convolutional Layer Fully-connected Layer Batch Normalization (in test mode) Non-linear Activation (ReLU, Sigmoid, Softmax) Pooling Layer (max, avg) 99

Encode ReLU Mixed Integer Linear Programming adds discrete values to
LP ReLU (Rectified Linear Unit ) ! = max(0, )) + ∈ 0, 1 ! ≥ ) ! ≥ 0 ! ≤ ) − 1 1 − + ! ≤ 2+ 1 2 Piecewise Linear

Mixed Integer Linear Programming (MILP) Intractable in theory (NP-Complete) Efficient
in practice (e.g., Gurobi solver) MIPVerify Vincent Tjeng, Kai Xiao, Russ Tedrake Verify NNs using MILP

Encode Feature Squeezers Binary Filter 0.5 1 0 Actual Input:
uint8 [0, 1, 2, … 254, 255] 127 / 255 = 0.498 128 / 255 = 0.502 An infeasible gap [0.499, 0.501] Lower semi-continuous

Verified L ∞ Robustness Model Test Accuracy Robust Error ε
= 0.1 Robust Error with Binary Filter Raghunathan et al. 95.82% 14.36%-30.81% 7.37% Wong & Kolter 98.11% 4.38% 4.25% Ours with binary filter 98.94% 2.66-6.63% - Even without detection, this helps!

Encode Detection Mechanism Original version: Simplify for verification: !" ⟶
maximum difference softmax ⟶ multiple piecewise-linear approximate sigmoid score(*) = - * − -(squeeze * ) " where f(x) is softmax output

Preliminary Experiments 105 Model (4-layer CNN) Model Bit Depth-1 Yes
Input !’ Adversarial No y1 valid max_diff +, , +. > 0 Verification: for a seed !, there is no adversarial input !1 ∈ Ball5 ! for which +. ≠ 7 ! and not detected Adversarially robust retrained [Wong & Kolter] model 1000 test MNIST seeds, 8 = 0.1 (=> ) 970 infeasible (verified no adversarial example) 13 misclassified (original seed) 17 vulnerable Robust error: 0.3% Verification time ~0.2s (compared to 0.8s without binarization)

107 target class Original Model (no robustness training) seed class
target class MNIST Model 2 convolutional layers 2 fully-connected layers (100, 10 units) ! = 0.2, '(

108 target class Original Model (no robustness training) seed class
target class MNIST Model 2 convolutional layers 2 fully-connected layers (100, 10 units) ! = 0.2, '(

Training a Robust Network Eric Wong and J. Zico Kolter.
Provable defenses against adversarial examples via the convex outer adversarial polytope. ICML 2018. replace loss with differentiable function based on outer bound using dual network ReLU (Rectified Linear Unit ) linear approximation ! "

110 seed class target class Standard Robustness Training (overall robustness
goal) MNIST Model 2 convolutional layers 2 fully-connected layers (100, 10 units) ! = 0.2, '(

Cost-Sensitive Robustness Training 111 Xiao Zhang Cost-matrix: cost of different
adversarial transformations ! = − 0 1 − benign malware benign malware Incorporate a cost-matrix into robustness training

112 seed class target class Standard Robustness Training (overall robustness
goal) MNIST Model 2 convolutional layers 2 fully-connected layers (100, 10 units) ! = 0.2, '(

113 seed class target class Cost- Sensitive Robustness Training Protect
odd classes from evasion

114 seed class target class Cost- Sensitive Robustness Training Protect
even classes from evasion

History of the destruction of Troy, 1498 Conclusion

Security State-of-the-Art Attack success probability Threat models Proofs Cryptography !"#!$
information theoretic, resource bounded required System Security !"%! capabilities, motivations, rationality common Adversarial Machine Learning !&; !"#* artificially limited adversary making progress! 116

Security State-of-the-Art Attack success probability Threat models Proofs Cryptography !"#!$
information theoretic, resource bounded required System Security !"%! capabilities, motivations, rationality common Adversarial Machine Learning !&; !"#* artificially limited adversary making progress! 117 Huge gaps to close: threat models are unrealistic (but real threats unclear) verification techniques only work for tiny models experimental defenses often (quickly) broken

David Evans University of Virginia [email protected] EvadeML.org Weilin Xu Yanjun
Qi Funding: NSF, Intel, Baidu Xiao Zhang Center for Trustworthy Machine Learning

David Evans University of Virginia [email protected] EvadeML.org

Can Machine Learning Ever Be Trustworthy?

Can Machine Learning Ever Be Trustworthy?

More Decks by David Evans

Other Decks in Research

Featured

Transcript