Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Can Machine Learning Ever Be Trustworthy?

David Evans
December 07, 2018

Can Machine Learning Ever Be Trustworthy?

University of Maryland
Booz Allen Hamilton Distinguished Colloquium
7 December 2018

https://evademl.org

David Evans

December 07, 2018
Tweet

More Decks by David Evans

Other Decks in Research

Transcript

  1. Can Machine Learning Ever Be Trustworthy? David Evans University of

    Virginia evadeML.org 7 December 2018 University of Maryland
  2. 3 “Unfortunately, our translation systems made an error last week

    that misinterpreted what this individual posted. Even though our translations are getting better each day, mistakes like these might happen from time to time and we’ve taken steps to address this particular issue. We apologize to him and his family for the mistake and the disruption this caused.”
  3. 4

  4. Risks from Artificial Intelligence 7 Benign developers and operators AI

    out of control AI inadvertently causes harm Malicious operators Build AI to do harm Malicious abuse of benign AI On Robots Joe Berger and Pascal Wyse (The Guardian, 21 July 2018)
  5. Risks from Artificial Intelligence Benign developers and operators AI out

    of control AI inadvertently causes harm Malicious operators Build AI to do harm Malicious abuse of benign AI systems 8
  6. Labelled Training Data ML Algorithm Feature Extraction Vectors Deployment Malicious

    / Benign Operational Data Trained Classifier Training (supervised learning) Statistical Machine Learning
  7. Labelled Training Data ML Algorithm Feature Extraction Vectors Deployment Malicious

    / Benign Operational Data Trained Classifier Training (supervised learning) Assumption: Training Data is Representative
  8. 14

  9. More Ambition 15 “The human race will have a new

    kind of instrument which will increase the power of the mind much more than optical lenses strengthen the eyes and which will be as far superior to microscopes or telescopes as reason is superior to sight.”
  10. More Ambition 16 “The human race will have a new

    kind of instrument which will increase the power of the mind much more than optical lenses strengthen the eyes and which will be as far superior to microscopes or telescopes as reason is superior to sight.” Gottfried Wilhelm Leibniz (1679)
  11. 17 Gottfried Wilhelm Leibniz (Universitat Altdorf, 1666) who advised: Jacob

    Bernoulli (Universitdt Basel, 1684) who advised: Johann Bernoulli (Universitdt Basel, 1694) who advised: Leonhard Euler (Universitat Basel, 1726) who advised: Joseph Louis Lagrange who advised: Simeon Denis Poisson who advised: Michel Chasles (Ecole Polytechnique, 1814) who advised: H. A. (Hubert Anson) Newton (Yale, 1850) who advised: E. H. Moore (Yale, 1885) who advised: Oswald Veblen (U. of Chicago, 1903) who advised: Philip Franklin (Princeton 1921) who advised: Alan Perlis (MIT Math PhD 1950) who advised: Jerry Feldman (CMU Math 1966) who advised: Jim Horning (Stanford CS PhD 1969) who advised: John Guttag (U. of Toronto CS PhD 1975) who advised: David Evans (MIT CS PhD 2000) my academic great- great-great-great- great-great-great- great-great-great- great-great-great- great-great- grandparent!
  12. More Precision 18 “The human race will have a new

    kind of instrument which will increase the power of the mind much more than optical lenses strengthen the eyes and which will be as far superior to microscopes or telescopes as reason is superior to sight.” Gottfried Wilhelm Leibniz (1679) Normal computing amplifies (quadrillions of times faster) and aggregates (enables millions of humans to work together) human cognitive abilities; AI goes beyond what humans can do.
  13. Operational Definition “Artificial Intelligence” means making computers do things their

    programmers don’t understand well enough to program explicitly. 19 If it is explainable, its not ML!
  14. Inherent Paradox of “Trustworthy” ML 20 “Artificial Intelligence” means making

    computers do things their programmers don’t understand well enough to program explicitly. If we could specify precisely what the model should do, we wouldn’t need ML to do it!
  15. Inherent Paradox of “Trustworthy” ML 21 If we could specify

    precisely what the model should do, we wouldn’t need ML to do it! Best we hope for is verifying certain properties M1 M2 ∀": $% " = $' (") DeepXplore: Automated Whitebox Testing of Deep Learning Systems. Kexin Pei, Yinzhi Cao, Junfeng Yang, Suman Jana. SOSP 2017 Model Similarity
  16. Inherent Paradox of “Trustworthy” ML 22 Best we hope for

    is verifying certain properties M 1 M 2 ∀" ∈ $: &' " ≈ &) (") DeepXplore: Automated Whitebox Testing of Deep Learning Systems. Kexin Pei, Yinzhi Cao, Junfeng Yang, Suman Jana. SOSP 2017 Model Similarity M ∀" ∈ $, ∀∆ ∈ .: & " ≈ &(" + ∆) " " + ∆ Model Robustness 0 M 0∗
  17. Adversarial Robustness 24 M ∀" ∈ $, ∀∆ ∈ ':

    ) " ≈ )(" + ∆) " " + ∆ . M .∗ Adversary’s Goal: find a “small” perturbation that changes model output targeted attack: in some desired way Defender’s Goal: Robust Model: find model where this is hard Detection: detect inputs that are adversarial
  18. Not a new problem... 25 Or do you think any

    Greek gift’s free of treachery? Is that Ulysses’s reputation? Either there are Greeks in hiding, concealed by the wood, or it’s been built as a machine to use against our walls, or spy on our homes, or fall on the city from above, or it hides some other trick: Trojans, don’t trust this horse. Whatever it is, I’m afraid of Greeks even those bearing gifts.’ Virgil, The Aenid (Book II)
  19. Adversarial Examples for DNNs 26 0.007 × [&'()*] + =

    “panda” “gibbon” Example from: Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy. Explaining and Harnessing Adversarial Examples. 2014 (in ICLR 2015)
  20. Impact of Adversarial Perturbations 27 Distance between layer output and

    its output for original seed FGSM ! = 0.0245 CIFAR-10 DenseNet 95th percentile 5th percentile
  21. Impact of Adversarial Perturbations 28 Distance between layer output and

    its output for original seed Random noise (same amount) FGSM ! = 0.0245 CIFAR-10 DenseNet
  22. Impact of Adversarial Perturbations 29 Distance between layer output and

    its output for original seed Random noise (same amount) Carlini- Wagner L2 CIFAR-10 DenseNet
  23. 0 200 400 600 800 1000 1200 1400 1600 1800

    2018 2017 2016 2015 2014 2013 30 Papers on “Adversarial Examples” (Google Scholar) 1826.68 papers expected in 2018!
  24. 0 200 400 600 800 1000 1200 1400 1600 1800

    2018 2017 2016 2015 2014 2013 31 Papers on “Adversarial Examples” (Google Scholar) 1826.68 papers expected in 2018!
  25. 0 200 400 600 800 1000 1200 1400 1600 1800

    2018 2017 2016 2015 2014 2013 32 Emergence of “Theory” ICML Workshop 2015 15% of 2018 “adversarial examples” papers contain “theorem” and “proof”
  26. Adversarial Example 33 Prediction Change Definition: An input, !′ ∈

    $, is an adversarial example for ! ∈ $, iff ∃!& ∈ Ball* (!) such that - ! ≠ - !& .
  27. Adversarial Example 34 Ball$ (&) is some space around &,

    typically defined in some (simple!) metric space: () norm (# different), (* norm (“Euclidean distance”), (+ Without constraints on Ball$ , every input has adversarial examples. Prediction Change Definition: An input, &′ ∈ /, is an adversarial example for & ∈ /, iff ∃&1 ∈ Ball$ (&) such that 2 & ≠ 2 &1 .
  28. Adversarial Example 35 Any non-trivial model has adversarial examples: ∃"#

    , "% ∈ '. ) "# ≠ )("% ) Prediction Change Definition: An input, -′ ∈ ', is an adversarial example for - ∈ ', iff ∃-/ ∈ Ball3 (-) such that ) - ≠ ) -/ .
  29. Prediction Error Robustness 36 Error Robustness: An input, !′ ∈

    $, is an adversarial example for (correct) ! ∈ $, iff ∃!& ∈ Ball* (!) such that - !′ ≠ true label for !′. Perfect classifier has no (error robustness) adversarial examples.
  30. Prediction Error Robustness 37 Error Robustness: An input, !′ ∈

    $, is an adversarial example for (correct) ! ∈ $, iff ∃!& ∈ Ball* (!) such that - !′ ≠ true label for !′. Perfect classifier has no (error robustness) adversarial examples. If we have a way to know this, don’t need an ML classifier.
  31. Global Robustness Properties 38 Adversarial Risk: probability an input has

    an adversarial example Pr # ← % [∃ () ∈ +,--. ( . 0 () ≠ class (′ ] Dimitrios I. Diochnos, Saeed Mahloujifar, Mohammad Mahmoody, NeurIPS 2018
  32. Global Robustness Properties 39 Dimitrios I. Diochnos, Saeed Mahloujifar, Mohammad

    Mahmoody, NeurIPS 2018 Adversarial Risk: probability an input has an adversarial example Pr # ← % [∃ () ∈ +,--. ( . 0 () ≠ class (′ ] Error Region Robustness: expected distance to closest AE: 8 # ← % [inf { =: ∃ () ∈ +,--. ( . 0 () ≠ class () }]
  33. Assumption Key Result Adversarial Spheres [Gilmer et al., 2018] Uniform

    distribution on two concentric !-spheres Expected safe distance ("# -norm) is relatively small. Adversarial vulnerability for any classifier [Fawzi × 3, 2018] Smooth generative model: 1. Gaussian in latent space. 2. Generator is L-Lipschitz. Adversarial risk ⟶ 1 for relatively small attack strength ("# -norm). Curse of Concentration in Robust Learning [Mahloujifar et al., 2018] Normal Lévy families • Unit sphere, uniform, "# norm • Boolean hypercube, uniform, Hamming distance ... If attack strength exceeds a relatively small threshold, adversarial risk > 1/2. b > p log(k1/") p k2 · n ! Riskb(h, c) 1/2 Recent Global Robustness Results P(r(x)  ⌘) 1 r ⇡ 2 e ⌘2/2L2 Properties of any model for input space: distance to AE is small relative to expected distance between two sampled points
  34. Prediction Change Robustness 41 Prediction Change: An input, !′ ∈

    $, is an adversarial example for ! ∈ $, iff ∃!& ∈ Ball* (!) such that - !′ ≠ - ! . Any non-trivial model has adversarial examples: ∃!0 , !2 ∈ $. - !0 ≠ -(!2 ) Solutions: - only consider particular inputs (“good” seeds) - output isn’t just class (e.g., confidence) - targeted adversarial examples cost-sensitive adversarial robustness
  35. Local (Instance) Robustness 42 Robust Region: For an input !,

    the robust region is the maximum region with no adversarial example: sup % > 0 ∀)* ∈ Ball/ ) , 1 )* = 1 ) }
  36. Local (Instance) Robustness 43 Robust Region: For an input !,

    the robust region is the maximum region with no adversarial example: sup % > 0 ∀)* ∈ Ball/ ) , 1 )* = 1 ) } Robust Error: For a test set, 4, and bound, %5 : | ) ∈ 4, RobustRegion ) < %5 } | 4|
  37. 44 Scalability Formal Verification MILP solver (MIPVerify) SMT solver (Reluplex)

    Interval analysis (Reluval) robust error Heuristic Defenses distillation (Papernot et al., 2016) gradient obfuscation adversarial retraining (Madry et al., 2017) attack success rate (set of attacks) Certified Robustness CNN-Cert (Boopathy et al., 2018) Dual-LP (Kolter & Wong 2018) Dual-SDP (Raghunathan et al., 2018) bound Evaluation Metric precise feature squeezing
  38. 45 Theory “Practice” Reality Distributional assumptions Toy, arbitrary datasets Malware,

    Fake News, ... Classification Problems Adversarial Strength !" norm bound !# bound application specific Fake
  39. Finding Evasive Malware 47 Given seed sample, !, with desired

    malicious behavior find an adversarial example !" that satisfies: # !" = “&'()*(” Model misclassifies ℬ !′) = ℬ(! Malicious behavior preserved Generic attack: heuristically explore input space for !′ that satisfies definition. No requirement that ! ~ !′ except through ℬ.
  40. PDF Malware Classifiers Random Forest Random Forest Support Vector Machine

    Features Object counts, lengths, positions, … Object structural paths Very robust against “strongest conceivable mimicry attack”. Automated Features Manual Features PDFrate [ACSA 2012] Hidost16 [JIS 2016] Hidost13 [NDSS 2013]
  41. Variants Evolutionary Search Clone Benign PDFs Malicious PDF Mutation 01011001101

    Variants Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive? Benign Oracle Weilin Xu Yanjun Qi Fitness Selection Mutant Generation
  42. Variants Generating Variants Clone Benign PDFs Malicious PDF Mutation 01011001101

    Variants Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive? Fitness Selection Mutant Generation
  43. Variants Generating Variants Clone Benign PDFs Malicious PDF Mutation 01011001101

    Variants Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive? Fitness Selection Mutant Generation
  44. Variants Generating Variants Clone Benign PDFs Malicious PDF Mutation 01011001101

    Variants Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive? Found Evasive ? 0 /JavaScript eval(‘…’); /Root /Catalog /Pages Select random node Randomly transform: delete, insert, replace
  45. Variants Generating Variants Clone Benign PDFs Malicious PDF Mutation 01011001101

    Variants Variants Select Variants Found Evasive? Found Evasive ? Select random node Randomly transform: delete, insert, replace Nodes from Benign PDFs 0 /JavaScript eval(‘…’); /Root /Catalog /Pages 128 546 7 63 128
  46. Variants Selecting Promising Variants Clone Benign PDFs Malicious PDF Mutation

    01011001101 Variants Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive? Fitness Selection Mutant Generation
  47. Variants Selecting Promising Variants Clone Benign PDFs Malicious PDF Mutation

    01011001101 Variants Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive? Fitness Function Candidate Variant !(#$%&'() , #'(&++ ) Score Malicious 0 /JavaScript eval(‘…’); /Root /Catalog /Pages 128 Oracle Target Classifier
  48. Oracle: ℬ "′) = ℬ(" ? Execute candidate in vulnerable

    Adobe Reader in virtual environment Behavioral signature: malicious if signature matches https://github.com/cuckoosandbox Simulated network: INetSim Cuckoo HTTP_URL + HOST extracted from API traces
  49. Fitness Function Assumes lost malicious behavior will not be recovered

    !itness '′ = * 1 − classi!ier_score '3 if ℬ '′) = ℬ(' −∞ otherwise
  50. 0 100 200 300 400 500 0 100 200 300

    Seeds Evaded (out of 500) PDFRate Number of Mutations Hidost
  51. 0 100 200 300 400 500 0 100 200 300

    Seeds Evaded (out of 500) PDFRate Number of Mutations Hidost Simple transformations often worked
  52. 0 100 200 300 400 500 0 100 200 300

    Seeds Evaded (out of 500) PDFRate Number of Mutations Hidost (insert, /Root/Pages/Kids, 3:/Root/Pages/Kids/4/Kids/5/) Works on 162/500 seeds
  53. 0 100 200 300 400 500 0 100 200 300

    Seeds Evaded (out of 500) PDFRate Number of Mutations Hidost Some seeds required complex transformations
  54. Malicious Label Threshold Original Malicious Seeds Evading PDFrate Classification Score

    Malware Seed (sorted by original score) Discovered Evasive Variants
  55. Discovered Evasive Variants Malicious Label Threshold Original Malicious Seeds Adjust

    threshold? Charles Smutz, Angelos Stavrou. When a Tree Falls: Using Diversity in Ensemble Classifiers to Identify Evasion in Malware Detectors. NDSS 2016. Classification Score Malware Seed (sorted by original score)
  56. Variants found with threshold = 0.25 Variants found with threshold

    = 0.50 Adjust threshold? Classification Score Malware Seed (sorted by original score)
  57. Variants Hide the Classifier Score? Clone Benign PDFs Malicious PDF

    Mutation 01011001101 Variants Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive? Fitness Function Candidate Variant !(#$%&'() , #'(&++ ) Score Malicious 0 /JavaScript eval(‘…’); /Root /Catalog /Pages 128 Oracle Target Classifier
  58. Variants Binary Classifier Output is Enough Clone Benign PDFs Malicious

    PDF Mutation 01011001101 Variants Variants Select Variants ✓ ✓ ✗ ✓ Found Evasive? Fitness Function Candidate Variant !(#$%&'() , #'(&++ ) Score Malicious 0 /JavaScript eval(‘…’); /Root /Catalog /Pages 128 Oracle Target Classifier ACM CCS 2017
  59. Labelled Training Data ML Algorithm Feature Extraction Vectors Deployment Malicious

    / Benign Operational Data Trained Classifier Training (supervised learning) Retrain Classifier
  60. 0 100 200 300 400 500 0 200 400 600

    800 Seeds Evaded (out of 500) Generations Hidost16 Original classifier: Takes 614 generations to evade all seeds
  61. 0 100 200 300 400 500 0 200 400 600

    800 HidostR1 Seeds Evaded (out of 500) Generations Hidost16
  62. 0 100 200 300 400 500 0 200 400 600

    800 HidostR1 Seeds Evaded (out of 500) Generations Hidost16
  63. 0 100 200 300 400 500 0 200 400 600

    800 HidostR1 HidostR2 Seeds Evaded (out of 500) Generations Hidost16
  64. 0 100 200 300 400 500 0 200 400 600

    800 HidostR1 HidostR2 Seeds Evaded (out of 500) Generations Hidost16
  65. 0 100 200 300 400 500 0 200 400 600

    800 Hidost16 Genome Contagio Benign Hidost16 0.00 0.00 HidostR1 0.78 0.30 HidostR2 0.85 0.53 False Positive Rates HidostR1 Seeds Evaded (out of 500) Generations HidostR2
  66. 76 Only 8/6987 robust features (Hidost) Robust classifier High false

    positives /Names /Names /JavaScript /Names /JavaScript /Names /Names /JavaScript /JS /OpenAction /OpenAction /JS /OpenAction /S /Pages
  67. Malware Classification Moral To build robust, effective malware classifiers need

    robust features that are strong signals for malware. 77 If you have features like this – don’t need ML!
  68. 78 Theory “Practice” “Reality” Distributional assumptions Toy, arbitrary datasets Malware,

    Fake News, ... Classification Problems Adversarial Strength !" norm bound !# bound application specific Fake
  69. Adversarial Examples across Domains 79 Domain Classifier Space “Reality” Space

    Trojan Wars Judgment of Trojans !(#) = “gift” Physical Reality !∗(#) = invading army Malware Malware Detector !(#) = “benign” Victim’s Execution !∗(#) = malicious behavior Image Classification DNN Classifier !(#) = ) Human Perception !∗(#) = * Next Done Not DL
  70. Adversarial Example 80 Prediction Change Definition: An input, !′ ∈

    $, is an adversarial example for ! ∈ $, iff ∃!& ∈ Ball ' (!) such that * ! ≠ * !& . Suggested Defense: given an input !∗, see how the model behaves on .(!∗) where .(/) reverses transformations in ∆-space.
  71. 81 Scalability Formal Verification MILP solver (MIPVerify) SMT solver (Reluplex)

    Interval analysis (Reluval) robust error Heuristic Defenses distillation (Papernot et al., 2016) gradient obfuscation adversarial retraining (Madry et al., 2017) attack success rate (set of attacks) Certified Robustness CNN-Cert (Boopathy et al., 2018) Dual-LP (Kolter & Wong 2018) Dual-SDP (Raghunathan et al., 2018) bound Evaluation Metric precise feature squeezing
  72. Model Model Squeezer 1 Prediction0 Prediction1 "($%&'( , $%&'* ,

    … , $%&', ) Input Adversarial Legitimate Model’ Squeezer k … Predictionk Feature Squeezing Detection Framework Weilin Xu Yanjun Qi
  73. Model Model Squeezer 1 Prediction0 Prediction1 "($%&'( , $%&'* ,

    … , $%&', ) Input Adversarial Legitimate Model’ Squeezer k … Predictionk Feature Squeezing Detection Framework Feature Squeezer coalesces similar inputs into one point: • Barely change legitimate inputs. • Destruct adversarial perturbations.
  74. Coalescing by Feature Squeezing 84 Metric Space 1: Target Classifier

    Metric Space 2: “Oracle” Before: find a small perturbation that changes class for classifier, but imperceptible to oracle. Now: change class for both original and squeezed classifier, but imperceptible to oracle.
  75. Example Squeezer: Bit Depth Reduction 0 0.1 0.2 0.3 0.4

    0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 3-bit 1-bit 8-bit Input Output 85 Signal Quantization
  76. Example Squeezer: Bit Depth Reduction 0 0.1 0.2 0.3 0.4

    0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 3-bit 1-bit 8-bit Input Output 86 Signal Quantization Seed 1 1 4 2 2 1 1 1 1 1 CW 2 CW ∞ BIM FGSM
  77. Other Potential Squeezers 87 C Xie, et al. Mitigating Adversarial

    Effects Through Randomization, ICLR 2018. J Buckman, et al. Thermometer Encoding: One Hot Way To Resist Adversarial Examples, ICLR 2018. D Meng and H Chen, MagNet: a Two-Pronged Defense against Adversarial Examples, CCS 2017; A Prakash, et al., Deflecting Adversarial Attacks with Pixel Deflection, CVPR 2018;... Thermometer Encoding (learnable bit depth reduction) Image denoising using autoencoder, wavelet, JPEG, etc. Image resizing ... Spatial Smoothers: median filter, non-local means
  78. Other Potential Squeezers 88 C Xie, et al. Mitigating Adversarial

    Effects Through Randomization, ICLR 2018. J Buckman, et al. Thermometer Encoding: One Hot Way To Resist Adversarial Examples, ICLR 2018. D Meng and H Chen, MagNet: a Two-Pronged Defense against Adversarial Examples, CCS 2017; A Prakash, et al., Deflecting Adversarial Attacks with Pixel Deflection, CVPR 2018;... Thermometer Encoding (learnable bit depth reduction) Image denoising using autoencoder, wavelet, JPEG, etc. Image resizing ... Spatial Smoothers: median filter, non-local means Anish Athalye, Nicholas Carlini, David Wagner. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. ICML 2018.
  79. “Feature Squeezing” (Vacuous) Conjecture For any distance-limited adversarial method, there

    exists some feature squeezer that accurately detects its adversarial examples. 89 Intuition: if the perturbation is small (in some simple metric space), there is some squeezer that coalesces original and adversarial example into same sample.
  80. Feature Squeezing Detection Model (7-layer CNN) Model Model Bit Depth-

    1 Median 2×2 Prediction0 Prediction1 Prediction2 Yes Input Adversarial No Legitimate max '( )* , )( , '( )* , )2 > -
  81. Detecting Adversarial Examples Distance between original input and its squeezed

    version Adversarial inputs (CW attack) Legitimate inputs
  82. 92 0 200 400 600 800 0.0 0.4 0.8 1.2

    1.6 2.0 Number of Examples Legitimate Adversarial Maximum !" distance between original and squeezed input threshold = 0.0029 detection: 98.2%, FP < 4% Training a detector (MNIST) set the detection threshold to keep false positive rate below target
  83. ImageNet Configuration Model (MobileNet) Model Model Bit Depth- 5 Median

    2×2 Prediction0 Prediction1 Prediction2 Yes Input Adversarial No Legitimate max(() (*+ , {*) , *. , */ }) > 3 Model Non-local Mean Prediction3
  84. 94 0 20 40 60 80 100 120 140 0.0

    0.4 0.8 1.2 1.6 2.0 Legitimate Adversarial Maximum !" distance between original and squeezed input threshold = 1.24 detection: 85%, FP < 5% Training a detector (ImageNet)
  85. Instance Defense-Robustness 96 For an input !, the robust-defended region

    is the maximum region with no undetected adversarial example: sup % > 0 ∀)* ∈ Ball/ ) , 1 )* = 1 ) ⋁ 45657654(!*)} Defense Failure: For a test set, ;, and bound, %< : | ) ∈ ;, RobustDefendedRegion ) < %< } | ;| Can we verify a defense?
  86. Formal Verification of Defense Instance exhaustively test all inputs in

    ∀"# ∈ Ball( " for correctness or detection Need to transform model into a function amenable to verification
  87. Linear Programming !"" #" + !"% #% + ⋯ ≤

    (" !%" #" + !%% #% + ⋯ ≤ (% #) ≤ 0 ... Find values of + that minimize linear function under constraints: ," #" + ,% #% + ,- #- + …
  88. Encoding a Neural Network Linear Components (! = #$ +

    &) Convolutional Layer Fully-connected Layer Batch Normalization (in test mode) Non-linear Activation (ReLU, Sigmoid, Softmax) Pooling Layer (max, avg) 99
  89. Encode ReLU Mixed Integer Linear Programming adds discrete values to

    LP ReLU (Rectified Linear Unit ) ! = max(0, )) + ∈ 0, 1 ! ≥ ) ! ≥ 0 ! ≤ ) − 1 1 − + ! ≤ 2+ 1 2 Piecewise Linear
  90. Mixed Integer Linear Programming (MILP) Intractable in theory (NP-Complete) Efficient

    in practice (e.g., Gurobi solver) MIPVerify Vincent Tjeng, Kai Xiao, Russ Tedrake Verify NNs using MILP
  91. Encode Feature Squeezers Binary Filter 0.5 1 0 Actual Input:

    uint8 [0, 1, 2, … 254, 255] 127 / 255 = 0.498 128 / 255 = 0.502 An infeasible gap [0.499, 0.501] Lower semi-continuous
  92. Verified L ∞ Robustness Model Test Accuracy Robust Error ε

    = 0.1 Robust Error with Binary Filter Raghunathan et al. 95.82% 14.36%-30.81% 7.37% Wong & Kolter 98.11% 4.38% 4.25% Ours with binary filter 98.94% 2.66-6.63% - Even without detection, this helps!
  93. Encode Detection Mechanism Original version: Simplify for verification: !" ⟶

    maximum difference softmax ⟶ multiple piecewise-linear approximate sigmoid score(*) = - * − -(squeeze * ) " where f(x) is softmax output
  94. Preliminary Experiments 105 Model (4-layer CNN) Model Bit Depth-1 Yes

    Input !’ Adversarial No y1 valid max_diff +, , +. > 0 Verification: for a seed !, there is no adversarial input !1 ∈ Ball5 ! for which +. ≠ 7 ! and not detected Adversarially robust retrained [Wong & Kolter] model 1000 test MNIST seeds, 8 = 0.1 (=> ) 970 infeasible (verified no adversarial example) 13 misclassified (original seed) 17 vulnerable Robust error: 0.3% Verification time ~0.2s (compared to 0.8s without binarization)
  95. 106 Scalability Formal Verification MILP solver (MIPVerify) SMT solver (Reluplex)

    Interval analysis (Reluval) robust error Heuristic Defenses distillation (Papernot et al., 2016) gradient obfuscation adversarial retraining (Madry et al., 2017) attack success rate (set of attacks) Certified Robustness CNN-Cert (Boopathy et al., 2018) Dual-LP (Kolter & Wong 2018) Dual-SDP (Raghunathan et al., 2018) bound Evaluation Metric precise feature squeezing
  96. 107 target class Original Model (no robustness training) seed class

    target class MNIST Model 2 convolutional layers 2 fully-connected layers (100, 10 units) ! = 0.2, '(
  97. 108 target class Original Model (no robustness training) seed class

    target class MNIST Model 2 convolutional layers 2 fully-connected layers (100, 10 units) ! = 0.2, '(
  98. Training a Robust Network Eric Wong and J. Zico Kolter.

    Provable defenses against adversarial examples via the convex outer adversarial polytope. ICML 2018. replace loss with differentiable function based on outer bound using dual network ReLU (Rectified Linear Unit ) linear approximation ! "
  99. 110 seed class target class Standard Robustness Training (overall robustness

    goal) MNIST Model 2 convolutional layers 2 fully-connected layers (100, 10 units) ! = 0.2, '(
  100. Cost-Sensitive Robustness Training 111 Xiao Zhang Cost-matrix: cost of different

    adversarial transformations ! = − 0 1 − benign malware benign malware Incorporate a cost-matrix into robustness training
  101. 112 seed class target class Standard Robustness Training (overall robustness

    goal) MNIST Model 2 convolutional layers 2 fully-connected layers (100, 10 units) ! = 0.2, '(
  102. Security State-of-the-Art Attack success probability Threat models Proofs Cryptography !"#!$

    information theoretic, resource bounded required System Security !"%! capabilities, motivations, rationality common Adversarial Machine Learning !&; !"#* artificially limited adversary making progress! 116
  103. Security State-of-the-Art Attack success probability Threat models Proofs Cryptography !"#!$

    information theoretic, resource bounded required System Security !"%! capabilities, motivations, rationality common Adversarial Machine Learning !&; !"#* artificially limited adversary making progress! 117 Huge gaps to close: threat models are unrealistic (but real threats unclear) verification techniques only work for tiny models experimental defenses often (quickly) broken
  104. David Evans University of Virginia [email protected] EvadeML.org Weilin Xu Yanjun

    Qi Funding: NSF, Intel, Baidu Xiao Zhang Center for Trustworthy Machine Learning