Upgrade to Pro — share decks privately, control downloads, hide ads and more …

FOSAD Trustworthy Machine Learning: Class 2

David Evans
August 27, 2019

FOSAD Trustworthy Machine Learning: Class 2

19th International School on Foundations of Security Analysis and Design
Mini-course on "Trustworthy Machine Learning"
https://jeffersonswheel.org/fosad2019
David Evans

Class 2: Defenses

David Evans

August 27, 2019
Tweet

More Decks by David Evans

Other Decks in Education

Transcript

  1. Trustworthy Machine Learning David Evans University of Virginia jeffersonswheel.org Bertinoro,

    Italy 27 August 2019 19th International School on Foundations of Security Analysis and Design 2: Defenses
  2. Threat Models 3 1. What are the attacker’s goals? •

    Malicious behavior without detection • Commit check fraud • ... 2. What are the attacker’s capabilities? information: what do they know actions: what they can do resources: how much they can spend?
  3. Threat Models in Cryptography Ciphertext-only attack Intercept message, want to

    learn plaintext Chosen-plaintext attack Adversary has encryption function as black box, wants to learn key (or decrypt some ciphertext) Chosen-ciphertext attack Adversary has decryption function as black box, wants to learn key (or encrypt some message) 4 Goals Information Actions Resources
  4. Threat Models in Cryptography 5 Goals Information Actions Resources Polynomial

    time/space: adversary has computational resources that scale polynomially in some security parameter (e.g., key size)
  5. Security Goals in Cryptography 7 Semantic Security: adversary with intercepted

    ciphertext has no advantage over adversary without it Shafi Goldwasser and Silvio Micali Developed semantic security in 1980s (2013 Turing Awardees)
  6. Threat Models in Adversarial ML? 8 Ciphertext-only attack Chosen-plaintext attack

    Chosen-ciphertext attack Polynomial time/space Semantic Security proofs Can we get to threat models as precise as those used in cryptography? Can we prove strong security notions for those threat models?
  7. Threat Models in Adversarial ML? 9 Ciphertext-only attack Chosen-plaintext attack

    Chosen-ciphertext attack Polynomial time/space Semantic Security proofs Can we get to threat models as precise as those used in cryptography? Can we prove strong security notions for those threat models? Current state: “Pre-Shannon” (Nicolas Carlini)
  8. 10 Ali Rahimi NIPS Test-of-Time Award Speech (Dec 2017) ”If

    you're building photo- sharing systems alchemy is okay but we're beyond that; now we're building systems that govern healthcare and mediate our civic dialogue”
  9. 11 Ali Rahimi NIPS Test-of-Time Award Speech (Dec 2017) ”If

    you're building photo- sharing systems alchemy is okay but we're beyond that; now we're building systems that govern healthcare and mediate our civic dialogue”
  10. Alchemy (~700 − 1660) Well-defined, testable goal turn lead into

    gold Established theory four elements: earth, fire, water, air Methodical experiments and lab techniques (Jabir ibn Hayyan in 8th century) Wrong and ultimately unsuccessful, but led to modern chemistry.
  11. Attacker Access White Box Attack has model: full knowledge of

    all parameters Black Box 14 ! " = ! $ ! % &' … ! ' !(") ! " !(") Each model query is “expensive” Only receives output “API Access”
  12. Black-Box Attacks 16 PGD Attack !" # = ! for

    % iterations: !&'( # = project0,2 (!& # − 5 ⋅ sign(∇ < !& # , = ) !# = !? ′ Can we execute these attacks if we don’t have the model?
  13. Black-Box Optimization Attacks 17 ! " !(") Black-Box Gradient Attack

    "% & = " for ( iterations: use queries to estimate ∇ * "+ & , -
  14. Black-Box Optimization Attacks 18 ! " !(") Black-Box Gradient Attack

    "% & = " for ( iterations: use queries to estimate ∇ * "+ & , - "+./ & = take step in “white-box” attack using estimated gradients "+./ &
  15. Black-Box Gradient Attacks 19 Hybrid Batch Attacks: Finding Black-box Adversarial

    Examples with Limited Queries. Fnu Suya, Jianfeng Chi, David Evans, Yuan Tian. USENIX Security 2020.
  16. Transfer Attacks 20 ! "∗ ! "∗ = % !&

    Target Model '∗ = whiteBoxAttack(!& , ') Adversarial examples against one model, often transfer to another model. External Local
  17. 21

  18. Improving Transfer Attacks 22 ! "∗ ! "∗ = %

    Target Model Adversarial examples against several models, more likely to transfer. External Local !& !' !( "∗ = whiteBoxAttack(6(!& , !' , !( ), ") Yanpei Liu, Xinyun Chen, Chang Liu, Dawn Song [ICLR 2017]
  19. Hybrid Attacks Transfer Attacks Efficient: only one API query Low

    success rates - 3% transfer rate for targeted attack on ImageNet (ensemble) Gradient Attacks Expensive: 10k+ queries/seed High success rates - 100% for targeted attack on Imagenet 23 Hybrid Batch Attacks: Finding Black-box Adversarial Examples with Limited Queries. Fnu Suya, Jianfeng Chi, David Evans, Yuan Tian. USENIX Security 2020. Combine both attacks: efficient + high success
  20. Hybrid Attack 24 ! "∗ ! "∗ = % External

    Local !& !' !( "∗ = whiteBoxAttack(6(!& , !' , !( ), ") 1: Transfer Attack
  21. Hybrid Attack 25 ! "∗ ! "∗ ≠ % External

    Local !& !' !( "∗ = whiteBoxAttack(7(!& , !' , !( ), ") 1: Transfer Attack ": ; "< ; "∗ 2: Gradient Attack (starting from transfer candidate)
  22. Hybrid Attack 26 ! "∗ ! "∗ ≠ % External

    Local !& !' !( "∗ = whiteBoxAttack(7(!& , !' , !( ), ") 1: Transfer Attack ": ; "< ; "∗ 2: Gradient Attack 3: Tune Local Models using label byproducts
  23. 27 Dataset / Model Direct Transfer Rate Gradient Attack (AutoZoom)

    Hybrid Attack Success Rate Queries per AE Success Rate Queries per AE MNIST (Targeted) 61.6 90.9 1,645 98.8 298 CIFAR10 (Targeted) 63.3 92.2 1,227 98.1 227 ImageNet (Targeted) 3.4 95.4 45,166 98.0 30,089
  24. Realistic Adversary Model? Knowledge Only API access to target Good

    models for ensemble − pretrained models − (or access to similar training dataset, resources) Set of starting seeds Goals Find one adversarial example for each seed 28 Resources Unlimited number of API queries
  25. Batch Attacks Knowledge Only API access to target Good models

    for ensemble − pretrained models − (or access to similar training dataset, resources) Set of starting seeds Goals Find many seed/adversarial example pairs 29 Resources Limited number of API queries Prioritize seeds to attack: use resources to attack the low-cost seeds first
  26. Requirements 1. There is a high variance across seeds in

    the cost to find adversarial examples. 2. There are ways to predict in advance which seeds will be easy to attack. 30
  27. Predicting the Low-Cost Seeds Strategy 1: Cost of local attack

    number of PGD steps to find local AE Strategy 2: Loss function on target 32 NES gradient attack on robust CIFAR-10 model
  28. What about Direct Transfers? Strategy 1: Cost of local attack

    number of PGD steps to find local AE Strategy 2: Loss function on target 33
  29. Two-Phase Hybrid Attack 36 Retroactive Optimal: unrealizable strategy that always

    picks lowest cost seed Phase 1: Find Direct Transfers (1000 queries to find 95 direct transfers) AutoZOOM attack on Robust CIFAR-10 Model
  30. Two-Phase Hybrid Attack 37 Retroactive Optimal: unrealizable strategy that always

    picks lowest cost seed Phase 1: Find Direct Transfers (1000 queries to find 95 direct transfers) AutoZOOM attack on Robust CIFAR-10 Model Phase 2: Gradient Attack (100,000 queries to find 95 direct transfers)
  31. Cost of Hybrid Batch Attacks 38 Target Model Prioritization Total

    Queries (Standard Error) Goal (number of seeds attacked) 1% 2% 10% CIFAR-10 (Robust) 1000 seeds “Optimal” 10.0 (0.0) 20.0 (0.0) 107.8 (17.4) Two-Phase 20.4 (2.1) 54.2 (5.6) 826.2 (226.6) Random 24,054 (132) 45,372 (260) 251,917 (137) ImageNet 100 seeds “Optimal” 1.0 (0.0) 2.0 (0.0) 34,949 (3,742) Two-Phase 28.0 (2.0) 38.6 (7.5) 78,844 (11,837) Random 15,046 (423) 45,136 (1,270) 285,855 (8,045)
  32. How can we construct models that make it hard for

    adversaries to find adversarial examples? 40
  33. Defense Strategies 1. Hide the gradients − Transferability results 42

    ! "∗ ! "∗ = % !& '∗ = whiteBoxAttack(!& , ')
  34. Defense Strategies 1. Hide the gradients − Transferability results 43

    ! "∗ ! "∗ = % !& '∗ = whiteBoxAttack(!& , ') Maybe they can work against adversaries who don’t have access to training data/similar model? (or transfer loss is high)
  35. Defense Strategies 1. Hide the gradients − Clever adversaries can

    still find adversarial examples 48 ICML 2018 (Best Paper award)
  36. Defense Strategies 1. Hide the gradients − Transferability results −

    Clever adversaries can still find adversarial examples 2. Build a robust classifier 49 Increase capacity
  37. Defense Strategies 1. Hide the gradients − Transferability results −

    Clever adversaries can still find adversarial examples 2. Build a robust classifier 51 Increase capacity Consider adversaries in training: adversarial training
  38. Adversarial Training (Example from Yesterday) 52 Training Data ML Algorithm

    Training Clone 010110011 01 EvadeML Deployment Why didn’t this work?
  39. Adversarial Training Training Data Training Process Candidate Model !" Adversarial

    Example Generator Successful AEs against !" add to training data (with correct labels) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013
  40. Ensemble Adversarial Training Training Data Training Process Candidate Model !"

    Adversarial Example Generator Successful AEs against !" add to training data (with correct labels) Florian Tramer, et al. [ICLR 2018]
  41. Ensemble Adversarial Training Training Data Training Process Candidate Model !"

    Adversarial Example Generator Successful AEs against !" add to training data (with correct labels) Florian Tramer, et al. [ICLR 2018] Static Model !# Adversarial Example Generator AEs against !# Static Model !$ Adversarial Example Generator AEs against !$
  42. Formalizing Adversarial Training 57 2017 min$ (&(',))∼, max/∈1 (ℒ(3$ ,

    4 + 6, 7) ) min$ (&(',))∼, ℒ(3$ , 4, 7)) Regular training: Adversarial training: Simulate with PGD attack with multiple restarts
  43. Attacking Robust Models 58 Dataset / Model Direct Transfer Rate

    Gradient Attack (AutoZoom) Hybrid Attack Success Rate Queries per AE Success Rate Queries per AE MNIST (Targeted) 61.6 90.9 1,645 98.8 298 CIFAR10 (Targeted) 63.3 92.2 1,227 98.1 227 MNIST-Robust (Untar’d) 2.9 7.2 52,182 7.3 51,328 CIFAR10 Robust (Untar’d) 9.5 64.4 2,640 65.2 2,529 Hybrid Batch Attacks: Finding Black-box Adversarial Examples with Limited Queries. Fnu Suya, Jianfeng Chi, David Evans, Yuan Tian. USENIX Security 2020.
  44. Defense Strategies 1. Hide the gradients − Transferability results −

    Clever adversaries can still find adversarial examples 2. Build a robust classifier − Adversarial retraining with increased model capacity Very expensive Assumes you can generate adversarial examples as well as adversary − If we could build a perfect model, we would! 59
  45. Defense Strategies 1. Hide the gradients − Transferability results −

    Clever adversaries can still find adversarial examples 2. Build a robust classifier − Adversarial retraining, increasing model capacity, etc. − If we could build a perfect model, we would! 60 Our strategy: “Feature Squeezing”: reduce the search space available to the adversary Weilin Xu, David Evans, Yanjun Qi [NDSS 2018]
  46. Model Model Squeezer 1 Prediction0 Prediction1 "($%&'( , $%&'* ,

    … , $%&', ) Input Adversarial Legitimate Model’ Squeezer k … Predictionk Feature Squeezing Detection Framework Weilin Xu Yanjun Qi
  47. Model Model Squeezer 1 Prediction0 Prediction1 "($%&'( , $%&'* ,

    … , $%&', ) Input Adversarial Legitimate Model’ Squeezer k … Predictionk Feature Squeezing Detection Framework Feature Squeezer coalesces similar inputs into one point: • Barely change legitimate inputs. • Destruct adversarial perturbations.
  48. Coalescing by Feature Squeezing 63 Metric Space 1: Target Classifier

    Metric Space 2: “Oracle” Before: find a small perturbation that changes class for classifier, but imperceptible to oracle. Now: change class for both original and squeezed classifier, but imperceptible to oracle.
  49. Fast Gradient Sign [Yesterday] 64 original 0.1 0.2 0.3 0.4

    0.5 Adversary Power: ! "# -bounded adversary: max(abs(*+ −*+ -)) ≤ ! *- = * − ! ⋅ sign(∇* 6(*, 8)) Goodfellow, Shlens, Szegedy 2014
  50. Bit Depth Reduction 0 0.1 0.2 0.3 0.4 0.5 0.6

    0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 3-bit 1-bit 8-bit Reduce to 1-bit !" = round(!" ×2)/2 Reduce to 1-bit !" = round(!" ×2)/2 [0.312 0.271 …… 0.159 0.651] X* [0.012 0.571 …… 0.159 0.951] X Input Output 65 [0. 1. …… 0. 1. ] [0. 0. …… 0. 1. ] Signal Quantization Adversarial Example Normal Example
  51. Bit Depth Reduction 66 Seed 1 1 4 2 2

    1 1 1 1 1 CW 2 CW ∞ BIM FGSM
  52. Accuracy with Bit Depth Reduction 67 Dataset Squeezer Adversarial Examples

    (FGSM, BIM, CW∞ , Deep Fool, CW2 , CW0 , JSMA) Legitimate Images MNIST None 13.0% 99.43% 1-bit Depth 62.7% 99.33% ImageNet None 2.78% 69.70% 4-bit Depth 52.11% 68.00%
  53. Spatial Smoothing: Median Filter Replace a pixel with median of

    its neighbors. Effective in eliminating ”salt-and-pepper” noise (!" attacks) 68 Image from https://sultanofswing90.wordpress.com/tag/image-processing/ 3×3 Median Filter
  54. Spatial Smoothing: Non-local Means Replace a patch with weighted mean

    of similar patches (in region). 69 ! "# "$ !% = '((!, "+ )×"+ Preserves edges, while removing noise.
  55. 70 Airplane 94.4% Truck 99.9% Automobile 56.5% Airplane 98.4% Airplane

    99.9% Ship 46.0% Airplane 98.3% Airplane 80.8% Airplane 70.0% Median Filter (2×2) Non-local Means (13-3-4) Original BIM (L ∞ ) JSMA (L 0 )
  56. Accuracy with Spatial Smoothing 71 Dataset Squeezer Adversarial Examples (FGSM,

    BIM, CW∞ , Deep Fool, CW2 , CW0 ) Legitimate Images ImageNet None 2.78% 69.70% Median Filter 2*2 68.11% 65.40% Non-local Means 11-3-4 57.11% 65.40%
  57. Other Potential Squeezers 72 C Xie, et al. Mitigating Adversarial

    Effects Through Randomization, ICLR 2018. J Buckman, et al. Thermometer Encoding: One Hot Way To Resist Adversarial Examples, ICLR 2018. D Meng and H Chen, MagNet: a Two-Pronged Defense against Adversarial Examples, CCS 2017; A Prakash, et al., Deflecting Adversarial Attacks with Pixel Deflection, CVPR 2018;... Thermometer Encoding (learnable bit depth reduction) Image denoising using autoencoder, wavelet, JPEG, etc. Image resizing ... Spatial Smoothers: median filter, non-local means
  58. Other Potential Squeezers 73 C Xie, et al. Mitigating Adversarial

    Effects Through Randomization, ICLR 2018. J Buckman, et al. Thermometer Encoding: One Hot Way To Resist Adversarial Examples, ICLR 2018. D Meng and H Chen, MagNet: a Two-Pronged Defense against Adversarial Examples, CCS 2017; A Prakash, et al., Deflecting Adversarial Attacks with Pixel Deflection, CVPR 2018;... Thermometer Encoding (learnable bit depth reduction) Image denoising using autoencoder, wavelet, JPEG, etc. Image resizing ... Spatial Smoothers: median filter, non-local means Anish Athalye, Nicholas Carlini, David Wagner. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. ICML 2018.
  59. “Feature Squeezing” (Vacuous) Conjecture For any distance-limited adversarial method, there

    exists some feature squeezer that accurately detects its adversarial examples. 74 Intuition: if the perturbation is small (in some simple metric space), there is some squeezer that coalesces original and adversarial example into same sample.
  60. Feature Squeezing Detection Model (7-layer CNN) Model Model Bit Depth-

    1 Median 2×2 Prediction0 Prediction1 Prediction2 Yes Input Adversarial No Legitimate max '( )* , )( , '( )* , )2 > -
  61. Detecting Adversarial Examples Distance between original input and its squeezed

    version Adversarial inputs (CW attack) Legitimate inputs
  62. 77 0 200 400 600 800 0.0 0.4 0.8 1.2

    1.6 2.0 Number of Examples Legitimate Adversarial Maximum !" distance between original and squeezed input threshold = 0.0029 detection: 98.2%, FP < 4% Training a detector (MNIST) set the detection threshold to keep false positive rate below target
  63. ImageNet Configuration Model (MobileNet) Model Model Bit Depth- 5 Median

    2×2 Prediction0 Prediction1 Prediction2 Yes Input Adversarial No Legitimate max(() (*+ , {*) , *. , */ }) > 3 Model Non-local Mean Prediction3
  64. 79 0 20 40 60 80 100 120 140 0.0

    0.4 0.8 1.2 1.6 2.0 Legitimate Adversarial Maximum !" distance between original and squeezed input threshold = 1.24 detection: 85%, FP < 5% Training a detector (ImageNet)
  65. Aggregated Detection Results Dataset Squeezers Threshold False Positive Rate Detection

    Rate (SAEs) ROC-AUC Exclude FAEs MNIST Bit Depth (1-bit), Median (2x2) 0.0029 3.98% 98.2% 99.44% CIFAR-10 Bit Depth (5-bit), Median (2x2), Non-local Mean (13-3-2) 1.1402 4.93% 84.5% 95.74% ImageNet Bit Depth (5-bit), Median (2x2), Non-local Mean (11-3-4) 1.2128 8.33% 85.9% 94.24% 80
  66. Threat Models Oblivious attack: The adversary has full knowledge of

    the target model, but is not aware of the detector. Adaptive attack: The adversary has full knowledge of the target model and the detector. 81
  67. Adaptive Adversary Adaptive CW 2 attack, unbounded adversary: Warren He,

    James Wei, Xinyun Chen, Nicholas Carlini, Dawn Song, Adversarial Example Defense: Ensembles of Weak Defenses are not Strong, USENIX WOOT’17. !"#"!"$% & '( − * + , ∗ Δ ', '( + 0 ∗ 12 3456%('′) 82 Misclassification term Distance term Detection term
  68. Adaptive Adversarial Examples 83 No successful adversarial examples were found

    for images originally labeled as 3 or 8. Mean L2 2.80 4.14 4.67 Attack Untargeted Targeted (next) Targeted (least likely)
  69. Adaptive Adversary Success Rates 84 0.68 0.06 0.01 0.44 0.01

    0.24 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Adversary’s Success Rate Clipped ε Targeted (Next) Targeted (LL) Untargeted Unbounded Typical !
  70. Model Model Model Squeezer 1 Squeezer 2 Prediction0 Prediction1 Prediction2

    #(%&'() , %&'(+ , … , %&'(- ) Yes Input Adversarial No Legitimate Model’ Squeezer k … Predictionk Defender’s Entropy Advantage random seed
  71. Counter Measure: Randomization Binary filter threshold := 0.5 threshold :=

    ! 0.5, 0.0625 86 0 0.2 0.4 0.6 0.8 1 0 0.5 1 0 0.2 0.4 0.6 0.8 1 0 0.5 1 Strengthen the adaptive adversary Attack an ensemble of 3 detectors with thresholds: [0.4, 0.5, 0.6]
  72. 87 2.80, Untargeted 4.14, Targeted-Next 4.67, Targeted-LL 3.63, Untargeted 5.48,

    Targeted-Next 5.76, Targeted-LL Attack Deterministic Detector Mean L2 Attack Randomized Detector
  73. (Redefining) Adversarial Example 89 Prediction Change Definition: An input, !′

    ∈ $, is an adversarial example for ! ∈ $, iff ∃!& ∈ Ball* (!) such that - ! ≠ - !& .
  74. Adversarial Example 90 Ball$ (&) is some space around &,

    typically defined in some (simple!) metric space: () norm (# different), (* norm (“Euclidean distance”), (+ Without constraints on Ball$ , every input has adversarial examples. Prediction Change Definition: An input, &′ ∈ /, is an adversarial example for & ∈ /, iff ∃&1 ∈ Ball$ (&) such that 2 & ≠ 2 &1 .
  75. Adversarial Example 91 Any non-trivial model has adversarial examples: ∃"#

    , "% ∈ '. ) "# ≠ )("% ) Prediction Change Definition: An input, -′ ∈ ', is an adversarial example for - ∈ ', iff ∃-/ ∈ Ball3 (-) such that ) - ≠ ) -/ .
  76. Prediction Error Robustness 92 Error Robustness: An input, !′ ∈

    $, is an adversarial example for (correct) ! ∈ $, iff ∃!& ∈ Ball* (!) such that - !′ ≠ true label for !′. Perfect classifier has no (error robustness) adversarial examples.
  77. Prediction Error Robustness 93 Error Robustness: An input, !′ ∈

    $, is an adversarial example for (correct) ! ∈ $, iff ∃!& ∈ Ball* (!) such that - !′ ≠ true label for !′. Perfect classifier has no (error robustness) adversarial examples. If we have a way to know this, don’t need an ML classifier.
  78. Global Robustness Properties 94 Adversarial Risk: probability an input has

    an adversarial example Pr # ← % [∃ () ∈ +,--. ( . 0 () ≠ class (′ ] Dimitrios I. Diochnos, Saeed Mahloujifar, Mohammad Mahmoody, NeurIPS 2018
  79. Global Robustness Properties 95 Dimitrios I. Diochnos, Saeed Mahloujifar, Mohammad

    Mahmoody, NeurIPS 2018 Adversarial Risk: probability an input has an adversarial example Pr # ← % [∃ () ∈ +,--. ( . 0 () ≠ class (′ ] Error Region Robustness: expected distance to closest AE: 8 # ← % [inf { =: ∃ () ∈ +,--. ( . 0 () ≠ class () }]
  80. Assumption Key Result Adversarial Spheres [Gilmer et al., 2018] Uniform

    distribution on two concentric !-spheres Expected safe distance ("# -norm) is relatively small. Adversarial vulnerability for any classifier [Fawzi × 3, 2018] Smooth generative model: 1. Gaussian in latent space. 2. Generator is L-Lipschitz. Adversarial risk ⟶ 1 for relatively small attack strength ("# -norm). Curse of Concentration in Robust Learning [Mahloujifar et al., 2018] Normal Lévy families • Unit sphere, uniform, "# norm • Boolean hypercube, uniform, Hamming distance ... If attack strength exceeds a relatively small threshold, adversarial risk > 1/2. b > p log(k1/") p k2 · n ! Riskb(h, c) 1/2 Recent Global Robustness Results P(r(x)  ⌘) 1 r ⇡ 2 e ⌘2/2L2 Properties of any model for input space: distance to AE is small relative to expected distance between two sampled points
  81. Prediction Change Robustness 97 Prediction Change: An input, !′ ∈

    $, is an adversarial example for ! ∈ $, iff ∃!& ∈ Ball* (!) such that - !′ ≠ - ! . Any non-trivial model has adversarial examples: ∃!0 , !2 ∈ $. - !0 ≠ -(!2 ) Solutions: - only consider distribution inputs (“good” seeds) - output isn’t just class (e.g., confidence) - targeted adversarial examples cost-sensitive adversarial robustness
  82. Local (Instance) Robustness 98 Robust Region: For an input !,

    the robust region is the maximum region with no adversarial example: sup % > 0 ∀)* ∈ Ball/ ) , 1 )* = 1 ) }
  83. Local (Instance) Robustness 99 Robust Region: For an input !,

    the robust region is the maximum region with no adversarial example: sup % > 0 ∀)* ∈ Ball/ ) , 1 )* = 1 ) } Robust Error: For a test set, 4, and bound, %5 : | ) ∈ 4, RobustRegion ) < %5 } | 4|
  84. Instance Defense-Robustness 100 For an input !, the robust-defended region

    is the maximum region with no undetected adversarial example: sup % > 0 ∀)* ∈ Ball/ ) , 1 )* = 1 ) ⋁ 45657654(!*)} Defense Failure: For a test set, ;, and bound, %< : | ) ∈ ;, RobustDefendedRegion ) < %< } | ;| Can we verify a defense?
  85. Formal Verification of Defense Instance exhaustively test all inputs in

    ∀"# ∈ Ball( " for correctness or detection Need to transform model into a function amenable to verification
  86. Linear Programming !"" #" + !"% #% + ⋯ ≤

    (" !%" #" + !%% #% + ⋯ ≤ (% #) ≤ 0 ... Find values of + that minimize linear function under constraints: ," #" + ,% #% + ,- #- + …
  87. Encoding a Neural Network Linear Components (! = #$ +

    &) Convolutional Layer Fully-connected Layer Batch Normalization (in test mode) Non-linear Activation (ReLU, Sigmoid, Softmax) Pooling Layer (max, avg) 103
  88. Encode ReLU Mixed Integer Linear Programming adds discrete values to

    LP ReLU (Rectified Linear Unit ) ! = max(0, )) + ∈ 0, 1 ! ≥ ) ! ≥ 0 ! ≤ ) − 1 1 − + ! ≤ 2+ 1 2 Piecewise Linear
  89. Mixed Integer Linear Programming (MILP) Intractable in theory (NP-Complete) Efficient

    in practice (e.g., Gurobi solver) MIPVerify Vincent Tjeng, Kai Xiao, Russ Tedrake Verify NNs using MILP
  90. Encode Feature Squeezers Binary Filter 0.5 1 0 Actual Input:

    uint8 [0, 1, 2, … 254, 255] 127 / 255 = 0.498 128 / 255 = 0.502 An infeasible gap [0.499, 0.501] Lower semi-continuous
  91. Verified L ∞ Robustness Model Test Accuracy Robust Error ε

    = 0.1 Robust Error with Binary Filter Raghunathan et al. 95.82% 14.36%-30.81% 7.37% Wong & Kolter 98.11% 4.38% 4.25% Ours with binary filter 98.94% 2.66-6.63% - Even without detection, this helps!
  92. Encode Detection Mechanism Original version: Simplify for verification: !" ⟶

    maximum difference softmax ⟶ multiple piecewise-linear approximate sigmoid score(*) = - * − -(squeeze * ) " where f(x) is softmax output
  93. Preliminary Experiments 109 Model (4-layer CNN) Model Bit Depth-1 Yes

    Input !’ Adversarial No y1 valid max_diff +, , +. > 0 Verification: for a seed !, there is no adversarial input !1 ∈ Ball5 ! for which +. ≠ 7 ! and not detected Adversarially robust retrained [Wong & Kolter] model 1000 test MNIST seeds, 8 = 0.1 (=> ) 970 infeasible (verified no adversarial example) 13 misclassified (original seed) 17 vulnerable Robust error: 0.3% Verification time ~0.2s (compared to 0.8s without binarization)
  94. 110 Scalability Formal Verification MILP solver (MIPVerify) SMT solver (Reluplex)

    Interval analysis (Reluval) robust error Heuristic Defenses distillation (Papernot et al., 2016) gradient obfuscation adversarial retraining (Madry et al., 2017) attack success rate (set of attacks) Certified Robustness CNN-Cert (Boopathy et al., 2018) Dual-LP (Kolter & Wong 2018) Dual-SDP (Raghunathan et al., 2018) bound Evaluation Metric precise feature squeezing
  95. Realistic Threat Models Knowledge Full access to target Goals Find

    many seed/adversarial example pairs 111 Resources Limited number of API queries Limited computation It matters which seed and target classes
  96. 112 target class Original Model (no robustness training) seed class

    target class MNIST Model 2 convolutional layers 2 fully-connected layers (100, 10 units) ! = 0.2, '(
  97. 113 target class Original Model (no robustness training) seed class

    target class MNIST Model 2 convolutional layers 2 fully-connected layers (100, 10 units) ! = 0.2, '(
  98. Training a Robust Network Eric Wong and J. Zico Kolter.

    Provable defenses against adversarial examples via the convex outer adversarial polytope. ICML 2018. replace loss with differentiable function based on outer bound using dual network ReLU (Rectified Linear Unit ) linear approximation ! "
  99. 115 seed class target class Standard Robustness Training (overall robustness

    goal) MNIST Model 2 convolutional layers 2 fully-connected layers (100, 10 units) ! = 0.2, '(
  100. Cost-Sensitive Robustness Training 116 Xiao Zhang Cost-matrix: cost of different

    adversarial transformations ! = − 0 1 − benign malware benign malware Incorporate a cost-matrix into robustness training Xiao Zhang and David Evans [ICLR 2019]
  101. 117 seed class target class Standard Robustness Training (overall robustness

    goal) MNIST Model 2 convolutional layers 2 fully-connected layers (100, 10 units) ! = 0.2, '(
  102. Security State-of-the-Art Attack success probability Threat models Proofs Cryptography !−#!$

    information theoretic, resource bounded required 121 Considered seriously broken if attack method increases to !%#!& even if it requires 2() ciphertexts.
  103. Security State-of-the-Art Attack success probability Threat models Proofs Cryptography !−#!$

    information theoretic, resource bounded required System Security !−%! capabilities, motivations, rationality common 122 Considered seriously broken if attack method can succeed in “lab” environment with probability 2'(.
  104. Security State-of-the-Art Attack success probability Threat models Proofs Cryptography !−#!$

    information theoretic, resource bounded required System Security !−%! capabilities, motivations, rationality common Adversarial Machine Learning !−# artificially limited adversary making progress! 123 Considered broken if attack method succeeds with probability 2'.
  105. Security State-of-the-Art Attack success probability Threat models Proofs Cryptography !−#!$

    information theoretic, resource bounded required System Security !−%! capabilities, motivations, rationality common Adversarial Machine Learning !−# artificially limited adversary making progress! 124 Huge gaps to close: threat models are unrealistic (but real threats unclear) verification techniques only work for tiny models experimental defenses often (quickly) broken