Slide 1

Slide 1 text

Trustworthy Machine Learning David Evans University of Virginia jeffersonswheel.org Bertinoro, Italy 27 August 2019 19th International School on Foundations of Security Analysis and Design 2: Defenses

Slide 2

Slide 2 text

Recap/Plan Monday (Yesterday) Introduction / Attacks Tuesday (Today) Threat Models Defenses Wednesday Privacy + 1

Slide 3

Slide 3 text

Questions 2

Slide 4

Slide 4 text

Threat Models 3 1. What are the attacker’s goals? • Malicious behavior without detection • Commit check fraud • ... 2. What are the attacker’s capabilities? information: what do they know actions: what they can do resources: how much they can spend?

Slide 5

Slide 5 text

Threat Models in Cryptography Ciphertext-only attack Intercept message, want to learn plaintext Chosen-plaintext attack Adversary has encryption function as black box, wants to learn key (or decrypt some ciphertext) Chosen-ciphertext attack Adversary has decryption function as black box, wants to learn key (or encrypt some message) 4 Goals Information Actions Resources

Slide 6

Slide 6 text

Threat Models in Cryptography 5 Goals Information Actions Resources Polynomial time/space: adversary has computational resources that scale polynomially in some security parameter (e.g., key size)

Slide 7

Slide 7 text

Security Goals in Cryptography 6 First formal notions of cryptography, information theory Claude Shannon (1940s)

Slide 8

Slide 8 text

Security Goals in Cryptography 7 Semantic Security: adversary with intercepted ciphertext has no advantage over adversary without it Shafi Goldwasser and Silvio Micali Developed semantic security in 1980s (2013 Turing Awardees)

Slide 9

Slide 9 text

Threat Models in Adversarial ML? 8 Ciphertext-only attack Chosen-plaintext attack Chosen-ciphertext attack Polynomial time/space Semantic Security proofs Can we get to threat models as precise as those used in cryptography? Can we prove strong security notions for those threat models?

Slide 10

Slide 10 text

Threat Models in Adversarial ML? 9 Ciphertext-only attack Chosen-plaintext attack Chosen-ciphertext attack Polynomial time/space Semantic Security proofs Can we get to threat models as precise as those used in cryptography? Can we prove strong security notions for those threat models? Current state: “Pre-Shannon” (Nicolas Carlini)

Slide 11

Slide 11 text

10 Ali Rahimi NIPS Test-of-Time Award Speech (Dec 2017) ”If you're building photo- sharing systems alchemy is okay but we're beyond that; now we're building systems that govern healthcare and mediate our civic dialogue”

Slide 12

Slide 12 text

11 Ali Rahimi NIPS Test-of-Time Award Speech (Dec 2017) ”If you're building photo- sharing systems alchemy is okay but we're beyond that; now we're building systems that govern healthcare and mediate our civic dialogue”

Slide 13

Slide 13 text

Alchemy (~700 − 1660) Well-defined, testable goal turn lead into gold Established theory four elements: earth, fire, water, air Methodical experiments and lab techniques (Jabir ibn Hayyan in 8th century) Wrong and ultimately unsuccessful, but led to modern chemistry.

Slide 14

Slide 14 text

“Realistic” Threat Model for Adversarial ML 13

Slide 15

Slide 15 text

Attacker Access White Box Attack has model: full knowledge of all parameters Black Box 14 ! " = ! $ ! % &' … ! ' !(") ! " !(") Each model query is “expensive” Only receives output “API Access”

Slide 16

Slide 16 text

ML-as-a-Service 15

Slide 17

Slide 17 text

Black-Box Attacks 16 PGD Attack !" # = ! for % iterations: !&'( # = project0,2 (!& # − 5 ⋅ sign(∇ < !& # , = ) !# = !? ′ Can we execute these attacks if we don’t have the model?

Slide 18

Slide 18 text

Black-Box Optimization Attacks 17 ! " !(") Black-Box Gradient Attack "% & = " for ( iterations: use queries to estimate ∇ * "+ & , -

Slide 19

Slide 19 text

Black-Box Optimization Attacks 18 ! " !(") Black-Box Gradient Attack "% & = " for ( iterations: use queries to estimate ∇ * "+ & , - "+./ & = take step in “white-box” attack using estimated gradients "+./ &

Slide 20

Slide 20 text

Black-Box Gradient Attacks 19 Hybrid Batch Attacks: Finding Black-box Adversarial Examples with Limited Queries. Fnu Suya, Jianfeng Chi, David Evans, Yuan Tian. USENIX Security 2020.

Slide 21

Slide 21 text

Transfer Attacks 20 ! "∗ ! "∗ = % !& Target Model '∗ = whiteBoxAttack(!& , ') Adversarial examples against one model, often transfer to another model. External Local

Slide 22

Slide 22 text

21

Slide 23

Slide 23 text

Improving Transfer Attacks 22 ! "∗ ! "∗ = % Target Model Adversarial examples against several models, more likely to transfer. External Local !& !' !( "∗ = whiteBoxAttack(6(!& , !' , !( ), ") Yanpei Liu, Xinyun Chen, Chang Liu, Dawn Song [ICLR 2017]

Slide 24

Slide 24 text

Hybrid Attacks Transfer Attacks Efficient: only one API query Low success rates - 3% transfer rate for targeted attack on ImageNet (ensemble) Gradient Attacks Expensive: 10k+ queries/seed High success rates - 100% for targeted attack on Imagenet 23 Hybrid Batch Attacks: Finding Black-box Adversarial Examples with Limited Queries. Fnu Suya, Jianfeng Chi, David Evans, Yuan Tian. USENIX Security 2020. Combine both attacks: efficient + high success

Slide 25

Slide 25 text

Hybrid Attack 24 ! "∗ ! "∗ = % External Local !& !' !( "∗ = whiteBoxAttack(6(!& , !' , !( ), ") 1: Transfer Attack

Slide 26

Slide 26 text

Hybrid Attack 25 ! "∗ ! "∗ ≠ % External Local !& !' !( "∗ = whiteBoxAttack(7(!& , !' , !( ), ") 1: Transfer Attack ": ; "< ; "∗ 2: Gradient Attack (starting from transfer candidate)

Slide 27

Slide 27 text

Hybrid Attack 26 ! "∗ ! "∗ ≠ % External Local !& !' !( "∗ = whiteBoxAttack(7(!& , !' , !( ), ") 1: Transfer Attack ": ; "< ; "∗ 2: Gradient Attack 3: Tune Local Models using label byproducts

Slide 28

Slide 28 text

27 Dataset / Model Direct Transfer Rate Gradient Attack (AutoZoom) Hybrid Attack Success Rate Queries per AE Success Rate Queries per AE MNIST (Targeted) 61.6 90.9 1,645 98.8 298 CIFAR10 (Targeted) 63.3 92.2 1,227 98.1 227 ImageNet (Targeted) 3.4 95.4 45,166 98.0 30,089

Slide 29

Slide 29 text

Realistic Adversary Model? Knowledge Only API access to target Good models for ensemble − pretrained models − (or access to similar training dataset, resources) Set of starting seeds Goals Find one adversarial example for each seed 28 Resources Unlimited number of API queries

Slide 30

Slide 30 text

Batch Attacks Knowledge Only API access to target Good models for ensemble − pretrained models − (or access to similar training dataset, resources) Set of starting seeds Goals Find many seed/adversarial example pairs 29 Resources Limited number of API queries Prioritize seeds to attack: use resources to attack the low-cost seeds first

Slide 31

Slide 31 text

Requirements 1. There is a high variance across seeds in the cost to find adversarial examples. 2. There are ways to predict in advance which seeds will be easy to attack. 30

Slide 32

Slide 32 text

31 Variation in Query cost of NES Gradient Attack Excludes direct transfers

Slide 33

Slide 33 text

Predicting the Low-Cost Seeds Strategy 1: Cost of local attack number of PGD steps to find local AE Strategy 2: Loss function on target 32 NES gradient attack on robust CIFAR-10 model

Slide 34

Slide 34 text

What about Direct Transfers? Strategy 1: Cost of local attack number of PGD steps to find local AE Strategy 2: Loss function on target 33

Slide 35

Slide 35 text

Direct Transfers Cost of local attack number of PGD steps to find local AE 34

Slide 36

Slide 36 text

Two-Phase Hybrid Attack 35 Retroactive Optimal: unrealizable strategy that always picks lowest cost seed

Slide 37

Slide 37 text

Two-Phase Hybrid Attack 36 Retroactive Optimal: unrealizable strategy that always picks lowest cost seed Phase 1: Find Direct Transfers (1000 queries to find 95 direct transfers) AutoZOOM attack on Robust CIFAR-10 Model

Slide 38

Slide 38 text

Two-Phase Hybrid Attack 37 Retroactive Optimal: unrealizable strategy that always picks lowest cost seed Phase 1: Find Direct Transfers (1000 queries to find 95 direct transfers) AutoZOOM attack on Robust CIFAR-10 Model Phase 2: Gradient Attack (100,000 queries to find 95 direct transfers)

Slide 39

Slide 39 text

Cost of Hybrid Batch Attacks 38 Target Model Prioritization Total Queries (Standard Error) Goal (number of seeds attacked) 1% 2% 10% CIFAR-10 (Robust) 1000 seeds “Optimal” 10.0 (0.0) 20.0 (0.0) 107.8 (17.4) Two-Phase 20.4 (2.1) 54.2 (5.6) 826.2 (226.6) Random 24,054 (132) 45,372 (260) 251,917 (137) ImageNet 100 seeds “Optimal” 1.0 (0.0) 2.0 (0.0) 34,949 (3,742) Two-Phase 28.0 (2.0) 38.6 (7.5) 78,844 (11,837) Random 15,046 (423) 45,136 (1,270) 285,855 (8,045)

Slide 40

Slide 40 text

Defenses 39

Slide 41

Slide 41 text

How can we construct models that make it hard for adversaries to find adversarial examples? 40

Slide 42

Slide 42 text

Defense Strategies 1. Hide the gradients 41

Slide 43

Slide 43 text

Defense Strategies 1. Hide the gradients − Transferability results 42 ! "∗ ! "∗ = % !& '∗ = whiteBoxAttack(!& , ')

Slide 44

Slide 44 text

Defense Strategies 1. Hide the gradients − Transferability results 43 ! "∗ ! "∗ = % !& '∗ = whiteBoxAttack(!& , ') Maybe they can work against adversaries who don’t have access to training data/similar model? (or transfer loss is high)

Slide 45

Slide 45 text

44 Visualization by Nicholas Carlini

Slide 46

Slide 46 text

45 Visualization by Nicholas Carlini

Slide 47

Slide 47 text

46 Visualization by Nicholas Carlini

Slide 48

Slide 48 text

47 Visualization by Nicholas Carlini

Slide 49

Slide 49 text

Defense Strategies 1. Hide the gradients − Clever adversaries can still find adversarial examples 48 ICML 2018 (Best Paper award)

Slide 50

Slide 50 text

Defense Strategies 1. Hide the gradients − Transferability results − Clever adversaries can still find adversarial examples 2. Build a robust classifier 49 Increase capacity

Slide 51

Slide 51 text

Increasing Model Capacity 50 Image from Aleksander Mądry, et al. 2017

Slide 52

Slide 52 text

Defense Strategies 1. Hide the gradients − Transferability results − Clever adversaries can still find adversarial examples 2. Build a robust classifier 51 Increase capacity Consider adversaries in training: adversarial training

Slide 53

Slide 53 text

Adversarial Training (Example from Yesterday) 52 Training Data ML Algorithm Training Clone 010110011 01 EvadeML Deployment Why didn’t this work?

Slide 54

Slide 54 text

Adversarial Training Training Data Training Process Candidate Model !" Adversarial Example Generator Successful AEs against !" add to training data (with correct labels) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013

Slide 55

Slide 55 text

Ensemble Adversarial Training Training Data Training Process Candidate Model !" Adversarial Example Generator Successful AEs against !" add to training data (with correct labels) Florian Tramer, et al. [ICLR 2018]

Slide 56

Slide 56 text

Ensemble Adversarial Training Training Data Training Process Candidate Model !" Adversarial Example Generator Successful AEs against !" add to training data (with correct labels) Florian Tramer, et al. [ICLR 2018] Static Model !# Adversarial Example Generator AEs against !# Static Model !$ Adversarial Example Generator AEs against !$

Slide 57

Slide 57 text

Formalizing Adversarial Training 56 2017 min$ (&(',))∼, ℒ(.$ , /, 0)) Regular training:

Slide 58

Slide 58 text

Formalizing Adversarial Training 57 2017 min$ (&(',))∼, max/∈1 (ℒ(3$ , 4 + 6, 7) ) min$ (&(',))∼, ℒ(3$ , 4, 7)) Regular training: Adversarial training: Simulate with PGD attack with multiple restarts

Slide 59

Slide 59 text

Attacking Robust Models 58 Dataset / Model Direct Transfer Rate Gradient Attack (AutoZoom) Hybrid Attack Success Rate Queries per AE Success Rate Queries per AE MNIST (Targeted) 61.6 90.9 1,645 98.8 298 CIFAR10 (Targeted) 63.3 92.2 1,227 98.1 227 MNIST-Robust (Untar’d) 2.9 7.2 52,182 7.3 51,328 CIFAR10 Robust (Untar’d) 9.5 64.4 2,640 65.2 2,529 Hybrid Batch Attacks: Finding Black-box Adversarial Examples with Limited Queries. Fnu Suya, Jianfeng Chi, David Evans, Yuan Tian. USENIX Security 2020.

Slide 60

Slide 60 text

Defense Strategies 1. Hide the gradients − Transferability results − Clever adversaries can still find adversarial examples 2. Build a robust classifier − Adversarial retraining with increased model capacity Very expensive Assumes you can generate adversarial examples as well as adversary − If we could build a perfect model, we would! 59

Slide 61

Slide 61 text

Defense Strategies 1. Hide the gradients − Transferability results − Clever adversaries can still find adversarial examples 2. Build a robust classifier − Adversarial retraining, increasing model capacity, etc. − If we could build a perfect model, we would! 60 Our strategy: “Feature Squeezing”: reduce the search space available to the adversary Weilin Xu, David Evans, Yanjun Qi [NDSS 2018]

Slide 62

Slide 62 text

Model Model Squeezer 1 Prediction0 Prediction1 "($%&'( , $%&'* , … , $%&', ) Input Adversarial Legitimate Model’ Squeezer k … Predictionk Feature Squeezing Detection Framework Weilin Xu Yanjun Qi

Slide 63

Slide 63 text

Model Model Squeezer 1 Prediction0 Prediction1 "($%&'( , $%&'* , … , $%&', ) Input Adversarial Legitimate Model’ Squeezer k … Predictionk Feature Squeezing Detection Framework Feature Squeezer coalesces similar inputs into one point: • Barely change legitimate inputs. • Destruct adversarial perturbations.

Slide 64

Slide 64 text

Coalescing by Feature Squeezing 63 Metric Space 1: Target Classifier Metric Space 2: “Oracle” Before: find a small perturbation that changes class for classifier, but imperceptible to oracle. Now: change class for both original and squeezed classifier, but imperceptible to oracle.

Slide 65

Slide 65 text

Fast Gradient Sign [Yesterday] 64 original 0.1 0.2 0.3 0.4 0.5 Adversary Power: ! "# -bounded adversary: max(abs(*+ −*+ -)) ≤ ! *- = * − ! ⋅ sign(∇* 6(*, 8)) Goodfellow, Shlens, Szegedy 2014

Slide 66

Slide 66 text

Bit Depth Reduction 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 3-bit 1-bit 8-bit Reduce to 1-bit !" = round(!" ×2)/2 Reduce to 1-bit !" = round(!" ×2)/2 [0.312 0.271 …… 0.159 0.651] X* [0.012 0.571 …… 0.159 0.951] X Input Output 65 [0. 1. …… 0. 1. ] [0. 0. …… 0. 1. ] Signal Quantization Adversarial Example Normal Example

Slide 67

Slide 67 text

Bit Depth Reduction 66 Seed 1 1 4 2 2 1 1 1 1 1 CW 2 CW ∞ BIM FGSM

Slide 68

Slide 68 text

Accuracy with Bit Depth Reduction 67 Dataset Squeezer Adversarial Examples (FGSM, BIM, CW∞ , Deep Fool, CW2 , CW0 , JSMA) Legitimate Images MNIST None 13.0% 99.43% 1-bit Depth 62.7% 99.33% ImageNet None 2.78% 69.70% 4-bit Depth 52.11% 68.00%

Slide 69

Slide 69 text

Spatial Smoothing: Median Filter Replace a pixel with median of its neighbors. Effective in eliminating ”salt-and-pepper” noise (!" attacks) 68 Image from https://sultanofswing90.wordpress.com/tag/image-processing/ 3×3 Median Filter

Slide 70

Slide 70 text

Spatial Smoothing: Non-local Means Replace a patch with weighted mean of similar patches (in region). 69 ! "# "$ !% = '((!, "+ )×"+ Preserves edges, while removing noise.

Slide 71

Slide 71 text

70 Airplane 94.4% Truck 99.9% Automobile 56.5% Airplane 98.4% Airplane 99.9% Ship 46.0% Airplane 98.3% Airplane 80.8% Airplane 70.0% Median Filter (2×2) Non-local Means (13-3-4) Original BIM (L ∞ ) JSMA (L 0 )

Slide 72

Slide 72 text

Accuracy with Spatial Smoothing 71 Dataset Squeezer Adversarial Examples (FGSM, BIM, CW∞ , Deep Fool, CW2 , CW0 ) Legitimate Images ImageNet None 2.78% 69.70% Median Filter 2*2 68.11% 65.40% Non-local Means 11-3-4 57.11% 65.40%

Slide 73

Slide 73 text

Other Potential Squeezers 72 C Xie, et al. Mitigating Adversarial Effects Through Randomization, ICLR 2018. J Buckman, et al. Thermometer Encoding: One Hot Way To Resist Adversarial Examples, ICLR 2018. D Meng and H Chen, MagNet: a Two-Pronged Defense against Adversarial Examples, CCS 2017; A Prakash, et al., Deflecting Adversarial Attacks with Pixel Deflection, CVPR 2018;... Thermometer Encoding (learnable bit depth reduction) Image denoising using autoencoder, wavelet, JPEG, etc. Image resizing ... Spatial Smoothers: median filter, non-local means

Slide 74

Slide 74 text

Other Potential Squeezers 73 C Xie, et al. Mitigating Adversarial Effects Through Randomization, ICLR 2018. J Buckman, et al. Thermometer Encoding: One Hot Way To Resist Adversarial Examples, ICLR 2018. D Meng and H Chen, MagNet: a Two-Pronged Defense against Adversarial Examples, CCS 2017; A Prakash, et al., Deflecting Adversarial Attacks with Pixel Deflection, CVPR 2018;... Thermometer Encoding (learnable bit depth reduction) Image denoising using autoencoder, wavelet, JPEG, etc. Image resizing ... Spatial Smoothers: median filter, non-local means Anish Athalye, Nicholas Carlini, David Wagner. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. ICML 2018.

Slide 75

Slide 75 text

“Feature Squeezing” (Vacuous) Conjecture For any distance-limited adversarial method, there exists some feature squeezer that accurately detects its adversarial examples. 74 Intuition: if the perturbation is small (in some simple metric space), there is some squeezer that coalesces original and adversarial example into same sample.

Slide 76

Slide 76 text

Feature Squeezing Detection Model (7-layer CNN) Model Model Bit Depth- 1 Median 2×2 Prediction0 Prediction1 Prediction2 Yes Input Adversarial No Legitimate max '( )* , )( , '( )* , )2 > -

Slide 77

Slide 77 text

Detecting Adversarial Examples Distance between original input and its squeezed version Adversarial inputs (CW attack) Legitimate inputs

Slide 78

Slide 78 text

77 0 200 400 600 800 0.0 0.4 0.8 1.2 1.6 2.0 Number of Examples Legitimate Adversarial Maximum !" distance between original and squeezed input threshold = 0.0029 detection: 98.2%, FP < 4% Training a detector (MNIST) set the detection threshold to keep false positive rate below target

Slide 79

Slide 79 text

ImageNet Configuration Model (MobileNet) Model Model Bit Depth- 5 Median 2×2 Prediction0 Prediction1 Prediction2 Yes Input Adversarial No Legitimate max(() (*+ , {*) , *. , */ }) > 3 Model Non-local Mean Prediction3

Slide 80

Slide 80 text

79 0 20 40 60 80 100 120 140 0.0 0.4 0.8 1.2 1.6 2.0 Legitimate Adversarial Maximum !" distance between original and squeezed input threshold = 1.24 detection: 85%, FP < 5% Training a detector (ImageNet)

Slide 81

Slide 81 text

Aggregated Detection Results Dataset Squeezers Threshold False Positive Rate Detection Rate (SAEs) ROC-AUC Exclude FAEs MNIST Bit Depth (1-bit), Median (2x2) 0.0029 3.98% 98.2% 99.44% CIFAR-10 Bit Depth (5-bit), Median (2x2), Non-local Mean (13-3-2) 1.1402 4.93% 84.5% 95.74% ImageNet Bit Depth (5-bit), Median (2x2), Non-local Mean (11-3-4) 1.2128 8.33% 85.9% 94.24% 80

Slide 82

Slide 82 text

Threat Models Oblivious attack: The adversary has full knowledge of the target model, but is not aware of the detector. Adaptive attack: The adversary has full knowledge of the target model and the detector. 81

Slide 83

Slide 83 text

Adaptive Adversary Adaptive CW 2 attack, unbounded adversary: Warren He, James Wei, Xinyun Chen, Nicholas Carlini, Dawn Song, Adversarial Example Defense: Ensembles of Weak Defenses are not Strong, USENIX WOOT’17. !"#"!"$% & '( − * + , ∗ Δ ', '( + 0 ∗ 12 3456%('′) 82 Misclassification term Distance term Detection term

Slide 84

Slide 84 text

Adaptive Adversarial Examples 83 No successful adversarial examples were found for images originally labeled as 3 or 8. Mean L2 2.80 4.14 4.67 Attack Untargeted Targeted (next) Targeted (least likely)

Slide 85

Slide 85 text

Adaptive Adversary Success Rates 84 0.68 0.06 0.01 0.44 0.01 0.24 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Adversary’s Success Rate Clipped ε Targeted (Next) Targeted (LL) Untargeted Unbounded Typical !

Slide 86

Slide 86 text

Model Model Model Squeezer 1 Squeezer 2 Prediction0 Prediction1 Prediction2 #(%&'() , %&'(+ , … , %&'(- ) Yes Input Adversarial No Legitimate Model’ Squeezer k … Predictionk Defender’s Entropy Advantage random seed

Slide 87

Slide 87 text

Counter Measure: Randomization Binary filter threshold := 0.5 threshold := ! 0.5, 0.0625 86 0 0.2 0.4 0.6 0.8 1 0 0.5 1 0 0.2 0.4 0.6 0.8 1 0 0.5 1 Strengthen the adaptive adversary Attack an ensemble of 3 detectors with thresholds: [0.4, 0.5, 0.6]

Slide 88

Slide 88 text

87 2.80, Untargeted 4.14, Targeted-Next 4.67, Targeted-LL 3.63, Untargeted 5.48, Targeted-Next 5.76, Targeted-LL Attack Deterministic Detector Mean L2 Attack Randomized Detector

Slide 89

Slide 89 text

Are defenses against adversarial examples even possible? 88

Slide 90

Slide 90 text

(Redefining) Adversarial Example 89 Prediction Change Definition: An input, !′ ∈ $, is an adversarial example for ! ∈ $, iff ∃!& ∈ Ball* (!) such that - ! ≠ - !& .

Slide 91

Slide 91 text

Adversarial Example 90 Ball$ (&) is some space around &, typically defined in some (simple!) metric space: () norm (# different), (* norm (“Euclidean distance”), (+ Without constraints on Ball$ , every input has adversarial examples. Prediction Change Definition: An input, &′ ∈ /, is an adversarial example for & ∈ /, iff ∃&1 ∈ Ball$ (&) such that 2 & ≠ 2 &1 .

Slide 92

Slide 92 text

Adversarial Example 91 Any non-trivial model has adversarial examples: ∃"# , "% ∈ '. ) "# ≠ )("% ) Prediction Change Definition: An input, -′ ∈ ', is an adversarial example for - ∈ ', iff ∃-/ ∈ Ball3 (-) such that ) - ≠ ) -/ .

Slide 93

Slide 93 text

Prediction Error Robustness 92 Error Robustness: An input, !′ ∈ $, is an adversarial example for (correct) ! ∈ $, iff ∃!& ∈ Ball* (!) such that - !′ ≠ true label for !′. Perfect classifier has no (error robustness) adversarial examples.

Slide 94

Slide 94 text

Prediction Error Robustness 93 Error Robustness: An input, !′ ∈ $, is an adversarial example for (correct) ! ∈ $, iff ∃!& ∈ Ball* (!) such that - !′ ≠ true label for !′. Perfect classifier has no (error robustness) adversarial examples. If we have a way to know this, don’t need an ML classifier.

Slide 95

Slide 95 text

Global Robustness Properties 94 Adversarial Risk: probability an input has an adversarial example Pr # ← % [∃ () ∈ +,--. ( . 0 () ≠ class (′ ] Dimitrios I. Diochnos, Saeed Mahloujifar, Mohammad Mahmoody, NeurIPS 2018

Slide 96

Slide 96 text

Global Robustness Properties 95 Dimitrios I. Diochnos, Saeed Mahloujifar, Mohammad Mahmoody, NeurIPS 2018 Adversarial Risk: probability an input has an adversarial example Pr # ← % [∃ () ∈ +,--. ( . 0 () ≠ class (′ ] Error Region Robustness: expected distance to closest AE: 8 # ← % [inf { =: ∃ () ∈ +,--. ( . 0 () ≠ class () }]

Slide 97

Slide 97 text

Assumption Key Result Adversarial Spheres [Gilmer et al., 2018] Uniform distribution on two concentric !-spheres Expected safe distance ("# -norm) is relatively small. Adversarial vulnerability for any classifier [Fawzi × 3, 2018] Smooth generative model: 1. Gaussian in latent space. 2. Generator is L-Lipschitz. Adversarial risk ⟶ 1 for relatively small attack strength ("# -norm). Curse of Concentration in Robust Learning [Mahloujifar et al., 2018] Normal Lévy families • Unit sphere, uniform, "# norm • Boolean hypercube, uniform, Hamming distance ... If attack strength exceeds a relatively small threshold, adversarial risk > 1/2. b > p log(k1/") p k2 · n ! Riskb(h, c) 1/2 Recent Global Robustness Results P(r(x)  ⌘) 1 r ⇡ 2 e ⌘2/2L2 Properties of any model for input space: distance to AE is small relative to expected distance between two sampled points

Slide 98

Slide 98 text

Prediction Change Robustness 97 Prediction Change: An input, !′ ∈ $, is an adversarial example for ! ∈ $, iff ∃!& ∈ Ball* (!) such that - !′ ≠ - ! . Any non-trivial model has adversarial examples: ∃!0 , !2 ∈ $. - !0 ≠ -(!2 ) Solutions: - only consider distribution inputs (“good” seeds) - output isn’t just class (e.g., confidence) - targeted adversarial examples cost-sensitive adversarial robustness

Slide 99

Slide 99 text

Local (Instance) Robustness 98 Robust Region: For an input !, the robust region is the maximum region with no adversarial example: sup % > 0 ∀)* ∈ Ball/ ) , 1 )* = 1 ) }

Slide 100

Slide 100 text

Local (Instance) Robustness 99 Robust Region: For an input !, the robust region is the maximum region with no adversarial example: sup % > 0 ∀)* ∈ Ball/ ) , 1 )* = 1 ) } Robust Error: For a test set, 4, and bound, %5 : | ) ∈ 4, RobustRegion ) < %5 } | 4|

Slide 101

Slide 101 text

Instance Defense-Robustness 100 For an input !, the robust-defended region is the maximum region with no undetected adversarial example: sup % > 0 ∀)* ∈ Ball/ ) , 1 )* = 1 ) ⋁ 45657654(!*)} Defense Failure: For a test set, ;, and bound, %< : | ) ∈ ;, RobustDefendedRegion ) < %< } | ;| Can we verify a defense?

Slide 102

Slide 102 text

Formal Verification of Defense Instance exhaustively test all inputs in ∀"# ∈ Ball( " for correctness or detection Need to transform model into a function amenable to verification

Slide 103

Slide 103 text

Linear Programming !"" #" + !"% #% + ⋯ ≤ (" !%" #" + !%% #% + ⋯ ≤ (% #) ≤ 0 ... Find values of + that minimize linear function under constraints: ," #" + ,% #% + ,- #- + …

Slide 104

Slide 104 text

Encoding a Neural Network Linear Components (! = #$ + &) Convolutional Layer Fully-connected Layer Batch Normalization (in test mode) Non-linear Activation (ReLU, Sigmoid, Softmax) Pooling Layer (max, avg) 103

Slide 105

Slide 105 text

Encode ReLU Mixed Integer Linear Programming adds discrete values to LP ReLU (Rectified Linear Unit ) ! = max(0, )) + ∈ 0, 1 ! ≥ ) ! ≥ 0 ! ≤ ) − 1 1 − + ! ≤ 2+ 1 2 Piecewise Linear

Slide 106

Slide 106 text

Mixed Integer Linear Programming (MILP) Intractable in theory (NP-Complete) Efficient in practice (e.g., Gurobi solver) MIPVerify Vincent Tjeng, Kai Xiao, Russ Tedrake Verify NNs using MILP

Slide 107

Slide 107 text

Encode Feature Squeezers Binary Filter 0.5 1 0 Actual Input: uint8 [0, 1, 2, … 254, 255] 127 / 255 = 0.498 128 / 255 = 0.502 An infeasible gap [0.499, 0.501] Lower semi-continuous

Slide 108

Slide 108 text

Verified L ∞ Robustness Model Test Accuracy Robust Error ε = 0.1 Robust Error with Binary Filter Raghunathan et al. 95.82% 14.36%-30.81% 7.37% Wong & Kolter 98.11% 4.38% 4.25% Ours with binary filter 98.94% 2.66-6.63% - Even without detection, this helps!

Slide 109

Slide 109 text

Encode Detection Mechanism Original version: Simplify for verification: !" ⟶ maximum difference softmax ⟶ multiple piecewise-linear approximate sigmoid score(*) = - * − -(squeeze * ) " where f(x) is softmax output

Slide 110

Slide 110 text

Preliminary Experiments 109 Model (4-layer CNN) Model Bit Depth-1 Yes Input !’ Adversarial No y1 valid max_diff +, , +. > 0 Verification: for a seed !, there is no adversarial input !1 ∈ Ball5 ! for which +. ≠ 7 ! and not detected Adversarially robust retrained [Wong & Kolter] model 1000 test MNIST seeds, 8 = 0.1 (=> ) 970 infeasible (verified no adversarial example) 13 misclassified (original seed) 17 vulnerable Robust error: 0.3% Verification time ~0.2s (compared to 0.8s without binarization)

Slide 111

Slide 111 text

110 Scalability Formal Verification MILP solver (MIPVerify) SMT solver (Reluplex) Interval analysis (Reluval) robust error Heuristic Defenses distillation (Papernot et al., 2016) gradient obfuscation adversarial retraining (Madry et al., 2017) attack success rate (set of attacks) Certified Robustness CNN-Cert (Boopathy et al., 2018) Dual-LP (Kolter & Wong 2018) Dual-SDP (Raghunathan et al., 2018) bound Evaluation Metric precise feature squeezing

Slide 112

Slide 112 text

Realistic Threat Models Knowledge Full access to target Goals Find many seed/adversarial example pairs 111 Resources Limited number of API queries Limited computation It matters which seed and target classes

Slide 113

Slide 113 text

112 target class Original Model (no robustness training) seed class target class MNIST Model 2 convolutional layers 2 fully-connected layers (100, 10 units) ! = 0.2, '(

Slide 114

Slide 114 text

113 target class Original Model (no robustness training) seed class target class MNIST Model 2 convolutional layers 2 fully-connected layers (100, 10 units) ! = 0.2, '(

Slide 115

Slide 115 text

Training a Robust Network Eric Wong and J. Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. ICML 2018. replace loss with differentiable function based on outer bound using dual network ReLU (Rectified Linear Unit ) linear approximation ! "

Slide 116

Slide 116 text

115 seed class target class Standard Robustness Training (overall robustness goal) MNIST Model 2 convolutional layers 2 fully-connected layers (100, 10 units) ! = 0.2, '(

Slide 117

Slide 117 text

Cost-Sensitive Robustness Training 116 Xiao Zhang Cost-matrix: cost of different adversarial transformations ! = − 0 1 − benign malware benign malware Incorporate a cost-matrix into robustness training Xiao Zhang and David Evans [ICLR 2019]

Slide 118

Slide 118 text

117 seed class target class Standard Robustness Training (overall robustness goal) MNIST Model 2 convolutional layers 2 fully-connected layers (100, 10 units) ! = 0.2, '(

Slide 119

Slide 119 text

118 seed class target class Cost- Sensitive Robustness Training Protect odd classes from evasion

Slide 120

Slide 120 text

119 seed class target class Cost- Sensitive Robustness Training Protect even classes from evasion

Slide 121

Slide 121 text

History of the destruction of Troy, 1498 Wrap-Up

Slide 122

Slide 122 text

Security State-of-the-Art Attack success probability Threat models Proofs Cryptography !−#!$ information theoretic, resource bounded required 121 Considered seriously broken if attack method increases to !%#!& even if it requires 2() ciphertexts.

Slide 123

Slide 123 text

Security State-of-the-Art Attack success probability Threat models Proofs Cryptography !−#!$ information theoretic, resource bounded required System Security !−%! capabilities, motivations, rationality common 122 Considered seriously broken if attack method can succeed in “lab” environment with probability 2'(.

Slide 124

Slide 124 text

Security State-of-the-Art Attack success probability Threat models Proofs Cryptography !−#!$ information theoretic, resource bounded required System Security !−%! capabilities, motivations, rationality common Adversarial Machine Learning !−# artificially limited adversary making progress! 123 Considered broken if attack method succeeds with probability 2'.

Slide 125

Slide 125 text

Security State-of-the-Art Attack success probability Threat models Proofs Cryptography !−#!$ information theoretic, resource bounded required System Security !−%! capabilities, motivations, rationality common Adversarial Machine Learning !−# artificially limited adversary making progress! 124 Huge gaps to close: threat models are unrealistic (but real threats unclear) verification techniques only work for tiny models experimental defenses often (quickly) broken

Slide 126

Slide 126 text

Tomorrow: Privacy 125 David Evans University of Virginia [email protected] https://www.cs.virginia.edu/evans