19th International School on Foundations of Security Analysis and Design
Mini-course on "Trustworthy Machine Learning"
https://jeffersonswheel.org/fosad2019 David Evans

Trustworthy Machine Learning David Evans University of Virginia jeffersonswheel.org Bertinoro, Italy 27 August 2019 19th International School on Foundations of Security Analysis and Design 2: Defenses

Threat Models 3 1. What are the attacker’s goals? • Malicious behavior without detection • Commit check fraud • ... 2. What are the attacker’s capabilities? information: what do they know actions: what they can do resources: how much they can spend?

Threat Models in Cryptography Ciphertext-only attack Intercept message, want to learn plaintext Chosen-plaintext attack Adversary has encryption function as black box, wants to learn key (or decrypt some ciphertext) Chosen-ciphertext attack Adversary has decryption function as black box, wants to learn key (or encrypt some message) 4 Goals Information Actions Resources

Threat Models in Cryptography 5 Goals Information Actions Resources Polynomial time/space: adversary has computational resources that scale polynomially in some security parameter (e.g., key size)

Security Goals in Cryptography 7 Semantic Security: adversary with intercepted ciphertext has no advantage over adversary without it Shafi Goldwasser and Silvio Micali Developed semantic security in 1980s (2013 Turing Awardees)

Threat Models in Adversarial ML? 8 Ciphertext-only attack Chosen-plaintext attack Chosen-ciphertext attack Polynomial time/space Semantic Security proofs Can we get to threat models as precise as those used in cryptography? Can we prove strong security notions for those threat models?

Threat Models in Adversarial ML? 9 Ciphertext-only attack Chosen-plaintext attack Chosen-ciphertext attack Polynomial time/space Semantic Security proofs Can we get to threat models as precise as those used in cryptography? Can we prove strong security notions for those threat models? Current state: “Pre-Shannon” (Nicolas Carlini)

10 Ali Rahimi NIPS Test-of-Time Award Speech (Dec 2017) ”If you're building photo- sharing systems alchemy is okay but we're beyond that; now we're building systems that govern healthcare and mediate our civic dialogue”

11 Ali Rahimi NIPS Test-of-Time Award Speech (Dec 2017) ”If you're building photo- sharing systems alchemy is okay but we're beyond that; now we're building systems that govern healthcare and mediate our civic dialogue”

Alchemy (~700 − 1660) Well-defined, testable goal turn lead into gold Established theory four elements: earth, fire, water, air Methodical experiments and lab techniques (Jabir ibn Hayyan in 8th century) Wrong and ultimately unsuccessful, but led to modern chemistry.

Attacker Access White Box Attack has model: full knowledge of all parameters Black Box 14 ! " = ! $ ! % &' … ! ' !(") ! " !(") Each model query is “expensive” Only receives output “API Access”

Transfer Attacks 20 ! "∗ ! "∗ = % !& Target Model '∗ = whiteBoxAttack(!& , ') Adversarial examples against one model, often transfer to another model. External Local

Improving Transfer Attacks 22 ! "∗ ! "∗ = % Target Model Adversarial examples against several models, more likely to transfer. External Local !& !' !( "∗ = whiteBoxAttack(6(!& , !' , !( ), ") Yanpei Liu, Xinyun Chen, Chang Liu, Dawn Song [ICLR 2017]

Hybrid Attacks Transfer Attacks Efficient: only one API query Low success rates - 3% transfer rate for targeted attack on ImageNet (ensemble) Gradient Attacks Expensive: 10k+ queries/seed High success rates - 100% for targeted attack on Imagenet 23 Hybrid Batch Attacks: Finding Black-box Adversarial Examples with Limited Queries. Fnu Suya, Jianfeng Chi, David Evans, Yuan Tian. USENIX Security 2020. Combine both attacks: efficient + high success

Realistic Adversary Model? Knowledge Only API access to target Good models for ensemble − pretrained models − (or access to similar training dataset, resources) Set of starting seeds Goals Find one adversarial example for each seed 28 Resources Unlimited number of API queries

Batch Attacks Knowledge Only API access to target Good models for ensemble − pretrained models − (or access to similar training dataset, resources) Set of starting seeds Goals Find many seed/adversarial example pairs 29 Resources Limited number of API queries Prioritize seeds to attack: use resources to attack the low-cost seeds first

Requirements 1. There is a high variance across seeds in the cost to find adversarial examples. 2. There are ways to predict in advance which seeds will be easy to attack. 30

Predicting the Low-Cost Seeds Strategy 1: Cost of local attack number of PGD steps to find local AE Strategy 2: Loss function on target 32 NES gradient attack on robust CIFAR-10 model

Defense Strategies 1. Hide the gradients − Transferability results 43 ! "∗ ! "∗ = % !& '∗ = whiteBoxAttack(!& , ') Maybe they can work against adversaries who don’t have access to training data/similar model? (or transfer loss is high)

Adversarial Training Training Data Training Process Candidate Model !" Adversarial Example Generator Successful AEs against !" add to training data (with correct labels) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013

Ensemble Adversarial Training Training Data Training Process Candidate Model !" Adversarial Example Generator Successful AEs against !" add to training data (with correct labels) Florian Tramer, et al. [ICLR 2018]

Ensemble Adversarial Training Training Data Training Process Candidate Model !" Adversarial Example Generator Successful AEs against !" add to training data (with correct labels) Florian Tramer, et al. [ICLR 2018] Static Model !# Adversarial Example Generator AEs against !# Static Model !$ Adversarial Example Generator AEs against !$

Defense Strategies 1. Hide the gradients − Transferability results − Clever adversaries can still find adversarial examples 2. Build a robust classifier − Adversarial retraining with increased model capacity Very expensive Assumes you can generate adversarial examples as well as adversary − If we could build a perfect model, we would! 59

Defense Strategies 1. Hide the gradients − Transferability results − Clever adversaries can still find adversarial examples 2. Build a robust classifier − Adversarial retraining, increasing model capacity, etc. − If we could build a perfect model, we would! 60 Our strategy: “Feature Squeezing”: reduce the search space available to the adversary Weilin Xu, David Evans, Yanjun Qi [NDSS 2018]

Coalescing by Feature Squeezing 63 Metric Space 1: Target Classifier Metric Space 2: “Oracle” Before: find a small perturbation that changes class for classifier, but imperceptible to oracle. Now: change class for both original and squeezed classifier, but imperceptible to oracle.

Spatial Smoothing: Median Filter Replace a pixel with median of its neighbors. Effective in eliminating ”salt-and-pepper” noise (!" attacks) 68 Image from https://sultanofswing90.wordpress.com/tag/image-processing/ 3×3 Median Filter

Spatial Smoothing: Non-local Means Replace a patch with weighted mean of similar patches (in region). 69 ! "# "$ !% = '((!, "+ )×"+ Preserves edges, while removing noise.

Other Potential Squeezers 72 C Xie, et al. Mitigating Adversarial Effects Through Randomization, ICLR 2018. J Buckman, et al. Thermometer Encoding: One Hot Way To Resist Adversarial Examples, ICLR 2018. D Meng and H Chen, MagNet: a Two-Pronged Defense against Adversarial Examples, CCS 2017; A Prakash, et al., Deflecting Adversarial Attacks with Pixel Deflection, CVPR 2018;... Thermometer Encoding (learnable bit depth reduction) Image denoising using autoencoder, wavelet, JPEG, etc. Image resizing ... Spatial Smoothers: median filter, non-local means

Other Potential Squeezers 73 C Xie, et al. Mitigating Adversarial Effects Through Randomization, ICLR 2018. J Buckman, et al. Thermometer Encoding: One Hot Way To Resist Adversarial Examples, ICLR 2018. D Meng and H Chen, MagNet: a Two-Pronged Defense against Adversarial Examples, CCS 2017; A Prakash, et al., Deflecting Adversarial Attacks with Pixel Deflection, CVPR 2018;... Thermometer Encoding (learnable bit depth reduction) Image denoising using autoencoder, wavelet, JPEG, etc. Image resizing ... Spatial Smoothers: median filter, non-local means Anish Athalye, Nicholas Carlini, David Wagner. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. ICML 2018.

“Feature Squeezing” (Vacuous) Conjecture For any distance-limited adversarial method, there exists some feature squeezer that accurately detects its adversarial examples. 74 Intuition: if the perturbation is small (in some simple metric space), there is some squeezer that coalesces original and adversarial example into same sample.

Feature Squeezing Detection Model (7-layer CNN) Model Model Bit Depth- 1 Median 2×2 Prediction0 Prediction1 Prediction2 Yes Input Adversarial No Legitimate max '( )* , )( , '( )* , )2 > -

77 0 200 400 600 800 0.0 0.4 0.8 1.2 1.6 2.0 Number of Examples Legitimate Adversarial Maximum !" distance between original and squeezed input threshold = 0.0029 detection: 98.2%, FP < 4% Training a detector (MNIST) set the detection threshold to keep false positive rate below target

ImageNet Configuration Model (MobileNet) Model Model Bit Depth- 5 Median 2×2 Prediction0 Prediction1 Prediction2 Yes Input Adversarial No Legitimate max(() (*+ , {*) , *. , */ }) > 3 Model Non-local Mean Prediction3

Threat Models Oblivious attack: The adversary has full knowledge of the target model, but is not aware of the detector. Adaptive attack: The adversary has full knowledge of the target model and the detector. 81

Adaptive Adversarial Examples 83 No successful adversarial examples were found for images originally labeled as 3 or 8. Mean L2 2.80 4.14 4.67 Attack Untargeted Targeted (next) Targeted (least likely)

(Redefining) Adversarial Example 89 Prediction Change Definition: An input, !′ ∈ $, is an adversarial example for ! ∈ $, iff ∃!& ∈ Ball* (!) such that - ! ≠ - !& .

Adversarial Example 90 Ball$ (&) is some space around &, typically defined in some (simple!) metric space: () norm (# different), (* norm (“Euclidean distance”), (+ Without constraints on Ball$ , every input has adversarial examples. Prediction Change Definition: An input, &′ ∈ /, is an adversarial example for & ∈ /, iff ∃&1 ∈ Ball$ (&) such that 2 & ≠ 2 &1 .

Adversarial Example 91 Any non-trivial model has adversarial examples: ∃"# , "% ∈ '. ) "# ≠ )("% ) Prediction Change Definition: An input, -′ ∈ ', is an adversarial example for - ∈ ', iff ∃-/ ∈ Ball3 (-) such that ) - ≠ ) -/ .

Prediction Error Robustness 92 Error Robustness: An input, !′ ∈ $, is an adversarial example for (correct) ! ∈ $, iff ∃!& ∈ Ball* (!) such that - !′ ≠ true label for !′. Perfect classifier has no (error robustness) adversarial examples.

Prediction Error Robustness 93 Error Robustness: An input, !′ ∈ $, is an adversarial example for (correct) ! ∈ $, iff ∃!& ∈ Ball* (!) such that - !′ ≠ true label for !′. Perfect classifier has no (error robustness) adversarial examples. If we have a way to know this, don’t need an ML classifier.

Global Robustness Properties 94 Adversarial Risk: probability an input has an adversarial example Pr # ← % [∃ () ∈ +,--. ( . 0 () ≠ class (′ ] Dimitrios I. Diochnos, Saeed Mahloujifar, Mohammad Mahmoody, NeurIPS 2018

Assumption Key Result Adversarial Spheres [Gilmer et al., 2018] Uniform distribution on two concentric !-spheres Expected safe distance ("# -norm) is relatively small. Adversarial vulnerability for any classifier [Fawzi × 3, 2018] Smooth generative model: 1. Gaussian in latent space. 2. Generator is L-Lipschitz. Adversarial risk ⟶ 1 for relatively small attack strength ("# -norm). Curse of Concentration in Robust Learning [Mahloujifar et al., 2018] Normal Lévy families • Unit sphere, uniform, "# norm • Boolean hypercube, uniform, Hamming distance ... If attack strength exceeds a relatively small threshold, adversarial risk > 1/2. b > p log(k1/") p k2 · n ! Riskb(h, c) 1/2 Recent Global Robustness Results P(r(x) ⌘) 1 r ⇡ 2 e ⌘2/2L2 Properties of any model for input space: distance to AE is small relative to expected distance between two sampled points

Prediction Change Robustness 97 Prediction Change: An input, !′ ∈ $, is an adversarial example for ! ∈ $, iff ∃!& ∈ Ball* (!) such that - !′ ≠ - ! . Any non-trivial model has adversarial examples: ∃!0 , !2 ∈ $. - !0 ≠ -(!2 ) Solutions: - only consider distribution inputs (“good” seeds) - output isn’t just class (e.g., confidence) - targeted adversarial examples cost-sensitive adversarial robustness

Local (Instance) Robustness 98 Robust Region: For an input !, the robust region is the maximum region with no adversarial example: sup % > 0 ∀)* ∈ Ball/ ) , 1 )* = 1 ) }

Local (Instance) Robustness 99 Robust Region: For an input !, the robust region is the maximum region with no adversarial example: sup % > 0 ∀)* ∈ Ball/ ) , 1 )* = 1 ) } Robust Error: For a test set, 4, and bound, %5 : | ) ∈ 4, RobustRegion ) < %5 } | 4|

Instance Defense-Robustness 100 For an input !, the robust-defended region is the maximum region with no undetected adversarial example: sup % > 0 ∀)* ∈ Ball/ ) , 1 )* = 1 ) ⋁ 45657654(!*)} Defense Failure: For a test set, ;, and bound, %< : | ) ∈ ;, RobustDefendedRegion ) < %< } | ;| Can we verify a defense?

Formal Verification of Defense Instance exhaustively test all inputs in ∀"# ∈ Ball( " for correctness or detection Need to transform model into a function amenable to verification

Mixed Integer Linear Programming (MILP) Intractable in theory (NP-Complete) Efficient in practice (e.g., Gurobi solver) MIPVerify Vincent Tjeng, Kai Xiao, Russ Tedrake Verify NNs using MILP

Verified L ∞ Robustness Model Test Accuracy Robust Error ε = 0.1 Robust Error with Binary Filter Raghunathan et al. 95.82% 14.36%-30.81% 7.37% Wong & Kolter 98.11% 4.38% 4.25% Ours with binary filter 98.94% 2.66-6.63% - Even without detection, this helps!

Preliminary Experiments 109 Model (4-layer CNN) Model Bit Depth-1 Yes Input !’ Adversarial No y1 valid max_diff +, , +. > 0 Verification: for a seed !, there is no adversarial input !1 ∈ Ball5 ! for which +. ≠ 7 ! and not detected Adversarially robust retrained [Wong & Kolter] model 1000 test MNIST seeds, 8 = 0.1 (=> ) 970 infeasible (verified no adversarial example) 13 misclassified (original seed) 17 vulnerable Robust error: 0.3% Verification time ~0.2s (compared to 0.8s without binarization)

Realistic Threat Models Knowledge Full access to target Goals Find many seed/adversarial example pairs 111 Resources Limited number of API queries Limited computation It matters which seed and target classes

112 target class Original Model (no robustness training) seed class target class MNIST Model 2 convolutional layers 2 fully-connected layers (100, 10 units) ! = 0.2, '(

113 target class Original Model (no robustness training) seed class target class MNIST Model 2 convolutional layers 2 fully-connected layers (100, 10 units) ! = 0.2, '(

Training a Robust Network Eric Wong and J. Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. ICML 2018. replace loss with differentiable function based on outer bound using dual network ReLU (Rectified Linear Unit ) linear approximation ! "

Cost-Sensitive Robustness Training 116 Xiao Zhang Cost-matrix: cost of different adversarial transformations ! = − 0 1 − benign malware benign malware Incorporate a cost-matrix into robustness training Xiao Zhang and David Evans [ICLR 2019]

Security State-of-the-Art Attack success probability Threat models Proofs Cryptography !−#!$ information theoretic, resource bounded required 121 Considered seriously broken if attack method increases to !%#!& even if it requires 2() ciphertexts.

Security State-of-the-Art Attack success probability Threat models Proofs Cryptography !−#!$ information theoretic, resource bounded required System Security !−%! capabilities, motivations, rationality common 122 Considered seriously broken if attack method can succeed in “lab” environment with probability 2'(.

Security State-of-the-Art Attack success probability Threat models Proofs Cryptography !−#!$ information theoretic, resource bounded required System Security !−%! capabilities, motivations, rationality common Adversarial Machine Learning !−# artificially limited adversary making progress! 123 Considered broken if attack method succeeds with probability 2'.

Security State-of-the-Art Attack success probability Threat models Proofs Cryptography !−#!$ information theoretic, resource bounded required System Security !−%! capabilities, motivations, rationality common Adversarial Machine Learning !−# artificially limited adversary making progress! 124 Huge gaps to close: threat models are unrealistic (but real threats unclear) verification techniques only work for tiny models experimental defenses often (quickly) broken