19th International School on Foundations of Security Analysis and Design
Mini-course on "Trustworthy Machine Learning"
https://jeffersonswheel.org/fosad2019 David Evans

Malicious behavior without detection • Commit check fraud • ... 2. What are the attacker’s capabilities? information: what do they know actions: what they can do resources: how much they can spend?

learn plaintext Chosen-plaintext attack Adversary has encryption function as black box, wants to learn key (or decrypt some ciphertext) Chosen-ciphertext attack Adversary has decryption function as black box, wants to learn key (or encrypt some message) 4 Goals Information Actions Resources

Chosen-ciphertext attack Polynomial time/space Semantic Security proofs Can we get to threat models as precise as those used in cryptography? Can we prove strong security notions for those threat models?

Chosen-ciphertext attack Polynomial time/space Semantic Security proofs Can we get to threat models as precise as those used in cryptography? Can we prove strong security notions for those threat models? Current state: “Pre-Shannon” (Nicolas Carlini)

you're building photo- sharing systems alchemy is okay but we're beyond that; now we're building systems that govern healthcare and mediate our civic dialogue”

you're building photo- sharing systems alchemy is okay but we're beyond that; now we're building systems that govern healthcare and mediate our civic dialogue”

gold Established theory four elements: earth, fire, water, air Methodical experiments and lab techniques (Jabir ibn Hayyan in 8th century) Wrong and ultimately unsuccessful, but led to modern chemistry.

Target Model Adversarial examples against several models, more likely to transfer. External Local !& !' !( "∗ = whiteBoxAttack(6(!& , !' , !( ), ") Yanpei Liu, Xinyun Chen, Chang Liu, Dawn Song [ICLR 2017]

models for ensemble − pretrained models − (or access to similar training dataset, resources) Set of starting seeds Goals Find one adversarial example for each seed 28 Resources Unlimited number of API queries

for ensemble − pretrained models − (or access to similar training dataset, resources) Set of starting seeds Goals Find many seed/adversarial example pairs 29 Resources Limited number of API queries Prioritize seeds to attack: use resources to attack the low-cost seeds first

picks lowest cost seed Phase 1: Find Direct Transfers (1000 queries to find 95 direct transfers) AutoZOOM attack on Robust CIFAR-10 Model Phase 2: Gradient Attack (100,000 queries to find 95 direct transfers)

! "∗ ! "∗ = % !& '∗ = whiteBoxAttack(!& , ') Maybe they can work against adversaries who don’t have access to training data/similar model? (or transfer loss is high)

Clever adversaries can still find adversarial examples 2. Build a robust classifier 51 Increase capacity Consider adversaries in training: adversarial training

Example Generator Successful AEs against !" add to training data (with correct labels) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013

Adversarial Example Generator Successful AEs against !" add to training data (with correct labels) Florian Tramer, et al. [ICLR 2018] Static Model !# Adversarial Example Generator AEs against !# Static Model !$ Adversarial Example Generator AEs against !$

Clever adversaries can still find adversarial examples 2. Build a robust classifier − Adversarial retraining with increased model capacity Very expensive Assumes you can generate adversarial examples as well as adversary − If we could build a perfect model, we would! 59

Clever adversaries can still find adversarial examples 2. Build a robust classifier − Adversarial retraining, increasing model capacity, etc. − If we could build a perfect model, we would! 60 Our strategy: “Feature Squeezing”: reduce the search space available to the adversary Weilin Xu, David Evans, Yanjun Qi [NDSS 2018]

Metric Space 2: “Oracle” Before: find a small perturbation that changes class for classifier, but imperceptible to oracle. Now: change class for both original and squeezed classifier, but imperceptible to oracle.

its neighbors. Effective in eliminating ”salt-and-pepper” noise (!" attacks) 68 Image from https://sultanofswing90.wordpress.com/tag/image-processing/ 3×3 Median Filter

Effects Through Randomization, ICLR 2018. J Buckman, et al. Thermometer Encoding: One Hot Way To Resist Adversarial Examples, ICLR 2018. D Meng and H Chen, MagNet: a Two-Pronged Defense against Adversarial Examples, CCS 2017; A Prakash, et al., Deflecting Adversarial Attacks with Pixel Deflection, CVPR 2018;... Thermometer Encoding (learnable bit depth reduction) Image denoising using autoencoder, wavelet, JPEG, etc. Image resizing ... Spatial Smoothers: median filter, non-local means

Effects Through Randomization, ICLR 2018. J Buckman, et al. Thermometer Encoding: One Hot Way To Resist Adversarial Examples, ICLR 2018. D Meng and H Chen, MagNet: a Two-Pronged Defense against Adversarial Examples, CCS 2017; A Prakash, et al., Deflecting Adversarial Attacks with Pixel Deflection, CVPR 2018;... Thermometer Encoding (learnable bit depth reduction) Image denoising using autoencoder, wavelet, JPEG, etc. Image resizing ... Spatial Smoothers: median filter, non-local means Anish Athalye, Nicholas Carlini, David Wagner. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. ICML 2018.

exists some feature squeezer that accurately detects its adversarial examples. 74 Intuition: if the perturbation is small (in some simple metric space), there is some squeezer that coalesces original and adversarial example into same sample.

1.6 2.0 Number of Examples Legitimate Adversarial Maximum !" distance between original and squeezed input threshold = 0.0029 detection: 98.2%, FP < 4% Training a detector (MNIST) set the detection threshold to keep false positive rate below target

0.4 0.8 1.2 1.6 2.0 Legitimate Adversarial Maximum !" distance between original and squeezed input threshold = 1.24 detection: 85%, FP < 5% Training a detector (ImageNet)

James Wei, Xinyun Chen, Nicholas Carlini, Dawn Song, Adversarial Example Defense: Ensembles of Weak Defenses are not Strong, USENIX WOOT’17. !"#"!"$% & '( − * + , ∗ Δ ', '( + 0 ∗ 12 3456%('′) 82 Misclassification term Distance term Detection term

typically defined in some (simple!) metric space: () norm (# different), (* norm (“Euclidean distance”), (+ Without constraints on Ball$ , every input has adversarial examples. Prediction Change Definition: An input, &′ ∈ /, is an adversarial example for & ∈ /, iff ∃&1 ∈ Ball$ (&) such that 2 & ≠ 2 &1 .

$, is an adversarial example for (correct) ! ∈ $, iff ∃!& ∈ Ball* (!) such that - !′ ≠ true label for !′. Perfect classifier has no (error robustness) adversarial examples.

$, is an adversarial example for (correct) ! ∈ $, iff ∃!& ∈ Ball* (!) such that - !′ ≠ true label for !′. Perfect classifier has no (error robustness) adversarial examples. If we have a way to know this, don’t need an ML classifier.

distribution on two concentric !-spheres Expected safe distance ("# -norm) is relatively small. Adversarial vulnerability for any classifier [Fawzi × 3, 2018] Smooth generative model: 1. Gaussian in latent space. 2. Generator is L-Lipschitz. Adversarial risk ⟶ 1 for relatively small attack strength ("# -norm). Curse of Concentration in Robust Learning [Mahloujifar et al., 2018] Normal Lévy families • Unit sphere, uniform, "# norm • Boolean hypercube, uniform, Hamming distance ... If attack strength exceeds a relatively small threshold, adversarial risk > 1/2. b > p log(k1/") p k2 · n ! Riskb(h, c) 1/2 Recent Global Robustness Results P(r(x) ⌘) 1 r ⇡ 2 e ⌘2/2L2 Properties of any model for input space: distance to AE is small relative to expected distance between two sampled points

$, is an adversarial example for ! ∈ $, iff ∃!& ∈ Ball* (!) such that - !′ ≠ - ! . Any non-trivial model has adversarial examples: ∃!0 , !2 ∈ $. - !0 ≠ -(!2 ) Solutions: - only consider distribution inputs (“good” seeds) - output isn’t just class (e.g., confidence) - targeted adversarial examples cost-sensitive adversarial robustness

the robust region is the maximum region with no adversarial example: sup % > 0 ∀)* ∈ Ball/ ) , 1 )* = 1 ) } Robust Error: For a test set, 4, and bound, %5 : | ) ∈ 4, RobustRegion ) < %5 } | 4|

is the maximum region with no undetected adversarial example: sup % > 0 ∀)* ∈ Ball/ ) , 1 )* = 1 ) ⋁ 45657654(!*)} Defense Failure: For a test set, ;, and bound, %< : | ) ∈ ;, RobustDefendedRegion ) < %< } | ;| Can we verify a defense?

= 0.1 Robust Error with Binary Filter Raghunathan et al. 95.82% 14.36%-30.81% 7.37% Wong & Kolter 98.11% 4.38% 4.25% Ours with binary filter 98.94% 2.66-6.63% - Even without detection, this helps!

Input !’ Adversarial No y1 valid max_diff +, , +. > 0 Verification: for a seed !, there is no adversarial input !1 ∈ Ball5 ! for which +. ≠ 7 ! and not detected Adversarially robust retrained [Wong & Kolter] model 1000 test MNIST seeds, 8 = 0.1 (=> ) 970 infeasible (verified no adversarial example) 13 misclassified (original seed) 17 vulnerable Robust error: 0.3% Verification time ~0.2s (compared to 0.8s without binarization)

Provable defenses against adversarial examples via the convex outer adversarial polytope. ICML 2018. replace loss with differentiable function based on outer bound using dual network ReLU (Rectified Linear Unit ) linear approximation ! "

adversarial transformations ! = − 0 1 − benign malware benign malware Incorporate a cost-matrix into robustness training Xiao Zhang and David Evans [ICLR 2019]

information theoretic, resource bounded required 121 Considered seriously broken if attack method increases to !%#!& even if it requires 2() ciphertexts.

information theoretic, resource bounded required System Security !−%! capabilities, motivations, rationality common 122 Considered seriously broken if attack method can succeed in “lab” environment with probability 2'(.

information theoretic, resource bounded required System Security !−%! capabilities, motivations, rationality common Adversarial Machine Learning !−# artificially limited adversary making progress! 123 Considered broken if attack method succeeds with probability 2'.

information theoretic, resource bounded required System Security !−%! capabilities, motivations, rationality common Adversarial Machine Learning !−# artificially limited adversary making progress! 124 Huge gaps to close: threat models are unrealistic (but real threats unclear) verification techniques only work for tiny models experimental defenses often (quickly) broken