FOSAD Trustworthy Machine Learning: Class 2

Trustworthy Machine Learning David Evans University of Virginia jeffersonswheel.org Bertinoro,
Italy 27 August 2019 19th International School on Foundations of Security Analysis and Design 2: Defenses

Recap/Plan Monday (Yesterday) Introduction / Attacks Tuesday (Today) Threat Models
Defenses Wednesday Privacy + 1

Questions 2

Threat Models 3 1. What are the attacker’s goals? •
Malicious behavior without detection • Commit check fraud • ... 2. What are the attacker’s capabilities? information: what do they know actions: what they can do resources: how much they can spend?

Threat Models in Cryptography Ciphertext-only attack Intercept message, want to
learn plaintext Chosen-plaintext attack Adversary has encryption function as black box, wants to learn key (or decrypt some ciphertext) Chosen-ciphertext attack Adversary has decryption function as black box, wants to learn key (or encrypt some message) 4 Goals Information Actions Resources

Threat Models in Cryptography 5 Goals Information Actions Resources Polynomial
time/space: adversary has computational resources that scale polynomially in some security parameter (e.g., key size)

Security Goals in Cryptography 6 First formal notions of cryptography,
information theory Claude Shannon (1940s)

Security Goals in Cryptography 7 Semantic Security: adversary with intercepted
ciphertext has no advantage over adversary without it Shafi Goldwasser and Silvio Micali Developed semantic security in 1980s (2013 Turing Awardees)

Threat Models in Adversarial ML? 8 Ciphertext-only attack Chosen-plaintext attack
Chosen-ciphertext attack Polynomial time/space Semantic Security proofs Can we get to threat models as precise as those used in cryptography? Can we prove strong security notions for those threat models?

Threat Models in Adversarial ML? 9 Ciphertext-only attack Chosen-plaintext attack
Chosen-ciphertext attack Polynomial time/space Semantic Security proofs Can we get to threat models as precise as those used in cryptography? Can we prove strong security notions for those threat models? Current state: “Pre-Shannon” (Nicolas Carlini)

10 Ali Rahimi NIPS Test-of-Time Award Speech (Dec 2017) ”If
you're building photo- sharing systems alchemy is okay but we're beyond that; now we're building systems that govern healthcare and mediate our civic dialogue”

11 Ali Rahimi NIPS Test-of-Time Award Speech (Dec 2017) ”If
you're building photo- sharing systems alchemy is okay but we're beyond that; now we're building systems that govern healthcare and mediate our civic dialogue”

Alchemy (~700 − 1660) Well-defined, testable goal turn lead into
gold Established theory four elements: earth, fire, water, air Methodical experiments and lab techniques (Jabir ibn Hayyan in 8th century) Wrong and ultimately unsuccessful, but led to modern chemistry.

“Realistic” Threat Model for Adversarial ML 13

Attacker Access White Box Attack has model: full knowledge of
all parameters Black Box 14 ! " = ! $ ! % &' … ! ' !(") ! " !(") Each model query is “expensive” Only receives output “API Access”

ML-as-a-Service 15

Black-Box Attacks 16 PGD Attack !" # = ! for
% iterations: !&'( # = project0,2 (!& # − 5 ⋅ sign(∇ < !& # , = ) !# = !? ′ Can we execute these attacks if we don’t have the model?

Black-Box Optimization Attacks 17 ! " !(") Black-Box Gradient Attack
"% & = " for ( iterations: use queries to estimate ∇ * "+ & , -

Black-Box Optimization Attacks 18 ! " !(") Black-Box Gradient Attack
"% & = " for ( iterations: use queries to estimate ∇ * "+ & , - "+./ & = take step in “white-box” attack using estimated gradients "+./ &

Black-Box Gradient Attacks 19 Hybrid Batch Attacks: Finding Black-box Adversarial
Examples with Limited Queries. Fnu Suya, Jianfeng Chi, David Evans, Yuan Tian. USENIX Security 2020.

Transfer Attacks 20 ! "∗ ! "∗ = % !&
Target Model '∗ = whiteBoxAttack(!& , ') Adversarial examples against one model, often transfer to another model. External Local

Improving Transfer Attacks 22 ! "∗ ! "∗ = %
Target Model Adversarial examples against several models, more likely to transfer. External Local !& !' !( "∗ = whiteBoxAttack(6(!& , !' , !( ), ") Yanpei Liu, Xinyun Chen, Chang Liu, Dawn Song [ICLR 2017]

Hybrid Attacks Transfer Attacks Efficient: only one API query Low
success rates - 3% transfer rate for targeted attack on ImageNet (ensemble) Gradient Attacks Expensive: 10k+ queries/seed High success rates - 100% for targeted attack on Imagenet 23 Hybrid Batch Attacks: Finding Black-box Adversarial Examples with Limited Queries. Fnu Suya, Jianfeng Chi, David Evans, Yuan Tian. USENIX Security 2020. Combine both attacks: efficient + high success

Hybrid Attack 24 ! "∗ ! "∗ = % External
Local !& !' !( "∗ = whiteBoxAttack(6(!& , !' , !( ), ") 1: Transfer Attack

Hybrid Attack 25 ! "∗ ! "∗ ≠ % External
Local !& !' !( "∗ = whiteBoxAttack(7(!& , !' , !( ), ") 1: Transfer Attack ": ; "< ; "∗ 2: Gradient Attack (starting from transfer candidate)

Hybrid Attack 26 ! "∗ ! "∗ ≠ % External
Local !& !' !( "∗ = whiteBoxAttack(7(!& , !' , !( ), ") 1: Transfer Attack ": ; "< ; "∗ 2: Gradient Attack 3: Tune Local Models using label byproducts

27 Dataset / Model Direct Transfer Rate Gradient Attack (AutoZoom)
Hybrid Attack Success Rate Queries per AE Success Rate Queries per AE MNIST (Targeted) 61.6 90.9 1,645 98.8 298 CIFAR10 (Targeted) 63.3 92.2 1,227 98.1 227 ImageNet (Targeted) 3.4 95.4 45,166 98.0 30,089

Realistic Adversary Model? Knowledge Only API access to target Good
models for ensemble − pretrained models − (or access to similar training dataset, resources) Set of starting seeds Goals Find one adversarial example for each seed 28 Resources Unlimited number of API queries

Batch Attacks Knowledge Only API access to target Good models
for ensemble − pretrained models − (or access to similar training dataset, resources) Set of starting seeds Goals Find many seed/adversarial example pairs 29 Resources Limited number of API queries Prioritize seeds to attack: use resources to attack the low-cost seeds first

Requirements 1. There is a high variance across seeds in
the cost to find adversarial examples. 2. There are ways to predict in advance which seeds will be easy to attack. 30

31 Variation in Query cost of NES Gradient Attack Excludes
direct transfers

Predicting the Low-Cost Seeds Strategy 1: Cost of local attack
number of PGD steps to find local AE Strategy 2: Loss function on target 32 NES gradient attack on robust CIFAR-10 model

What about Direct Transfers? Strategy 1: Cost of local attack
number of PGD steps to find local AE Strategy 2: Loss function on target 33

Direct Transfers Cost of local attack number of PGD steps
to find local AE 34

Two-Phase Hybrid Attack 35 Retroactive Optimal: unrealizable strategy that always
picks lowest cost seed

picks lowest cost seed Phase 1: Find Direct Transfers (1000 queries to find 95 direct transfers) AutoZOOM attack on Robust CIFAR-10 Model

picks lowest cost seed Phase 1: Find Direct Transfers (1000 queries to find 95 direct transfers) AutoZOOM attack on Robust CIFAR-10 Model Phase 2: Gradient Attack (100,000 queries to find 95 direct transfers)

Cost of Hybrid Batch Attacks 38 Target Model Prioritization Total
Queries (Standard Error) Goal (number of seeds attacked) 1% 2% 10% CIFAR-10 (Robust) 1000 seeds “Optimal” 10.0 (0.0) 20.0 (0.0) 107.8 (17.4) Two-Phase 20.4 (2.1) 54.2 (5.6) 826.2 (226.6) Random 24,054 (132) 45,372 (260) 251,917 (137) ImageNet 100 seeds “Optimal” 1.0 (0.0) 2.0 (0.0) 34,949 (3,742) Two-Phase 28.0 (2.0) 38.6 (7.5) 78,844 (11,837) Random 15,046 (423) 45,136 (1,270) 285,855 (8,045)

Defenses 39

How can we construct models that make it hard for
adversaries to find adversarial examples? 40

Defense Strategies 1. Hide the gradients 41

Defense Strategies 1. Hide the gradients − Transferability results 42
! "∗ ! "∗ = % !& '∗ = whiteBoxAttack(!& , ')

Defense Strategies 1. Hide the gradients − Transferability results 43
! "∗ ! "∗ = % !& '∗ = whiteBoxAttack(!& , ') Maybe they can work against adversaries who don’t have access to training data/similar model? (or transfer loss is high)

44 Visualization by Nicholas Carlini

Defense Strategies 1. Hide the gradients − Clever adversaries can
still find adversarial examples 48 ICML 2018 (Best Paper award)

Defense Strategies 1. Hide the gradients − Transferability results −
Clever adversaries can still find adversarial examples 2. Build a robust classifier 49 Increase capacity

Increasing Model Capacity 50 Image from Aleksander Mądry, et al.
2017

Clever adversaries can still find adversarial examples 2. Build a robust classifier 51 Increase capacity Consider adversaries in training: adversarial training

Adversarial Training (Example from Yesterday) 52 Training Data ML Algorithm
Training Clone 010110011 01 EvadeML Deployment Why didn’t this work?

Adversarial Training Training Data Training Process Candidate Model !" Adversarial
Example Generator Successful AEs against !" add to training data (with correct labels) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013

Ensemble Adversarial Training Training Data Training Process Candidate Model !"
Adversarial Example Generator Successful AEs against !" add to training data (with correct labels) Florian Tramer, et al. [ICLR 2018]

Ensemble Adversarial Training Training Data Training Process Candidate Model !"
Adversarial Example Generator Successful AEs against !" add to training data (with correct labels) Florian Tramer, et al. [ICLR 2018] Static Model !# Adversarial Example Generator AEs against !# Static Model !$ Adversarial Example Generator AEs against !$

Formalizing Adversarial Training 56 2017 min$ (&(',))∼, ℒ(.$ , /,
0)) Regular training:

Formalizing Adversarial Training 57 2017 min$ (&(',))∼, max/∈1 (ℒ(3$ ,
4 + 6, 7) ) min$ (&(',))∼, ℒ(3$ , 4, 7)) Regular training: Adversarial training: Simulate with PGD attack with multiple restarts

Attacking Robust Models 58 Dataset / Model Direct Transfer Rate
Gradient Attack (AutoZoom) Hybrid Attack Success Rate Queries per AE Success Rate Queries per AE MNIST (Targeted) 61.6 90.9 1,645 98.8 298 CIFAR10 (Targeted) 63.3 92.2 1,227 98.1 227 MNIST-Robust (Untar’d) 2.9 7.2 52,182 7.3 51,328 CIFAR10 Robust (Untar’d) 9.5 64.4 2,640 65.2 2,529 Hybrid Batch Attacks: Finding Black-box Adversarial Examples with Limited Queries. Fnu Suya, Jianfeng Chi, David Evans, Yuan Tian. USENIX Security 2020.

Clever adversaries can still find adversarial examples 2. Build a robust classifier − Adversarial retraining with increased model capacity Very expensive Assumes you can generate adversarial examples as well as adversary − If we could build a perfect model, we would! 59

Clever adversaries can still find adversarial examples 2. Build a robust classifier − Adversarial retraining, increasing model capacity, etc. − If we could build a perfect model, we would! 60 Our strategy: “Feature Squeezing”: reduce the search space available to the adversary Weilin Xu, David Evans, Yanjun Qi [NDSS 2018]

Model Model Squeezer 1 Prediction0 Prediction1 "($%&'( , $%&'* ,
… , $%&', ) Input Adversarial Legitimate Model’ Squeezer k … Predictionk Feature Squeezing Detection Framework Weilin Xu Yanjun Qi

Model Model Squeezer 1 Prediction0 Prediction1 "($%&'( , $%&'* ,
… , $%&', ) Input Adversarial Legitimate Model’ Squeezer k … Predictionk Feature Squeezing Detection Framework Feature Squeezer coalesces similar inputs into one point: • Barely change legitimate inputs. • Destruct adversarial perturbations.

Coalescing by Feature Squeezing 63 Metric Space 1: Target Classifier
Metric Space 2: “Oracle” Before: find a small perturbation that changes class for classifier, but imperceptible to oracle. Now: change class for both original and squeezed classifier, but imperceptible to oracle.

Fast Gradient Sign [Yesterday] 64 original 0.1 0.2 0.3 0.4
0.5 Adversary Power: ! "# -bounded adversary: max(abs(*+ −*+ -)) ≤ ! *- = * − ! ⋅ sign(∇* 6(*, 8)) Goodfellow, Shlens, Szegedy 2014

Bit Depth Reduction 0 0.1 0.2 0.3 0.4 0.5 0.6
0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 3-bit 1-bit 8-bit Reduce to 1-bit !" = round(!" ×2)/2 Reduce to 1-bit !" = round(!" ×2)/2 [0.312 0.271 …… 0.159 0.651] X* [0.012 0.571 …… 0.159 0.951] X Input Output 65 [0. 1. …… 0. 1. ] [0. 0. …… 0. 1. ] Signal Quantization Adversarial Example Normal Example

Bit Depth Reduction 66 Seed 1 1 4 2 2
1 1 1 1 1 CW 2 CW ∞ BIM FGSM

Accuracy with Bit Depth Reduction 67 Dataset Squeezer Adversarial Examples
(FGSM, BIM, CW∞ , Deep Fool, CW2 , CW0 , JSMA) Legitimate Images MNIST None 13.0% 99.43% 1-bit Depth 62.7% 99.33% ImageNet None 2.78% 69.70% 4-bit Depth 52.11% 68.00%

Spatial Smoothing: Median Filter Replace a pixel with median of
its neighbors. Effective in eliminating ”salt-and-pepper” noise (!" attacks) 68 Image from https://sultanofswing90.wordpress.com/tag/image-processing/ 3×3 Median Filter

Spatial Smoothing: Non-local Means Replace a patch with weighted mean
of similar patches (in region). 69 ! "# "$ !% = '((!, "+ )×"+ Preserves edges, while removing noise.

70 Airplane 94.4% Truck 99.9% Automobile 56.5% Airplane 98.4% Airplane
99.9% Ship 46.0% Airplane 98.3% Airplane 80.8% Airplane 70.0% Median Filter (2×2) Non-local Means (13-3-4) Original BIM (L ∞ ) JSMA (L 0 )

Accuracy with Spatial Smoothing 71 Dataset Squeezer Adversarial Examples (FGSM,
BIM, CW∞ , Deep Fool, CW2 , CW0 ) Legitimate Images ImageNet None 2.78% 69.70% Median Filter 2*2 68.11% 65.40% Non-local Means 11-3-4 57.11% 65.40%

Other Potential Squeezers 72 C Xie, et al. Mitigating Adversarial
Effects Through Randomization, ICLR 2018. J Buckman, et al. Thermometer Encoding: One Hot Way To Resist Adversarial Examples, ICLR 2018. D Meng and H Chen, MagNet: a Two-Pronged Defense against Adversarial Examples, CCS 2017; A Prakash, et al., Deflecting Adversarial Attacks with Pixel Deflection, CVPR 2018;... Thermometer Encoding (learnable bit depth reduction) Image denoising using autoencoder, wavelet, JPEG, etc. Image resizing ... Spatial Smoothers: median filter, non-local means

Other Potential Squeezers 73 C Xie, et al. Mitigating Adversarial
Effects Through Randomization, ICLR 2018. J Buckman, et al. Thermometer Encoding: One Hot Way To Resist Adversarial Examples, ICLR 2018. D Meng and H Chen, MagNet: a Two-Pronged Defense against Adversarial Examples, CCS 2017; A Prakash, et al., Deflecting Adversarial Attacks with Pixel Deflection, CVPR 2018;... Thermometer Encoding (learnable bit depth reduction) Image denoising using autoencoder, wavelet, JPEG, etc. Image resizing ... Spatial Smoothers: median filter, non-local means Anish Athalye, Nicholas Carlini, David Wagner. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. ICML 2018.

“Feature Squeezing” (Vacuous) Conjecture For any distance-limited adversarial method, there
exists some feature squeezer that accurately detects its adversarial examples. 74 Intuition: if the perturbation is small (in some simple metric space), there is some squeezer that coalesces original and adversarial example into same sample.

Feature Squeezing Detection Model (7-layer CNN) Model Model Bit Depth-
1 Median 2×2 Prediction0 Prediction1 Prediction2 Yes Input Adversarial No Legitimate max '( )* , )( , '( )* , )2 > -

Detecting Adversarial Examples Distance between original input and its squeezed
version Adversarial inputs (CW attack) Legitimate inputs

77 0 200 400 600 800 0.0 0.4 0.8 1.2
1.6 2.0 Number of Examples Legitimate Adversarial Maximum !" distance between original and squeezed input threshold = 0.0029 detection: 98.2%, FP < 4% Training a detector (MNIST) set the detection threshold to keep false positive rate below target

ImageNet Configuration Model (MobileNet) Model Model Bit Depth- 5 Median
2×2 Prediction0 Prediction1 Prediction2 Yes Input Adversarial No Legitimate max(() (*+ , {*) , *. , */ }) > 3 Model Non-local Mean Prediction3

79 0 20 40 60 80 100 120 140 0.0
0.4 0.8 1.2 1.6 2.0 Legitimate Adversarial Maximum !" distance between original and squeezed input threshold = 1.24 detection: 85%, FP < 5% Training a detector (ImageNet)

Aggregated Detection Results Dataset Squeezers Threshold False Positive Rate Detection
Rate (SAEs) ROC-AUC Exclude FAEs MNIST Bit Depth (1-bit), Median (2x2) 0.0029 3.98% 98.2% 99.44% CIFAR-10 Bit Depth (5-bit), Median (2x2), Non-local Mean (13-3-2) 1.1402 4.93% 84.5% 95.74% ImageNet Bit Depth (5-bit), Median (2x2), Non-local Mean (11-3-4) 1.2128 8.33% 85.9% 94.24% 80

Threat Models Oblivious attack: The adversary has full knowledge of
the target model, but is not aware of the detector. Adaptive attack: The adversary has full knowledge of the target model and the detector. 81

Adaptive Adversary Adaptive CW 2 attack, unbounded adversary: Warren He,
James Wei, Xinyun Chen, Nicholas Carlini, Dawn Song, Adversarial Example Defense: Ensembles of Weak Defenses are not Strong, USENIX WOOT’17. !"#"!"$% & '( − * + , ∗ Δ ', '( + 0 ∗ 12 3456%('′) 82 Misclassification term Distance term Detection term

Adaptive Adversarial Examples 83 No successful adversarial examples were found
for images originally labeled as 3 or 8. Mean L2 2.80 4.14 4.67 Attack Untargeted Targeted (next) Targeted (least likely)

Adaptive Adversary Success Rates 84 0.68 0.06 0.01 0.44 0.01
0.24 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Adversary’s Success Rate Clipped ε Targeted (Next) Targeted (LL) Untargeted Unbounded Typical !

Model Model Model Squeezer 1 Squeezer 2 Prediction0 Prediction1 Prediction2
#(%&'() , %&'(+ , … , %&'(- ) Yes Input Adversarial No Legitimate Model’ Squeezer k … Predictionk Defender’s Entropy Advantage random seed

Counter Measure: Randomization Binary filter threshold := 0.5 threshold :=
! 0.5, 0.0625 86 0 0.2 0.4 0.6 0.8 1 0 0.5 1 0 0.2 0.4 0.6 0.8 1 0 0.5 1 Strengthen the adaptive adversary Attack an ensemble of 3 detectors with thresholds: [0.4, 0.5, 0.6]

87 2.80, Untargeted 4.14, Targeted-Next 4.67, Targeted-LL 3.63, Untargeted 5.48,
Targeted-Next 5.76, Targeted-LL Attack Deterministic Detector Mean L2 Attack Randomized Detector

Are defenses against adversarial examples even possible? 88

(Redefining) Adversarial Example 89 Prediction Change Definition: An input, !′
∈ $, is an adversarial example for ! ∈ $, iff ∃!& ∈ Ball* (!) such that - ! ≠ - !& .

Adversarial Example 90 Ball$ (&) is some space around &,
typically defined in some (simple!) metric space: () norm (# different), (* norm (“Euclidean distance”), (+ Without constraints on Ball$ , every input has adversarial examples. Prediction Change Definition: An input, &′ ∈ /, is an adversarial example for & ∈ /, iff ∃&1 ∈ Ball$ (&) such that 2 & ≠ 2 &1 .

Adversarial Example 91 Any non-trivial model has adversarial examples: ∃"#
, "% ∈ '. ) "# ≠ )("% ) Prediction Change Definition: An input, -′ ∈ ', is an adversarial example for - ∈ ', iff ∃-/ ∈ Ball3 (-) such that ) - ≠ ) -/ .

Prediction Error Robustness 92 Error Robustness: An input, !′ ∈
$, is an adversarial example for (correct) ! ∈ $, iff ∃!& ∈ Ball* (!) such that - !′ ≠ true label for !′. Perfect classifier has no (error robustness) adversarial examples.

Prediction Error Robustness 93 Error Robustness: An input, !′ ∈
$, is an adversarial example for (correct) ! ∈ $, iff ∃!& ∈ Ball* (!) such that - !′ ≠ true label for !′. Perfect classifier has no (error robustness) adversarial examples. If we have a way to know this, don’t need an ML classifier.

Global Robustness Properties 94 Adversarial Risk: probability an input has
an adversarial example Pr # ← % [∃ () ∈ +,--. ( . 0 () ≠ class (′ ] Dimitrios I. Diochnos, Saeed Mahloujifar, Mohammad Mahmoody, NeurIPS 2018

Global Robustness Properties 95 Dimitrios I. Diochnos, Saeed Mahloujifar, Mohammad
Mahmoody, NeurIPS 2018 Adversarial Risk: probability an input has an adversarial example Pr # ← % [∃ () ∈ +,--. ( . 0 () ≠ class (′ ] Error Region Robustness: expected distance to closest AE: 8 # ← % [inf { =: ∃ () ∈ +,--. ( . 0 () ≠ class () }]

Assumption Key Result Adversarial Spheres [Gilmer et al., 2018] Uniform
distribution on two concentric !-spheres Expected safe distance ("# -norm) is relatively small. Adversarial vulnerability for any classifier [Fawzi × 3, 2018] Smooth generative model: 1. Gaussian in latent space. 2. Generator is L-Lipschitz. Adversarial risk ⟶ 1 for relatively small attack strength ("# -norm). Curse of Concentration in Robust Learning [Mahloujifar et al., 2018] Normal Lévy families • Unit sphere, uniform, "# norm • Boolean hypercube, uniform, Hamming distance ... If attack strength exceeds a relatively small threshold, adversarial risk > 1/2. b > p log(k1/") p k2 · n ! Riskb(h, c) 1/2 Recent Global Robustness Results P(r(x)  ⌘) 1 r ⇡ 2 e ⌘2/2L2 Properties of any model for input space: distance to AE is small relative to expected distance between two sampled points

Prediction Change Robustness 97 Prediction Change: An input, !′ ∈
$, is an adversarial example for ! ∈ $, iff ∃!& ∈ Ball* (!) such that - !′ ≠ - ! . Any non-trivial model has adversarial examples: ∃!0 , !2 ∈ $. - !0 ≠ -(!2 ) Solutions: - only consider distribution inputs (“good” seeds) - output isn’t just class (e.g., confidence) - targeted adversarial examples cost-sensitive adversarial robustness

Local (Instance) Robustness 98 Robust Region: For an input !,
the robust region is the maximum region with no adversarial example: sup % > 0 ∀)* ∈ Ball/ ) , 1 )* = 1 ) }

Local (Instance) Robustness 99 Robust Region: For an input !,
the robust region is the maximum region with no adversarial example: sup % > 0 ∀)* ∈ Ball/ ) , 1 )* = 1 ) } Robust Error: For a test set, 4, and bound, %5 : | ) ∈ 4, RobustRegion ) < %5 } | 4|

Instance Defense-Robustness 100 For an input !, the robust-defended region
is the maximum region with no undetected adversarial example: sup % > 0 ∀)* ∈ Ball/ ) , 1 )* = 1 ) ⋁ 45657654(!*)} Defense Failure: For a test set, ;, and bound, %< : | ) ∈ ;, RobustDefendedRegion ) < %< } | ;| Can we verify a defense?

Formal Verification of Defense Instance exhaustively test all inputs in
∀"# ∈ Ball( " for correctness or detection Need to transform model into a function amenable to verification

Linear Programming !"" #" + !"% #% + ⋯ ≤
(" !%" #" + !%% #% + ⋯ ≤ (% #) ≤ 0 ... Find values of + that minimize linear function under constraints: ," #" + ,% #% + ,- #- + …

Encoding a Neural Network Linear Components (! = #$ +
&) Convolutional Layer Fully-connected Layer Batch Normalization (in test mode) Non-linear Activation (ReLU, Sigmoid, Softmax) Pooling Layer (max, avg) 103

Encode ReLU Mixed Integer Linear Programming adds discrete values to
LP ReLU (Rectified Linear Unit ) ! = max(0, )) + ∈ 0, 1 ! ≥ ) ! ≥ 0 ! ≤ ) − 1 1 − + ! ≤ 2+ 1 2 Piecewise Linear

Mixed Integer Linear Programming (MILP) Intractable in theory (NP-Complete) Efficient
in practice (e.g., Gurobi solver) MIPVerify Vincent Tjeng, Kai Xiao, Russ Tedrake Verify NNs using MILP

Encode Feature Squeezers Binary Filter 0.5 1 0 Actual Input:
uint8 [0, 1, 2, … 254, 255] 127 / 255 = 0.498 128 / 255 = 0.502 An infeasible gap [0.499, 0.501] Lower semi-continuous

Verified L ∞ Robustness Model Test Accuracy Robust Error ε
= 0.1 Robust Error with Binary Filter Raghunathan et al. 95.82% 14.36%-30.81% 7.37% Wong & Kolter 98.11% 4.38% 4.25% Ours with binary filter 98.94% 2.66-6.63% - Even without detection, this helps!

Encode Detection Mechanism Original version: Simplify for verification: !" ⟶
maximum difference softmax ⟶ multiple piecewise-linear approximate sigmoid score(*) = - * − -(squeeze * ) " where f(x) is softmax output

Preliminary Experiments 109 Model (4-layer CNN) Model Bit Depth-1 Yes
Input !’ Adversarial No y1 valid max_diff +, , +. > 0 Verification: for a seed !, there is no adversarial input !1 ∈ Ball5 ! for which +. ≠ 7 ! and not detected Adversarially robust retrained [Wong & Kolter] model 1000 test MNIST seeds, 8 = 0.1 (=> ) 970 infeasible (verified no adversarial example) 13 misclassified (original seed) 17 vulnerable Robust error: 0.3% Verification time ~0.2s (compared to 0.8s without binarization)

110 Scalability Formal Verification MILP solver (MIPVerify) SMT solver (Reluplex)
Interval analysis (Reluval) robust error Heuristic Defenses distillation (Papernot et al., 2016) gradient obfuscation adversarial retraining (Madry et al., 2017) attack success rate (set of attacks) Certified Robustness CNN-Cert (Boopathy et al., 2018) Dual-LP (Kolter & Wong 2018) Dual-SDP (Raghunathan et al., 2018) bound Evaluation Metric precise feature squeezing

Realistic Threat Models Knowledge Full access to target Goals Find
many seed/adversarial example pairs 111 Resources Limited number of API queries Limited computation It matters which seed and target classes

112 target class Original Model (no robustness training) seed class
target class MNIST Model 2 convolutional layers 2 fully-connected layers (100, 10 units) ! = 0.2, '(

113 target class Original Model (no robustness training) seed class
target class MNIST Model 2 convolutional layers 2 fully-connected layers (100, 10 units) ! = 0.2, '(

Training a Robust Network Eric Wong and J. Zico Kolter.
Provable defenses against adversarial examples via the convex outer adversarial polytope. ICML 2018. replace loss with differentiable function based on outer bound using dual network ReLU (Rectified Linear Unit ) linear approximation ! "

115 seed class target class Standard Robustness Training (overall robustness
goal) MNIST Model 2 convolutional layers 2 fully-connected layers (100, 10 units) ! = 0.2, '(

Cost-Sensitive Robustness Training 116 Xiao Zhang Cost-matrix: cost of different
adversarial transformations ! = − 0 1 − benign malware benign malware Incorporate a cost-matrix into robustness training Xiao Zhang and David Evans [ICLR 2019]

117 seed class target class Standard Robustness Training (overall robustness
goal) MNIST Model 2 convolutional layers 2 fully-connected layers (100, 10 units) ! = 0.2, '(

118 seed class target class Cost- Sensitive Robustness Training Protect
odd classes from evasion

119 seed class target class Cost- Sensitive Robustness Training Protect
even classes from evasion

History of the destruction of Troy, 1498 Wrap-Up

Security State-of-the-Art Attack success probability Threat models Proofs Cryptography !−#!$
information theoretic, resource bounded required 121 Considered seriously broken if attack method increases to !%#!& even if it requires 2() ciphertexts.

information theoretic, resource bounded required System Security !−%! capabilities, motivations, rationality common 122 Considered seriously broken if attack method can succeed in “lab” environment with probability 2'(.

information theoretic, resource bounded required System Security !−%! capabilities, motivations, rationality common Adversarial Machine Learning !−# artificially limited adversary making progress! 123 Considered broken if attack method succeeds with probability 2'.

information theoretic, resource bounded required System Security !−%! capabilities, motivations, rationality common Adversarial Machine Learning !−# artificially limited adversary making progress! 124 Huge gaps to close: threat models are unrealistic (but real threats unclear) verification techniques only work for tiny models experimental defenses often (quickly) broken

Tomorrow: Privacy 125 David Evans University of Virginia [email protected] https://www.cs.virginia.edu/evans

FOSAD Trustworthy Machine Learning: Class 2

FOSAD Trustworthy Machine Learning: Class 2

More Decks by David Evans

Other Decks in Education

Featured

Transcript