Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lessons from the Last 3000 Years of Adversarial Examples

Lessons from the Last 3000 Years of Adversarial Examples

Invited Talk at Huawei Strategy and Technology Workshop (STW)
Shenzhen, China
15 May 2018

David Evans

May 15, 2018
Tweet

More Decks by David Evans

Other Decks in Research

Transcript

  1. Lessons from the Last 3000 Years of Adversarial Examples David

    Evans University of Virginia evadeML.org Huawei STW Shenzhen, China 15 May 2018
  2. … and can solve all Security Problems! Fake Spam IDS

    Malware Fake Accounts … “Fake News”
  3. Labelled Training Data ML Algorithm Feature Extraction Vectors Deployment Malicious

    / Benign Operational Data Trained Classifier Training (supervised learning)
  4. ML Algorithm Feature Extraction Vectors Deployment Malicious / Benign Operational

    Data Training (supervised learning) Assumption: Training Data is Representative Trained Classifier Labelled Training Data
  5. Adversarial Examples 8 0.007 × [&'()*] + = “panda” “gibbon”

    Example from: Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy. Explaining and Harnessing Adversarial Examples. 2014 (in ICLR 2015)
  6. 9 Papers on “Adversarial Examples” (Google Scholar) 0 200 400

    600 800 1000 1200 2018 (5/13) 2017 2016 2015 2014 2013 654
  7. 12 History of the destruction of Troy, 1498 Trojans, lumber

    into the city, heart fear any clever/packed ambush of the Argives. Homer, The Illiad (~1200 BCE)
  8. Adversarial Examples across Domains 13 Domain Classifier Space Oracle Space

    Trojan Wars Judgment of Trojans !(#) = “gi>” Physical Reality !∗(#) = invading army Malware Malware Detector !(#) = “benign” Victim’s Execution !∗(#) = malicious behavior Image Classification DNN Classifier !(#) = ) Human Perception !∗(#) = *
  9. Adversarial Examples across Domains 14 Domain Classifier Space Oracle Space

    Trojan Wars Judgment of Trojans !(#) = “gi@” Physical Reality !∗(#) = invading army Malware Malware Detector !(#) = “benign” Victim’s Execution !∗(#) = malicious behavior Image Classification DNN Classifier !(#) = ) Human Perception !∗(#) = * Today
  10. Adversarial Examples across Domains 15 Domain Classifier Space Oracle Space

    Trojan Wars Judgment of Trojans !(#) = “gi@” Physical Reality !∗(#) = invading army Malware Malware Detector !(#) = “benign” Victim’s Execution !∗(#) = malicious behavior Image Classification DNN Classifier !(#) = ) Human Perception !∗(#) = * Today Fixing (Breaking?) the Definition
  11. Adversarial Examples across Domains 16 Trojan Wars !(#) = “gi=”

    !∗(#) = invading army Malware Malware Detector !(#) = “benign” Victim’s Execution !∗(#) = malicious behavior Image Classification DNN Classifier !(#) = ) Human Perception !∗(#) = * Fixing (Breaking?) the Definition
  12. Goal of Machine Learning Classifier 17 Classifier Space (DNN Model)

    “Oracle” Space (human perception) Model and visualization based on work by Beilun Wang, Ji Gao and Yanjun Qi (ICLR 2017 Workshop)
  13. Well-Trained Classifier 18 Model and visualization based on work by

    Beilun Wang, Ji Gao and Yanjun Qi (ICLR 2017 Workshop) Classifier Space (DNN Model) “Oracle” Space (human perception)
  14. Adversarial Examples 19 Model and visualization based on work by

    Beilun Wang, Ji Gao and Yanjun Qi (ICLR 2017 Workshop) Classifier Space (DNN Model) “Oracle” Space (human perception)
  15. Misleading Visualization 20 Cartoon Reality 2 dimensions thousands of dimensions

    few samples near boundaries all samples near boundaries every sample near 1-3 classes every sample near all classes Classifier Space (DNN Model)
  16. Adversarial Examples 21 Adversary’s goal: find a small perturbation that

    changes class for classifier, but imperceptible to oracle. Classifier Space (DNN Model) “Oracle” Space (human perception)
  17. Adversarial Examples Definition 22 Given seed sample, !, !" is

    an adversarial example iff: # !" = % Class is % (targeted) ∆ !, !" ≤ ) Difference below threshold ∆ !, !" is defined in some (simple!) metric space: *+ norm (# different), *,, *-norm (“Euclidean distance”), *. Assumption (to map to earlier definition): small perturbation does not change class in Oracle space
  18. !" Adversary: Fast Gradient Sign 23 original 0.1 0.2 0.3

    0.4 0.5 Adversary Power: # !" adversary power: max(() −() +) < # () + = () − # ⋅ sign(∇loss7 (())
  19. Many Other Adversarial Methods 24 + “1” 100% confidence “4”

    100% = + “2” 99.9% = + “2” 83.8% = BIM ("# ) JSMA ("% ) CW 2 ("& ) Original Perturbation Adversarial Examples
  20. Defense Strategies 1. Hide the gradients − Transferability results −

    Clever adversaries can still find adversarial examples 29
  21. Defense Strategies 1. Hide the gradients − Transferability results −

    Clever adversaries can still find adversarial examples 2. Build a robust classifier − Adversarial retraining, increasing model capacity, etc. 30
  22. Defense Strategies 1. Hide the gradients − Transferability results −

    Clever adversaries can still find adversarial examples 2. Build a robust classifier − Adversarial retraining, increasing model capacity, etc. − If we could build a perfect model, we would! 31
  23. Defense Strategies 1. Hide the gradients − Transferability results −

    Clever adversaries can still find adversarial examples 2. Build a robust classifier − Adversarial retraining, increasing model capacity, etc. − If we could build a perfect model, we would! 32 Our strategy: “Feature Squeezing” reduce search space available to the adversary Weilin Xu Yanjun Qi
  24. Model Model Squeezer 1 Prediction0 Prediction1 "($%&'( , $%&'* ,

    … , $%&', ) Yes Input Adversarial No Legitimate Model’ Squeezer k … Predictionk Detection Framework
  25. Model Model Squeezer 1 Prediction0 Prediction1 "($%&'( , $%&'* ,

    … , $%&', ) Yes Input Adversarial No Legitimate Model’ Squeezer k … Predictionk Detection Framework Feature Squeezer coalesces similar inputs into one pointL: • Little change for legitimate inputs. • Destruct adversarial perturbations.
  26. Bit Depth Reduction 0 0.1 0.2 0.3 0.4 0.5 0.6

    0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 3-bit 1-bit 8-bit Reduce to 1-bit !" = round(!" ×2)/2 Reduce to 1-bit !" = round(!" ×2)/2 [0.312 0.271 …… 0.159 0.651] X* [0.012 0.571 …… 0.159 0.951] X Input Output 35 [0. 1. …… 0. 1. ] [0. 0. …… 0. 1. ] Signal Quantization Adversarial Example Normal Example
  27. Bit Depth Reduction 36 Seed 1 1 4 2 2

    1 1 1 1 1 CW 2 CW ∞ BIM FGSM
  28. Spatial Smoothing: Median Filter Replace a pixel with median of

    its neighbors. Effective in eliminating ”salt-and-pepper” noise (!" attacks) 37 Image from https://sultanofswing90.wordpress.com/tag/image-processing/ 3×3 Median Filter
  29. Spatial Smoothing: Non-local Means Replace a patch with weighted mean

    of similar patches (in region). 38 ! "# "$ !% = ' ((!, "+ )×"+ Preserves edges, while removing noise.
  30. 39 Airplane 94.4% Truck 99.9% Automobile 56.5% Airplane 98.4% Airplane

    99.9% Ship 46.0% Airplane 98.3% Airplane 80.8% Airplane 70.0% Median Filter (2×2) Non-local Means (13-3-4) Original BIM (L ∞ ) JSMA (L 0 )
  31. Other Potential Squeezers 40 C Xie, et al. Mitigating Adversarial

    Effects Through Randomization, ICLR 2018. J Buckman, et al. Thermometer Encoding: One Hot Way To Resist Adversarial Examples, ICLR 2018. D Meng and H Chen, MagNet: a Two-Pronged Defense against Adversarial Examples, in CCS 2017; A Prakash, et al., Deflecting Adversarial Attacks with Pixel Deflection, CVPR 2018;... Thermometer Encoding (learnable bit depth reduction) Image denoising using autoencoder, wavelet, JPEG, etc. Image resizing ...
  32. “Feature Squeezing” (Vacuous) Conjecture For any distance-limited adversarial method, there

    exists some feature squeezer that accurately detects its adversarial examples. 41 Intuition: if the perturbation is small (in some simple metric space), there is some squeezer that coalesces original and adversarial example into same sample.
  33. Feature Squeezing Detection Model (7-layer CNN) Model Model Bit Depth-

    1 Median 2×2 Prediction0 Prediction1 Prediction2 Yes Input Adversarial No Legitimate max '( )* , )( , '( )* , )2 > -
  34. Detecting Adversarial Examples 43 Distance between original input and its

    squeezed version Adversarial inputs (CW attack) Legitimate inputs
  35. 44 0 200 400 600 800 0.0 0.4 0.8 1.2

    1.6 2.0 Number of Examples Legitimate Adversarial Maximum !"distance between original and squeezed input threshold = 0.0029 detection: 98.2%, FP < 4% Training a detector (MNIST) set the detection threshold to keep false positive rate below target
  36. Aggregated Detection Results Dataset Squeezers Threshold False Positive Rate Detection

    Rate (SAEs) ROC-AUC Exclude FAEs MNIST Bit Depth (1-bit), Median (2x2) 0.0029 3.98% 98.2% 99.44% CIFAR-10 Bit Depth (5-bit), Median (2x2), Non-local Mean (13-3-2) 1.1402 4.93% 84.5% 95.74% ImageNet Bit Depth (5-bit), Median (2x2), Non-local Mean (11-3-4) 1.2128 8.33% 85.9% 94.24% 45
  37. Threat Models Oblivious attack: The adversary has full knowledge of

    the target model, but is not aware of the detector. Adaptive attack: The adversary has full knowledge of the target model and the detector. 46
  38. Adaptive Adversary Adaptive CW2 attack, unbounded adversary: Warren He, James

    Wei, Xinyun Chen, Nicholas Carlini, Dawn Song, Adversarial Example Defense: Ensembles of Weak Defenses are not Strong, USENIX WOOT’17. !"#"!"$% & '( − * + , ∗ Δ ', '( + 0 ∗ 12 3456%('′) 47 Misclassification term Distance term Detection term
  39. Adaptive Adversarial Examples 48 No successful adversarial examples were found

    for images originally labeled as 3 or 8. Mean L 2 2.80 4.14 4.67 Attack Untargeted Targeted (next) Targeted (least likely)
  40. Adaptive Adversary Success Rates 49 0.68 0.06 0.01 0.44 0.01

    0.24 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Adversary’s Success Rate Clipped ε Targeted (Next) Targeted (LL) Untargeted Unbounded Typical !
  41. Security State-of-the-Art Attack success probability Threat models Proofs Cryptography !"#!$

    information theoretic, resource bounded required System Security !"%! capabilities, motivations, rationality common Adversarial Machine Learning !&; !"##* white-box, black-box rare! 50
  42. Attacker Visibility “White-box attacker” Knows model architecture and all parameters

    “Black-box attacker” Interacts with model through API Limited number of interactions Output is <class, confidence> vector decision-based: output is just class “bird”, 0.09 “horse”, 0.84 ...
  43. Black-Box Batch Attacker 0 50 100 150 Images 0 1

    2 3 4 5 6 7 8 Number of Queries 10 5 Effort (number of model interactions) to find adversarial example varies by seed most only require a few thousand queries ZOO-attack on MNIST a few require > 10x more effort
  44. Easy and Hard Examples “Easy” “Hard” 1024 1280 1536 2560

    2816 “Easy” images: 5 with fewest number of queries needed to find adversarial example “Hard” images: 5 with highest number of queries needed (or failed) 138,240 101,376 97,792 75,008 71,424 768,000 query attempts without success 4608 6912 12,800 13,568 14,336 2 → 7 “bird” → “horse”
  45. Greedy Search Works 0 50 100 150 Number of Images

    Selected 0 2 4 6 8 10 12 Average Number of Queries 10 4 Greedy Search Random Search Retroactive Optimal MNIST CIFAR-10 0 50 100 150 Number of Images Selected 0 0.5 1 1.5 2 2.5 3 Average Number of Queries 10 4 Greedy Search Random Search Retroactive Optimal
  46. Conclusions Domain Expertise still matters Machine learning models designed without

    domain knowledge will not be robust against motivated adversaries Immature, but fun and active research area Need to make progress toward meaningful threat models, robustness measures, verifiable defenses Workshop to be held at DSN (Luxembourg, 25 June) Workshop to be held at IEEE S&P (San Francisco, 24 May)
  47. David Evans University of Virginia [email protected] EvadeML.org Weilin Xu Yanjun

    Qi Suya Yuan Tian Mainuddin Jonas Funding: NSF, Intel