Upgrade to Pro — share decks privately, control downloads, hide ads and more …

FOSAD Trustworthy Machine Learning: Class 2

David Evans
August 27, 2019

FOSAD Trustworthy Machine Learning: Class 2

19th International School on Foundations of Security Analysis and Design
Mini-course on "Trustworthy Machine Learning"
https://jeffersonswheel.org/fosad2019
David Evans

Class 2: Defenses

David Evans

August 27, 2019
Tweet

More Decks by David Evans

Other Decks in Education

Transcript

  1. Trustworthy
    Machine
    Learning
    David Evans
    University of Virginia
    jeffersonswheel.org
    Bertinoro, Italy
    27 August 2019
    19th International School on Foundations of Security Analysis and Design
    2: Defenses

    View Slide

  2. Recap/Plan
    Monday (Yesterday)
    Introduction / Attacks
    Tuesday (Today)
    Threat Models
    Defenses
    Wednesday
    Privacy +
    1

    View Slide

  3. Questions
    2

    View Slide

  4. Threat Models
    3
    1. What are the attacker’s goals?
    • Malicious behavior without detection
    • Commit check fraud
    • ...
    2. What are the attacker’s capabilities?
    information: what do they know
    actions: what they can do
    resources: how much they can spend?

    View Slide

  5. Threat Models in Cryptography
    Ciphertext-only attack
    Intercept message, want to learn plaintext
    Chosen-plaintext attack
    Adversary has encryption function as black box,
    wants to learn key (or decrypt some ciphertext)
    Chosen-ciphertext attack
    Adversary has decryption function as black box,
    wants to learn key (or encrypt some message)
    4
    Goals
    Information
    Actions
    Resources

    View Slide

  6. Threat Models in Cryptography
    5
    Goals
    Information
    Actions
    Resources
    Polynomial time/space:
    adversary has computational resources
    that scale polynomially in some
    security parameter (e.g., key size)

    View Slide

  7. Security Goals in
    Cryptography
    6
    First formal notions of
    cryptography, information
    theory
    Claude Shannon (1940s)

    View Slide

  8. Security Goals in
    Cryptography
    7
    Semantic Security:
    adversary with
    intercepted
    ciphertext has no
    advantage over
    adversary without it
    Shafi Goldwasser and Silvio Micali
    Developed semantic security in 1980s
    (2013 Turing Awardees)

    View Slide

  9. Threat Models in Adversarial ML?
    8
    Ciphertext-only attack
    Chosen-plaintext attack
    Chosen-ciphertext attack
    Polynomial time/space
    Semantic Security proofs
    Can we get to threat models as precise
    as those used in cryptography?
    Can we prove strong security notions
    for those threat models?

    View Slide

  10. Threat Models in Adversarial ML?
    9
    Ciphertext-only attack
    Chosen-plaintext attack
    Chosen-ciphertext attack
    Polynomial time/space
    Semantic Security proofs
    Can we get to threat models as precise
    as those used in cryptography?
    Can we prove strong security notions
    for those threat models?
    Current state: “Pre-Shannon” (Nicolas Carlini)

    View Slide

  11. 10
    Ali Rahimi
    NIPS Test-of-Time Award Speech
    (Dec 2017)
    ”If you're building photo-
    sharing systems alchemy
    is okay but we're beyond
    that; now we're building
    systems that govern
    healthcare and mediate
    our civic dialogue”

    View Slide

  12. 11
    Ali Rahimi
    NIPS Test-of-Time Award Speech
    (Dec 2017)
    ”If you're building photo-
    sharing systems alchemy
    is okay but we're beyond
    that; now we're building
    systems that govern
    healthcare and mediate
    our civic dialogue”

    View Slide

  13. Alchemy (~700 − 1660)
    Well-defined, testable goal
    turn lead into gold
    Established theory
    four elements: earth, fire, water, air
    Methodical experiments and lab
    techniques
    (Jabir ibn Hayyan in 8th century)
    Wrong and ultimately unsuccessful,
    but led to modern chemistry.

    View Slide

  14. “Realistic” Threat
    Model for
    Adversarial ML
    13

    View Slide

  15. Attacker Access
    White Box
    Attack has model: full
    knowledge of all parameters
    Black Box
    14
    ! " = ! $ ! % &' … ! ' !(")
    !
    " !(")
    Each model query is “expensive”
    Only receives output
    “API Access”

    View Slide

  16. ML-as-a-Service
    15

    View Slide

  17. Black-Box Attacks
    16
    PGD Attack
    !"
    # = !
    for % iterations:
    !&'(
    # = project0,2
    (!&
    # − 5 ⋅ sign(∇ < !&
    # , = )
    !# = !?

    Can we execute these attacks if we don’t have the model?

    View Slide

  18. Black-Box Optimization Attacks
    17
    !
    " !(")
    Black-Box Gradient Attack
    "%
    & = "
    for ( iterations:
    use queries to estimate ∇ * "+
    & , -

    View Slide

  19. Black-Box Optimization Attacks
    18
    !
    " !(")
    Black-Box Gradient Attack
    "%
    & = "
    for ( iterations:
    use queries to estimate ∇ * "+
    & , -
    "+./
    & = take step in “white-box” attack
    using estimated gradients
    "+./
    &

    View Slide

  20. Black-Box Gradient Attacks
    19
    Hybrid Batch Attacks: Finding Black-box Adversarial Examples with Limited Queries.
    Fnu Suya, Jianfeng Chi, David Evans, Yuan Tian. USENIX Security 2020.

    View Slide

  21. Transfer Attacks
    20
    !
    "∗ ! "∗ = %
    !&
    Target Model
    '∗ = whiteBoxAttack(!&
    , ')
    Adversarial examples against one model, often transfer to another model.
    External
    Local

    View Slide

  22. 21

    View Slide

  23. Improving Transfer Attacks
    22
    !
    "∗ ! "∗ = %
    Target Model
    Adversarial examples against several models, more likely to transfer.
    External
    Local
    !&
    !'
    !(
    "∗ = whiteBoxAttack(6(!&
    , !'
    , !(
    ), ")
    Yanpei Liu, Xinyun Chen, Chang Liu, Dawn Song [ICLR 2017]

    View Slide

  24. Hybrid Attacks
    Transfer Attacks
    Efficient: only one API query
    Low success rates
    - 3% transfer rate for targeted
    attack on ImageNet (ensemble)
    Gradient Attacks
    Expensive: 10k+ queries/seed
    High success rates
    - 100% for targeted attack on
    Imagenet
    23
    Hybrid Batch Attacks: Finding Black-box Adversarial Examples with Limited Queries.
    Fnu Suya, Jianfeng Chi, David Evans, Yuan Tian. USENIX Security 2020.
    Combine both attacks: efficient + high success

    View Slide

  25. Hybrid Attack
    24
    !
    "∗ ! "∗ = %
    External
    Local
    !&
    !'
    !(
    "∗ = whiteBoxAttack(6(!&
    , !'
    , !(
    ), ")
    1: Transfer Attack

    View Slide

  26. Hybrid Attack
    25
    !
    "∗ ! "∗ ≠ %
    External
    Local
    !&
    !'
    !(
    "∗ = whiteBoxAttack(7(!&
    , !'
    , !(
    ), ")
    1: Transfer Attack
    ":
    ;
    "<
    ;
    "∗
    2: Gradient Attack
    (starting from
    transfer candidate)

    View Slide

  27. Hybrid Attack
    26
    !
    "∗ ! "∗ ≠ %
    External
    Local
    !&
    !'
    !(
    "∗ = whiteBoxAttack(7(!&
    , !'
    , !(
    ), ")
    1: Transfer Attack
    ":
    ;
    "<
    ;
    "∗
    2: Gradient Attack
    3: Tune Local Models
    using label
    byproducts

    View Slide

  28. 27
    Dataset / Model
    Direct
    Transfer
    Rate
    Gradient Attack (AutoZoom) Hybrid Attack
    Success
    Rate
    Queries
    per AE
    Success
    Rate
    Queries
    per AE
    MNIST (Targeted) 61.6 90.9 1,645 98.8 298
    CIFAR10 (Targeted) 63.3 92.2 1,227 98.1 227
    ImageNet (Targeted) 3.4 95.4 45,166 98.0 30,089

    View Slide

  29. Realistic Adversary Model?
    Knowledge
    Only API access to target
    Good models for ensemble
    − pretrained models
    − (or access to similar training
    dataset, resources)
    Set of starting seeds
    Goals
    Find one adversarial example
    for each seed
    28
    Resources
    Unlimited number of API
    queries

    View Slide

  30. Batch Attacks
    Knowledge
    Only API access to target
    Good models for ensemble
    − pretrained models
    − (or access to similar training
    dataset, resources)
    Set of starting seeds
    Goals
    Find many seed/adversarial
    example pairs
    29
    Resources
    Limited number of API queries
    Prioritize seeds to attack: use resources to attack the low-cost seeds first

    View Slide

  31. Requirements
    1. There is a high variance
    across seeds in the cost to
    find adversarial examples.
    2. There are ways to predict in
    advance which seeds will
    be easy to attack.
    30

    View Slide

  32. 31
    Variation in Query cost of NES Gradient Attack
    Excludes direct transfers

    View Slide

  33. Predicting the Low-Cost Seeds
    Strategy 1:
    Cost of local attack
    number of PGD steps to find
    local AE
    Strategy 2:
    Loss function on target
    32
    NES gradient attack on robust CIFAR-10 model

    View Slide

  34. What about Direct Transfers?
    Strategy 1:
    Cost of local attack
    number of PGD steps to find
    local AE
    Strategy 2:
    Loss function on target
    33

    View Slide

  35. Direct Transfers
    Cost of local attack
    number of PGD steps to find
    local AE
    34

    View Slide

  36. Two-Phase
    Hybrid Attack
    35
    Retroactive Optimal:
    unrealizable strategy that
    always picks lowest cost seed

    View Slide

  37. Two-Phase
    Hybrid Attack
    36
    Retroactive Optimal:
    unrealizable strategy that
    always picks lowest cost seed
    Phase 1: Find Direct Transfers
    (1000 queries to find 95 direct
    transfers)
    AutoZOOM attack on Robust CIFAR-10 Model

    View Slide

  38. Two-Phase
    Hybrid Attack
    37
    Retroactive Optimal:
    unrealizable strategy that
    always picks lowest cost seed
    Phase 1: Find Direct Transfers
    (1000 queries to find 95 direct
    transfers)
    AutoZOOM attack on Robust CIFAR-10 Model
    Phase 2: Gradient Attack
    (100,000 queries to find 95
    direct transfers)

    View Slide

  39. Cost of Hybrid Batch Attacks
    38
    Target Model Prioritization
    Total Queries (Standard Error)
    Goal (number of seeds attacked)
    1% 2% 10%
    CIFAR-10
    (Robust)
    1000 seeds
    “Optimal” 10.0 (0.0) 20.0 (0.0) 107.8 (17.4)
    Two-Phase 20.4 (2.1) 54.2 (5.6) 826.2 (226.6)
    Random 24,054 (132) 45,372 (260) 251,917 (137)
    ImageNet
    100 seeds
    “Optimal” 1.0 (0.0) 2.0 (0.0) 34,949 (3,742)
    Two-Phase 28.0 (2.0) 38.6 (7.5) 78,844 (11,837)
    Random 15,046 (423) 45,136 (1,270) 285,855 (8,045)

    View Slide

  40. Defenses
    39

    View Slide

  41. How can we construct
    models that make it hard
    for adversaries to find
    adversarial examples?
    40

    View Slide

  42. Defense Strategies
    1. Hide the gradients
    41

    View Slide

  43. Defense Strategies
    1. Hide the gradients
    − Transferability results
    42
    !
    "∗ ! "∗ = %
    !&
    '∗ = whiteBoxAttack(!&
    , ')

    View Slide

  44. Defense Strategies
    1. Hide the gradients
    − Transferability results
    43
    !
    "∗ ! "∗ = %
    !&
    '∗ = whiteBoxAttack(!&
    , ')
    Maybe they can work against
    adversaries who don’t have
    access to training data/similar
    model? (or transfer loss is high)

    View Slide

  45. 44
    Visualization by Nicholas Carlini

    View Slide

  46. 45
    Visualization by Nicholas Carlini

    View Slide

  47. 46
    Visualization by Nicholas Carlini

    View Slide

  48. 47
    Visualization by Nicholas Carlini

    View Slide

  49. Defense Strategies
    1. Hide the gradients
    − Clever adversaries can still find adversarial examples
    48
    ICML 2018 (Best Paper award)

    View Slide

  50. Defense Strategies
    1. Hide the gradients
    − Transferability results
    − Clever adversaries can still find adversarial examples
    2. Build a robust classifier
    49
    Increase capacity

    View Slide

  51. Increasing Model Capacity
    50
    Image from Aleksander Mądry, et al. 2017

    View Slide

  52. Defense Strategies
    1. Hide the gradients
    − Transferability results
    − Clever adversaries can still find adversarial examples
    2. Build a robust classifier
    51
    Increase capacity
    Consider adversaries in training: adversarial training

    View Slide

  53. Adversarial Training (Example from Yesterday)
    52
    Training
    Data
    ML
    Algorithm
    Training Clone
    010110011
    01
    EvadeML
    Deployment
    Why didn’t this work?

    View Slide

  54. Adversarial Training
    Training
    Data
    Training
    Process
    Candidate
    Model !"
    Adversarial
    Example
    Generator
    Successful
    AEs against !"
    add to training data
    (with correct labels)
    Christian Szegedy, Wojciech
    Zaremba, Ilya Sutskever, Joan
    Bruna, Dumitru Erhan, Ian
    Goodfellow, and Rob Fergus. 2013

    View Slide

  55. Ensemble Adversarial Training
    Training
    Data
    Training
    Process
    Candidate
    Model !"
    Adversarial
    Example
    Generator
    Successful
    AEs against !"
    add to training data
    (with correct labels)
    Florian Tramer, et al. [ICLR 2018]

    View Slide

  56. Ensemble Adversarial Training
    Training
    Data
    Training
    Process
    Candidate
    Model !"
    Adversarial
    Example
    Generator
    Successful
    AEs against !"
    add to training data
    (with correct labels)
    Florian Tramer, et al. [ICLR 2018]
    Static
    Model !#
    Adversarial
    Example
    Generator
    AEs against
    !#
    Static
    Model !$
    Adversarial
    Example
    Generator
    AEs against
    !$

    View Slide

  57. Formalizing Adversarial Training
    56
    2017
    min$
    (&(',))∼,
    ℒ(.$
    , /, 0))
    Regular training:

    View Slide

  58. Formalizing Adversarial Training
    57
    2017
    min$
    (&(',))∼,
    max/∈1
    (ℒ(3$
    , 4 + 6, 7) )
    min$
    (&(',))∼,
    ℒ(3$
    , 4, 7))
    Regular training:
    Adversarial training:
    Simulate with PGD attack
    with multiple restarts

    View Slide

  59. Attacking Robust Models
    58
    Dataset / Model
    Direct
    Transfer
    Rate
    Gradient Attack
    (AutoZoom) Hybrid Attack
    Success
    Rate
    Queries
    per AE
    Success
    Rate
    Queries
    per AE
    MNIST (Targeted) 61.6 90.9 1,645 98.8 298
    CIFAR10 (Targeted) 63.3 92.2 1,227 98.1 227
    MNIST-Robust (Untar’d) 2.9 7.2 52,182 7.3 51,328
    CIFAR10 Robust (Untar’d) 9.5 64.4 2,640 65.2 2,529
    Hybrid Batch Attacks: Finding Black-box Adversarial Examples with Limited Queries.
    Fnu Suya, Jianfeng Chi, David Evans, Yuan Tian. USENIX Security 2020.

    View Slide

  60. Defense Strategies
    1. Hide the gradients
    − Transferability results
    − Clever adversaries can still find adversarial examples
    2. Build a robust classifier
    − Adversarial retraining with increased model capacity
    Very expensive
    Assumes you can generate adversarial examples as well as adversary
    − If we could build a perfect model, we would!
    59

    View Slide

  61. Defense Strategies
    1. Hide the gradients
    − Transferability results
    − Clever adversaries can still find adversarial examples
    2. Build a robust classifier
    − Adversarial retraining, increasing model capacity, etc.
    − If we could build a perfect model, we would!
    60
    Our strategy: “Feature Squeezing”: reduce the
    search space available to the adversary
    Weilin Xu, David Evans, Yanjun Qi [NDSS 2018]

    View Slide

  62. Model
    Model
    Squeezer
    1
    Prediction0
    Prediction1
    "($%&'(
    , $%&'*
    , … , $%&',
    )
    Input
    Adversarial
    Legitimate
    Model’
    Squeezer
    k

    Predictionk
    Feature Squeezing Detection Framework
    Weilin Xu Yanjun Qi

    View Slide

  63. Model
    Model
    Squeezer
    1
    Prediction0
    Prediction1
    "($%&'(
    , $%&'*
    , … , $%&',
    )
    Input
    Adversarial
    Legitimate
    Model’
    Squeezer
    k

    Predictionk
    Feature Squeezing Detection Framework
    Feature Squeezer coalesces similar inputs into one point:
    • Barely change legitimate inputs.
    • Destruct adversarial perturbations.

    View Slide

  64. Coalescing by Feature Squeezing
    63
    Metric Space 1: Target Classifier Metric Space 2: “Oracle”
    Before: find a small perturbation that changes class for classifier, but imperceptible to oracle.
    Now: change class for both original and squeezed classifier, but imperceptible to oracle.

    View Slide

  65. Fast Gradient Sign [Yesterday]
    64
    original 0.1 0.2 0.3 0.4 0.5
    Adversary Power: !
    "#
    -bounded adversary: max(abs(*+
    −*+
    -)) ≤ !
    *- = * − ! ⋅ sign(∇*
    6(*, 8))
    Goodfellow, Shlens, Szegedy 2014

    View Slide

  66. Bit Depth Reduction
    0
    0.1
    0.2
    0.3
    0.4
    0.5
    0.6
    0.7
    0.8
    0.9
    1
    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
    3-bit
    1-bit
    8-bit
    Reduce to 1-bit
    !"
    = round(!"
    ×2)/2
    Reduce to 1-bit
    !"
    = round(!"
    ×2)/2
    [0.312 0.271 …… 0.159 0.651]
    X*
    [0.012 0.571 …… 0.159 0.951]
    X
    Input
    Output
    65
    [0. 1. …… 0. 1. ]
    [0. 0. …… 0. 1. ]
    Signal Quantization
    Adversarial Example
    Normal Example

    View Slide

  67. Bit Depth Reduction
    66
    Seed
    1 1 4 2 2
    1 1 1 1 1
    CW
    2
    CW

    BIM
    FGSM

    View Slide

  68. Accuracy with Bit Depth Reduction
    67
    Dataset Squeezer
    Adversarial Examples
    (FGSM, BIM, CW∞
    , Deep Fool, CW2
    , CW0
    , JSMA)
    Legitimate
    Images
    MNIST
    None 13.0% 99.43%
    1-bit Depth 62.7% 99.33%
    ImageNet
    None 2.78% 69.70%
    4-bit Depth 52.11% 68.00%

    View Slide

  69. Spatial Smoothing: Median Filter
    Replace a pixel with median of its neighbors.
    Effective in eliminating ”salt-and-pepper” noise (!"
    attacks)
    68
    Image from https://sultanofswing90.wordpress.com/tag/image-processing/
    3×3 Median Filter

    View Slide

  70. Spatial Smoothing: Non-local Means
    Replace a patch with weighted mean of similar patches (in region).
    69
    !
    "#
    "$
    !% = '((!, "+
    )×"+
    Preserves edges, while removing noise.

    View Slide

  71. 70
    Airplane
    94.4%
    Truck
    99.9%
    Automobile
    56.5%
    Airplane
    98.4%
    Airplane
    99.9%
    Ship
    46.0%
    Airplane
    98.3%
    Airplane
    80.8%
    Airplane
    70.0%
    Median Filter
    (2×2)
    Non-local Means
    (13-3-4)
    Original BIM (L

    ) JSMA (L
    0
    )

    View Slide

  72. Accuracy with Spatial Smoothing
    71
    Dataset Squeezer
    Adversarial Examples
    (FGSM, BIM, CW∞
    , Deep Fool, CW2
    , CW0
    )
    Legitimate
    Images
    ImageNet
    None 2.78% 69.70%
    Median Filter
    2*2
    68.11% 65.40%
    Non-local
    Means
    11-3-4
    57.11% 65.40%

    View Slide

  73. Other Potential Squeezers
    72
    C Xie, et al. Mitigating Adversarial Effects Through Randomization, ICLR 2018.
    J Buckman, et al. Thermometer Encoding: One Hot Way To Resist Adversarial
    Examples, ICLR 2018.
    D Meng and H Chen, MagNet: a Two-Pronged Defense against Adversarial
    Examples, CCS 2017; A Prakash, et al., Deflecting Adversarial Attacks with Pixel
    Deflection, CVPR 2018;...
    Thermometer Encoding (learnable bit depth reduction)
    Image denoising using autoencoder, wavelet, JPEG, etc.
    Image resizing
    ...
    Spatial Smoothers: median filter, non-local means

    View Slide

  74. Other Potential Squeezers
    73
    C Xie, et al. Mitigating Adversarial Effects Through Randomization, ICLR 2018.
    J Buckman, et al. Thermometer Encoding: One Hot Way To Resist Adversarial
    Examples, ICLR 2018.
    D Meng and H Chen, MagNet: a Two-Pronged Defense against Adversarial
    Examples, CCS 2017; A Prakash, et al., Deflecting Adversarial Attacks with Pixel
    Deflection, CVPR 2018;...
    Thermometer Encoding (learnable bit depth reduction)
    Image denoising using autoencoder, wavelet, JPEG, etc.
    Image resizing
    ...
    Spatial Smoothers: median filter, non-local means
    Anish Athalye, Nicholas Carlini, David Wagner.
    Obfuscated Gradients Give a False Sense of
    Security: Circumventing Defenses to
    Adversarial Examples. ICML 2018.

    View Slide

  75. “Feature Squeezing” (Vacuous) Conjecture
    For any distance-limited adversarial method,
    there exists some feature squeezer that
    accurately detects its adversarial examples.
    74
    Intuition: if the perturbation is small (in some simple
    metric space), there is some squeezer that coalesces
    original and adversarial example into same sample.

    View Slide

  76. Feature Squeezing Detection
    Model
    (7-layer
    CNN)
    Model
    Model
    Bit Depth-
    1
    Median
    2×2
    Prediction0
    Prediction1
    Prediction2
    Yes
    Input
    Adversarial
    No
    Legitimate
    max '(
    )*
    , )(
    , '(
    )*
    , )2 > -

    View Slide

  77. Detecting Adversarial Examples
    Distance between original input and its squeezed version
    Adversarial
    inputs
    (CW attack)
    Legitimate
    inputs

    View Slide

  78. 77
    0
    200
    400
    600
    800
    0.0 0.4 0.8 1.2 1.6 2.0
    Number of Examples
    Legitimate
    Adversarial
    Maximum !"
    distance between original and squeezed input
    threshold = 0.0029
    detection: 98.2%, FP < 4%
    Training a detector
    (MNIST)
    set the detection threshold to keep
    false positive rate below target

    View Slide

  79. ImageNet Configuration
    Model
    (MobileNet)
    Model
    Model
    Bit Depth-
    5
    Median
    2×2
    Prediction0
    Prediction1
    Prediction2
    Yes
    Input
    Adversarial
    No
    Legitimate
    max(()
    (*+
    , {*)
    , *.
    , */
    }) > 3
    Model
    Non-local
    Mean
    Prediction3

    View Slide

  80. 79
    0
    20
    40
    60
    80
    100
    120
    140
    0.0 0.4 0.8 1.2 1.6 2.0
    Legitimate
    Adversarial
    Maximum !"
    distance between original and squeezed input
    threshold = 1.24
    detection: 85%, FP < 5%
    Training a detector
    (ImageNet)

    View Slide

  81. Aggregated Detection Results
    Dataset Squeezers Threshold
    False
    Positive
    Rate
    Detection
    Rate
    (SAEs)
    ROC-AUC
    Exclude
    FAEs
    MNIST
    Bit Depth (1-bit),
    Median (2x2)
    0.0029 3.98% 98.2% 99.44%
    CIFAR-10
    Bit Depth (5-bit),
    Median (2x2),
    Non-local Mean (13-3-2)
    1.1402 4.93% 84.5% 95.74%
    ImageNet
    Bit Depth (5-bit),
    Median (2x2),
    Non-local Mean (11-3-4)
    1.2128 8.33% 85.9% 94.24%
    80

    View Slide

  82. Threat Models
    Oblivious attack: The adversary has full knowledge of
    the target model, but is not aware of the detector.
    Adaptive attack: The adversary has full knowledge of
    the target model and the detector.
    81

    View Slide

  83. Adaptive Adversary
    Adaptive CW
    2
    attack, unbounded adversary:
    Warren He, James Wei, Xinyun Chen, Nicholas Carlini, Dawn Song,
    Adversarial Example Defense: Ensembles of Weak Defenses are
    not Strong, USENIX WOOT’17.
    !"#"!"$% & '( − * + , ∗ Δ ', '( + 0 ∗ 12
    3456%('′)
    82
    Misclassification term Distance term Detection term

    View Slide

  84. Adaptive Adversarial Examples
    83
    No successful adversarial examples were found
    for images originally labeled as 3 or 8.
    Mean L2
    2.80
    4.14
    4.67
    Attack
    Untargeted
    Targeted
    (next)
    Targeted
    (least likely)

    View Slide

  85. Adaptive Adversary Success Rates
    84
    0.68
    0.06
    0.01
    0.44
    0.01
    0.24
    0.0
    0.1
    0.2
    0.3
    0.4
    0.5
    0.6
    0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
    Adversary’s Success Rate
    Clipped ε
    Targeted
    (Next)
    Targeted
    (LL)
    Untargeted
    Unbounded
    Typical !

    View Slide

  86. Model
    Model
    Model
    Squeezer
    1
    Squeezer
    2
    Prediction0
    Prediction1
    Prediction2
    #(%&'()
    , %&'(+
    , … , %&'(-
    )
    Yes
    Input
    Adversarial
    No
    Legitimate
    Model’
    Squeezer
    k

    Predictionk
    Defender’s
    Entropy
    Advantage
    random
    seed

    View Slide

  87. Counter Measure: Randomization
    Binary filter threshold := 0.5 threshold := ! 0.5, 0.0625
    86
    0
    0.2
    0.4
    0.6
    0.8
    1
    0 0.5 1
    0
    0.2
    0.4
    0.6
    0.8
    1
    0 0.5 1
    Strengthen the adaptive adversary
    Attack an ensemble of 3 detectors with thresholds: [0.4, 0.5, 0.6]

    View Slide

  88. 87
    2.80, Untargeted
    4.14, Targeted-Next
    4.67, Targeted-LL
    3.63, Untargeted
    5.48, Targeted-Next
    5.76, Targeted-LL
    Attack Deterministic Detector Mean L2
    Attack Randomized Detector

    View Slide

  89. Are defenses against
    adversarial examples
    even possible?
    88

    View Slide

  90. (Redefining) Adversarial Example
    89
    Prediction Change Definition:
    An input, !′ ∈ $, is an adversarial example for ! ∈ $, iff
    ∃!& ∈ Ball*
    (!) such that - ! ≠ - !& .

    View Slide

  91. Adversarial Example
    90
    Ball$
    (&) is some space around &, typically defined in
    some (simple!) metric space:
    ()
    norm (# different), (*
    norm (“Euclidean distance”), (+
    Without constraints on Ball$
    , every input has
    adversarial examples.
    Prediction Change Definition:
    An input, &′ ∈ /, is an adversarial example for & ∈ /, iff
    ∃&1 ∈ Ball$
    (&) such that 2 & ≠ 2 &1 .

    View Slide

  92. Adversarial Example
    91
    Any non-trivial model has
    adversarial examples:
    ∃"#
    , "%
    ∈ '. ) "#
    ≠ )("%
    )
    Prediction Change Definition:
    An input, -′ ∈ ', is an adversarial example for - ∈ ', iff
    ∃-/ ∈ Ball3
    (-) such that ) - ≠ ) -/ .

    View Slide

  93. Prediction Error Robustness
    92
    Error Robustness:
    An input, !′ ∈ $, is an adversarial example for (correct) ! ∈ $, iff
    ∃!& ∈ Ball*
    (!) such that - !′ ≠ true label for !′.
    Perfect classifier has no
    (error robustness)
    adversarial examples.

    View Slide

  94. Prediction Error Robustness
    93
    Error Robustness:
    An input, !′ ∈ $, is an adversarial example for (correct) ! ∈ $, iff
    ∃!& ∈ Ball*
    (!) such that - !′ ≠ true label for !′.
    Perfect classifier has no
    (error robustness)
    adversarial examples.
    If we have a way to
    know this, don’t need
    an ML classifier.

    View Slide

  95. Global Robustness Properties
    94
    Adversarial Risk: probability an input has an adversarial example
    Pr
    # ← %
    [∃ () ∈ +,--.
    ( . 0 () ≠ class (′ ]
    Dimitrios I. Diochnos, Saeed Mahloujifar,
    Mohammad Mahmoody, NeurIPS 2018

    View Slide

  96. Global Robustness Properties
    95
    Dimitrios I. Diochnos, Saeed Mahloujifar,
    Mohammad Mahmoody, NeurIPS 2018
    Adversarial Risk: probability an input has an adversarial example
    Pr
    # ← %
    [∃ () ∈ +,--.
    ( . 0 () ≠ class (′ ]
    Error Region Robustness: expected distance to closest AE:
    8
    # ← %
    [inf { =: ∃ () ∈ +,--.
    ( . 0 () ≠ class () }]

    View Slide

  97. Assumption Key Result
    Adversarial Spheres
    [Gilmer et al., 2018]
    Uniform distribution on two
    concentric !-spheres
    Expected safe distance ("#
    -norm)
    is relatively small.
    Adversarial vulnerability
    for any classifier
    [Fawzi × 3, 2018]
    Smooth generative model:
    1. Gaussian in latent space.
    2. Generator is L-Lipschitz.
    Adversarial risk ⟶ 1 for relatively
    small attack strength ("#
    -norm).
    Curse of Concentration in
    Robust Learning
    [Mahloujifar et al., 2018]
    Normal Lévy families
    • Unit sphere, uniform, "#
    norm
    • Boolean hypercube, uniform,
    Hamming distance
    ...
    If attack strength exceeds a
    relatively small threshold,
    adversarial risk > 1/2.
    b >
    p
    log(k1/")
    p
    k2
    · n
    ! Riskb(h, c) 1/2
    Recent Global Robustness Results
    P(r(x)  ⌘) 1
    r

    2
    e ⌘2/2L2
    Properties of any model for input space:
    distance to AE is small relative to expected distance between two sampled points

    View Slide

  98. Prediction Change Robustness
    97
    Prediction Change:
    An input, !′ ∈ $, is an adversarial example for ! ∈ $, iff
    ∃!& ∈ Ball*
    (!) such that - !′ ≠ - ! .
    Any non-trivial model has
    adversarial examples:
    ∃!0
    , !2
    ∈ $. - !0
    ≠ -(!2
    )
    Solutions:
    - only consider distribution inputs (“good” seeds)
    - output isn’t just class (e.g., confidence)
    - targeted adversarial examples
    cost-sensitive adversarial robustness

    View Slide

  99. Local (Instance) Robustness
    98
    Robust Region: For an input !, the robust region is the
    maximum region with no adversarial example:
    sup % > 0 ∀)* ∈ Ball/
    ) , 1 )* = 1 ) }

    View Slide

  100. Local (Instance) Robustness
    99
    Robust Region: For an input !, the robust region is the
    maximum region with no adversarial example:
    sup % > 0 ∀)* ∈ Ball/
    ) , 1 )* = 1 ) }
    Robust Error: For a test set, 4, and bound, %5
    :
    | ) ∈ 4, RobustRegion ) < %5
    }
    | 4|

    View Slide

  101. Instance Defense-Robustness
    100
    For an input !, the robust-defended region is the maximum
    region with no undetected adversarial example:
    sup % > 0 ∀)* ∈ Ball/
    ) , 1 )* = 1 ) ⋁ 45657654(!*)}
    Defense Failure: For a test set, ;, and bound, %<
    :
    | ) ∈ ;, RobustDefendedRegion ) < %<
    }
    | ;|
    Can we verify a defense?

    View Slide

  102. Formal Verification of Defense Instance
    exhaustively test all inputs in ∀"# ∈ Ball(
    "
    for correctness or detection
    Need to transform model into a
    function amenable to verification

    View Slide

  103. Linear Programming
    !""
    #"
    + !"%
    #%
    + ⋯ ≤ ("
    !%"
    #"
    + !%%
    #%
    + ⋯ ≤ (%
    #)
    ≤ 0
    ...
    Find values of + that minimize linear
    function under constraints:
    ,"
    #"
    + ,%
    #%
    + ,-
    #-
    + …

    View Slide

  104. Encoding a Neural Network
    Linear Components (! = #$ + &)
    Convolutional Layer
    Fully-connected Layer
    Batch Normalization (in test mode)
    Non-linear
    Activation (ReLU, Sigmoid, Softmax)
    Pooling Layer (max, avg)
    103

    View Slide

  105. Encode ReLU
    Mixed Integer Linear Programming
    adds discrete values to LP
    ReLU
    (Rectified Linear Unit )
    ! = max(0, ))
    + ∈ 0, 1
    ! ≥ )
    ! ≥ 0
    ! ≤ ) − 1 1 − +
    ! ≤ 2+
    1 2
    Piecewise Linear

    View Slide

  106. Mixed Integer Linear Programming (MILP)
    Intractable in theory (NP-Complete)
    Efficient in practice
    (e.g., Gurobi solver)
    MIPVerify
    Vincent Tjeng, Kai Xiao, Russ Tedrake
    Verify NNs using MILP

    View Slide

  107. Encode Feature Squeezers
    Binary Filter
    0.5 1
    0
    Actual Input: uint8 [0, 1, 2, … 254, 255]
    127 / 255 = 0.498
    128 / 255 = 0.502
    An infeasible gap [0.499, 0.501]
    Lower semi-continuous

    View Slide

  108. Verified L ∞
    Robustness
    Model Test Accuracy
    Robust Error
    ε = 0.1
    Robust Error
    with
    Binary Filter
    Raghunathan
    et al.
    95.82% 14.36%-30.81% 7.37%
    Wong & Kolter 98.11% 4.38% 4.25%
    Ours with
    binary filter
    98.94% 2.66-6.63% -
    Even without detection, this helps!

    View Slide

  109. Encode Detection Mechanism
    Original version:
    Simplify for verification:
    !"
    ⟶ maximum difference
    softmax ⟶ multiple piecewise-linear
    approximate sigmoid
    score(*) = - * − -(squeeze * ) "
    where f(x) is softmax output

    View Slide

  110. Preliminary Experiments
    109
    Model
    (4-layer
    CNN)
    Model
    Bit Depth-1
    Yes
    Input
    !’
    Adversarial
    No
    y1
    valid
    max_diff +,
    , +.
    > 0
    Verification: for a
    seed !, there is no
    adversarial input
    !1 ∈ Ball5
    ! for
    which +.
    ≠ 7 !
    and not detected
    Adversarially robust retrained [Wong & Kolter] model
    1000 test MNIST seeds, 8 = 0.1 (=>
    )
    970 infeasible (verified no adversarial example)
    13 misclassified (original seed)
    17 vulnerable
    Robust error: 0.3%
    Verification time ~0.2s
    (compared to 0.8s without binarization)

    View Slide

  111. 110
    Scalability
    Formal Verification
    MILP solver (MIPVerify)
    SMT solver (Reluplex)
    Interval analysis (Reluval)
    robust error
    Heuristic Defenses
    distillation (Papernot et al., 2016)
    gradient obfuscation
    adversarial retraining (Madry et al., 2017)
    attack success rate
    (set of attacks)
    Certified Robustness
    CNN-Cert (Boopathy et al., 2018)
    Dual-LP (Kolter & Wong 2018)
    Dual-SDP (Raghunathan et al., 2018)
    bound
    Evaluation Metric
    precise
    feature squeezing

    View Slide

  112. Realistic Threat Models
    Knowledge
    Full access to target
    Goals
    Find many seed/adversarial
    example pairs
    111
    Resources
    Limited number of API queries
    Limited computation
    It matters which seed
    and target classes

    View Slide

  113. 112
    target class
    Original
    Model
    (no robustness
    training)
    seed class
    target class
    MNIST Model
    2 convolutional layers
    2 fully-connected layers
    (100, 10 units)
    ! = 0.2, '(

    View Slide

  114. 113
    target class
    Original
    Model
    (no robustness
    training)
    seed class
    target class
    MNIST Model
    2 convolutional layers
    2 fully-connected layers
    (100, 10 units)
    ! = 0.2, '(

    View Slide

  115. Training a Robust Network
    Eric Wong and J. Zico Kolter. Provable defenses against adversarial
    examples via the convex outer adversarial polytope. ICML 2018.
    replace loss with
    differentiable function
    based on outer bound
    using dual network
    ReLU
    (Rectified Linear Unit ) linear approximation
    ! "

    View Slide

  116. 115
    seed class
    target class
    Standard
    Robustness
    Training
    (overall
    robustness goal)
    MNIST Model
    2 convolutional layers
    2 fully-connected layers
    (100, 10 units)
    ! = 0.2, '(

    View Slide

  117. Cost-Sensitive Robustness Training
    116
    Xiao Zhang
    Cost-matrix: cost of different adversarial transformations
    ! =
    − 0
    1 −
    benign malware
    benign
    malware
    Incorporate a cost-matrix into robustness training
    Xiao Zhang and David Evans [ICLR 2019]

    View Slide

  118. 117
    seed class
    target class
    Standard
    Robustness
    Training
    (overall
    robustness goal)
    MNIST Model
    2 convolutional layers
    2 fully-connected layers
    (100, 10 units)
    ! = 0.2, '(

    View Slide

  119. 118
    seed class
    target class
    Cost-
    Sensitive
    Robustness
    Training
    Protect odd
    classes from
    evasion

    View Slide

  120. 119
    seed class
    target class
    Cost-
    Sensitive
    Robustness
    Training
    Protect even
    classes from
    evasion

    View Slide

  121. History of the
    destruction
    of Troy, 1498
    Wrap-Up

    View Slide

  122. Security State-of-the-Art
    Attack success
    probability
    Threat models Proofs
    Cryptography !−#!$
    information theoretic,
    resource bounded
    required
    121
    Considered seriously
    broken if attack method
    increases to !%#!& even if it
    requires 2() ciphertexts.

    View Slide

  123. Security State-of-the-Art
    Attack success
    probability
    Threat models Proofs
    Cryptography !−#!$
    information theoretic,
    resource bounded
    required
    System Security !−%!
    capabilities,
    motivations, rationality
    common
    122
    Considered seriously broken
    if attack method can succeed
    in “lab” environment with
    probability 2'(.

    View Slide

  124. Security State-of-the-Art
    Attack success
    probability
    Threat models Proofs
    Cryptography !−#!$
    information theoretic,
    resource bounded
    required
    System Security !−%!
    capabilities,
    motivations, rationality
    common
    Adversarial
    Machine Learning
    !−#
    artificially limited
    adversary
    making
    progress!
    123
    Considered broken if attack
    method succeeds with
    probability 2'.

    View Slide

  125. Security State-of-the-Art
    Attack success
    probability
    Threat models Proofs
    Cryptography !−#!$
    information theoretic,
    resource bounded
    required
    System Security !−%!
    capabilities,
    motivations, rationality
    common
    Adversarial
    Machine Learning
    !−#
    artificially limited
    adversary
    making
    progress!
    124
    Huge gaps to close:
    threat models are unrealistic (but real threats unclear)
    verification techniques only work for tiny models
    experimental defenses often (quickly) broken

    View Slide

  126. Tomorrow:
    Privacy
    125
    David Evans
    University of Virginia
    [email protected]
    https://www.cs.virginia.edu/evans

    View Slide