Upgrade to Pro — share decks privately, control downloads, hide ads and more …

FOSAD Trustworthy Machine Learning: Class 2

David Evans
August 27, 2019

FOSAD Trustworthy Machine Learning: Class 2

19th International School on Foundations of Security Analysis and Design
Mini-course on "Trustworthy Machine Learning"
https://jeffersonswheel.org/fosad2019
David Evans

Class 2: Defenses

David Evans

August 27, 2019
Tweet

More Decks by David Evans

Other Decks in Education

Transcript

  1. Trustworthy
    Machine
    Learning
    David Evans
    University of Virginia
    jeffersonswheel.org
    Bertinoro, Italy
    27 August 2019
    19th International School on Foundations of Security Analysis and Design
    2: Defenses

    View full-size slide

  2. Recap/Plan
    Monday (Yesterday)
    Introduction / Attacks
    Tuesday (Today)
    Threat Models
    Defenses
    Wednesday
    Privacy +
    1

    View full-size slide

  3. Threat Models
    3
    1. What are the attacker’s goals?
    • Malicious behavior without detection
    • Commit check fraud
    • ...
    2. What are the attacker’s capabilities?
    information: what do they know
    actions: what they can do
    resources: how much they can spend?

    View full-size slide

  4. Threat Models in Cryptography
    Ciphertext-only attack
    Intercept message, want to learn plaintext
    Chosen-plaintext attack
    Adversary has encryption function as black box,
    wants to learn key (or decrypt some ciphertext)
    Chosen-ciphertext attack
    Adversary has decryption function as black box,
    wants to learn key (or encrypt some message)
    4
    Goals
    Information
    Actions
    Resources

    View full-size slide

  5. Threat Models in Cryptography
    5
    Goals
    Information
    Actions
    Resources
    Polynomial time/space:
    adversary has computational resources
    that scale polynomially in some
    security parameter (e.g., key size)

    View full-size slide

  6. Security Goals in
    Cryptography
    6
    First formal notions of
    cryptography, information
    theory
    Claude Shannon (1940s)

    View full-size slide

  7. Security Goals in
    Cryptography
    7
    Semantic Security:
    adversary with
    intercepted
    ciphertext has no
    advantage over
    adversary without it
    Shafi Goldwasser and Silvio Micali
    Developed semantic security in 1980s
    (2013 Turing Awardees)

    View full-size slide

  8. Threat Models in Adversarial ML?
    8
    Ciphertext-only attack
    Chosen-plaintext attack
    Chosen-ciphertext attack
    Polynomial time/space
    Semantic Security proofs
    Can we get to threat models as precise
    as those used in cryptography?
    Can we prove strong security notions
    for those threat models?

    View full-size slide

  9. Threat Models in Adversarial ML?
    9
    Ciphertext-only attack
    Chosen-plaintext attack
    Chosen-ciphertext attack
    Polynomial time/space
    Semantic Security proofs
    Can we get to threat models as precise
    as those used in cryptography?
    Can we prove strong security notions
    for those threat models?
    Current state: “Pre-Shannon” (Nicolas Carlini)

    View full-size slide

  10. 10
    Ali Rahimi
    NIPS Test-of-Time Award Speech
    (Dec 2017)
    ”If you're building photo-
    sharing systems alchemy
    is okay but we're beyond
    that; now we're building
    systems that govern
    healthcare and mediate
    our civic dialogue”

    View full-size slide

  11. 11
    Ali Rahimi
    NIPS Test-of-Time Award Speech
    (Dec 2017)
    ”If you're building photo-
    sharing systems alchemy
    is okay but we're beyond
    that; now we're building
    systems that govern
    healthcare and mediate
    our civic dialogue”

    View full-size slide

  12. Alchemy (~700 − 1660)
    Well-defined, testable goal
    turn lead into gold
    Established theory
    four elements: earth, fire, water, air
    Methodical experiments and lab
    techniques
    (Jabir ibn Hayyan in 8th century)
    Wrong and ultimately unsuccessful,
    but led to modern chemistry.

    View full-size slide

  13. “Realistic” Threat
    Model for
    Adversarial ML
    13

    View full-size slide

  14. Attacker Access
    White Box
    Attack has model: full
    knowledge of all parameters
    Black Box
    14
    ! " = ! $ ! % &' … ! ' !(")
    !
    " !(")
    Each model query is “expensive”
    Only receives output
    “API Access”

    View full-size slide

  15. ML-as-a-Service
    15

    View full-size slide

  16. Black-Box Attacks
    16
    PGD Attack
    !"
    # = !
    for % iterations:
    !&'(
    # = project0,2
    (!&
    # − 5 ⋅ sign(∇ < !&
    # , = )
    !# = !?

    Can we execute these attacks if we don’t have the model?

    View full-size slide

  17. Black-Box Optimization Attacks
    17
    !
    " !(")
    Black-Box Gradient Attack
    "%
    & = "
    for ( iterations:
    use queries to estimate ∇ * "+
    & , -

    View full-size slide

  18. Black-Box Optimization Attacks
    18
    !
    " !(")
    Black-Box Gradient Attack
    "%
    & = "
    for ( iterations:
    use queries to estimate ∇ * "+
    & , -
    "+./
    & = take step in “white-box” attack
    using estimated gradients
    "+./
    &

    View full-size slide

  19. Black-Box Gradient Attacks
    19
    Hybrid Batch Attacks: Finding Black-box Adversarial Examples with Limited Queries.
    Fnu Suya, Jianfeng Chi, David Evans, Yuan Tian. USENIX Security 2020.

    View full-size slide

  20. Transfer Attacks
    20
    !
    "∗ ! "∗ = %
    !&
    Target Model
    '∗ = whiteBoxAttack(!&
    , ')
    Adversarial examples against one model, often transfer to another model.
    External
    Local

    View full-size slide

  21. Improving Transfer Attacks
    22
    !
    "∗ ! "∗ = %
    Target Model
    Adversarial examples against several models, more likely to transfer.
    External
    Local
    !&
    !'
    !(
    "∗ = whiteBoxAttack(6(!&
    , !'
    , !(
    ), ")
    Yanpei Liu, Xinyun Chen, Chang Liu, Dawn Song [ICLR 2017]

    View full-size slide

  22. Hybrid Attacks
    Transfer Attacks
    Efficient: only one API query
    Low success rates
    - 3% transfer rate for targeted
    attack on ImageNet (ensemble)
    Gradient Attacks
    Expensive: 10k+ queries/seed
    High success rates
    - 100% for targeted attack on
    Imagenet
    23
    Hybrid Batch Attacks: Finding Black-box Adversarial Examples with Limited Queries.
    Fnu Suya, Jianfeng Chi, David Evans, Yuan Tian. USENIX Security 2020.
    Combine both attacks: efficient + high success

    View full-size slide

  23. Hybrid Attack
    24
    !
    "∗ ! "∗ = %
    External
    Local
    !&
    !'
    !(
    "∗ = whiteBoxAttack(6(!&
    , !'
    , !(
    ), ")
    1: Transfer Attack

    View full-size slide

  24. Hybrid Attack
    25
    !
    "∗ ! "∗ ≠ %
    External
    Local
    !&
    !'
    !(
    "∗ = whiteBoxAttack(7(!&
    , !'
    , !(
    ), ")
    1: Transfer Attack
    ":
    ;
    "<
    ;
    "∗
    2: Gradient Attack
    (starting from
    transfer candidate)

    View full-size slide

  25. Hybrid Attack
    26
    !
    "∗ ! "∗ ≠ %
    External
    Local
    !&
    !'
    !(
    "∗ = whiteBoxAttack(7(!&
    , !'
    , !(
    ), ")
    1: Transfer Attack
    ":
    ;
    "<
    ;
    "∗
    2: Gradient Attack
    3: Tune Local Models
    using label
    byproducts

    View full-size slide

  26. 27
    Dataset / Model
    Direct
    Transfer
    Rate
    Gradient Attack (AutoZoom) Hybrid Attack
    Success
    Rate
    Queries
    per AE
    Success
    Rate
    Queries
    per AE
    MNIST (Targeted) 61.6 90.9 1,645 98.8 298
    CIFAR10 (Targeted) 63.3 92.2 1,227 98.1 227
    ImageNet (Targeted) 3.4 95.4 45,166 98.0 30,089

    View full-size slide

  27. Realistic Adversary Model?
    Knowledge
    Only API access to target
    Good models for ensemble
    − pretrained models
    − (or access to similar training
    dataset, resources)
    Set of starting seeds
    Goals
    Find one adversarial example
    for each seed
    28
    Resources
    Unlimited number of API
    queries

    View full-size slide

  28. Batch Attacks
    Knowledge
    Only API access to target
    Good models for ensemble
    − pretrained models
    − (or access to similar training
    dataset, resources)
    Set of starting seeds
    Goals
    Find many seed/adversarial
    example pairs
    29
    Resources
    Limited number of API queries
    Prioritize seeds to attack: use resources to attack the low-cost seeds first

    View full-size slide

  29. Requirements
    1. There is a high variance
    across seeds in the cost to
    find adversarial examples.
    2. There are ways to predict in
    advance which seeds will
    be easy to attack.
    30

    View full-size slide

  30. 31
    Variation in Query cost of NES Gradient Attack
    Excludes direct transfers

    View full-size slide

  31. Predicting the Low-Cost Seeds
    Strategy 1:
    Cost of local attack
    number of PGD steps to find
    local AE
    Strategy 2:
    Loss function on target
    32
    NES gradient attack on robust CIFAR-10 model

    View full-size slide

  32. What about Direct Transfers?
    Strategy 1:
    Cost of local attack
    number of PGD steps to find
    local AE
    Strategy 2:
    Loss function on target
    33

    View full-size slide

  33. Direct Transfers
    Cost of local attack
    number of PGD steps to find
    local AE
    34

    View full-size slide

  34. Two-Phase
    Hybrid Attack
    35
    Retroactive Optimal:
    unrealizable strategy that
    always picks lowest cost seed

    View full-size slide

  35. Two-Phase
    Hybrid Attack
    36
    Retroactive Optimal:
    unrealizable strategy that
    always picks lowest cost seed
    Phase 1: Find Direct Transfers
    (1000 queries to find 95 direct
    transfers)
    AutoZOOM attack on Robust CIFAR-10 Model

    View full-size slide

  36. Two-Phase
    Hybrid Attack
    37
    Retroactive Optimal:
    unrealizable strategy that
    always picks lowest cost seed
    Phase 1: Find Direct Transfers
    (1000 queries to find 95 direct
    transfers)
    AutoZOOM attack on Robust CIFAR-10 Model
    Phase 2: Gradient Attack
    (100,000 queries to find 95
    direct transfers)

    View full-size slide

  37. Cost of Hybrid Batch Attacks
    38
    Target Model Prioritization
    Total Queries (Standard Error)
    Goal (number of seeds attacked)
    1% 2% 10%
    CIFAR-10
    (Robust)
    1000 seeds
    “Optimal” 10.0 (0.0) 20.0 (0.0) 107.8 (17.4)
    Two-Phase 20.4 (2.1) 54.2 (5.6) 826.2 (226.6)
    Random 24,054 (132) 45,372 (260) 251,917 (137)
    ImageNet
    100 seeds
    “Optimal” 1.0 (0.0) 2.0 (0.0) 34,949 (3,742)
    Two-Phase 28.0 (2.0) 38.6 (7.5) 78,844 (11,837)
    Random 15,046 (423) 45,136 (1,270) 285,855 (8,045)

    View full-size slide

  38. How can we construct
    models that make it hard
    for adversaries to find
    adversarial examples?
    40

    View full-size slide

  39. Defense Strategies
    1. Hide the gradients
    41

    View full-size slide

  40. Defense Strategies
    1. Hide the gradients
    − Transferability results
    42
    !
    "∗ ! "∗ = %
    !&
    '∗ = whiteBoxAttack(!&
    , ')

    View full-size slide

  41. Defense Strategies
    1. Hide the gradients
    − Transferability results
    43
    !
    "∗ ! "∗ = %
    !&
    '∗ = whiteBoxAttack(!&
    , ')
    Maybe they can work against
    adversaries who don’t have
    access to training data/similar
    model? (or transfer loss is high)

    View full-size slide

  42. 44
    Visualization by Nicholas Carlini

    View full-size slide

  43. 45
    Visualization by Nicholas Carlini

    View full-size slide

  44. 46
    Visualization by Nicholas Carlini

    View full-size slide

  45. 47
    Visualization by Nicholas Carlini

    View full-size slide

  46. Defense Strategies
    1. Hide the gradients
    − Clever adversaries can still find adversarial examples
    48
    ICML 2018 (Best Paper award)

    View full-size slide

  47. Defense Strategies
    1. Hide the gradients
    − Transferability results
    − Clever adversaries can still find adversarial examples
    2. Build a robust classifier
    49
    Increase capacity

    View full-size slide

  48. Increasing Model Capacity
    50
    Image from Aleksander Mądry, et al. 2017

    View full-size slide

  49. Defense Strategies
    1. Hide the gradients
    − Transferability results
    − Clever adversaries can still find adversarial examples
    2. Build a robust classifier
    51
    Increase capacity
    Consider adversaries in training: adversarial training

    View full-size slide

  50. Adversarial Training (Example from Yesterday)
    52
    Training
    Data
    ML
    Algorithm
    Training Clone
    010110011
    01
    EvadeML
    Deployment
    Why didn’t this work?

    View full-size slide

  51. Adversarial Training
    Training
    Data
    Training
    Process
    Candidate
    Model !"
    Adversarial
    Example
    Generator
    Successful
    AEs against !"
    add to training data
    (with correct labels)
    Christian Szegedy, Wojciech
    Zaremba, Ilya Sutskever, Joan
    Bruna, Dumitru Erhan, Ian
    Goodfellow, and Rob Fergus. 2013

    View full-size slide

  52. Ensemble Adversarial Training
    Training
    Data
    Training
    Process
    Candidate
    Model !"
    Adversarial
    Example
    Generator
    Successful
    AEs against !"
    add to training data
    (with correct labels)
    Florian Tramer, et al. [ICLR 2018]

    View full-size slide

  53. Ensemble Adversarial Training
    Training
    Data
    Training
    Process
    Candidate
    Model !"
    Adversarial
    Example
    Generator
    Successful
    AEs against !"
    add to training data
    (with correct labels)
    Florian Tramer, et al. [ICLR 2018]
    Static
    Model !#
    Adversarial
    Example
    Generator
    AEs against
    !#
    Static
    Model !$
    Adversarial
    Example
    Generator
    AEs against
    !$

    View full-size slide

  54. Formalizing Adversarial Training
    56
    2017
    min$
    (&(',))∼,
    ℒ(.$
    , /, 0))
    Regular training:

    View full-size slide

  55. Formalizing Adversarial Training
    57
    2017
    min$
    (&(',))∼,
    max/∈1
    (ℒ(3$
    , 4 + 6, 7) )
    min$
    (&(',))∼,
    ℒ(3$
    , 4, 7))
    Regular training:
    Adversarial training:
    Simulate with PGD attack
    with multiple restarts

    View full-size slide

  56. Attacking Robust Models
    58
    Dataset / Model
    Direct
    Transfer
    Rate
    Gradient Attack
    (AutoZoom) Hybrid Attack
    Success
    Rate
    Queries
    per AE
    Success
    Rate
    Queries
    per AE
    MNIST (Targeted) 61.6 90.9 1,645 98.8 298
    CIFAR10 (Targeted) 63.3 92.2 1,227 98.1 227
    MNIST-Robust (Untar’d) 2.9 7.2 52,182 7.3 51,328
    CIFAR10 Robust (Untar’d) 9.5 64.4 2,640 65.2 2,529
    Hybrid Batch Attacks: Finding Black-box Adversarial Examples with Limited Queries.
    Fnu Suya, Jianfeng Chi, David Evans, Yuan Tian. USENIX Security 2020.

    View full-size slide

  57. Defense Strategies
    1. Hide the gradients
    − Transferability results
    − Clever adversaries can still find adversarial examples
    2. Build a robust classifier
    − Adversarial retraining with increased model capacity
    Very expensive
    Assumes you can generate adversarial examples as well as adversary
    − If we could build a perfect model, we would!
    59

    View full-size slide

  58. Defense Strategies
    1. Hide the gradients
    − Transferability results
    − Clever adversaries can still find adversarial examples
    2. Build a robust classifier
    − Adversarial retraining, increasing model capacity, etc.
    − If we could build a perfect model, we would!
    60
    Our strategy: “Feature Squeezing”: reduce the
    search space available to the adversary
    Weilin Xu, David Evans, Yanjun Qi [NDSS 2018]

    View full-size slide

  59. Model
    Model
    Squeezer
    1
    Prediction0
    Prediction1
    "($%&'(
    , $%&'*
    , … , $%&',
    )
    Input
    Adversarial
    Legitimate
    Model’
    Squeezer
    k

    Predictionk
    Feature Squeezing Detection Framework
    Weilin Xu Yanjun Qi

    View full-size slide

  60. Model
    Model
    Squeezer
    1
    Prediction0
    Prediction1
    "($%&'(
    , $%&'*
    , … , $%&',
    )
    Input
    Adversarial
    Legitimate
    Model’
    Squeezer
    k

    Predictionk
    Feature Squeezing Detection Framework
    Feature Squeezer coalesces similar inputs into one point:
    • Barely change legitimate inputs.
    • Destruct adversarial perturbations.

    View full-size slide

  61. Coalescing by Feature Squeezing
    63
    Metric Space 1: Target Classifier Metric Space 2: “Oracle”
    Before: find a small perturbation that changes class for classifier, but imperceptible to oracle.
    Now: change class for both original and squeezed classifier, but imperceptible to oracle.

    View full-size slide

  62. Fast Gradient Sign [Yesterday]
    64
    original 0.1 0.2 0.3 0.4 0.5
    Adversary Power: !
    "#
    -bounded adversary: max(abs(*+
    −*+
    -)) ≤ !
    *- = * − ! ⋅ sign(∇*
    6(*, 8))
    Goodfellow, Shlens, Szegedy 2014

    View full-size slide

  63. Bit Depth Reduction
    0
    0.1
    0.2
    0.3
    0.4
    0.5
    0.6
    0.7
    0.8
    0.9
    1
    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
    3-bit
    1-bit
    8-bit
    Reduce to 1-bit
    !"
    = round(!"
    ×2)/2
    Reduce to 1-bit
    !"
    = round(!"
    ×2)/2
    [0.312 0.271 …… 0.159 0.651]
    X*
    [0.012 0.571 …… 0.159 0.951]
    X
    Input
    Output
    65
    [0. 1. …… 0. 1. ]
    [0. 0. …… 0. 1. ]
    Signal Quantization
    Adversarial Example
    Normal Example

    View full-size slide

  64. Bit Depth Reduction
    66
    Seed
    1 1 4 2 2
    1 1 1 1 1
    CW
    2
    CW

    BIM
    FGSM

    View full-size slide

  65. Accuracy with Bit Depth Reduction
    67
    Dataset Squeezer
    Adversarial Examples
    (FGSM, BIM, CW∞
    , Deep Fool, CW2
    , CW0
    , JSMA)
    Legitimate
    Images
    MNIST
    None 13.0% 99.43%
    1-bit Depth 62.7% 99.33%
    ImageNet
    None 2.78% 69.70%
    4-bit Depth 52.11% 68.00%

    View full-size slide

  66. Spatial Smoothing: Median Filter
    Replace a pixel with median of its neighbors.
    Effective in eliminating ”salt-and-pepper” noise (!"
    attacks)
    68
    Image from https://sultanofswing90.wordpress.com/tag/image-processing/
    3×3 Median Filter

    View full-size slide

  67. Spatial Smoothing: Non-local Means
    Replace a patch with weighted mean of similar patches (in region).
    69
    !
    "#
    "$
    !% = '((!, "+
    )×"+
    Preserves edges, while removing noise.

    View full-size slide

  68. 70
    Airplane
    94.4%
    Truck
    99.9%
    Automobile
    56.5%
    Airplane
    98.4%
    Airplane
    99.9%
    Ship
    46.0%
    Airplane
    98.3%
    Airplane
    80.8%
    Airplane
    70.0%
    Median Filter
    (2×2)
    Non-local Means
    (13-3-4)
    Original BIM (L

    ) JSMA (L
    0
    )

    View full-size slide

  69. Accuracy with Spatial Smoothing
    71
    Dataset Squeezer
    Adversarial Examples
    (FGSM, BIM, CW∞
    , Deep Fool, CW2
    , CW0
    )
    Legitimate
    Images
    ImageNet
    None 2.78% 69.70%
    Median Filter
    2*2
    68.11% 65.40%
    Non-local
    Means
    11-3-4
    57.11% 65.40%

    View full-size slide

  70. Other Potential Squeezers
    72
    C Xie, et al. Mitigating Adversarial Effects Through Randomization, ICLR 2018.
    J Buckman, et al. Thermometer Encoding: One Hot Way To Resist Adversarial
    Examples, ICLR 2018.
    D Meng and H Chen, MagNet: a Two-Pronged Defense against Adversarial
    Examples, CCS 2017; A Prakash, et al., Deflecting Adversarial Attacks with Pixel
    Deflection, CVPR 2018;...
    Thermometer Encoding (learnable bit depth reduction)
    Image denoising using autoencoder, wavelet, JPEG, etc.
    Image resizing
    ...
    Spatial Smoothers: median filter, non-local means

    View full-size slide

  71. Other Potential Squeezers
    73
    C Xie, et al. Mitigating Adversarial Effects Through Randomization, ICLR 2018.
    J Buckman, et al. Thermometer Encoding: One Hot Way To Resist Adversarial
    Examples, ICLR 2018.
    D Meng and H Chen, MagNet: a Two-Pronged Defense against Adversarial
    Examples, CCS 2017; A Prakash, et al., Deflecting Adversarial Attacks with Pixel
    Deflection, CVPR 2018;...
    Thermometer Encoding (learnable bit depth reduction)
    Image denoising using autoencoder, wavelet, JPEG, etc.
    Image resizing
    ...
    Spatial Smoothers: median filter, non-local means
    Anish Athalye, Nicholas Carlini, David Wagner.
    Obfuscated Gradients Give a False Sense of
    Security: Circumventing Defenses to
    Adversarial Examples. ICML 2018.

    View full-size slide

  72. “Feature Squeezing” (Vacuous) Conjecture
    For any distance-limited adversarial method,
    there exists some feature squeezer that
    accurately detects its adversarial examples.
    74
    Intuition: if the perturbation is small (in some simple
    metric space), there is some squeezer that coalesces
    original and adversarial example into same sample.

    View full-size slide

  73. Feature Squeezing Detection
    Model
    (7-layer
    CNN)
    Model
    Model
    Bit Depth-
    1
    Median
    2×2
    Prediction0
    Prediction1
    Prediction2
    Yes
    Input
    Adversarial
    No
    Legitimate
    max '(
    )*
    , )(
    , '(
    )*
    , )2 > -

    View full-size slide

  74. Detecting Adversarial Examples
    Distance between original input and its squeezed version
    Adversarial
    inputs
    (CW attack)
    Legitimate
    inputs

    View full-size slide

  75. 77
    0
    200
    400
    600
    800
    0.0 0.4 0.8 1.2 1.6 2.0
    Number of Examples
    Legitimate
    Adversarial
    Maximum !"
    distance between original and squeezed input
    threshold = 0.0029
    detection: 98.2%, FP < 4%
    Training a detector
    (MNIST)
    set the detection threshold to keep
    false positive rate below target

    View full-size slide

  76. ImageNet Configuration
    Model
    (MobileNet)
    Model
    Model
    Bit Depth-
    5
    Median
    2×2
    Prediction0
    Prediction1
    Prediction2
    Yes
    Input
    Adversarial
    No
    Legitimate
    max(()
    (*+
    , {*)
    , *.
    , */
    }) > 3
    Model
    Non-local
    Mean
    Prediction3

    View full-size slide

  77. 79
    0
    20
    40
    60
    80
    100
    120
    140
    0.0 0.4 0.8 1.2 1.6 2.0
    Legitimate
    Adversarial
    Maximum !"
    distance between original and squeezed input
    threshold = 1.24
    detection: 85%, FP < 5%
    Training a detector
    (ImageNet)

    View full-size slide

  78. Aggregated Detection Results
    Dataset Squeezers Threshold
    False
    Positive
    Rate
    Detection
    Rate
    (SAEs)
    ROC-AUC
    Exclude
    FAEs
    MNIST
    Bit Depth (1-bit),
    Median (2x2)
    0.0029 3.98% 98.2% 99.44%
    CIFAR-10
    Bit Depth (5-bit),
    Median (2x2),
    Non-local Mean (13-3-2)
    1.1402 4.93% 84.5% 95.74%
    ImageNet
    Bit Depth (5-bit),
    Median (2x2),
    Non-local Mean (11-3-4)
    1.2128 8.33% 85.9% 94.24%
    80

    View full-size slide

  79. Threat Models
    Oblivious attack: The adversary has full knowledge of
    the target model, but is not aware of the detector.
    Adaptive attack: The adversary has full knowledge of
    the target model and the detector.
    81

    View full-size slide

  80. Adaptive Adversary
    Adaptive CW
    2
    attack, unbounded adversary:
    Warren He, James Wei, Xinyun Chen, Nicholas Carlini, Dawn Song,
    Adversarial Example Defense: Ensembles of Weak Defenses are
    not Strong, USENIX WOOT’17.
    !"#"!"$% & '( − * + , ∗ Δ ', '( + 0 ∗ 12
    3456%('′)
    82
    Misclassification term Distance term Detection term

    View full-size slide

  81. Adaptive Adversarial Examples
    83
    No successful adversarial examples were found
    for images originally labeled as 3 or 8.
    Mean L2
    2.80
    4.14
    4.67
    Attack
    Untargeted
    Targeted
    (next)
    Targeted
    (least likely)

    View full-size slide

  82. Adaptive Adversary Success Rates
    84
    0.68
    0.06
    0.01
    0.44
    0.01
    0.24
    0.0
    0.1
    0.2
    0.3
    0.4
    0.5
    0.6
    0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
    Adversary’s Success Rate
    Clipped ε
    Targeted
    (Next)
    Targeted
    (LL)
    Untargeted
    Unbounded
    Typical !

    View full-size slide

  83. Model
    Model
    Model
    Squeezer
    1
    Squeezer
    2
    Prediction0
    Prediction1
    Prediction2
    #(%&'()
    , %&'(+
    , … , %&'(-
    )
    Yes
    Input
    Adversarial
    No
    Legitimate
    Model’
    Squeezer
    k

    Predictionk
    Defender’s
    Entropy
    Advantage
    random
    seed

    View full-size slide

  84. Counter Measure: Randomization
    Binary filter threshold := 0.5 threshold := ! 0.5, 0.0625
    86
    0
    0.2
    0.4
    0.6
    0.8
    1
    0 0.5 1
    0
    0.2
    0.4
    0.6
    0.8
    1
    0 0.5 1
    Strengthen the adaptive adversary
    Attack an ensemble of 3 detectors with thresholds: [0.4, 0.5, 0.6]

    View full-size slide

  85. 87
    2.80, Untargeted
    4.14, Targeted-Next
    4.67, Targeted-LL
    3.63, Untargeted
    5.48, Targeted-Next
    5.76, Targeted-LL
    Attack Deterministic Detector Mean L2
    Attack Randomized Detector

    View full-size slide

  86. Are defenses against
    adversarial examples
    even possible?
    88

    View full-size slide

  87. (Redefining) Adversarial Example
    89
    Prediction Change Definition:
    An input, !′ ∈ $, is an adversarial example for ! ∈ $, iff
    ∃!& ∈ Ball*
    (!) such that - ! ≠ - !& .

    View full-size slide

  88. Adversarial Example
    90
    Ball$
    (&) is some space around &, typically defined in
    some (simple!) metric space:
    ()
    norm (# different), (*
    norm (“Euclidean distance”), (+
    Without constraints on Ball$
    , every input has
    adversarial examples.
    Prediction Change Definition:
    An input, &′ ∈ /, is an adversarial example for & ∈ /, iff
    ∃&1 ∈ Ball$
    (&) such that 2 & ≠ 2 &1 .

    View full-size slide

  89. Adversarial Example
    91
    Any non-trivial model has
    adversarial examples:
    ∃"#
    , "%
    ∈ '. ) "#
    ≠ )("%
    )
    Prediction Change Definition:
    An input, -′ ∈ ', is an adversarial example for - ∈ ', iff
    ∃-/ ∈ Ball3
    (-) such that ) - ≠ ) -/ .

    View full-size slide

  90. Prediction Error Robustness
    92
    Error Robustness:
    An input, !′ ∈ $, is an adversarial example for (correct) ! ∈ $, iff
    ∃!& ∈ Ball*
    (!) such that - !′ ≠ true label for !′.
    Perfect classifier has no
    (error robustness)
    adversarial examples.

    View full-size slide

  91. Prediction Error Robustness
    93
    Error Robustness:
    An input, !′ ∈ $, is an adversarial example for (correct) ! ∈ $, iff
    ∃!& ∈ Ball*
    (!) such that - !′ ≠ true label for !′.
    Perfect classifier has no
    (error robustness)
    adversarial examples.
    If we have a way to
    know this, don’t need
    an ML classifier.

    View full-size slide

  92. Global Robustness Properties
    94
    Adversarial Risk: probability an input has an adversarial example
    Pr
    # ← %
    [∃ () ∈ +,--.
    ( . 0 () ≠ class (′ ]
    Dimitrios I. Diochnos, Saeed Mahloujifar,
    Mohammad Mahmoody, NeurIPS 2018

    View full-size slide

  93. Global Robustness Properties
    95
    Dimitrios I. Diochnos, Saeed Mahloujifar,
    Mohammad Mahmoody, NeurIPS 2018
    Adversarial Risk: probability an input has an adversarial example
    Pr
    # ← %
    [∃ () ∈ +,--.
    ( . 0 () ≠ class (′ ]
    Error Region Robustness: expected distance to closest AE:
    8
    # ← %
    [inf { =: ∃ () ∈ +,--.
    ( . 0 () ≠ class () }]

    View full-size slide

  94. Assumption Key Result
    Adversarial Spheres
    [Gilmer et al., 2018]
    Uniform distribution on two
    concentric !-spheres
    Expected safe distance ("#
    -norm)
    is relatively small.
    Adversarial vulnerability
    for any classifier
    [Fawzi × 3, 2018]
    Smooth generative model:
    1. Gaussian in latent space.
    2. Generator is L-Lipschitz.
    Adversarial risk ⟶ 1 for relatively
    small attack strength ("#
    -norm).
    Curse of Concentration in
    Robust Learning
    [Mahloujifar et al., 2018]
    Normal Lévy families
    • Unit sphere, uniform, "#
    norm
    • Boolean hypercube, uniform,
    Hamming distance
    ...
    If attack strength exceeds a
    relatively small threshold,
    adversarial risk > 1/2.
    b >
    p
    log(k1/")
    p
    k2
    · n
    ! Riskb(h, c) 1/2
    Recent Global Robustness Results
    P(r(x)  ⌘) 1
    r

    2
    e ⌘2/2L2
    Properties of any model for input space:
    distance to AE is small relative to expected distance between two sampled points

    View full-size slide

  95. Prediction Change Robustness
    97
    Prediction Change:
    An input, !′ ∈ $, is an adversarial example for ! ∈ $, iff
    ∃!& ∈ Ball*
    (!) such that - !′ ≠ - ! .
    Any non-trivial model has
    adversarial examples:
    ∃!0
    , !2
    ∈ $. - !0
    ≠ -(!2
    )
    Solutions:
    - only consider distribution inputs (“good” seeds)
    - output isn’t just class (e.g., confidence)
    - targeted adversarial examples
    cost-sensitive adversarial robustness

    View full-size slide

  96. Local (Instance) Robustness
    98
    Robust Region: For an input !, the robust region is the
    maximum region with no adversarial example:
    sup % > 0 ∀)* ∈ Ball/
    ) , 1 )* = 1 ) }

    View full-size slide

  97. Local (Instance) Robustness
    99
    Robust Region: For an input !, the robust region is the
    maximum region with no adversarial example:
    sup % > 0 ∀)* ∈ Ball/
    ) , 1 )* = 1 ) }
    Robust Error: For a test set, 4, and bound, %5
    :
    | ) ∈ 4, RobustRegion ) < %5
    }
    | 4|

    View full-size slide

  98. Instance Defense-Robustness
    100
    For an input !, the robust-defended region is the maximum
    region with no undetected adversarial example:
    sup % > 0 ∀)* ∈ Ball/
    ) , 1 )* = 1 ) ⋁ 45657654(!*)}
    Defense Failure: For a test set, ;, and bound, %<
    :
    | ) ∈ ;, RobustDefendedRegion ) < %<
    }
    | ;|
    Can we verify a defense?

    View full-size slide

  99. Formal Verification of Defense Instance
    exhaustively test all inputs in ∀"# ∈ Ball(
    "
    for correctness or detection
    Need to transform model into a
    function amenable to verification

    View full-size slide

  100. Linear Programming
    !""
    #"
    + !"%
    #%
    + ⋯ ≤ ("
    !%"
    #"
    + !%%
    #%
    + ⋯ ≤ (%
    #)
    ≤ 0
    ...
    Find values of + that minimize linear
    function under constraints:
    ,"
    #"
    + ,%
    #%
    + ,-
    #-
    + …

    View full-size slide

  101. Encoding a Neural Network
    Linear Components (! = #$ + &)
    Convolutional Layer
    Fully-connected Layer
    Batch Normalization (in test mode)
    Non-linear
    Activation (ReLU, Sigmoid, Softmax)
    Pooling Layer (max, avg)
    103

    View full-size slide

  102. Encode ReLU
    Mixed Integer Linear Programming
    adds discrete values to LP
    ReLU
    (Rectified Linear Unit )
    ! = max(0, ))
    + ∈ 0, 1
    ! ≥ )
    ! ≥ 0
    ! ≤ ) − 1 1 − +
    ! ≤ 2+
    1 2
    Piecewise Linear

    View full-size slide

  103. Mixed Integer Linear Programming (MILP)
    Intractable in theory (NP-Complete)
    Efficient in practice
    (e.g., Gurobi solver)
    MIPVerify
    Vincent Tjeng, Kai Xiao, Russ Tedrake
    Verify NNs using MILP

    View full-size slide

  104. Encode Feature Squeezers
    Binary Filter
    0.5 1
    0
    Actual Input: uint8 [0, 1, 2, … 254, 255]
    127 / 255 = 0.498
    128 / 255 = 0.502
    An infeasible gap [0.499, 0.501]
    Lower semi-continuous

    View full-size slide

  105. Verified L ∞
    Robustness
    Model Test Accuracy
    Robust Error
    ε = 0.1
    Robust Error
    with
    Binary Filter
    Raghunathan
    et al.
    95.82% 14.36%-30.81% 7.37%
    Wong & Kolter 98.11% 4.38% 4.25%
    Ours with
    binary filter
    98.94% 2.66-6.63% -
    Even without detection, this helps!

    View full-size slide

  106. Encode Detection Mechanism
    Original version:
    Simplify for verification:
    !"
    ⟶ maximum difference
    softmax ⟶ multiple piecewise-linear
    approximate sigmoid
    score(*) = - * − -(squeeze * ) "
    where f(x) is softmax output

    View full-size slide

  107. Preliminary Experiments
    109
    Model
    (4-layer
    CNN)
    Model
    Bit Depth-1
    Yes
    Input
    !’
    Adversarial
    No
    y1
    valid
    max_diff +,
    , +.
    > 0
    Verification: for a
    seed !, there is no
    adversarial input
    !1 ∈ Ball5
    ! for
    which +.
    ≠ 7 !
    and not detected
    Adversarially robust retrained [Wong & Kolter] model
    1000 test MNIST seeds, 8 = 0.1 (=>
    )
    970 infeasible (verified no adversarial example)
    13 misclassified (original seed)
    17 vulnerable
    Robust error: 0.3%
    Verification time ~0.2s
    (compared to 0.8s without binarization)

    View full-size slide

  108. 110
    Scalability
    Formal Verification
    MILP solver (MIPVerify)
    SMT solver (Reluplex)
    Interval analysis (Reluval)
    robust error
    Heuristic Defenses
    distillation (Papernot et al., 2016)
    gradient obfuscation
    adversarial retraining (Madry et al., 2017)
    attack success rate
    (set of attacks)
    Certified Robustness
    CNN-Cert (Boopathy et al., 2018)
    Dual-LP (Kolter & Wong 2018)
    Dual-SDP (Raghunathan et al., 2018)
    bound
    Evaluation Metric
    precise
    feature squeezing

    View full-size slide

  109. Realistic Threat Models
    Knowledge
    Full access to target
    Goals
    Find many seed/adversarial
    example pairs
    111
    Resources
    Limited number of API queries
    Limited computation
    It matters which seed
    and target classes

    View full-size slide

  110. 112
    target class
    Original
    Model
    (no robustness
    training)
    seed class
    target class
    MNIST Model
    2 convolutional layers
    2 fully-connected layers
    (100, 10 units)
    ! = 0.2, '(

    View full-size slide

  111. 113
    target class
    Original
    Model
    (no robustness
    training)
    seed class
    target class
    MNIST Model
    2 convolutional layers
    2 fully-connected layers
    (100, 10 units)
    ! = 0.2, '(

    View full-size slide

  112. Training a Robust Network
    Eric Wong and J. Zico Kolter. Provable defenses against adversarial
    examples via the convex outer adversarial polytope. ICML 2018.
    replace loss with
    differentiable function
    based on outer bound
    using dual network
    ReLU
    (Rectified Linear Unit ) linear approximation
    ! "

    View full-size slide

  113. 115
    seed class
    target class
    Standard
    Robustness
    Training
    (overall
    robustness goal)
    MNIST Model
    2 convolutional layers
    2 fully-connected layers
    (100, 10 units)
    ! = 0.2, '(

    View full-size slide

  114. Cost-Sensitive Robustness Training
    116
    Xiao Zhang
    Cost-matrix: cost of different adversarial transformations
    ! =
    − 0
    1 −
    benign malware
    benign
    malware
    Incorporate a cost-matrix into robustness training
    Xiao Zhang and David Evans [ICLR 2019]

    View full-size slide

  115. 117
    seed class
    target class
    Standard
    Robustness
    Training
    (overall
    robustness goal)
    MNIST Model
    2 convolutional layers
    2 fully-connected layers
    (100, 10 units)
    ! = 0.2, '(

    View full-size slide

  116. 118
    seed class
    target class
    Cost-
    Sensitive
    Robustness
    Training
    Protect odd
    classes from
    evasion

    View full-size slide

  117. 119
    seed class
    target class
    Cost-
    Sensitive
    Robustness
    Training
    Protect even
    classes from
    evasion

    View full-size slide

  118. History of the
    destruction
    of Troy, 1498
    Wrap-Up

    View full-size slide

  119. Security State-of-the-Art
    Attack success
    probability
    Threat models Proofs
    Cryptography !−#!$
    information theoretic,
    resource bounded
    required
    121
    Considered seriously
    broken if attack method
    increases to !%#!& even if it
    requires 2() ciphertexts.

    View full-size slide

  120. Security State-of-the-Art
    Attack success
    probability
    Threat models Proofs
    Cryptography !−#!$
    information theoretic,
    resource bounded
    required
    System Security !−%!
    capabilities,
    motivations, rationality
    common
    122
    Considered seriously broken
    if attack method can succeed
    in “lab” environment with
    probability 2'(.

    View full-size slide

  121. Security State-of-the-Art
    Attack success
    probability
    Threat models Proofs
    Cryptography !−#!$
    information theoretic,
    resource bounded
    required
    System Security !−%!
    capabilities,
    motivations, rationality
    common
    Adversarial
    Machine Learning
    !−#
    artificially limited
    adversary
    making
    progress!
    123
    Considered broken if attack
    method succeeds with
    probability 2'.

    View full-size slide

  122. Security State-of-the-Art
    Attack success
    probability
    Threat models Proofs
    Cryptography !−#!$
    information theoretic,
    resource bounded
    required
    System Security !−%!
    capabilities,
    motivations, rationality
    common
    Adversarial
    Machine Learning
    !−#
    artificially limited
    adversary
    making
    progress!
    124
    Huge gaps to close:
    threat models are unrealistic (but real threats unclear)
    verification techniques only work for tiny models
    experimental defenses often (quickly) broken

    View full-size slide

  123. Tomorrow:
    Privacy
    125
    David Evans
    University of Virginia
    [email protected]
    https://www.cs.virginia.edu/evans

    View full-size slide