Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Architecture Support for Robust Deep Learning: Exploiting Software 1.0 Techniques to Defend Software 2.0

Yuhao Zhu
October 28, 2020

Architecture Support for Robust Deep Learning: Exploiting Software 1.0 Techniques to Defend Software 2.0

Yuhao Zhu

October 28, 2020
Tweet

More Decks by Yuhao Zhu

Other Decks in Research

Transcript

  1. 2 Algorithm & Software Hardware Architecture Architecture + X Web

    & Cloud ▹ [ISCA 2019] ▹ [PLDI 2016] ▹ [HPCA 2015] ▹ [HPCA 2013] ▹ [HPCA 2016] ▹ [MICRO 2015] ▹ [ISCA 2014] Robust ML ▹ [CVPR 2020] ▹ [CVPR 2019] ▹ [ICLR 2019] ▹ [MICRO 2020] Visual Computing ▹ [MICRO 2020] ▹ [ISCA 2019] ▹ [ISCA 2018] ▹ [IROS 2020] ▹ [CVPR 2019] ▹ [MICRO 2019] ▹ [FPGA 2020]
  2. 2 Algorithm & Software Hardware Architecture Architecture + X Web

    & Cloud ▹ [ISCA 2019] ▹ [PLDI 2016] ▹ [HPCA 2015] ▹ [HPCA 2013] ▹ [HPCA 2016] ▹ [MICRO 2015] ▹ [ISCA 2014] Robust ML ▹ [CVPR 2020] ▹ [CVPR 2019] ▹ [ICLR 2019] ▹ [MICRO 2020] Visual Computing ▹ [MICRO 2020] ▹ [ISCA 2019] ▹ [ISCA 2018] ▹ [IROS 2020] ▹ [CVPR 2019] ▹ [MICRO 2019] ▹ [FPGA 2020]
  3. Deep Learning Isn’t Robust 3 “Hidden person” Object mis-detection Person

    https://www.vox.com/2017/9/12/16294510/fatal-tesla-crash-self-driving-elon-musk-autopilot
  4. Software 1.0 vs. Software 2.0 5 A I G H

    F C E B D r=0 count[r]++ r=0; count[r]++ 1 2 4 A I G H F C E B D 1 2 0 0 0 3 0 2 8 (a) (b) Apply th (Section increme The dumm , whic count. The du responds to r backedge. T corre the backedge Figure 10 and edge val edges. As a guish the fou Software 2.0: Neural networks as self-written programs. Software 1.0: Explicit instructions with explicit logics. Efficient Path Profiling, MICRO 1996
  5. Software 1.0 vs. Software 2.0 6 A I G H

    F C E B D r=0 count[r]++ r=0; count[r]++ 1 2 4 A I G H F C E B D 1 2 0 0 0 3 0 2 8 (a) (b) Apply th (Section increme The dumm , whic count. The du responds to r backedge. T corre the backedge Figure 10 and edge val edges. As a guish the fou Software 2.0: Neural networks as self-written programs. Software 1.0: Explicit instructions with explicit logics. Efficient Path Profiling, MICRO 1996 Benign input
  6. Software 1.0 vs. Software 2.0 7 A I G H

    F C E B D r=0 count[r]++ r=0; count[r]++ 1 2 4 A I G H F C E B D 1 2 0 0 0 3 0 2 8 (a) (b) Apply th (Section increme The dumm , whic count. The du responds to r backedge. T corre the backedge Figure 10 and edge val edges. As a guish the fou Software 2.0: Neural networks as self-written programs. Software 1.0: Explicit instructions with explicit logics. Efficient Path Profiling, MICRO 1996 Adversarial input
  7. Software 1.0 vs. Software 2.0 7 A I G H

    F C E B D r=0 count[r]++ r=0; count[r]++ 1 2 4 A I G H F C E B D 1 2 0 0 0 3 0 2 8 (a) (b) Apply th (Section increme The dumm , whic count. The du responds to r backedge. T corre the backedge Figure 10 and edge val edges. As a guish the fou Software 2.0: Neural networks as self-written programs. Software 1.0: Explicit instructions with explicit logics. Efficient Path Profiling, MICRO 1996 Adversarial input
  8. Exploiting Dynamic Behaviors of a DNN 8 Software 2.0: Neural

    networks as self-written programs. Software 1.0: Explicit instructions with explicit logics. Adversarial input Use program “hot paths” for: ‣ Profile-guided optimizations ‣ Feedback-driven optimizations ‣ Tracing JIT compilation ‣ Dynamic deadcode elimination ‣ Run-time accelerator offloading ‣ Dynamic remote offloading ‣ …
  9. Exploiting Dynamic Behaviors of a DNN 9 Software 2.0: Neural

    networks as self-written programs. Software 1.0: Explicit instructions with explicit logics. Use program “hot paths” for: ‣ Profile-guided optimizations ‣ Feedback-driven optimizations ‣ Tracing JIT compilation ‣ Dynamic deadcode elimination ‣ Run-time accelerator offloading ‣ Dynamic remote offloading ‣ … Can we exploit DNN paths to detect and defend against adversarial attacks? ✓ Inference-time detection ✓ Accurate detection
  10. Defining Activation Path in a DNN 10 ‣ Loosely, an

    activation path is a collection of important neurons and connections that contribute significantly to the inference output. ‣ Activation path is input-specific, just like a program path. Activation path for a benign input Activation path for an adversarial input
  11. Extracting Activation Paths 11 ‣ Necessarily a backward process, starting

    from the last layer. ▹ Last layer Ln has only one important neuron n. Inference Extraction
  12. Extracting Activation Paths 11 ‣ Necessarily a backward process, starting

    from the last layer. ▹ Last layer Ln has only one important neuron n. ▹ Important neurons in Ln-1 are given by the minimal set of neurons whose cumulative partial sums contribute to at least of n, where (0 <= <= 1). Recursive definition. Inference Extraction
  13. Extracting Activation Paths 11 ‣ Necessarily a backward process, starting

    from the last layer. ▹ Last layer Ln has only one important neuron n. ▹ Important neurons in Ln-1 are given by the minimal set of neurons whose cumulative partial sums contribute to at least of n, where (0 <= <= 1). Recursive definition. 0.3 0.4 0.2 1.0 0.1 0.2 0.2 0.3 0.3 0.2 0.4 0.4 0.1 0.2 -0.1 0.9 0.1 -1.0 2.1 0.5 0.06 0.44 = X 0.46 Input Layer Kernel Output Layer Inference Extraction
  14. Extracting Activation Paths 11 ‣ Necessarily a backward process, starting

    from the last layer. ▹ Last layer Ln has only one important neuron n. ▹ Important neurons in Ln-1 are given by the minimal set of neurons whose cumulative partial sums contribute to at least of n, where (0 <= <= 1). Recursive definition. 0.3 0.4 0.2 1.0 0.1 0.2 0.2 0.3 0.3 0.2 0.4 0.4 0.1 0.2 -0.1 0.9 0.1 -1.0 2.1 0.5 0.06 0.44 = X 0.46 Input Layer Kernel Output Layer Inference Extraction Important neuron from previous layer
  15. Extracting Activation Paths 11 ‣ Necessarily a backward process, starting

    from the last layer. ▹ Last layer Ln has only one important neuron n. ▹ Important neurons in Ln-1 are given by the minimal set of neurons whose cumulative partial sums contribute to at least of n, where (0 <= <= 1). Recursive definition. 0.3 0.4 0.2 1.0 0.1 0.2 0.2 0.3 0.3 0.2 0.4 0.4 0.1 0.2 -0.1 0.9 0.1 -1.0 2.1 0.5 0.06 0.44 = X 0.46 Input Layer Kernel Output Layer Inference Extraction Important neurons at the current layer if = 0.5 Important neuron from previous layer
  16. Extracting Activation Paths 12 ‣ Necessarily a backward process, starting

    from the last layer. ▹ Last layer Ln has only one important neuron n. ▹ Important neurons in Ln-1 are given by the minimal set of neurons whose cumulative partial sums contribute to at least of n, where (0 <= <= 1). 0.3 0.4 0.2 1.0 0.1 0.2 0.2 0.3 0.3 0.2 0.4 0.4 0.1 0.2 -0.1 0.09 0.1 -1.0 2.1 0.5 0.06 0.44 = X 6 = 0.1 x 2.1 + 1.0 x 0.09 + 0.4 x 0.2 + 0.3 x 0.2 + 0.2 x 0.1 x 2.1 + 1.0 x 0.09 > 0.6 x 0.46, assuming θ = 0.6 0.46 t Feature Map Kernel Output Feature Map (OFMap) 2.63 1.1 1.2 0.9 0.2 1.2 1.9 1.0 1.0 1.1 ⊛ 0.1 0.2 0.2 0.7 0.2 1.0 5.97 4.31 1.95 5.14 3.14 2.88 3.57 0.3 0.9 0.2 0.8 0.9 = 5.47 = 2.0 x 0.7 + 1.4 x 0.9 + 1.5 x 0.8 + 1.0 x 0.9 + …… 2.0 x 0.7 + 1.4 x 0.9 + 1.5 x 0.8 > 0.6 x 5.47, assuming θ = 0.6 2.0 1.4 5.47 1.5 Input Feature Map Kernel Output Feature Map (OFMap) Important Neuron Extraction in Fully-connected Layer Important Neuron Extraction in Convolution Layer Const from Inference Extraction
  17. Extracting Activation Paths 12 ‣ Necessarily a backward process, starting

    from the last layer. ▹ Last layer Ln has only one important neuron n. ▹ Important neurons in Ln-1 are given by the minimal set of neurons whose cumulative partial sums contribute to at least of n, where (0 <= <= 1). 0.3 0.4 0.2 1.0 0.1 0.2 0.2 0.3 0.3 0.2 0.4 0.4 0.1 0.2 -0.1 0.09 0.1 -1.0 2.1 0.5 0.06 0.44 = X 6 = 0.1 x 2.1 + 1.0 x 0.09 + 0.4 x 0.2 + 0.3 x 0.2 + 0.2 x 0.1 x 2.1 + 1.0 x 0.09 > 0.6 x 0.46, assuming θ = 0.6 0.46 t Feature Map Kernel Output Feature Map (OFMap) 2.63 1.1 1.2 0.9 0.2 1.2 1.9 1.0 1.0 1.1 ⊛ 0.1 0.2 0.2 0.7 0.2 1.0 5.97 4.31 1.95 5.14 3.14 2.88 3.57 0.3 0.9 0.2 0.8 0.9 = 5.47 = 2.0 x 0.7 + 1.4 x 0.9 + 1.5 x 0.8 + 1.0 x 0.9 + …… 2.0 x 0.7 + 1.4 x 0.9 + 1.5 x 0.8 > 0.6 x 5.47, assuming θ = 0.6 2.0 1.4 5.47 1.5 Input Feature Map Kernel Output Feature Map (OFMap) Important Neuron Extraction in Fully-connected Layer Important Neuron Extraction in Convolution Layer Const from Inference Extraction Important neuron from previous layer
  18. Extracting Activation Paths 12 ‣ Necessarily a backward process, starting

    from the last layer. ▹ Last layer Ln has only one important neuron n. ▹ Important neurons in Ln-1 are given by the minimal set of neurons whose cumulative partial sums contribute to at least of n, where (0 <= <= 1). 0.3 0.4 0.2 1.0 0.1 0.2 0.2 0.3 0.3 0.2 0.4 0.4 0.1 0.2 -0.1 0.09 0.1 -1.0 2.1 0.5 0.06 0.44 = X 6 = 0.1 x 2.1 + 1.0 x 0.09 + 0.4 x 0.2 + 0.3 x 0.2 + 0.2 x 0.1 x 2.1 + 1.0 x 0.09 > 0.6 x 0.46, assuming θ = 0.6 0.46 t Feature Map Kernel Output Feature Map (OFMap) 2.63 1.1 1.2 0.9 0.2 1.2 1.9 1.0 1.0 1.1 ⊛ 0.1 0.2 0.2 0.7 0.2 1.0 5.97 4.31 1.95 5.14 3.14 2.88 3.57 0.3 0.9 0.2 0.8 0.9 = 5.47 = 2.0 x 0.7 + 1.4 x 0.9 + 1.5 x 0.8 + 1.0 x 0.9 + …… 2.0 x 0.7 + 1.4 x 0.9 + 1.5 x 0.8 > 0.6 x 5.47, assuming θ = 0.6 2.0 1.4 5.47 1.5 Input Feature Map Kernel Output Feature Map (OFMap) Important Neuron Extraction in Fully-connected Layer Important Neuron Extraction in Convolution Layer Const from Inference Extraction Important neurons at the current layer if = 0.5 Important neuron from previous layer
  19. Extracting Activation Paths 13 ‣ controls the differentiability of an

    activation path, i.e., how many important neurons are included in an activation path. ▹ = 1 includes all neurons. 0.3 0.4 0.2 1.0 0.1 0.2 0.2 0.3 0.3 0.2 0.4 0.4 0.1 0.2 -0.1 0.09 0.1 -1.0 2.1 0.5 0.06 0.44 = X 6 = 0.1 x 2.1 + 1.0 x 0.09 + 0.4 x 0.2 + 0.3 x 0.2 + 0.2 x 0.1 x 2.1 + 1.0 x 0.09 > 0.6 x 0.46, assuming θ = 0.6 0.46 t Feature Map Kernel Output Feature Map (OFMap) 2.63 1.1 1.2 0.9 0.2 1.2 1.9 1.0 1.0 1.1 ⊛ 0.1 0.2 0.2 0.7 0.2 1.0 5.97 4.31 1.95 5.14 3.14 2.88 3.57 0.3 0.9 0.2 0.8 0.9 = 5.47 = 2.0 x 0.7 + 1.4 x 0.9 + 1.5 x 0.8 + 1.0 x 0.9 + …… 2.0 x 0.7 + 1.4 x 0.9 + 1.5 x 0.8 > 0.6 x 5.47, assuming θ = 0.6 2.0 1.4 5.47 1.5 Input Feature Map Kernel Output Feature Map (OFMap) Important Neuron Extraction in Fully-connected Layer Important Neuron Extraction in Convolution Layer Const from Inference Extraction
  20. Representing Activation Paths 14 ‣ An activation path (for an

    input) is represented by a bit mask. 0 0 1 0 1 1 0 0 0 1 0 0 1 0
  21. From Input Paths to Class Paths 15 ‣ A class

    path for a class c aggregates (e.g., bitwise OR) the activation paths of different inputs that are correctly predicted as class c. ‣ Generated from a training dataset. 1 0 1 0 1 1 0 0 1 1 0 0 1 0 Pc = ⋃ x∈¯ xc P(x)
  22. From Input Paths to Class Paths 15 ‣ A class

    path for a class c aggregates (e.g., bitwise OR) the activation paths of different inputs that are correctly predicted as class c. ‣ Generated from a training dataset. 1 0 1 0 1 1 0 0 1 1 0 0 1 0 Pc = ⋃ x∈¯ xc P(x) Activation path for correctly- predicted benign input x
  23. From Input Paths to Class Paths 15 ‣ A class

    path for a class c aggregates (e.g., bitwise OR) the activation paths of different inputs that are correctly predicted as class c. ‣ Generated from a training dataset. 1 0 1 0 1 1 0 0 1 1 0 0 1 0 Pc = ⋃ x∈¯ xc P(x) Activation path for correctly- predicted benign input x Class path for class c
  24. Path Similarities 16 ‣ Class paths of different classes are

    very different. S = |Pc1 ⋂Pc2 | |Pc1 ⋃Pc2 | Class paths of c1 and c2.
  25. Path Similarities 16 ‣ Class paths of different classes are

    very different. ResNet18 @ CIFAR-10 S = |Pc1 ⋂Pc2 | |Pc1 ⋃Pc2 | Class paths of c1 and c2.
  26. Path Similarities 16 ‣ Class paths of different classes are

    very different. ResNet18 @ CIFAR-10 AlexNet @ ImageNet (10 random classes) S = |Pc1 ⋂Pc2 | |Pc1 ⋃Pc2 | Class paths of c1 and c2.
  27. Path Similarities 17 ‣ Activation paths of benign inputs are

    similar to their corresponding class path. ‣ Activation paths of adversarial inputs are different from the class path. S = |P(x)⋂Pc | P(x)
  28. Path Similarities 17 ‣ Activation paths of benign inputs are

    similar to their corresponding class path. ‣ Activation paths of adversarial inputs are different from the class path. S = |P(x)⋂Pc | P(x) Class path
  29. Path Similarities 17 ‣ Activation paths of benign inputs are

    similar to their corresponding class path. ‣ Activation paths of adversarial inputs are different from the class path. S = |P(x)⋂Pc | P(x) Activation path for input x. x is not used in generating Pc. Class path
  30. Path Similarities 17 ‣ Activation paths of benign inputs are

    similar to their corresponding class path. ‣ Activation paths of adversarial inputs are different from the class path. Figure 4: Normal example and perturbations from different attacks. The perturbations are enh the differences. (a) (b) (c) (d) LeNet S = |P(x)⋂Pc | P(x) Activation path for input x. x is not used in generating Pc. Class path
  31. Path Similarities 17 ‣ Activation paths of benign inputs are

    similar to their corresponding class path. ‣ Activation paths of adversarial inputs are different from the class path. Figure 4: Normal example and perturbations from different attacks. The perturbations are enh the differences. (a) (b) (c) (d) LeNet S = |P(x)⋂Pc | P(x) Activation path for input x. x is not used in generating Pc. Class path Benign inputs
  32. Basic Idea of the Detection Algorithm 18 ‣ For an

    input x that is predicted as class c, if the activation path of x does not resemble the class path of c, x is likely an adversarial sample.
  33. Basic Idea of the Detection Algorithm 18 ‣ For an

    input x that is predicted as class c, if the activation path of x does not resemble the class path of c, x is likely an adversarial sample. Class Path Construction Training Data Class Path Activation Path Extraction Input Inference Adversarial Classification Adversarial? + Output Activation Path Offline Online
  34. Basic Idea of the Detection Algorithm 18 ‣ For an

    input x that is predicted as class c, if the activation path of x does not resemble the class path of c, x is likely an adversarial sample. Class Path Construction Training Data Class Path Activation Path Extraction Input Inference Adversarial Classification Adversarial? + Output Activation Path Offline Online Random forest. Lightweight and works effectively.
  35. Run-time Overhead 19 Layer 1 Layer 2 Layer N Layer

    N Layer N-1 Layer 1 …… …… Forward Inference Backward Path Extraction
  36. Run-time Overhead 19 Layer 1 Layer 2 Layer N Layer

    N Layer N-1 Layer 1 …… …… Forward Inference Backward Path Extraction Compute Store Partial Sums
  37. Run-time Overhead 19 Layer 1 Layer 2 Layer N Layer

    N Layer N-1 Layer 1 …… …… Forward Inference Backward Path Extraction Compute Store Partial Sums 1 8 64 512 Memory Overhead Lenet Alexnet Resnet18
  38. Run-time Overhead 19 Layer 1 Layer 2 Layer N Layer

    N Layer N-1 Layer 1 …… …… Forward Inference Backward Path Extraction Compute Store Partial Sums Rank Neurons Read Partial Sums Construct Path 1 8 64 512 Memory Overhead Lenet Alexnet Resnet18
  39. Run-time Overhead 19 Layer 1 Layer 2 Layer N Layer

    N Layer N-1 Layer 1 …… …… Forward Inference Backward Path Extraction Compute Store Partial Sums Rank Neurons Read Partial Sums Construct Path 1 8 64 512 Memory Overhead Lenet Alexnet Resnet18 40 30 20 10 0 Operation Overhead (%) 0.9 0.7 0.5 0.3 0.1 Threshold θ
  40. Run-time Overhead 19 Layer 1 Layer 2 Layer N Layer

    N Layer N-1 Layer 1 …… …… Forward Inference Backward Path Extraction Compute Store Partial Sums Rank Neurons Read Partial Sums Construct Path 1 8 64 512 Memory Overhead Lenet Alexnet Resnet18 40 30 20 10 0 Operation Overhead (%) 0.9 0.7 0.5 0.3 0.1 Threshold θ 15.4 X (AlexNet) and 50.7 X (ResNet) overhead for a pure software implementation.
  41. Hiding Cost: Extraction Direction 20 Layer 1 Layer 2 Layer

    N Layer 1 Layer N-1 Layer N …… …… Inference Extraction ‣ The important neurons are determined locally now. ‣ Pros: overlap inference with extraction. ‣ Cons: locally important neurons aren’t necessarily eventually important.
  42. Reducing Cost: Selective Extraction + Thresholding 21 Layer 1 Layer

    2 Layer N Layer 1 Layer N-1 Layer N …… …… Inference Extraction
  43. Reducing Cost: Selective Extraction + Thresholding 21 Layer 1 Layer

    2 Layer N Layer 1 Layer N-1 Layer N …… …… Inference Extraction
  44. Reducing Cost: Selective Extraction + Thresholding 21 Layer 1 Layer

    2 Layer N Layer 1 Layer N-1 Layer N …… …… Inference Extraction 0.3 0.4 0.2 1.0 0.1 0.2 0.2 0.3 0.3 0.2 0.4 0.4 0.1 0.2 -0.1 0.9 0.1 -1.0 2.1 0.5 0.06 0.44 = X 0.46 Input Layer Kernel Output Layer 0.46 = 0.1 x 2.1 + 1.0 x 0.09 + 0.4 x 0.2 + 0.3 x 0.2 + 0.2 x 0.1 Cumulative threshold:
  45. Reducing Cost: Selective Extraction + Thresholding 21 Layer 1 Layer

    2 Layer N Layer 1 Layer N-1 Layer N …… …… Inference Extraction Absolute threshold: 0.1 x 2.1 > 0.1 1.0 x 0.09 < 0.1 0.4 x 0.2 < 0.1 0.3 x 0.2 < 0.1 0.2 x 0.1 < 0.1 0.3 0.4 0.2 1.0 0.1 0.2 0.2 0.3 0.3 0.2 0.4 0.4 0.1 0.2 -0.1 0.9 0.1 -1.0 2.1 0.5 0.06 0.44 = X 0.46 Input Layer Kernel Output Layer 0.46 = 0.1 x 2.1 + 1.0 x 0.09 + 0.4 x 0.2 + 0.3 x 0.2 + 0.2 x 0.1 Cumulative threshold:
  46. Hardware Support 23 ‣ Nothing special. Nothing outrageous. Minimal extension

    to conventional NPU. DNN Accelerator SRAM (Weights, Feature Maps, Partial Sums, Masks) Path Constructor Sort & Merge Accumulate Controller SRAM (Code, Paths) DRAM Input/Output Weights Feature Maps Partial Sums Masks Mask Gen. SRAM (Partial sums, Partial masks, Masks) Paths
  47. Hardware Support 23 ‣ Nothing special. Nothing outrageous. Minimal extension

    to conventional NPU. DNN Accelerator SRAM (Weights, Feature Maps, Partial Sums, Masks) Path Constructor Sort & Merge Accumulate Controller SRAM (Code, Paths) DRAM Input/Output Weights Feature Maps Partial Sums Masks Mask Gen. SRAM (Partial sums, Partial masks, Masks) Paths i w x + psum >? thd MUX 0/1 mode to SRAM to SRAM
  48. Hardware Support 23 ‣ Nothing special. Nothing outrageous. Minimal extension

    to conventional NPU. DNN Accelerator SRAM (Weights, Feature Maps, Partial Sums, Masks) Path Constructor Sort & Merge Accumulate Controller SRAM (Code, Paths) DRAM Input/Output Weights Feature Maps Partial Sums Masks Mask Gen. SRAM (Partial sums, Partial masks, Masks) Paths … SRAM Merge Unit Sort Unit Sort Unit i w x + psum >? thd MUX 0/1 mode to SRAM to SRAM
  49. Mapping an Algorithmic Variant to Hardware 24 ‣ Key: statically

    schedule the execution. Forward extraction for j = 1 to L { inf(j) <extraction for j> }
  50. Mapping an Algorithmic Variant to Hardware 24 ‣ Key: statically

    schedule the execution. Forward extraction for j = 1 to L { inf(j) <extraction for j> } Software pipelining inf(1) for j = 1 to L-1 { inf(j+1) <extraction for j> } <extraction for L>
  51. Mapping an Algorithmic Variant to Hardware 24 ‣ Key: statically

    schedule the execution. Forward extraction for j = 1 to L { inf(j) <extraction for j> } Software pipelining inf(1) for j = 1 to L-1 { inf(j+1) <extraction for j> } <extraction for L> Cumulative thresholding for i = 1 to N { ldpsum(i) sort(i) acum(i) }
  52. Mapping an Algorithmic Variant to Hardware 24 ‣ Key: statically

    schedule the execution. Forward extraction for j = 1 to L { inf(j) <extraction for j> } Software pipelining inf(1) for j = 1 to L-1 { inf(j+1) <extraction for j> } <extraction for L> Cumulative thresholding for i = 1 to N { ldpsum(i) sort(i) acum(i) } Software pipelining ldpsum(1) sort(1) for i = 1 to N-1 { ldpsum(i+1) sort(i+1) acum(i) } acum(N)
  53. Mapping an Algorithmic Variant to Hardware 24 ‣ Key: statically

    schedule the execution. Forward extraction for j = 1 to L { inf(j) <extraction for j> } 0.3 0.4 0.2 1.0 0.1 0.2 0.2 0.3 0.3 0.2 0.4 0.4 0.1 0.2 -0.1 0.09 0.1 -1.0 2.1 0.5 0.06 0.44 = X 0.46 = 0.1 x 2.1 + 1.0 x 0.09 + 0.4 x 0.2 + 0.3 x 0.2 + 0.2 x 0.1 0.1 x 2.1 + 1.0 x 0.09 > 0.6 x 0.46, assuming θ = 0.6 0.46 Important Neurons identified in the current layer: 1.0, 0.1 Important Neurons in the OFMap (identified before): 0.46 Input Feature Map Kernel Output Feature Map (OFMap) 2.63 1.1 1.2 0.9 0.2 1.2 1.9 1.0 1.0 1.1 ⊛ 0.1 0.2 0.2 0.7 0.2 1.0 5.97 4.31 1.95 5.14 3.14 2.88 3.57 0.3 0.9 0.2 0.8 0.9 = Important Neurons identified in the current layer: 2.0, 1.4, 1.5 Important Neurons in the OFMap (identified before): 5.47 5.47 = 2.0 x 0.7 + 1.4 x 0.9 + 1.5 x 0.8 + 1.0 x 0.9 + …… 2.0 x 0.7 + 1.4 x 0.9 + 1.5 x 0.8 > 0.6 x 5.47, assuming θ = 0.6 2.0 1.4 5.47 1.5 Input Feature Map Kernel Output Feature Map (OFMap) Important Neuron Extraction in Fully-connected Layer Important Neuron Extraction in Convolution Layer C Recompute rather than store the partial sums (only 5% of the partial sums are later used). Software pipelining inf(1) for j = 1 to L-1 { inf(j+1) <extraction for j> } <extraction for L> Cumulative thresholding for i = 1 to N { ldpsum(i) sort(i) acum(i) } Software pipelining ldpsum(1) sort(1) for i = 1 to N-1 { ldpsum(i+1) sort(i+1) acum(i) } acum(N)
  54. Mapping an Algorithmic Variant to Hardware 24 ‣ Key: statically

    schedule the execution. Forward extraction for j = 1 to L { inf(j) <extraction for j> } 0.3 0.4 0.2 1.0 0.1 0.2 0.2 0.3 0.3 0.2 0.4 0.4 0.1 0.2 -0.1 0.09 0.1 -1.0 2.1 0.5 0.06 0.44 = X 0.46 = 0.1 x 2.1 + 1.0 x 0.09 + 0.4 x 0.2 + 0.3 x 0.2 + 0.2 x 0.1 0.1 x 2.1 + 1.0 x 0.09 > 0.6 x 0.46, assuming θ = 0.6 0.46 Important Neurons identified in the current layer: 1.0, 0.1 Important Neurons in the OFMap (identified before): 0.46 Input Feature Map Kernel Output Feature Map (OFMap) 2.63 1.1 1.2 0.9 0.2 1.2 1.9 1.0 1.0 1.1 ⊛ 0.1 0.2 0.2 0.7 0.2 1.0 5.97 4.31 1.95 5.14 3.14 2.88 3.57 0.3 0.9 0.2 0.8 0.9 = Important Neurons identified in the current layer: 2.0, 1.4, 1.5 Important Neurons in the OFMap (identified before): 5.47 5.47 = 2.0 x 0.7 + 1.4 x 0.9 + 1.5 x 0.8 + 1.0 x 0.9 + …… 2.0 x 0.7 + 1.4 x 0.9 + 1.5 x 0.8 > 0.6 x 5.47, assuming θ = 0.6 2.0 1.4 5.47 1.5 Input Feature Map Kernel Output Feature Map (OFMap) Important Neuron Extraction in Fully-connected Layer Important Neuron Extraction in Convolution Layer C Recompute rather than store the partial sums (only 5% of the partial sums are later used). Software pipelining inf(1) for j = 1 to L-1 { inf(j+1) <extraction for j> } <extraction for L> Cumulative thresholding for i = 1 to N { ldpsum(i) sort(i) acum(i) } Software pipelining ldpsum(1) sort(1) for i = 1 to N-1 { ldpsum(i+1) sort(i+1) acum(i) } acum(N) Recompute csps(1) sort(1) for i = 1 to N-1 { csps(i+1) sort(i+1) acum(i) } acum(N)
  55. Putting It Together 25 Algorithm Framework Canary Class Paths Programming

    Interface output = Inference() foreach Layer ExtractImptNeurons() GenMask() return Classify() Compiler Optimizations ✓ Layer-Level Pipelining ✓ Neuron-Level Pipelining ✓ Comp.-Mem. Trade-off ISA .set rfsize 0x200 mov r3, rfsize findrf r4, r1 sort r1, r3, r6 acum r6, r1, r5 Offline Profiling & Extraction Extraction Inference Reduce Cost Hide Cost Algorithm Knobs DNN Models and Legitimate Training Samples Classification Hardware Architecture Memory Augmented DNN Accelerator Programmable Path Extractor Selective Extraction Extraction Direction Thresholding Mechanism
  56. Evaluation Setup 26 ‣ Hardware: a cycle-level simulator parameterized with

    synthesis results from RTL implementation (Silvaco’s Open-Cell 15nm). ‣ Baselines: EP [CVPR 2019], CDRP [CVPR 2018], DeepFense [ICCAD 2018] ‣ Dataset: ImageNet, CIFAR-100 ‣ Network: ResNet18, AlexNet, VGG ‣ Attacks: ▹ BIM, CWL2, DeepFool, FGSM, and JSMA, which comprehensively cover all three types of input perturbation measures (L0, L2, and L∞). ▹ Adaptive attacks, which are specifically designed to defeat our detection mechanisms. https://github.com/ptolemy-dl/ptolemy
  57. Evaluation Setup 27 ‣ Variants: ▹ BwCU: Backward + cumulative

    thresholding ▹ BwAb: Backward + absolute thresholding ▹ FwAb: Forward + absolute thresholding ▹ Hybrid: BwAb + BwCu Forward Direction Backward Extraction Absolute Thresholding Cumulative Thresholding N 1
  58. Detection Accuracy 28 1.00 0.95 0.90 0.85 0.80 0.75 Accuracy

    BwCu BwAb FwAb Hybrid EP CDRP 0.95 0.90 0.85 0.80 Accuracy BwCu BwAb FwAb Hybrid EP CDRP AlexNet @ ImageNet ResNet18 @ CIFAR-100
  59. Detection Overhead 29 1 2 4 8 16 Latency Overhead

    BwCu BwAb FwAb Hybrid EP 1 2 4 8 Energy Overhead 1 4 16 64 256 Latency Overhead 1 4 16 64 256 Energy Overhead BwCu BwAb FwAb Hybrid EP ‣ CDRP requires re-training, unsuitable for inference-time attack detection.
  60. Compare with DeepFense 30 1.00 0.95 0.90 0.85 0.80 Accuracy

    BwCu BwAb FwAb Hybrid DFL DFM DFH PTOLEMY DeepFense 1 4 16 Latency Overhead 1 4 16 Energy Overhead BwCu BwAb FwAb Hybrid DFL DFM DFH PTOLEMY DeepFense ‣ DeepFense uses modular redundancy to defend against adversarial samples. ‣ Ptolemy is both more accurate and faster.
  61. Adaptive Attacks 31 ‣ If attackers know our defense mechanism,

    what can they do? x c t Input xt True Class xa=x+Δx c
  62. Adaptive Attacks 31 ‣ If attackers know our defense mechanism,

    what can they do? ‣ Given an input x that has a true class c, add minimal amount of perturbation Δx to generate an adversarial input xa such that the path of xa resembles a totally different input xt whose true class is t (!= c). x c t Input xt True Class xa=x+Δx c
  63. Adaptive Attacks 31 ‣ If attackers know our defense mechanism,

    what can they do? ‣ Given an input x that has a true class c, add minimal amount of perturbation Δx to generate an adversarial input xa such that the path of xa resembles a totally different input xt whose true class is t (!= c). x c t Input xt True Class xa=x+Δx c ∑ i zi (x + δx) − zi (xt ) 2 2 Loss function when generating adversarial samples.
  64. Adaptive Attacks 31 ‣ If attackers know our defense mechanism,

    what can they do? ‣ Given an input x that has a true class c, add minimal amount of perturbation Δx to generate an adversarial input xa such that the path of xa resembles a totally different input xt whose true class is t (!= c). ‣ Average Δx is 0.007 (in MSE): xa still looks like x. x c t Input xt True Class xa=x+Δx c ∑ i zi (x + δx) − zi (xt ) 2 2 Loss function when generating adversarial samples.
  65. Adaptive Attacks 32 ‣ 100% attack success rate without our

    defense. ‣ Adaptive attacks are more effective than non-adaptive attacks. ‣ When an adversarial input resembles the activations in more layers, it has a better chance fooling our defense. 1.0 0.8 0.6 0.4 0.2 0.0 Accuracy BwCu FwAb AT1 BIM AT2 CWL2 AT3 DeepFool AT8 FGSM JSMA
  66. Early Termination (for Backward Extraction) 33 1 4 16 Norm.

    Latency 8 7 6 5 4 3 2 1 Termination Layer 1 2 4 8 Norm. Energy 0.95 0.89 0.83 0.77 Accuracy 8 7 6 5 4 3 2 1 Termination Layer ‣ Terminating earlier increases accuracy but lowers overhead. ‣ Accuracy plateaus beyond layer 6 (i.e., extracting 3 layers only) but overhead still keeps increasing. AlexNet (8 layers in total), BwCu Terminating earlier
  67. Late Start (for Forward Extraction) 34 ‣ Starting earlier increases

    accuracy and energy overhead. ‣ Not much impact on latency, which is hidden anyway. AlexNet (8 layers in total), FwAb 0.95 0.89 0.83 0.77 Accuracy 8 7 6 5 4 3 2 1 Start Layer 1.04 1.03 1.02 1.01 1.00 Norm. Latency 1.20 1.15 1.10 1.05 1.00 Norm. Energy 8 7 6 5 4 3 2 1 Start Layer Starting later
  68. Optics, Sensor, Computing Co-Defense 35 Image https://thesmartphonephotographer.com/phone-camera-sensor/ Light Optics Sensor

    Vision Algorithm Lens Model Phase Profile Spectral Sensitivity Function Noise and Quantization Models Layer weights Sparsity Bit width Network Loss
  69. Optics, Sensor, Computing Co-Defense 35 Image https://thesmartphonephotographer.com/phone-camera-sensor/ Light Optics Sensor

    Vision Algorithm How to co-design optics, image sensor, and DNN to improve the system robustness? Lens Model Phase Profile Spectral Sensitivity Function Noise and Quantization Models Layer weights Sparsity Bit width Network Loss
  70. “Adversarial Attacks” in Neural Scientific Computing 36 https://blog.yiningkarlli.com/2015/06/attenuated-transmission.html Physics simulation

    Detecting extreme weather https://www.slideshare.net/SAMSI_Info/program-on-mathematical-and-statistical-methods-for-climate-and-the-earth-system-deep-learning-for-extreme-weather-detection-prabhat-aug-23-2017 https://www.osti.gov/servlets/purl/1471083
  71. “Adversarial Attacks” in Neural Scientific Computing ‣ Numerical stability casted

    as an adversarial robustness problem. 36 https://blog.yiningkarlli.com/2015/06/attenuated-transmission.html Physics simulation Detecting extreme weather https://www.slideshare.net/SAMSI_Info/program-on-mathematical-and-statistical-methods-for-climate-and-the-earth-system-deep-learning-for-extreme-weather-detection-prabhat-aug-23-2017 https://www.osti.gov/servlets/purl/1471083
  72. “Adversarial Attacks” in Neural Scientific Computing ‣ Numerical stability casted

    as an adversarial robustness problem. ▹ Can DNNs improve numerical instability of scientific computing algorithms? 36 https://blog.yiningkarlli.com/2015/06/attenuated-transmission.html Physics simulation Detecting extreme weather https://www.slideshare.net/SAMSI_Info/program-on-mathematical-and-statistical-methods-for-climate-and-the-earth-system-deep-learning-for-extreme-weather-detection-prabhat-aug-23-2017 https://www.osti.gov/servlets/purl/1471083
  73. “Adversarial Attacks” in Neural Scientific Computing ‣ Numerical stability casted

    as an adversarial robustness problem. ▹ Can DNNs improve numerical instability of scientific computing algorithms? ▹ Will DNNs introduce new robustness issues to scientific computing? 36 https://blog.yiningkarlli.com/2015/06/attenuated-transmission.html Physics simulation Detecting extreme weather https://www.slideshare.net/SAMSI_Info/program-on-mathematical-and-statistical-methods-for-climate-and-the-earth-system-deep-learning-for-extreme-weather-detection-prabhat-aug-23-2017 https://www.osti.gov/servlets/purl/1471083
  74. Connections Between Software 1.0 and Software 2.0 37 Software 1.0:

    Explicit instructions with explicit logics. Software 2.0: Neural networks as self-written programs. Program Path/Trace Adversarial defense Program optimizations Over-parameterization Approximate computing Model compression Modular Redundancy Ensemble Fault tolerance ???
  75. Summary ‣ Robustness is a major roadblock & under-explored in

    our community. ‣ Ptolemy defends DNNs from adversarial attacks leveraging dynamic program path, a critical connection between Software 1.0 and Software 2.0. ‣ Ptolemy is not an algorithm; it’s an algorithmic framework. ‣ Overhead could be very low with the right choice of algorithms variants and static instruction scheduling. ‣ Hardware extension is minimal and principled. 38
  76. Architecture Support for Robust Deep Learning: Exploiting Software 1.0 Techniques

    to Defend Software 2.0 Qiuyue Sun Sam Triest Yu Feng Jingwen Leng Amir Taherin Yawo Siatitse http://horizon-lab.org Yuxian Qiu Yiming Gan
  77. Path Similarities 40 ‣ Activation paths of benign inputs are

    similar to the class path. ‣ Activation paths of adversarial inputs are different from the class path. mple and perturbations from different attacks. The perturbations are enhanced by 100 ti AlexNet on ImageNet Sl = |P(x)l⋂Pl c | P(x)l Activation path for input x at layer l. Class path at layer l. nhanced by 100 times to highlight
  78. Adaptive Attacks 41 ‣ Detection accuracy doesn’t change much as

    more perturbation (Δx) is added, likely because the perturbation is very small — a desirable property. ‣ Ptolemy is not more vulnerable when the attacker simply targets a similar class when generating the attacks. 0.90 0.85 0.80 0.75 0.70 Accuracy 35 30 25 20 15 10 5 0 Distortion/Perturbation (x10-3 MSE) 1.00 0.95 0.90 0.85 0.80 Accuracy 0.30 0.20 0.10 0.00 Path Similarity
  79. Choice of the Final Classifier 42 Figure 4: ROC for

    AlexNet on Ima- geNet with weight-based joint simi- larity. Figure 5: Detection accuracy comparison under different (a) Linear model. Figure 7: Impact of attack (a) Linear model. Figure 8: Effective path la 4.1. Detection Accuracy Fig. 5 shows the detecti under different defense mod tacks, we find that random fo linear model performs worst a gap between random forest which indicates that effectiv ‣ Random forest works the best while having low overhead. ‣ C.f., training a dedicated DNN to detect adversarial inputs.