Algorithm-SoC Co-Design for Mobile Continuous Vision

Algorithm-SoC Co-Design for Mobile Continuous Vision

ISCA 2018 Main Talk

3c332dfc0b438785cb10c5234652dd66?s=128

Yuhao Zhu

June 06, 2018
Tweet

Transcript

  1. Algorithm-SoC Co-Design for Mobile Continuous Vision Yuhao Zhu Department of

    Computer Science University of Rochester with Anand Samajdar, Georgia Tech Matthew Mattina, ARM Research Paul Whatmough, ARM Research
  2. Mobile Continuous Vision: Excessive Energy Consumption

  3. Mobile Continuous Vision: Excessive Energy Consumption 720p, 30 FPS

  4. Mobile Continuous Vision: Excessive Energy Consumption Energy Budget: (under 3

    W TDP) 109 nJ/pixel 720p, 30 FPS
  5. Mobile Continuous Vision: Excessive Energy Consumption Energy Budget: (under 3

    W TDP) 109 nJ/pixel Object Detection Energy Consumption 1400 nJ/pixel 720p, 30 FPS
  6. Application Drivers for Continuous Vision 3 Autonomous Drones

  7. Application Drivers for Continuous Vision 3 Autonomous Drones ADAS

  8. Application Drivers for Continuous Vision 3 Autonomous Drones Augmented Reality

    ADAS
  9. Application Drivers for Continuous Vision 3 Autonomous Drones Augmented Reality

    ADAS Security Camera
  10. Expanding the Scope 4 Vision Kernels RGB Frames Semantic Results

    Conventional Scope
  11. Expanding the Scope 4 Vision Kernels RGB Frames Semantic Results

    Imaging Photons Conventional Scope
  12. Expanding the Scope Our Scope 4 Vision Kernels RGB Frames

    Semantic Results Imaging Photons
  13. Expanding the Scope Our Scope 4 Vision Kernels RGB Frames

    Semantic Results Imaging Photons Motion Metadata
  14. Expanding the Scope Our Scope 4 Vision Kernels RGB Frames

    Semantic Results Imaging Photons Motion Metadata f(xt) =
  15. Expanding the Scope Our Scope 4 Vision Kernels RGB Frames

    Semantic Results Imaging Photons Motion Metadata diff (motion) f(xt) = (xt ⊖ xt-1)
  16. Expanding the Scope Our Scope 4 Vision Kernels RGB Frames

    Semantic Results Imaging Photons Motion Metadata diff (motion) f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)
  17. Expanding the Scope Our Scope 4 Vision Kernels RGB Frames

    Semantic Results Imaging Photons Motion Metadata diff (motion) synthesis f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)
  18. Expanding the Scope Our Scope 4 Vision Kernels RGB Frames

    Semantic Results Imaging Photons Motion Metadata diff (motion) synthesis cheap f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)
  19. Expanding the Scope Our Scope 4 Vision Kernels RGB Frames

    Semantic Results Imaging Photons Motion Metadata diff (motion) synthesis Motion-based Synthesis cheap f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)
  20. Getting Motion Data 5 Vision Kernels RGB Frames Semantic Results

    Imaging Photons Motion Metadata
  21. Getting Motion Data 5 Vision Kernels RGB Frames Semantic Results

    Imaging Photons Conversion Demosaic … Bayer Domain Dead Pixel Correction … YUV Domain Temporal Denoising …
  22. Getting Motion Data 5 Vision Kernels RGB Frames Semantic Results

    Imaging Photons Conversion Demosaic … Bayer Domain Dead Pixel Correction … YUV Domain Temporal Denoising …
  23. Getting Motion Data 5 Vision Kernels RGB Frames Semantic Results

    Imaging Photons Conversion Demosaic … Bayer Domain Dead Pixel Correction … YUV Domain Temporal Denoising … Frame k
  24. Getting Motion Data 5 Vision Kernels RGB Frames Semantic Results

    Imaging Photons Conversion Demosaic … Bayer Domain Dead Pixel Correction … YUV Domain Temporal Denoising … Frame k <u, v>
  25. Getting Motion Data 5 Vision Kernels RGB Frames Semantic Results

    Imaging Photons Conversion Demosaic … Bayer Domain Dead Pixel Correction … YUV Domain Temporal Denoising … Frame k-1 Frame k <u, v> <x, y>
  26. Getting Motion Data 5 Vision Kernels RGB Frames Semantic Results

    Imaging Photons Conversion Demosaic … Bayer Domain Dead Pixel Correction … YUV Domain Temporal Denoising … Frame k-1 Frame k <u, v> <x, y> Motion Vector = <x - u, y - v>
  27. Getting Motion Data 5 Vision Kernels RGB Frames Semantic Results

    Imaging Photons Conversion Demosaic … Bayer Domain Dead Pixel Correction … YUV Domain Temporal Denoising … Motion Info. Frame k-1 Frame k <u, v> <x, y> Motion Vector = <x - u, y - v>
  28. Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt

    ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis
  29. Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt

    ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors
  30. Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt

    ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors
  31. Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt

    ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors
  32. Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt

    ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors
  33. Inference (I-Frame) Extrapolation (E-Frame) Inference (I-Frame) Extrapolation (E-Frame) Extrapolation Window

    = 2 Extrapolation (E-Frame) Extrapolation Window = 3 t4 t0 t1 t2 t3 Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors
  34. Inference (I-Frame) Extrapolation (E-Frame) Inference (I-Frame) Extrapolation (E-Frame) Extrapolation Window

    = 2 Extrapolation (E-Frame) Extrapolation Window = 3 t4 t0 t1 t2 t3 Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors
  35. Inference (I-Frame) Extrapolation (E-Frame) Inference (I-Frame) Extrapolation (E-Frame) Extrapolation Window

    = 2 Extrapolation (E-Frame) Extrapolation Window = 3 t4 t0 t1 t2 t3 Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors
  36. Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt

    ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges:
  37. Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt

    ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges: ▹ Handle deformable parts
  38. Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt

    ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges: ▹ Handle deformable parts ▹ Filter motion noise
  39. Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt

    ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges: ▹ Handle deformable parts ▹ Filter motion noise ▹ When to inference vs. extrapolate?
  40. Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt

    ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges: ▹ Handle deformable parts ▹ Filter motion noise ▹ When to inference vs. extrapolate? ▹ See paper for details!
  41. Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt

    ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges: ▹ Handle deformable parts ▹ Filter motion noise ▹ When to inference vs. extrapolate? ▹ See paper for details! Computationally efficient: Extrapolation: 10K operations/frame CNN Inference: 50B operations/frame
  42. 7 Euphrates An Algorithm-SoC Co-Designed System for Energy-Efficient Mobile Continuous

    Vision
  43. 7 Euphrates An Algorithm-SoC Co-Designed System for Energy-Efficient Mobile Continuous

    Vision Algorithm Motion-based tracking and detection synthesis.
  44. 7 Euphrates An Algorithm-SoC Co-Designed System for Energy-Efficient Mobile Continuous

    Vision SoC Exploits synergies across IP blocks. Enables task autonomy. Algorithm Motion-based tracking and detection synthesis.
  45. 7 Euphrates An Algorithm-SoC Co-Designed System for Energy-Efficient Mobile Continuous

    Vision Results 66% energy saving & 1% accuracy loss with RTL/measurement. SoC Exploits synergies across IP blocks. Enables task autonomy. Algorithm Motion-based tracking and detection synthesis.
  46. 7 Euphrates An Algorithm-SoC Co-Designed System for Energy-Efficient Mobile Continuous

    Vision Results 66% energy saving & 1% accuracy loss with RTL/measurement. SoC Exploits synergies across IP blocks. Enables task autonomy. Algorithm Motion-based tracking and detection synthesis.
  47. SoC Architecture 8 Vision Kernels RGB Frames Semantic Results Imaging

    Photons
  48. SoC Architecture 8 CNN Accelerator Vision Kernels RGB Frames Semantic

    Results Imaging Photons
  49. SoC Architecture 8 Image Signal Processor CNN Accelerator Vision Kernels

    RGB Frames Semantic Results Imaging Photons
  50. SoC Architecture 8 Image Signal Processor CNN Accelerator Camera Sensor

    Sensor Interface On-chip Interconnect Vision Kernels RGB Frames Semantic Results Imaging Photons
  51. SoC Architecture 9 Image Signal Processor CNN Accelerator Camera Sensor

    Sensor Interface On-chip Interconnect
  52. DRAM Display SoC Architecture 9 Image Signal Processor CNN Accelerator

    Camera Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine SoC
  53. DRAM Display Frame Buffer SoC Architecture 9 Image Signal Processor

    CNN Accelerator Camera Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine SoC
  54. DRAM Display Frame Buffer SoC Architecture 9 Image Signal Processor

    CNN Accelerator Camera Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine SoC
  55. DRAM Display Frame Buffer SoC Architecture 9 Image Signal Processor

    CNN Accelerator Camera Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine SoC
  56. DRAM Display Frame Buffer SoC Architecture 9 Image Signal Processor

    CNN Accelerator Camera Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine SoC
  57. DRAM Display Frame Buffer SoC Architecture 9 CNN Accelerator Camera

    Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine Image Signal Processor SoC
  58. DRAM Display Frame Buffer SoC Architecture 9 CNN Accelerator Camera

    Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine Image Signal Processor Metadata SoC
  59. DRAM Display Frame Buffer SoC Architecture 9 CNN Accelerator Camera

    Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine Image Signal Processor Metadata 1 SoC
  60. DRAM Display Frame Buffer SoC Architecture 9 CNN Accelerator Camera

    Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine Image Signal Processor Motion Controller Metadata 1 2
  61. DRAM Display Frame Buffer SoC Architecture 9 CNN Accelerator Camera

    Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine Image Signal Processor Motion Controller Metadata 1 2
  62. DRAM Display Frame Buffer SoC Architecture 9 CNN Accelerator Camera

    Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine Image Signal Processor Motion Controller Metadata 1 2
  63. DRAM Display Frame Buffer SoC Architecture 9 CNN Accelerator Camera

    Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine Image Signal Processor Motion Controller Metadata 1 2
  64. ISP Augmentation ▸ Expose motion vectors to the rest of

    the SoC 10
  65. ISP Augmentation ▸ Expose motion vectors to the rest of

    the SoC ▸ Design decision: transfer MVs through DRAM 10
  66. ISP Augmentation ▸ Expose motion vectors to the rest of

    the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data 10
  67. ISP Augmentation ▸ Expose motion vectors to the rest of

    the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme 10
  68. ISP Augmentation ▸ Expose motion vectors to the rest of

    the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10
  69. Temporal Denoising Stage Motion Estimation Motion Compensation SRAM DMA Demosaic

    Color Balance ISP Internal Interconnect SoC Interconnect ISP Pipeline Frame Buffer (DRAM) ISP Sequencer Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10
  70. Temporal Denoising Stage Motion Estimation Motion Compensation SRAM DMA Demosaic

    Color Balance ISP Internal Interconnect SoC Interconnect ISP Pipeline Frame Buffer (DRAM) ISP Sequencer Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10
  71. Temporal Denoising Stage Motion Estimation Motion Compensation SRAM DMA Demosaic

    Color Balance ISP Internal Interconnect SoC Interconnect ISP Pipeline Frame Buffer (DRAM) ISP Sequencer Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10
  72. Temporal Denoising Stage Motion Estimation Motion Compensation SRAM DMA Demosaic

    Color Balance ISP Internal Interconnect SoC Interconnect ISP Pipeline Frame Buffer (DRAM) ISP Sequencer Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10 MVs
  73. Temporal Denoising Stage Motion Estimation Motion Compensation SRAM DMA Demosaic

    Color Balance ISP Internal Interconnect SoC Interconnect ISP Pipeline Frame Buffer (DRAM) ISP Sequencer Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10 MVs
  74. Temporal Denoising Stage Motion Estimation Motion Compensation SRAM DMA Demosaic

    Color Balance ISP Internal Interconnect SoC Interconnect ISP Pipeline Frame Buffer (DRAM) ISP Sequencer Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10 MVs
  75. Motion Controller IP 11 Extrapolation Unit Motion Vector Buffer DMA

    Sequencer (FSM) ROI Selection ROI 4-Way SIMD Unit Scalar MVs New ROI MMap Regs ROI Winsize Base Addrs Conf
  76. Motion Controller IP 11 Extrapolation Unit Motion Vector Buffer DMA

    Sequencer (FSM) ROI Selection ROI 4-Way SIMD Unit Scalar MVs New ROI MMap Regs ROI Winsize Base Addrs Conf
  77. Motion Controller IP 11 Extrapolation Unit Motion Vector Buffer DMA

    Sequencer (FSM) ROI Selection ROI 4-Way SIMD Unit Scalar MVs New ROI MMap Regs ROI Winsize Base Addrs Conf
  78. Motion Controller IP 11 Extrapolation Unit Motion Vector Buffer DMA

    Sequencer (FSM) ROI Selection ROI 4-Way SIMD Unit Scalar MVs New ROI MMap Regs ROI Winsize Base Addrs Conf
  79. Motion Controller IP 11 Extrapolation Unit Motion Vector Buffer DMA

    Sequencer (FSM) ROI Selection ROI 4-Way SIMD Unit Scalar MVs New ROI MMap Regs ROI Winsize Base Addrs Conf
  80. Motion Controller IP ▸ Why not directly augment the CNN

    accelerator, but a new IP? ▹Independent of vision algo./arch implementation 11 Extrapolation Unit Motion Vector Buffer DMA Sequencer (FSM) ROI Selection ROI 4-Way SIMD Unit Scalar MVs New ROI MMap Regs ROI Winsize Base Addrs Conf
  81. Motion Controller IP ▸ Why not directly augment the CNN

    accelerator, but a new IP? ▹Independent of vision algo./arch implementation ▸ Why not synthesize in CPU, but a new IP? ▹Switch-off CPU to enable “always-on” vision 11 Extrapolation Unit Motion Vector Buffer DMA Sequencer (FSM) ROI Selection ROI 4-Way SIMD Unit Scalar MVs New ROI MMap Regs ROI Winsize Base Addrs Conf
  82. Motion Controller CNN Accelerator Motion Controller IP 12 Extrapolation Unit

    Motion Vector Buffer DMA Sequencer (FSM) ROI Selection ROI 4-Way SIMD Unit Scalar MVs New ROI MMap Regs ROI Winsize Base Addrs Conf ISP SoC Interconnect ▸ Why not directly augment the CNN accelerator, but a new IP? ▹Independent of vision algo./arch implementation ▸ Why not synthesize in CPU, but a new IP? ▹Switch-off CPU to enable “always-on” vision
  83. 13 Euphrates An Algorithm-SoC Co-Designed System for Energy-Efficient Mobile Continuous

    Vision Algorithm Motion-based tracking and detection synthesis. SoC Exploits synergies across IP blocks. Enables task autonomy. Results 66% energy saving & 1% accuracy loss with RTL/measurement.
  84. Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC:

    Nvidia Tegra X2 ▹ Real board measurement 14
  85. Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC:

    Nvidia Tegra X2 ▹ Real board measurement ▸ Develop RTL models for IPs unavailable on TX2 ▹ CNN Accelerator (651 mW, 1.58 mm2) ▹ Motion Controller (2.2 mW, 0.035 mm2) 14
  86. Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC:

    Nvidia Tegra X2 ▹ Real board measurement ▸ Develop RTL models for IPs unavailable on TX2 ▹ CNN Accelerator (651 mW, 1.58 mm2) ▹ Motion Controller (2.2 mW, 0.035 mm2) 14 ▸ Evaluate on Object Tracking and Object Detection ▹Important domains that are building blocks for many vision applications ▹IP vendors have started shipping standalone tracking/detection IPs
  87. Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC:

    Nvidia Tegra X2 ▹ Real board measurement ▸ Develop RTL models for IPs unavailable on TX2 ▹ CNN Accelerator (651 mW, 1.58 mm2) ▹ Motion Controller (2.2 mW, 0.035 mm2) 14 ▸ Evaluate on Object Tracking and Object Detection ▹Important domains that are building blocks for many vision applications ▹IP vendors have started shipping standalone tracking/detection IPs
  88. Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC:

    Nvidia Tegra X2 ▹ Real board measurement ▸ Develop RTL models for IPs unavailable on TX2 ▹ CNN Accelerator (651 mW, 1.58 mm2) ▹ Motion Controller (2.2 mW, 0.035 mm2) 14 ▸ Evaluate on Object Tracking and Object Detection ▹Important domains that are building blocks for many vision applications ▹IP vendors have started shipping standalone tracking/detection IPs ▸ Object Detection ▹Baseline CNN: YOLOv2 (state-of-the-art detection results)
  89. Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC:

    Nvidia Tegra X2 ▹ Real board measurement ▸ Develop RTL models for IPs unavailable on TX2 ▹ CNN Accelerator (651 mW, 1.58 mm2) ▹ Motion Controller (2.2 mW, 0.035 mm2) 14 ▸ Evaluate on Object Tracking and Object Detection ▹Important domains that are building blocks for many vision applications ▹IP vendors have started shipping standalone tracking/detection IPs ▸ Object Detection ▹Baseline CNN: YOLOv2 (state-of-the-art detection results) ▸ SCALESim: A systolic array-based, cycle-accurate CNN accelerator simulator. https://github.com/ARM-software/SCALE-Sim.
  90. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 YOLOv2 Evaluation Results

    15 Accuracy
  91. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 YOLOv2 0 0.25

    0.5 0.75 1 YOLOv2 Evaluation Results 15 Accuracy Norm. Energy
  92. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 YOLOv2 0 0.25

    0.5 0.75 1 YOLOv2 YOLOv2 EW-2 EW-4 EW-8 EW-16 EW-32 Evaluation Results 15 Accuracy Norm. Energy EW = Extrapolation Window
  93. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 YOLOv2 0 0.25

    0.5 0.75 1 YOLOv2 YOLOv2 EW-2 EW-4 EW-8 EW-16 EW-32 YOLOv2 EW-2 EW-4 EW-8 EW-16 EW-32 Evaluation Results 15 Accuracy Norm. Energy EW = Extrapolation Window
  94. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 YOLOv2 0 0.25

    0.5 0.75 1 YOLOv2 YOLOv2 EW-2 EW-4 EW-8 EW-16 EW-32 YOLOv2 EW-2 EW-4 EW-8 EW-16 EW-32 Evaluation Results 15 Accuracy Norm. Energy 66% system energy saving with ~ 1% accuracy loss. EW = Extrapolation Window
  95. Scale-down CNN 0.1 0.2 0.3 0.4 0.5 0.6 0.7 YOLOv2

    0 0.25 0.5 0.75 1 YOLOv2 YOLOv2 EW-2 EW-4 EW-8 EW-16 EW-32 YOLOv2 EW-2 EW-4 EW-8 EW-16 EW-32 YOLOv2 EW-4 EW-16 TinyYOLO Evaluation Results 15 Accuracy Norm. Energy 66% system energy saving with ~ 1% accuracy loss. EW = Extrapolation Window
  96. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 YOLOv2 0 0.25

    0.5 0.75 1 YOLOv2 YOLOv2 EW-2 EW-4 EW-8 EW-16 EW-32 YOLOv2 EW-2 EW-4 EW-8 EW-16 EW-32 YOLOv2 EW-4 EW-16 TinyYOLO Evaluation Results 15 Accuracy Norm. Energy 66% system energy saving with ~ 1% accuracy loss. More efficient than simply scaling-down the CNN. EW = Extrapolation Window
  97. Conclusions 16

  98. Conclusions 16 ▸ We must expand our focus from isolated

    accelerators to holistic SoC architecture.
  99. Conclusions 16 ▸ We must expand our focus from isolated

    accelerators to holistic SoC architecture.
  100. Conclusions 16 ▸ Euphrates co-designs the SoC with a motion-based

    synthesis algorithm. ▸ We must expand our focus from isolated accelerators to holistic SoC architecture.
  101. Conclusions 16 ▸ Euphrates co-designs the SoC with a motion-based

    synthesis algorithm. ▸ We must expand our focus from isolated accelerators to holistic SoC architecture. ▸ 66% SoC energy savings with ~1% accuracy loss. More efficient than scaling-down CNNs.
  102. Thank you! 17 Georgia Tech Anand Samajdar Paul Whatmough ARM

    Research Matt Mattina ARM Research