Algorithm-SoC Co-Design for Mobile Continuous Vision

Algorithm-SoC Co-Design for Mobile Continuous Vision

ISCA 2018 Main Talk

3c332dfc0b438785cb10c5234652dd66?s=128

Yuhao Zhu

June 06, 2018
Tweet

Transcript

  1. 1.

    Algorithm-SoC Co-Design for Mobile Continuous Vision Yuhao Zhu Department of

    Computer Science University of Rochester with Anand Samajdar, Georgia Tech Matthew Mattina, ARM Research Paul Whatmough, ARM Research
  2. 5.

    Mobile Continuous Vision: Excessive Energy Consumption Energy Budget: (under 3

    W TDP) 109 nJ/pixel Object Detection Energy Consumption 1400 nJ/pixel 720p, 30 FPS
  3. 12.
  4. 13.

    Expanding the Scope Our Scope 4 Vision Kernels RGB Frames

    Semantic Results Imaging Photons Motion Metadata
  5. 14.

    Expanding the Scope Our Scope 4 Vision Kernels RGB Frames

    Semantic Results Imaging Photons Motion Metadata f(xt) =
  6. 15.

    Expanding the Scope Our Scope 4 Vision Kernels RGB Frames

    Semantic Results Imaging Photons Motion Metadata diff (motion) f(xt) = (xt ⊖ xt-1)
  7. 16.

    Expanding the Scope Our Scope 4 Vision Kernels RGB Frames

    Semantic Results Imaging Photons Motion Metadata diff (motion) f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)
  8. 17.

    Expanding the Scope Our Scope 4 Vision Kernels RGB Frames

    Semantic Results Imaging Photons Motion Metadata diff (motion) synthesis f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)
  9. 18.

    Expanding the Scope Our Scope 4 Vision Kernels RGB Frames

    Semantic Results Imaging Photons Motion Metadata diff (motion) synthesis cheap f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)
  10. 19.

    Expanding the Scope Our Scope 4 Vision Kernels RGB Frames

    Semantic Results Imaging Photons Motion Metadata diff (motion) synthesis Motion-based Synthesis cheap f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)
  11. 21.

    Getting Motion Data 5 Vision Kernels RGB Frames Semantic Results

    Imaging Photons Conversion Demosaic … Bayer Domain Dead Pixel Correction … YUV Domain Temporal Denoising …
  12. 22.

    Getting Motion Data 5 Vision Kernels RGB Frames Semantic Results

    Imaging Photons Conversion Demosaic … Bayer Domain Dead Pixel Correction … YUV Domain Temporal Denoising …
  13. 23.

    Getting Motion Data 5 Vision Kernels RGB Frames Semantic Results

    Imaging Photons Conversion Demosaic … Bayer Domain Dead Pixel Correction … YUV Domain Temporal Denoising … Frame k
  14. 24.

    Getting Motion Data 5 Vision Kernels RGB Frames Semantic Results

    Imaging Photons Conversion Demosaic … Bayer Domain Dead Pixel Correction … YUV Domain Temporal Denoising … Frame k <u, v>
  15. 25.

    Getting Motion Data 5 Vision Kernels RGB Frames Semantic Results

    Imaging Photons Conversion Demosaic … Bayer Domain Dead Pixel Correction … YUV Domain Temporal Denoising … Frame k-1 Frame k <u, v> <x, y>
  16. 26.

    Getting Motion Data 5 Vision Kernels RGB Frames Semantic Results

    Imaging Photons Conversion Demosaic … Bayer Domain Dead Pixel Correction … YUV Domain Temporal Denoising … Frame k-1 Frame k <u, v> <x, y> Motion Vector = <x - u, y - v>
  17. 27.

    Getting Motion Data 5 Vision Kernels RGB Frames Semantic Results

    Imaging Photons Conversion Demosaic … Bayer Domain Dead Pixel Correction … YUV Domain Temporal Denoising … Motion Info. Frame k-1 Frame k <u, v> <x, y> Motion Vector = <x - u, y - v>
  18. 28.

    Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt

    ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis
  19. 29.

    Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt

    ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors
  20. 30.

    Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt

    ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors
  21. 31.

    Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt

    ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors
  22. 32.

    Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt

    ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors
  23. 33.

    Inference (I-Frame) Extrapolation (E-Frame) Inference (I-Frame) Extrapolation (E-Frame) Extrapolation Window

    = 2 Extrapolation (E-Frame) Extrapolation Window = 3 t4 t0 t1 t2 t3 Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors
  24. 34.

    Inference (I-Frame) Extrapolation (E-Frame) Inference (I-Frame) Extrapolation (E-Frame) Extrapolation Window

    = 2 Extrapolation (E-Frame) Extrapolation Window = 3 t4 t0 t1 t2 t3 Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors
  25. 35.

    Inference (I-Frame) Extrapolation (E-Frame) Inference (I-Frame) Extrapolation (E-Frame) Extrapolation Window

    = 2 Extrapolation (E-Frame) Extrapolation Window = 3 t4 t0 t1 t2 t3 Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors
  26. 36.

    Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt

    ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges:
  27. 37.

    Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt

    ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges: ▹ Handle deformable parts
  28. 38.

    Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt

    ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges: ▹ Handle deformable parts ▹ Filter motion noise
  29. 39.

    Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt

    ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges: ▹ Handle deformable parts ▹ Filter motion noise ▹ When to inference vs. extrapolate?
  30. 40.

    Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt

    ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges: ▹ Handle deformable parts ▹ Filter motion noise ▹ When to inference vs. extrapolate? ▹ See paper for details!
  31. 41.

    Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt

    ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges: ▹ Handle deformable parts ▹ Filter motion noise ▹ When to inference vs. extrapolate? ▹ See paper for details! Computationally efficient: Extrapolation: 10K operations/frame CNN Inference: 50B operations/frame
  32. 43.

    7 Euphrates An Algorithm-SoC Co-Designed System for Energy-Efficient Mobile Continuous

    Vision Algorithm Motion-based tracking and detection synthesis.
  33. 44.

    7 Euphrates An Algorithm-SoC Co-Designed System for Energy-Efficient Mobile Continuous

    Vision SoC Exploits synergies across IP blocks. Enables task autonomy. Algorithm Motion-based tracking and detection synthesis.
  34. 45.

    7 Euphrates An Algorithm-SoC Co-Designed System for Energy-Efficient Mobile Continuous

    Vision Results 66% energy saving & 1% accuracy loss with RTL/measurement. SoC Exploits synergies across IP blocks. Enables task autonomy. Algorithm Motion-based tracking and detection synthesis.
  35. 46.

    7 Euphrates An Algorithm-SoC Co-Designed System for Energy-Efficient Mobile Continuous

    Vision Results 66% energy saving & 1% accuracy loss with RTL/measurement. SoC Exploits synergies across IP blocks. Enables task autonomy. Algorithm Motion-based tracking and detection synthesis.
  36. 49.
  37. 50.

    SoC Architecture 8 Image Signal Processor CNN Accelerator Camera Sensor

    Sensor Interface On-chip Interconnect Vision Kernels RGB Frames Semantic Results Imaging Photons
  38. 52.

    DRAM Display SoC Architecture 9 Image Signal Processor CNN Accelerator

    Camera Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine SoC
  39. 53.

    DRAM Display Frame Buffer SoC Architecture 9 Image Signal Processor

    CNN Accelerator Camera Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine SoC
  40. 54.

    DRAM Display Frame Buffer SoC Architecture 9 Image Signal Processor

    CNN Accelerator Camera Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine SoC
  41. 55.

    DRAM Display Frame Buffer SoC Architecture 9 Image Signal Processor

    CNN Accelerator Camera Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine SoC
  42. 56.

    DRAM Display Frame Buffer SoC Architecture 9 Image Signal Processor

    CNN Accelerator Camera Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine SoC
  43. 57.

    DRAM Display Frame Buffer SoC Architecture 9 CNN Accelerator Camera

    Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine Image Signal Processor SoC
  44. 58.

    DRAM Display Frame Buffer SoC Architecture 9 CNN Accelerator Camera

    Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine Image Signal Processor Metadata SoC
  45. 59.

    DRAM Display Frame Buffer SoC Architecture 9 CNN Accelerator Camera

    Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine Image Signal Processor Metadata 1 SoC
  46. 60.

    DRAM Display Frame Buffer SoC Architecture 9 CNN Accelerator Camera

    Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine Image Signal Processor Motion Controller Metadata 1 2
  47. 61.

    DRAM Display Frame Buffer SoC Architecture 9 CNN Accelerator Camera

    Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine Image Signal Processor Motion Controller Metadata 1 2
  48. 62.

    DRAM Display Frame Buffer SoC Architecture 9 CNN Accelerator Camera

    Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine Image Signal Processor Motion Controller Metadata 1 2
  49. 63.

    DRAM Display Frame Buffer SoC Architecture 9 CNN Accelerator Camera

    Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine Image Signal Processor Motion Controller Metadata 1 2
  50. 65.

    ISP Augmentation ▸ Expose motion vectors to the rest of

    the SoC ▸ Design decision: transfer MVs through DRAM 10
  51. 66.

    ISP Augmentation ▸ Expose motion vectors to the rest of

    the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data 10
  52. 67.

    ISP Augmentation ▸ Expose motion vectors to the rest of

    the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme 10
  53. 68.

    ISP Augmentation ▸ Expose motion vectors to the rest of

    the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10
  54. 69.

    Temporal Denoising Stage Motion Estimation Motion Compensation SRAM DMA Demosaic

    Color Balance ISP Internal Interconnect SoC Interconnect ISP Pipeline Frame Buffer (DRAM) ISP Sequencer Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10
  55. 70.

    Temporal Denoising Stage Motion Estimation Motion Compensation SRAM DMA Demosaic

    Color Balance ISP Internal Interconnect SoC Interconnect ISP Pipeline Frame Buffer (DRAM) ISP Sequencer Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10
  56. 71.

    Temporal Denoising Stage Motion Estimation Motion Compensation SRAM DMA Demosaic

    Color Balance ISP Internal Interconnect SoC Interconnect ISP Pipeline Frame Buffer (DRAM) ISP Sequencer Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10
  57. 72.

    Temporal Denoising Stage Motion Estimation Motion Compensation SRAM DMA Demosaic

    Color Balance ISP Internal Interconnect SoC Interconnect ISP Pipeline Frame Buffer (DRAM) ISP Sequencer Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10 MVs
  58. 73.

    Temporal Denoising Stage Motion Estimation Motion Compensation SRAM DMA Demosaic

    Color Balance ISP Internal Interconnect SoC Interconnect ISP Pipeline Frame Buffer (DRAM) ISP Sequencer Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10 MVs
  59. 74.

    Temporal Denoising Stage Motion Estimation Motion Compensation SRAM DMA Demosaic

    Color Balance ISP Internal Interconnect SoC Interconnect ISP Pipeline Frame Buffer (DRAM) ISP Sequencer Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10 MVs
  60. 75.

    Motion Controller IP 11 Extrapolation Unit Motion Vector Buffer DMA

    Sequencer (FSM) ROI Selection ROI 4-Way SIMD Unit Scalar MVs New ROI MMap Regs ROI Winsize Base Addrs Conf
  61. 76.

    Motion Controller IP 11 Extrapolation Unit Motion Vector Buffer DMA

    Sequencer (FSM) ROI Selection ROI 4-Way SIMD Unit Scalar MVs New ROI MMap Regs ROI Winsize Base Addrs Conf
  62. 77.

    Motion Controller IP 11 Extrapolation Unit Motion Vector Buffer DMA

    Sequencer (FSM) ROI Selection ROI 4-Way SIMD Unit Scalar MVs New ROI MMap Regs ROI Winsize Base Addrs Conf
  63. 78.

    Motion Controller IP 11 Extrapolation Unit Motion Vector Buffer DMA

    Sequencer (FSM) ROI Selection ROI 4-Way SIMD Unit Scalar MVs New ROI MMap Regs ROI Winsize Base Addrs Conf
  64. 79.

    Motion Controller IP 11 Extrapolation Unit Motion Vector Buffer DMA

    Sequencer (FSM) ROI Selection ROI 4-Way SIMD Unit Scalar MVs New ROI MMap Regs ROI Winsize Base Addrs Conf
  65. 80.

    Motion Controller IP ▸ Why not directly augment the CNN

    accelerator, but a new IP? ▹Independent of vision algo./arch implementation 11 Extrapolation Unit Motion Vector Buffer DMA Sequencer (FSM) ROI Selection ROI 4-Way SIMD Unit Scalar MVs New ROI MMap Regs ROI Winsize Base Addrs Conf
  66. 81.

    Motion Controller IP ▸ Why not directly augment the CNN

    accelerator, but a new IP? ▹Independent of vision algo./arch implementation ▸ Why not synthesize in CPU, but a new IP? ▹Switch-off CPU to enable “always-on” vision 11 Extrapolation Unit Motion Vector Buffer DMA Sequencer (FSM) ROI Selection ROI 4-Way SIMD Unit Scalar MVs New ROI MMap Regs ROI Winsize Base Addrs Conf
  67. 82.

    Motion Controller CNN Accelerator Motion Controller IP 12 Extrapolation Unit

    Motion Vector Buffer DMA Sequencer (FSM) ROI Selection ROI 4-Way SIMD Unit Scalar MVs New ROI MMap Regs ROI Winsize Base Addrs Conf ISP SoC Interconnect ▸ Why not directly augment the CNN accelerator, but a new IP? ▹Independent of vision algo./arch implementation ▸ Why not synthesize in CPU, but a new IP? ▹Switch-off CPU to enable “always-on” vision
  68. 83.

    13 Euphrates An Algorithm-SoC Co-Designed System for Energy-Efficient Mobile Continuous

    Vision Algorithm Motion-based tracking and detection synthesis. SoC Exploits synergies across IP blocks. Enables task autonomy. Results 66% energy saving & 1% accuracy loss with RTL/measurement.
  69. 84.
  70. 85.

    Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC:

    Nvidia Tegra X2 ▹ Real board measurement ▸ Develop RTL models for IPs unavailable on TX2 ▹ CNN Accelerator (651 mW, 1.58 mm2) ▹ Motion Controller (2.2 mW, 0.035 mm2) 14
  71. 86.

    Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC:

    Nvidia Tegra X2 ▹ Real board measurement ▸ Develop RTL models for IPs unavailable on TX2 ▹ CNN Accelerator (651 mW, 1.58 mm2) ▹ Motion Controller (2.2 mW, 0.035 mm2) 14 ▸ Evaluate on Object Tracking and Object Detection ▹Important domains that are building blocks for many vision applications ▹IP vendors have started shipping standalone tracking/detection IPs
  72. 87.

    Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC:

    Nvidia Tegra X2 ▹ Real board measurement ▸ Develop RTL models for IPs unavailable on TX2 ▹ CNN Accelerator (651 mW, 1.58 mm2) ▹ Motion Controller (2.2 mW, 0.035 mm2) 14 ▸ Evaluate on Object Tracking and Object Detection ▹Important domains that are building blocks for many vision applications ▹IP vendors have started shipping standalone tracking/detection IPs
  73. 88.

    Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC:

    Nvidia Tegra X2 ▹ Real board measurement ▸ Develop RTL models for IPs unavailable on TX2 ▹ CNN Accelerator (651 mW, 1.58 mm2) ▹ Motion Controller (2.2 mW, 0.035 mm2) 14 ▸ Evaluate on Object Tracking and Object Detection ▹Important domains that are building blocks for many vision applications ▹IP vendors have started shipping standalone tracking/detection IPs ▸ Object Detection ▹Baseline CNN: YOLOv2 (state-of-the-art detection results)
  74. 89.

    Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC:

    Nvidia Tegra X2 ▹ Real board measurement ▸ Develop RTL models for IPs unavailable on TX2 ▹ CNN Accelerator (651 mW, 1.58 mm2) ▹ Motion Controller (2.2 mW, 0.035 mm2) 14 ▸ Evaluate on Object Tracking and Object Detection ▹Important domains that are building blocks for many vision applications ▹IP vendors have started shipping standalone tracking/detection IPs ▸ Object Detection ▹Baseline CNN: YOLOv2 (state-of-the-art detection results) ▸ SCALESim: A systolic array-based, cycle-accurate CNN accelerator simulator. https://github.com/ARM-software/SCALE-Sim.
  75. 91.

    0.1 0.2 0.3 0.4 0.5 0.6 0.7 YOLOv2 0 0.25

    0.5 0.75 1 YOLOv2 Evaluation Results 15 Accuracy Norm. Energy
  76. 92.

    0.1 0.2 0.3 0.4 0.5 0.6 0.7 YOLOv2 0 0.25

    0.5 0.75 1 YOLOv2 YOLOv2 EW-2 EW-4 EW-8 EW-16 EW-32 Evaluation Results 15 Accuracy Norm. Energy EW = Extrapolation Window
  77. 93.

    0.1 0.2 0.3 0.4 0.5 0.6 0.7 YOLOv2 0 0.25

    0.5 0.75 1 YOLOv2 YOLOv2 EW-2 EW-4 EW-8 EW-16 EW-32 YOLOv2 EW-2 EW-4 EW-8 EW-16 EW-32 Evaluation Results 15 Accuracy Norm. Energy EW = Extrapolation Window
  78. 94.

    0.1 0.2 0.3 0.4 0.5 0.6 0.7 YOLOv2 0 0.25

    0.5 0.75 1 YOLOv2 YOLOv2 EW-2 EW-4 EW-8 EW-16 EW-32 YOLOv2 EW-2 EW-4 EW-8 EW-16 EW-32 Evaluation Results 15 Accuracy Norm. Energy 66% system energy saving with ~ 1% accuracy loss. EW = Extrapolation Window
  79. 95.

    Scale-down CNN 0.1 0.2 0.3 0.4 0.5 0.6 0.7 YOLOv2

    0 0.25 0.5 0.75 1 YOLOv2 YOLOv2 EW-2 EW-4 EW-8 EW-16 EW-32 YOLOv2 EW-2 EW-4 EW-8 EW-16 EW-32 YOLOv2 EW-4 EW-16 TinyYOLO Evaluation Results 15 Accuracy Norm. Energy 66% system energy saving with ~ 1% accuracy loss. EW = Extrapolation Window
  80. 96.

    0.1 0.2 0.3 0.4 0.5 0.6 0.7 YOLOv2 0 0.25

    0.5 0.75 1 YOLOv2 YOLOv2 EW-2 EW-4 EW-8 EW-16 EW-32 YOLOv2 EW-2 EW-4 EW-8 EW-16 EW-32 YOLOv2 EW-4 EW-16 TinyYOLO Evaluation Results 15 Accuracy Norm. Energy 66% system energy saving with ~ 1% accuracy loss. More efficient than simply scaling-down the CNN. EW = Extrapolation Window
  81. 98.

    Conclusions 16 ▸ We must expand our focus from isolated

    accelerators to holistic SoC architecture.
  82. 99.

    Conclusions 16 ▸ We must expand our focus from isolated

    accelerators to holistic SoC architecture.
  83. 100.

    Conclusions 16 ▸ Euphrates co-designs the SoC with a motion-based

    synthesis algorithm. ▸ We must expand our focus from isolated accelerators to holistic SoC architecture.
  84. 101.

    Conclusions 16 ▸ Euphrates co-designs the SoC with a motion-based

    synthesis algorithm. ▸ We must expand our focus from isolated accelerators to holistic SoC architecture. ▸ 66% SoC energy savings with ~1% accuracy loss. More efficient than scaling-down CNNs.
  85. 102.