Slide 1

Slide 1 text

Algorithm-SoC Co-Design for Mobile Continuous Vision Yuhao Zhu Department of Computer Science University of Rochester with Anand Samajdar, Georgia Tech Matthew Mattina, ARM Research Paul Whatmough, ARM Research

Slide 2

Slide 2 text

Mobile Continuous Vision: Excessive Energy Consumption

Slide 3

Slide 3 text

Mobile Continuous Vision: Excessive Energy Consumption 720p, 30 FPS

Slide 4

Slide 4 text

Mobile Continuous Vision: Excessive Energy Consumption Energy Budget: (under 3 W TDP) 109 nJ/pixel 720p, 30 FPS

Slide 5

Slide 5 text

Mobile Continuous Vision: Excessive Energy Consumption Energy Budget: (under 3 W TDP) 109 nJ/pixel Object Detection Energy Consumption 1400 nJ/pixel 720p, 30 FPS

Slide 6

Slide 6 text

Application Drivers for Continuous Vision 3 Autonomous Drones

Slide 7

Slide 7 text

Application Drivers for Continuous Vision 3 Autonomous Drones ADAS

Slide 8

Slide 8 text

Application Drivers for Continuous Vision 3 Autonomous Drones Augmented Reality ADAS

Slide 9

Slide 9 text

Application Drivers for Continuous Vision 3 Autonomous Drones Augmented Reality ADAS Security Camera

Slide 10

Slide 10 text

Expanding the Scope 4 Vision Kernels RGB Frames Semantic Results Conventional Scope

Slide 11

Slide 11 text

Expanding the Scope 4 Vision Kernels RGB Frames Semantic Results Imaging Photons Conventional Scope

Slide 12

Slide 12 text

Expanding the Scope Our Scope 4 Vision Kernels RGB Frames Semantic Results Imaging Photons

Slide 13

Slide 13 text

Expanding the Scope Our Scope 4 Vision Kernels RGB Frames Semantic Results Imaging Photons Motion Metadata

Slide 14

Slide 14 text

Expanding the Scope Our Scope 4 Vision Kernels RGB Frames Semantic Results Imaging Photons Motion Metadata f(xt) =

Slide 15

Slide 15 text

Expanding the Scope Our Scope 4 Vision Kernels RGB Frames Semantic Results Imaging Photons Motion Metadata diff (motion) f(xt) = (xt ⊖ xt-1)

Slide 16

Slide 16 text

Expanding the Scope Our Scope 4 Vision Kernels RGB Frames Semantic Results Imaging Photons Motion Metadata diff (motion) f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

Slide 17

Slide 17 text

Expanding the Scope Our Scope 4 Vision Kernels RGB Frames Semantic Results Imaging Photons Motion Metadata diff (motion) synthesis f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

Slide 18

Slide 18 text

Expanding the Scope Our Scope 4 Vision Kernels RGB Frames Semantic Results Imaging Photons Motion Metadata diff (motion) synthesis cheap f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

Slide 19

Slide 19 text

Expanding the Scope Our Scope 4 Vision Kernels RGB Frames Semantic Results Imaging Photons Motion Metadata diff (motion) synthesis Motion-based Synthesis cheap f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

Slide 20

Slide 20 text

Getting Motion Data 5 Vision Kernels RGB Frames Semantic Results Imaging Photons Motion Metadata

Slide 21

Slide 21 text

Getting Motion Data 5 Vision Kernels RGB Frames Semantic Results Imaging Photons Conversion Demosaic … Bayer Domain Dead Pixel Correction … YUV Domain Temporal Denoising …

Slide 22

Slide 22 text

Getting Motion Data 5 Vision Kernels RGB Frames Semantic Results Imaging Photons Conversion Demosaic … Bayer Domain Dead Pixel Correction … YUV Domain Temporal Denoising …

Slide 23

Slide 23 text

Getting Motion Data 5 Vision Kernels RGB Frames Semantic Results Imaging Photons Conversion Demosaic … Bayer Domain Dead Pixel Correction … YUV Domain Temporal Denoising … Frame k

Slide 24

Slide 24 text

Getting Motion Data 5 Vision Kernels RGB Frames Semantic Results Imaging Photons Conversion Demosaic … Bayer Domain Dead Pixel Correction … YUV Domain Temporal Denoising … Frame k

Slide 25

Slide 25 text

Getting Motion Data 5 Vision Kernels RGB Frames Semantic Results Imaging Photons Conversion Demosaic … Bayer Domain Dead Pixel Correction … YUV Domain Temporal Denoising … Frame k-1 Frame k

Slide 26

Slide 26 text

Getting Motion Data 5 Vision Kernels RGB Frames Semantic Results Imaging Photons Conversion Demosaic … Bayer Domain Dead Pixel Correction … YUV Domain Temporal Denoising … Frame k-1 Frame k Motion Vector =

Slide 27

Slide 27 text

Getting Motion Data 5 Vision Kernels RGB Frames Semantic Results Imaging Photons Conversion Demosaic … Bayer Domain Dead Pixel Correction … YUV Domain Temporal Denoising … Motion Info. Frame k-1 Frame k Motion Vector =

Slide 28

Slide 28 text

Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis

Slide 29

Slide 29 text

Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors

Slide 30

Slide 30 text

Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors

Slide 31

Slide 31 text

Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors

Slide 32

Slide 32 text

Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors

Slide 33

Slide 33 text

Inference (I-Frame) Extrapolation (E-Frame) Inference (I-Frame) Extrapolation (E-Frame) Extrapolation Window = 2 Extrapolation (E-Frame) Extrapolation Window = 3 t4 t0 t1 t2 t3 Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors

Slide 34

Slide 34 text

Inference (I-Frame) Extrapolation (E-Frame) Inference (I-Frame) Extrapolation (E-Frame) Extrapolation Window = 2 Extrapolation (E-Frame) Extrapolation Window = 3 t4 t0 t1 t2 t3 Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors

Slide 35

Slide 35 text

Inference (I-Frame) Extrapolation (E-Frame) Inference (I-Frame) Extrapolation (E-Frame) Extrapolation Window = 2 Extrapolation (E-Frame) Extrapolation Window = 3 t4 t0 t1 t2 t3 Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors

Slide 36

Slide 36 text

Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges:

Slide 37

Slide 37 text

Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges: ▹ Handle deformable parts

Slide 38

Slide 38 text

Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges: ▹ Handle deformable parts ▹ Filter motion noise

Slide 39

Slide 39 text

Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges: ▹ Handle deformable parts ▹ Filter motion noise ▹ When to inference vs. extrapolate?

Slide 40

Slide 40 text

Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges: ▹ Handle deformable parts ▹ Filter motion noise ▹ When to inference vs. extrapolate? ▹ See paper for details!

Slide 41

Slide 41 text

Synthesis Operation 6 f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges: ▹ Handle deformable parts ▹ Filter motion noise ▹ When to inference vs. extrapolate? ▹ See paper for details! Computationally efficient: Extrapolation: 10K operations/frame CNN Inference: 50B operations/frame

Slide 42

Slide 42 text

7 Euphrates An Algorithm-SoC Co-Designed System for Energy-Efficient Mobile Continuous Vision

Slide 43

Slide 43 text

7 Euphrates An Algorithm-SoC Co-Designed System for Energy-Efficient Mobile Continuous Vision Algorithm Motion-based tracking and detection synthesis.

Slide 44

Slide 44 text

7 Euphrates An Algorithm-SoC Co-Designed System for Energy-Efficient Mobile Continuous Vision SoC Exploits synergies across IP blocks. Enables task autonomy. Algorithm Motion-based tracking and detection synthesis.

Slide 45

Slide 45 text

7 Euphrates An Algorithm-SoC Co-Designed System for Energy-Efficient Mobile Continuous Vision Results 66% energy saving & 1% accuracy loss with RTL/measurement. SoC Exploits synergies across IP blocks. Enables task autonomy. Algorithm Motion-based tracking and detection synthesis.

Slide 46

Slide 46 text

7 Euphrates An Algorithm-SoC Co-Designed System for Energy-Efficient Mobile Continuous Vision Results 66% energy saving & 1% accuracy loss with RTL/measurement. SoC Exploits synergies across IP blocks. Enables task autonomy. Algorithm Motion-based tracking and detection synthesis.

Slide 47

Slide 47 text

SoC Architecture 8 Vision Kernels RGB Frames Semantic Results Imaging Photons

Slide 48

Slide 48 text

SoC Architecture 8 CNN Accelerator Vision Kernels RGB Frames Semantic Results Imaging Photons

Slide 49

Slide 49 text

SoC Architecture 8 Image Signal Processor CNN Accelerator Vision Kernels RGB Frames Semantic Results Imaging Photons

Slide 50

Slide 50 text

SoC Architecture 8 Image Signal Processor CNN Accelerator Camera Sensor Sensor Interface On-chip Interconnect Vision Kernels RGB Frames Semantic Results Imaging Photons

Slide 51

Slide 51 text

SoC Architecture 9 Image Signal Processor CNN Accelerator Camera Sensor Sensor Interface On-chip Interconnect

Slide 52

Slide 52 text

DRAM Display SoC Architecture 9 Image Signal Processor CNN Accelerator Camera Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine SoC

Slide 53

Slide 53 text

DRAM Display Frame Buffer SoC Architecture 9 Image Signal Processor CNN Accelerator Camera Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine SoC

Slide 54

Slide 54 text

DRAM Display Frame Buffer SoC Architecture 9 Image Signal Processor CNN Accelerator Camera Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine SoC

Slide 55

Slide 55 text

DRAM Display Frame Buffer SoC Architecture 9 Image Signal Processor CNN Accelerator Camera Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine SoC

Slide 56

Slide 56 text

DRAM Display Frame Buffer SoC Architecture 9 Image Signal Processor CNN Accelerator Camera Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine SoC

Slide 57

Slide 57 text

DRAM Display Frame Buffer SoC Architecture 9 CNN Accelerator Camera Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine Image Signal Processor SoC

Slide 58

Slide 58 text

DRAM Display Frame Buffer SoC Architecture 9 CNN Accelerator Camera Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine Image Signal Processor Metadata SoC

Slide 59

Slide 59 text

DRAM Display Frame Buffer SoC Architecture 9 CNN Accelerator Camera Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine Image Signal Processor Metadata 1 SoC

Slide 60

Slide 60 text

DRAM Display Frame Buffer SoC Architecture 9 CNN Accelerator Camera Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine Image Signal Processor Motion Controller Metadata 1 2

Slide 61

Slide 61 text

DRAM Display Frame Buffer SoC Architecture 9 CNN Accelerator Camera Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine Image Signal Processor Motion Controller Metadata 1 2

Slide 62

Slide 62 text

DRAM Display Frame Buffer SoC Architecture 9 CNN Accelerator Camera Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine Image Signal Processor Motion Controller Metadata 1 2

Slide 63

Slide 63 text

DRAM Display Frame Buffer SoC Architecture 9 CNN Accelerator Camera Sensor Sensor Interface On-chip Interconnect CPU (Host) Memory Controller DMA Engine Image Signal Processor Motion Controller Metadata 1 2

Slide 64

Slide 64 text

ISP Augmentation ▸ Expose motion vectors to the rest of the SoC 10

Slide 65

Slide 65 text

ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM 10

Slide 66

Slide 66 text

ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data 10

Slide 67

Slide 67 text

ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme 10

Slide 68

Slide 68 text

ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10

Slide 69

Slide 69 text

Temporal Denoising Stage Motion Estimation Motion Compensation SRAM DMA Demosaic Color Balance ISP Internal Interconnect SoC Interconnect ISP Pipeline Frame Buffer (DRAM) ISP Sequencer Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10

Slide 70

Slide 70 text

Temporal Denoising Stage Motion Estimation Motion Compensation SRAM DMA Demosaic Color Balance ISP Internal Interconnect SoC Interconnect ISP Pipeline Frame Buffer (DRAM) ISP Sequencer Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10

Slide 71

Slide 71 text

Temporal Denoising Stage Motion Estimation Motion Compensation SRAM DMA Demosaic Color Balance ISP Internal Interconnect SoC Interconnect ISP Pipeline Frame Buffer (DRAM) ISP Sequencer Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10

Slide 72

Slide 72 text

Temporal Denoising Stage Motion Estimation Motion Compensation SRAM DMA Demosaic Color Balance ISP Internal Interconnect SoC Interconnect ISP Pipeline Frame Buffer (DRAM) ISP Sequencer Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10 MVs

Slide 73

Slide 73 text

Temporal Denoising Stage Motion Estimation Motion Compensation SRAM DMA Demosaic Color Balance ISP Internal Interconnect SoC Interconnect ISP Pipeline Frame Buffer (DRAM) ISP Sequencer Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10 MVs

Slide 74

Slide 74 text

Temporal Denoising Stage Motion Estimation Motion Compensation SRAM DMA Demosaic Color Balance ISP Internal Interconnect SoC Interconnect ISP Pipeline Frame Buffer (DRAM) ISP Sequencer Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10 MVs

Slide 75

Slide 75 text

Motion Controller IP 11 Extrapolation Unit Motion Vector Buffer DMA Sequencer (FSM) ROI Selection ROI 4-Way SIMD Unit Scalar MVs New ROI MMap Regs ROI Winsize Base Addrs Conf

Slide 76

Slide 76 text

Motion Controller IP 11 Extrapolation Unit Motion Vector Buffer DMA Sequencer (FSM) ROI Selection ROI 4-Way SIMD Unit Scalar MVs New ROI MMap Regs ROI Winsize Base Addrs Conf

Slide 77

Slide 77 text

Motion Controller IP 11 Extrapolation Unit Motion Vector Buffer DMA Sequencer (FSM) ROI Selection ROI 4-Way SIMD Unit Scalar MVs New ROI MMap Regs ROI Winsize Base Addrs Conf

Slide 78

Slide 78 text

Motion Controller IP 11 Extrapolation Unit Motion Vector Buffer DMA Sequencer (FSM) ROI Selection ROI 4-Way SIMD Unit Scalar MVs New ROI MMap Regs ROI Winsize Base Addrs Conf

Slide 79

Slide 79 text

Motion Controller IP 11 Extrapolation Unit Motion Vector Buffer DMA Sequencer (FSM) ROI Selection ROI 4-Way SIMD Unit Scalar MVs New ROI MMap Regs ROI Winsize Base Addrs Conf

Slide 80

Slide 80 text

Motion Controller IP ▸ Why not directly augment the CNN accelerator, but a new IP? ▹Independent of vision algo./arch implementation 11 Extrapolation Unit Motion Vector Buffer DMA Sequencer (FSM) ROI Selection ROI 4-Way SIMD Unit Scalar MVs New ROI MMap Regs ROI Winsize Base Addrs Conf

Slide 81

Slide 81 text

Motion Controller IP ▸ Why not directly augment the CNN accelerator, but a new IP? ▹Independent of vision algo./arch implementation ▸ Why not synthesize in CPU, but a new IP? ▹Switch-off CPU to enable “always-on” vision 11 Extrapolation Unit Motion Vector Buffer DMA Sequencer (FSM) ROI Selection ROI 4-Way SIMD Unit Scalar MVs New ROI MMap Regs ROI Winsize Base Addrs Conf

Slide 82

Slide 82 text

Motion Controller CNN Accelerator Motion Controller IP 12 Extrapolation Unit Motion Vector Buffer DMA Sequencer (FSM) ROI Selection ROI 4-Way SIMD Unit Scalar MVs New ROI MMap Regs ROI Winsize Base Addrs Conf ISP SoC Interconnect ▸ Why not directly augment the CNN accelerator, but a new IP? ▹Independent of vision algo./arch implementation ▸ Why not synthesize in CPU, but a new IP? ▹Switch-off CPU to enable “always-on” vision

Slide 83

Slide 83 text

13 Euphrates An Algorithm-SoC Co-Designed System for Energy-Efficient Mobile Continuous Vision Algorithm Motion-based tracking and detection synthesis. SoC Exploits synergies across IP blocks. Enables task autonomy. Results 66% energy saving & 1% accuracy loss with RTL/measurement.

Slide 84

Slide 84 text

Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2 ▹ Real board measurement 14

Slide 85

Slide 85 text

Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2 ▹ Real board measurement ▸ Develop RTL models for IPs unavailable on TX2 ▹ CNN Accelerator (651 mW, 1.58 mm2) ▹ Motion Controller (2.2 mW, 0.035 mm2) 14

Slide 86

Slide 86 text

Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2 ▹ Real board measurement ▸ Develop RTL models for IPs unavailable on TX2 ▹ CNN Accelerator (651 mW, 1.58 mm2) ▹ Motion Controller (2.2 mW, 0.035 mm2) 14 ▸ Evaluate on Object Tracking and Object Detection ▹Important domains that are building blocks for many vision applications ▹IP vendors have started shipping standalone tracking/detection IPs

Slide 87

Slide 87 text

Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2 ▹ Real board measurement ▸ Develop RTL models for IPs unavailable on TX2 ▹ CNN Accelerator (651 mW, 1.58 mm2) ▹ Motion Controller (2.2 mW, 0.035 mm2) 14 ▸ Evaluate on Object Tracking and Object Detection ▹Important domains that are building blocks for many vision applications ▹IP vendors have started shipping standalone tracking/detection IPs

Slide 88

Slide 88 text

Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2 ▹ Real board measurement ▸ Develop RTL models for IPs unavailable on TX2 ▹ CNN Accelerator (651 mW, 1.58 mm2) ▹ Motion Controller (2.2 mW, 0.035 mm2) 14 ▸ Evaluate on Object Tracking and Object Detection ▹Important domains that are building blocks for many vision applications ▹IP vendors have started shipping standalone tracking/detection IPs ▸ Object Detection ▹Baseline CNN: YOLOv2 (state-of-the-art detection results)

Slide 89

Slide 89 text

Experimental Setup ▸ In-house simulator modeling a commercial mobile SoC: Nvidia Tegra X2 ▹ Real board measurement ▸ Develop RTL models for IPs unavailable on TX2 ▹ CNN Accelerator (651 mW, 1.58 mm2) ▹ Motion Controller (2.2 mW, 0.035 mm2) 14 ▸ Evaluate on Object Tracking and Object Detection ▹Important domains that are building blocks for many vision applications ▹IP vendors have started shipping standalone tracking/detection IPs ▸ Object Detection ▹Baseline CNN: YOLOv2 (state-of-the-art detection results) ▸ SCALESim: A systolic array-based, cycle-accurate CNN accelerator simulator. https://github.com/ARM-software/SCALE-Sim.

Slide 90

Slide 90 text

0.1 0.2 0.3 0.4 0.5 0.6 0.7 YOLOv2 Evaluation Results 15 Accuracy

Slide 91

Slide 91 text

0.1 0.2 0.3 0.4 0.5 0.6 0.7 YOLOv2 0 0.25 0.5 0.75 1 YOLOv2 Evaluation Results 15 Accuracy Norm. Energy

Slide 92

Slide 92 text

0.1 0.2 0.3 0.4 0.5 0.6 0.7 YOLOv2 0 0.25 0.5 0.75 1 YOLOv2 YOLOv2 EW-2 EW-4 EW-8 EW-16 EW-32 Evaluation Results 15 Accuracy Norm. Energy EW = Extrapolation Window

Slide 93

Slide 93 text

0.1 0.2 0.3 0.4 0.5 0.6 0.7 YOLOv2 0 0.25 0.5 0.75 1 YOLOv2 YOLOv2 EW-2 EW-4 EW-8 EW-16 EW-32 YOLOv2 EW-2 EW-4 EW-8 EW-16 EW-32 Evaluation Results 15 Accuracy Norm. Energy EW = Extrapolation Window

Slide 94

Slide 94 text

0.1 0.2 0.3 0.4 0.5 0.6 0.7 YOLOv2 0 0.25 0.5 0.75 1 YOLOv2 YOLOv2 EW-2 EW-4 EW-8 EW-16 EW-32 YOLOv2 EW-2 EW-4 EW-8 EW-16 EW-32 Evaluation Results 15 Accuracy Norm. Energy 66% system energy saving with ~ 1% accuracy loss. EW = Extrapolation Window

Slide 95

Slide 95 text

Scale-down CNN 0.1 0.2 0.3 0.4 0.5 0.6 0.7 YOLOv2 0 0.25 0.5 0.75 1 YOLOv2 YOLOv2 EW-2 EW-4 EW-8 EW-16 EW-32 YOLOv2 EW-2 EW-4 EW-8 EW-16 EW-32 YOLOv2 EW-4 EW-16 TinyYOLO Evaluation Results 15 Accuracy Norm. Energy 66% system energy saving with ~ 1% accuracy loss. EW = Extrapolation Window

Slide 96

Slide 96 text

0.1 0.2 0.3 0.4 0.5 0.6 0.7 YOLOv2 0 0.25 0.5 0.75 1 YOLOv2 YOLOv2 EW-2 EW-4 EW-8 EW-16 EW-32 YOLOv2 EW-2 EW-4 EW-8 EW-16 EW-32 YOLOv2 EW-4 EW-16 TinyYOLO Evaluation Results 15 Accuracy Norm. Energy 66% system energy saving with ~ 1% accuracy loss. More efficient than simply scaling-down the CNN. EW = Extrapolation Window

Slide 97

Slide 97 text

Conclusions 16

Slide 98

Slide 98 text

Conclusions 16 ▸ We must expand our focus from isolated accelerators to holistic SoC architecture.

Slide 99

Slide 99 text

Conclusions 16 ▸ We must expand our focus from isolated accelerators to holistic SoC architecture.

Slide 100

Slide 100 text

Conclusions 16 ▸ Euphrates co-designs the SoC with a motion-based synthesis algorithm. ▸ We must expand our focus from isolated accelerators to holistic SoC architecture.

Slide 101

Slide 101 text

Conclusions 16 ▸ Euphrates co-designs the SoC with a motion-based synthesis algorithm. ▸ We must expand our focus from isolated accelerators to holistic SoC architecture. ▸ 66% SoC energy savings with ~1% accuracy loss. More efficient than scaling-down CNNs.

Slide 102

Slide 102 text

Thank you! 17 Georgia Tech Anand Samajdar Paul Whatmough ARM Research Matt Mattina ARM Research