Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Algorithm-SoC Co-Design for Mobile Continuous Vision

Algorithm-SoC Co-Design for Mobile Continuous Vision

ISCA 2018 Main Talk

Yuhao Zhu

June 06, 2018
Tweet

More Decks by Yuhao Zhu

Other Decks in Technology

Transcript

  1. Algorithm-SoC Co-Design
    for Mobile Continuous Vision
    Yuhao Zhu
    Department of Computer Science

    University of Rochester

    with

    Anand Samajdar, Georgia Tech

    Matthew Mattina, ARM Research

    Paul Whatmough, ARM Research

    View full-size slide

  2. Mobile Continuous Vision:
    Excessive Energy Consumption

    View full-size slide

  3. Mobile Continuous Vision:
    Excessive Energy Consumption
    720p, 30 FPS

    View full-size slide

  4. Mobile Continuous Vision:
    Excessive Energy Consumption
    Energy Budget:
    (under 3 W TDP)
    109 nJ/pixel
    720p, 30 FPS

    View full-size slide

  5. Mobile Continuous Vision:
    Excessive Energy Consumption
    Energy Budget:
    (under 3 W TDP)
    109 nJ/pixel
    Object Detection
    Energy Consumption
    1400 nJ/pixel
    720p, 30 FPS

    View full-size slide

  6. Application Drivers for Continuous Vision
    3
    Autonomous Drones

    View full-size slide

  7. Application Drivers for Continuous Vision
    3
    Autonomous Drones ADAS

    View full-size slide

  8. Application Drivers for Continuous Vision
    3
    Autonomous Drones
    Augmented Reality
    ADAS

    View full-size slide

  9. Application Drivers for Continuous Vision
    3
    Autonomous Drones
    Augmented Reality
    ADAS
    Security Camera

    View full-size slide

  10. Expanding the Scope
    4
    Vision
    Kernels
    RGB
    Frames
    Semantic
    Results
    Conventional
    Scope

    View full-size slide

  11. Expanding the Scope
    4
    Vision
    Kernels
    RGB
    Frames
    Semantic
    Results
    Imaging
    Photons
    Conventional
    Scope

    View full-size slide

  12. Expanding the Scope
    Our Scope
    4
    Vision
    Kernels
    RGB
    Frames
    Semantic
    Results
    Imaging
    Photons

    View full-size slide

  13. Expanding the Scope
    Our Scope
    4
    Vision
    Kernels
    RGB
    Frames
    Semantic
    Results
    Imaging
    Photons
    Motion
    Metadata

    View full-size slide

  14. Expanding the Scope
    Our Scope
    4
    Vision
    Kernels
    RGB
    Frames
    Semantic
    Results
    Imaging
    Photons
    Motion
    Metadata
    f(xt) =

    View full-size slide

  15. Expanding the Scope
    Our Scope
    4
    Vision
    Kernels
    RGB
    Frames
    Semantic
    Results
    Imaging
    Photons
    Motion
    Metadata
    diff (motion)
    f(xt) =

    (xt ⊖ xt-1)

    View full-size slide

  16. Expanding the Scope
    Our Scope
    4
    Vision
    Kernels
    RGB
    Frames
    Semantic
    Results
    Imaging
    Photons
    Motion
    Metadata
    diff (motion)
    f(xt) =

    f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

    View full-size slide

  17. Expanding the Scope
    Our Scope
    4
    Vision
    Kernels
    RGB
    Frames
    Semantic
    Results
    Imaging
    Photons
    Motion
    Metadata
    diff (motion)
    synthesis
    f(xt) =

    f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

    View full-size slide

  18. Expanding the Scope
    Our Scope
    4
    Vision
    Kernels
    RGB
    Frames
    Semantic
    Results
    Imaging
    Photons
    Motion
    Metadata
    diff (motion)
    synthesis
    cheap
    f(xt) =

    f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

    View full-size slide

  19. Expanding the Scope
    Our Scope
    4
    Vision
    Kernels
    RGB
    Frames
    Semantic
    Results
    Imaging
    Photons
    Motion
    Metadata
    diff (motion)
    synthesis
    Motion-based

    Synthesis
    cheap
    f(xt) =

    f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

    View full-size slide

  20. Getting Motion Data
    5
    Vision
    Kernels
    RGB
    Frames
    Semantic
    Results
    Imaging
    Photons
    Motion
    Metadata

    View full-size slide

  21. Getting Motion Data
    5
    Vision
    Kernels
    RGB
    Frames
    Semantic
    Results
    Imaging
    Photons
    Conversion
    Demosaic …
    Bayer Domain
    Dead Pixel
    Correction

    YUV Domain
    Temporal
    Denoising

    View full-size slide

  22. Getting Motion Data
    5
    Vision
    Kernels
    RGB
    Frames
    Semantic
    Results
    Imaging
    Photons
    Conversion
    Demosaic …
    Bayer Domain
    Dead Pixel
    Correction

    YUV Domain
    Temporal
    Denoising

    View full-size slide

  23. Getting Motion Data
    5
    Vision
    Kernels
    RGB
    Frames
    Semantic
    Results
    Imaging
    Photons
    Conversion
    Demosaic …
    Bayer Domain
    Dead Pixel
    Correction

    YUV Domain
    Temporal
    Denoising

    Frame k

    View full-size slide

  24. Getting Motion Data
    5
    Vision
    Kernels
    RGB
    Frames
    Semantic
    Results
    Imaging
    Photons
    Conversion
    Demosaic …
    Bayer Domain
    Dead Pixel
    Correction

    YUV Domain
    Temporal
    Denoising

    Frame k

    View full-size slide

  25. Getting Motion Data
    5
    Vision
    Kernels
    RGB
    Frames
    Semantic
    Results
    Imaging
    Photons
    Conversion
    Demosaic …
    Bayer Domain
    Dead Pixel
    Correction

    YUV Domain
    Temporal
    Denoising

    Frame k-1 Frame k


    View full-size slide

  26. Getting Motion Data
    5
    Vision
    Kernels
    RGB
    Frames
    Semantic
    Results
    Imaging
    Photons
    Conversion
    Demosaic …
    Bayer Domain
    Dead Pixel
    Correction

    YUV Domain
    Temporal
    Denoising

    Frame k-1 Frame k


    Motion Vector =

    View full-size slide

  27. Getting Motion Data
    5
    Vision
    Kernels
    RGB
    Frames
    Semantic
    Results
    Imaging
    Photons
    Conversion
    Demosaic …
    Bayer Domain
    Dead Pixel
    Correction

    YUV Domain
    Temporal
    Denoising

    Motion
    Info.
    Frame k-1 Frame k


    Motion Vector =

    View full-size slide

  28. Synthesis Operation
    6
    f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

    diff (motion)
    synthesis
    Motion-based

    Synthesis

    View full-size slide

  29. Synthesis Operation
    6
    f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

    diff (motion)
    synthesis
    Motion-based

    Synthesis
    ▸ Synthesis operation: Extrapolate
    based on motion vectors

    View full-size slide

  30. Synthesis Operation
    6
    f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

    diff (motion)
    synthesis
    Motion-based

    Synthesis
    ▸ Synthesis operation: Extrapolate
    based on motion vectors

    View full-size slide

  31. Synthesis Operation
    6
    f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

    diff (motion)
    synthesis
    Motion-based

    Synthesis
    ▸ Synthesis operation: Extrapolate
    based on motion vectors

    View full-size slide

  32. Synthesis Operation
    6
    f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

    diff (motion)
    synthesis
    Motion-based

    Synthesis
    ▸ Synthesis operation: Extrapolate
    based on motion vectors

    View full-size slide

  33. Inference
    (I-Frame)
    Extrapolation
    (E-Frame)
    Inference
    (I-Frame)
    Extrapolation
    (E-Frame)
    Extrapolation Window = 2
    Extrapolation
    (E-Frame)
    Extrapolation Window = 3
    t4
    t0 t1 t2 t3
    Synthesis Operation
    6
    f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

    diff (motion)
    synthesis
    Motion-based

    Synthesis
    ▸ Synthesis operation: Extrapolate
    based on motion vectors

    View full-size slide

  34. Inference
    (I-Frame)
    Extrapolation
    (E-Frame)
    Inference
    (I-Frame)
    Extrapolation
    (E-Frame)
    Extrapolation Window = 2
    Extrapolation
    (E-Frame)
    Extrapolation Window = 3
    t4
    t0 t1 t2 t3
    Synthesis Operation
    6
    f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

    diff (motion)
    synthesis
    Motion-based

    Synthesis
    ▸ Synthesis operation: Extrapolate
    based on motion vectors

    View full-size slide

  35. Inference
    (I-Frame)
    Extrapolation
    (E-Frame)
    Inference
    (I-Frame)
    Extrapolation
    (E-Frame)
    Extrapolation Window = 2
    Extrapolation
    (E-Frame)
    Extrapolation Window = 3
    t4
    t0 t1 t2 t3
    Synthesis Operation
    6
    f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

    diff (motion)
    synthesis
    Motion-based

    Synthesis
    ▸ Synthesis operation: Extrapolate
    based on motion vectors

    View full-size slide

  36. Synthesis Operation
    6
    f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

    diff (motion)
    synthesis
    Motion-based

    Synthesis
    ▸ Synthesis operation: Extrapolate
    based on motion vectors
    ▸ Address three challenges:

    View full-size slide

  37. Synthesis Operation
    6
    f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

    diff (motion)
    synthesis
    Motion-based

    Synthesis
    ▸ Synthesis operation: Extrapolate
    based on motion vectors
    ▸ Address three challenges:
    ▹ Handle deformable parts

    View full-size slide

  38. Synthesis Operation
    6
    f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

    diff (motion)
    synthesis
    Motion-based

    Synthesis
    ▸ Synthesis operation: Extrapolate
    based on motion vectors
    ▸ Address three challenges:
    ▹ Handle deformable parts
    ▹ Filter motion noise

    View full-size slide

  39. Synthesis Operation
    6
    f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

    diff (motion)
    synthesis
    Motion-based

    Synthesis
    ▸ Synthesis operation: Extrapolate
    based on motion vectors
    ▸ Address three challenges:
    ▹ Handle deformable parts
    ▹ Filter motion noise
    ▹ When to inference vs. extrapolate?

    View full-size slide

  40. Synthesis Operation
    6
    f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

    diff (motion)
    synthesis
    Motion-based

    Synthesis
    ▸ Synthesis operation: Extrapolate
    based on motion vectors
    ▸ Address three challenges:
    ▹ Handle deformable parts
    ▹ Filter motion noise
    ▹ When to inference vs. extrapolate?
    ▹ See paper for details!

    View full-size slide

  41. Synthesis Operation
    6
    f(xt) = f(x1, …, t-1) ⊕ (xt ⊖ xt-1)

    diff (motion)
    synthesis
    Motion-based

    Synthesis
    ▸ Synthesis operation: Extrapolate
    based on motion vectors
    ▸ Address three challenges:
    ▹ Handle deformable parts
    ▹ Filter motion noise
    ▹ When to inference vs. extrapolate?
    ▹ See paper for details!
    Computationally efficient:

    Extrapolation: 10K operations/frame
    CNN Inference: 50B operations/frame

    View full-size slide

  42. 7
    Euphrates
    An Algorithm-SoC Co-Designed System for
    Energy-Efficient Mobile Continuous Vision

    View full-size slide

  43. 7
    Euphrates
    An Algorithm-SoC Co-Designed System for
    Energy-Efficient Mobile Continuous Vision
    Algorithm Motion-based tracking and
    detection synthesis.

    View full-size slide

  44. 7
    Euphrates
    An Algorithm-SoC Co-Designed System for
    Energy-Efficient Mobile Continuous Vision
    SoC Exploits synergies across IP
    blocks. Enables task autonomy.
    Algorithm Motion-based tracking and
    detection synthesis.

    View full-size slide

  45. 7
    Euphrates
    An Algorithm-SoC Co-Designed System for
    Energy-Efficient Mobile Continuous Vision
    Results 66% energy saving & 1% accuracy
    loss with RTL/measurement.
    SoC Exploits synergies across IP
    blocks. Enables task autonomy.
    Algorithm Motion-based tracking and
    detection synthesis.

    View full-size slide

  46. 7
    Euphrates
    An Algorithm-SoC Co-Designed System for
    Energy-Efficient Mobile Continuous Vision
    Results 66% energy saving & 1% accuracy
    loss with RTL/measurement.
    SoC Exploits synergies across IP
    blocks. Enables task autonomy.
    Algorithm Motion-based tracking and
    detection synthesis.

    View full-size slide

  47. SoC Architecture
    8
    Vision
    Kernels
    RGB
    Frames
    Semantic
    Results
    Imaging
    Photons

    View full-size slide

  48. SoC Architecture
    8
    CNN
    Accelerator
    Vision
    Kernels
    RGB
    Frames
    Semantic
    Results
    Imaging
    Photons

    View full-size slide

  49. SoC Architecture
    8
    Image Signal
    Processor
    CNN
    Accelerator
    Vision
    Kernels
    RGB
    Frames
    Semantic
    Results
    Imaging
    Photons

    View full-size slide

  50. SoC Architecture
    8
    Image Signal
    Processor
    CNN
    Accelerator
    Camera
    Sensor
    Sensor
    Interface
    On-chip Interconnect
    Vision
    Kernels
    RGB
    Frames
    Semantic
    Results
    Imaging
    Photons

    View full-size slide

  51. SoC Architecture
    9
    Image Signal
    Processor
    CNN
    Accelerator
    Camera
    Sensor
    Sensor
    Interface
    On-chip Interconnect

    View full-size slide

  52. DRAM
    Display
    SoC Architecture
    9
    Image Signal
    Processor
    CNN
    Accelerator
    Camera
    Sensor
    Sensor
    Interface
    On-chip Interconnect
    CPU
    (Host)
    Memory
    Controller
    DMA
    Engine
    SoC

    View full-size slide

  53. DRAM
    Display Frame Buffer
    SoC Architecture
    9
    Image Signal
    Processor
    CNN
    Accelerator
    Camera
    Sensor
    Sensor
    Interface
    On-chip Interconnect
    CPU
    (Host)
    Memory
    Controller
    DMA
    Engine
    SoC

    View full-size slide

  54. DRAM
    Display Frame Buffer
    SoC Architecture
    9
    Image Signal
    Processor
    CNN
    Accelerator
    Camera
    Sensor
    Sensor
    Interface
    On-chip Interconnect
    CPU
    (Host)
    Memory
    Controller
    DMA
    Engine
    SoC

    View full-size slide

  55. DRAM
    Display Frame Buffer
    SoC Architecture
    9
    Image Signal
    Processor
    CNN
    Accelerator
    Camera
    Sensor
    Sensor
    Interface
    On-chip Interconnect
    CPU
    (Host)
    Memory
    Controller
    DMA
    Engine
    SoC

    View full-size slide

  56. DRAM
    Display Frame Buffer
    SoC Architecture
    9
    Image Signal
    Processor
    CNN
    Accelerator
    Camera
    Sensor
    Sensor
    Interface
    On-chip Interconnect
    CPU
    (Host)
    Memory
    Controller
    DMA
    Engine
    SoC

    View full-size slide

  57. DRAM
    Display Frame Buffer
    SoC Architecture
    9
    CNN
    Accelerator
    Camera
    Sensor
    Sensor
    Interface
    On-chip Interconnect
    CPU
    (Host)
    Memory
    Controller
    DMA
    Engine
    Image Signal
    Processor
    SoC

    View full-size slide

  58. DRAM
    Display Frame Buffer
    SoC Architecture
    9
    CNN
    Accelerator
    Camera
    Sensor
    Sensor
    Interface
    On-chip Interconnect
    CPU
    (Host)
    Memory
    Controller
    DMA
    Engine
    Image Signal
    Processor
    Metadata
    SoC

    View full-size slide

  59. DRAM
    Display Frame Buffer
    SoC Architecture
    9
    CNN
    Accelerator
    Camera
    Sensor
    Sensor
    Interface
    On-chip Interconnect
    CPU
    (Host)
    Memory
    Controller
    DMA
    Engine
    Image Signal
    Processor
    Metadata
    1
    SoC

    View full-size slide

  60. DRAM
    Display Frame Buffer
    SoC Architecture
    9
    CNN
    Accelerator
    Camera
    Sensor
    Sensor
    Interface
    On-chip Interconnect
    CPU
    (Host)
    Memory
    Controller
    DMA
    Engine
    Image Signal
    Processor
    Motion
    Controller
    Metadata
    1
    2

    View full-size slide

  61. DRAM
    Display Frame Buffer
    SoC Architecture
    9
    CNN
    Accelerator
    Camera
    Sensor
    Sensor
    Interface
    On-chip Interconnect
    CPU
    (Host)
    Memory
    Controller
    DMA
    Engine
    Image Signal
    Processor
    Motion
    Controller
    Metadata
    1
    2

    View full-size slide

  62. DRAM
    Display Frame Buffer
    SoC Architecture
    9
    CNN
    Accelerator
    Camera
    Sensor
    Sensor
    Interface
    On-chip Interconnect
    CPU
    (Host)
    Memory
    Controller
    DMA
    Engine
    Image Signal
    Processor
    Motion
    Controller
    Metadata
    1
    2

    View full-size slide

  63. DRAM
    Display Frame Buffer
    SoC Architecture
    9
    CNN
    Accelerator
    Camera
    Sensor
    Sensor
    Interface
    On-chip Interconnect
    CPU
    (Host)
    Memory
    Controller
    DMA
    Engine
    Image Signal
    Processor
    Motion
    Controller
    Metadata
    1
    2

    View full-size slide

  64. ISP Augmentation
    ▸ Expose motion vectors to the rest of the SoC
    10

    View full-size slide

  65. ISP Augmentation
    ▸ Expose motion vectors to the rest of the SoC
    ▸ Design decision: transfer MVs through DRAM
    10

    View full-size slide

  66. ISP Augmentation
    ▸ Expose motion vectors to the rest of the SoC
    ▸ Design decision: transfer MVs through DRAM
    ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data
    10

    View full-size slide

  67. ISP Augmentation
    ▸ Expose motion vectors to the rest of the SoC
    ▸ Design decision: transfer MVs through DRAM
    ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data
    ▹ Easy to piggyback on the existing SoC communication scheme
    10

    View full-size slide

  68. ISP Augmentation
    ▸ Expose motion vectors to the rest of the SoC
    ▸ Design decision: transfer MVs through DRAM
    ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data
    ▹ Easy to piggyback on the existing SoC communication scheme
    ▸ Light-weight modification to ISP Sequencer
    10

    View full-size slide

  69. Temporal Denoising Stage
    Motion
    Estimation
    Motion
    Compensation
    SRAM
    DMA
    Demosaic
    Color
    Balance
    ISP Internal
    Interconnect
    SoC
    Interconnect
    ISP Pipeline
    Frame Buffer
    (DRAM)
    ISP
    Sequencer
    Noisy
    Frame
    Denoised
    Frame
    Prev.
    Noisy
    Frame
    Prev.
    Denoised
    Frame
    ISP Augmentation
    ▸ Expose motion vectors to the rest of the SoC
    ▸ Design decision: transfer MVs through DRAM
    ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data
    ▹ Easy to piggyback on the existing SoC communication scheme
    ▸ Light-weight modification to ISP Sequencer
    10

    View full-size slide

  70. Temporal Denoising Stage
    Motion
    Estimation
    Motion
    Compensation
    SRAM
    DMA
    Demosaic
    Color
    Balance
    ISP Internal
    Interconnect
    SoC
    Interconnect
    ISP Pipeline
    Frame Buffer
    (DRAM)
    ISP
    Sequencer
    Noisy
    Frame
    Denoised
    Frame
    Prev.
    Noisy
    Frame
    Prev.
    Denoised
    Frame
    ISP Augmentation
    ▸ Expose motion vectors to the rest of the SoC
    ▸ Design decision: transfer MVs through DRAM
    ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data
    ▹ Easy to piggyback on the existing SoC communication scheme
    ▸ Light-weight modification to ISP Sequencer
    10

    View full-size slide

  71. Temporal Denoising Stage
    Motion
    Estimation
    Motion
    Compensation
    SRAM
    DMA
    Demosaic
    Color
    Balance
    ISP Internal
    Interconnect
    SoC
    Interconnect
    ISP Pipeline
    Frame Buffer
    (DRAM)
    ISP
    Sequencer
    Noisy
    Frame
    Denoised
    Frame
    Prev.
    Noisy
    Frame
    Prev.
    Denoised
    Frame
    ISP Augmentation
    ▸ Expose motion vectors to the rest of the SoC
    ▸ Design decision: transfer MVs through DRAM
    ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data
    ▹ Easy to piggyback on the existing SoC communication scheme
    ▸ Light-weight modification to ISP Sequencer
    10

    View full-size slide

  72. Temporal Denoising Stage
    Motion
    Estimation
    Motion
    Compensation
    SRAM
    DMA
    Demosaic
    Color
    Balance
    ISP Internal
    Interconnect
    SoC
    Interconnect
    ISP Pipeline
    Frame Buffer
    (DRAM)
    ISP
    Sequencer
    Noisy
    Frame
    Denoised
    Frame
    Prev.
    Noisy
    Frame
    Prev.
    Denoised
    Frame
    ISP Augmentation
    ▸ Expose motion vectors to the rest of the SoC
    ▸ Design decision: transfer MVs through DRAM
    ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data
    ▹ Easy to piggyback on the existing SoC communication scheme
    ▸ Light-weight modification to ISP Sequencer
    10
    MVs

    View full-size slide

  73. Temporal Denoising Stage
    Motion
    Estimation
    Motion
    Compensation
    SRAM
    DMA
    Demosaic
    Color
    Balance
    ISP Internal
    Interconnect
    SoC
    Interconnect
    ISP Pipeline
    Frame Buffer
    (DRAM)
    ISP
    Sequencer
    Noisy
    Frame
    Denoised
    Frame
    Prev.
    Noisy
    Frame
    Prev.
    Denoised
    Frame
    ISP Augmentation
    ▸ Expose motion vectors to the rest of the SoC
    ▸ Design decision: transfer MVs through DRAM
    ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data
    ▹ Easy to piggyback on the existing SoC communication scheme
    ▸ Light-weight modification to ISP Sequencer
    10
    MVs

    View full-size slide

  74. Temporal Denoising Stage
    Motion
    Estimation
    Motion
    Compensation
    SRAM
    DMA
    Demosaic
    Color
    Balance
    ISP Internal
    Interconnect
    SoC
    Interconnect
    ISP Pipeline
    Frame Buffer
    (DRAM)
    ISP
    Sequencer
    Noisy
    Frame
    Denoised
    Frame
    Prev.
    Noisy
    Frame
    Prev.
    Denoised
    Frame
    ISP Augmentation
    ▸ Expose motion vectors to the rest of the SoC
    ▸ Design decision: transfer MVs through DRAM
    ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data
    ▹ Easy to piggyback on the existing SoC communication scheme
    ▸ Light-weight modification to ISP Sequencer
    10
    MVs

    View full-size slide

  75. Motion Controller IP
    11
    Extrapolation Unit
    Motion
    Vector
    Buffer
    DMA
    Sequencer (FSM)
    ROI Selection
    ROI
    4-Way
    SIMD Unit
    Scalar
    MVs
    New
    ROI
    MMap
    Regs
    ROI
    Winsize
    Base
    Addrs
    Conf

    View full-size slide

  76. Motion Controller IP
    11
    Extrapolation Unit
    Motion
    Vector
    Buffer
    DMA
    Sequencer (FSM)
    ROI Selection
    ROI
    4-Way
    SIMD Unit
    Scalar
    MVs
    New
    ROI
    MMap
    Regs
    ROI
    Winsize
    Base
    Addrs
    Conf

    View full-size slide

  77. Motion Controller IP
    11
    Extrapolation Unit
    Motion
    Vector
    Buffer
    DMA
    Sequencer (FSM)
    ROI Selection
    ROI
    4-Way
    SIMD Unit
    Scalar
    MVs
    New
    ROI
    MMap
    Regs
    ROI
    Winsize
    Base
    Addrs
    Conf

    View full-size slide

  78. Motion Controller IP
    11
    Extrapolation Unit
    Motion
    Vector
    Buffer
    DMA
    Sequencer (FSM)
    ROI Selection
    ROI
    4-Way
    SIMD Unit
    Scalar
    MVs
    New
    ROI
    MMap
    Regs
    ROI
    Winsize
    Base
    Addrs
    Conf

    View full-size slide

  79. Motion Controller IP
    11
    Extrapolation Unit
    Motion
    Vector
    Buffer
    DMA
    Sequencer (FSM)
    ROI Selection
    ROI
    4-Way
    SIMD Unit
    Scalar
    MVs
    New
    ROI
    MMap
    Regs
    ROI
    Winsize
    Base
    Addrs
    Conf

    View full-size slide

  80. Motion Controller IP
    ▸ Why not directly augment the CNN accelerator, but a new IP?
    ▹Independent of vision algo./arch implementation
    11
    Extrapolation Unit
    Motion
    Vector
    Buffer
    DMA
    Sequencer (FSM)
    ROI Selection
    ROI
    4-Way
    SIMD Unit
    Scalar
    MVs
    New
    ROI
    MMap
    Regs
    ROI
    Winsize
    Base
    Addrs
    Conf

    View full-size slide

  81. Motion Controller IP
    ▸ Why not directly augment the CNN accelerator, but a new IP?
    ▹Independent of vision algo./arch implementation
    ▸ Why not synthesize in CPU, but a new IP?
    ▹Switch-off CPU to enable “always-on” vision
    11
    Extrapolation Unit
    Motion
    Vector
    Buffer
    DMA
    Sequencer (FSM)
    ROI Selection
    ROI
    4-Way
    SIMD Unit
    Scalar
    MVs
    New
    ROI
    MMap
    Regs
    ROI
    Winsize
    Base
    Addrs
    Conf

    View full-size slide

  82. Motion Controller
    CNN
    Accelerator
    Motion Controller IP
    12
    Extrapolation Unit
    Motion
    Vector
    Buffer
    DMA
    Sequencer (FSM)
    ROI Selection
    ROI
    4-Way
    SIMD Unit
    Scalar
    MVs
    New
    ROI
    MMap
    Regs
    ROI
    Winsize
    Base
    Addrs
    Conf
    ISP
    SoC Interconnect
    ▸ Why not directly augment the CNN accelerator, but a new IP?
    ▹Independent of vision algo./arch implementation
    ▸ Why not synthesize in CPU, but a new IP?
    ▹Switch-off CPU to enable “always-on” vision

    View full-size slide

  83. 13
    Euphrates
    An Algorithm-SoC Co-Designed System for
    Energy-Efficient Mobile Continuous Vision
    Algorithm Motion-based tracking and
    detection synthesis.
    SoC Exploits synergies across IP
    blocks. Enables task autonomy.
    Results 66% energy saving & 1% accuracy
    loss with RTL/measurement.

    View full-size slide

  84. Experimental Setup
    ▸ In-house simulator modeling a commercial
    mobile SoC: Nvidia Tegra X2

    ▹ Real board measurement
    14

    View full-size slide

  85. Experimental Setup
    ▸ In-house simulator modeling a commercial
    mobile SoC: Nvidia Tegra X2

    ▹ Real board measurement
    ▸ Develop RTL models for IPs unavailable on TX2

    ▹ CNN Accelerator (651 mW, 1.58 mm2)
    ▹ Motion Controller (2.2 mW, 0.035 mm2)
    14

    View full-size slide

  86. Experimental Setup
    ▸ In-house simulator modeling a commercial
    mobile SoC: Nvidia Tegra X2

    ▹ Real board measurement
    ▸ Develop RTL models for IPs unavailable on TX2

    ▹ CNN Accelerator (651 mW, 1.58 mm2)
    ▹ Motion Controller (2.2 mW, 0.035 mm2)
    14
    ▸ Evaluate on Object Tracking and Object Detection

    ▹Important domains that are building blocks for many vision applications
    ▹IP vendors have started shipping standalone tracking/detection IPs

    View full-size slide

  87. Experimental Setup
    ▸ In-house simulator modeling a commercial
    mobile SoC: Nvidia Tegra X2

    ▹ Real board measurement
    ▸ Develop RTL models for IPs unavailable on TX2

    ▹ CNN Accelerator (651 mW, 1.58 mm2)
    ▹ Motion Controller (2.2 mW, 0.035 mm2)
    14
    ▸ Evaluate on Object Tracking and Object Detection

    ▹Important domains that are building blocks for many vision applications
    ▹IP vendors have started shipping standalone tracking/detection IPs

    View full-size slide

  88. Experimental Setup
    ▸ In-house simulator modeling a commercial
    mobile SoC: Nvidia Tegra X2

    ▹ Real board measurement
    ▸ Develop RTL models for IPs unavailable on TX2

    ▹ CNN Accelerator (651 mW, 1.58 mm2)
    ▹ Motion Controller (2.2 mW, 0.035 mm2)
    14
    ▸ Evaluate on Object Tracking and Object Detection

    ▹Important domains that are building blocks for many vision applications
    ▹IP vendors have started shipping standalone tracking/detection IPs
    ▸ Object Detection

    ▹Baseline CNN: YOLOv2 (state-of-the-art detection results)

    View full-size slide

  89. Experimental Setup
    ▸ In-house simulator modeling a commercial
    mobile SoC: Nvidia Tegra X2

    ▹ Real board measurement
    ▸ Develop RTL models for IPs unavailable on TX2

    ▹ CNN Accelerator (651 mW, 1.58 mm2)
    ▹ Motion Controller (2.2 mW, 0.035 mm2)
    14
    ▸ Evaluate on Object Tracking and Object Detection

    ▹Important domains that are building blocks for many vision applications
    ▹IP vendors have started shipping standalone tracking/detection IPs
    ▸ Object Detection

    ▹Baseline CNN: YOLOv2 (state-of-the-art detection results)
    ▸ SCALESim: A systolic array-based, cycle-accurate CNN accelerator
    simulator. https://github.com/ARM-software/SCALE-Sim.

    View full-size slide

  90. 0.1
    0.2
    0.3
    0.4
    0.5
    0.6
    0.7
    YOLOv2
    Evaluation Results
    15
    Accuracy

    View full-size slide

  91. 0.1
    0.2
    0.3
    0.4
    0.5
    0.6
    0.7
    YOLOv2
    0
    0.25
    0.5
    0.75
    1
    YOLOv2
    Evaluation Results
    15
    Accuracy
    Norm. Energy

    View full-size slide

  92. 0.1
    0.2
    0.3
    0.4
    0.5
    0.6
    0.7
    YOLOv2
    0
    0.25
    0.5
    0.75
    1
    YOLOv2
    YOLOv2 EW-2 EW-4 EW-8
    EW-16
    EW-32
    Evaluation Results
    15
    Accuracy
    Norm. Energy
    EW = Extrapolation Window

    View full-size slide

  93. 0.1
    0.2
    0.3
    0.4
    0.5
    0.6
    0.7
    YOLOv2
    0
    0.25
    0.5
    0.75
    1
    YOLOv2
    YOLOv2 EW-2 EW-4 EW-8
    EW-16
    EW-32
    YOLOv2 EW-2 EW-4 EW-8
    EW-16
    EW-32
    Evaluation Results
    15
    Accuracy
    Norm. Energy
    EW = Extrapolation Window

    View full-size slide

  94. 0.1
    0.2
    0.3
    0.4
    0.5
    0.6
    0.7
    YOLOv2
    0
    0.25
    0.5
    0.75
    1
    YOLOv2
    YOLOv2 EW-2 EW-4 EW-8
    EW-16
    EW-32
    YOLOv2 EW-2 EW-4 EW-8
    EW-16
    EW-32
    Evaluation Results
    15
    Accuracy
    Norm. Energy
    66% system energy saving with ~ 1% accuracy loss.
    EW = Extrapolation Window

    View full-size slide

  95. Scale-down

    CNN
    0.1
    0.2
    0.3
    0.4
    0.5
    0.6
    0.7
    YOLOv2
    0
    0.25
    0.5
    0.75
    1
    YOLOv2
    YOLOv2 EW-2 EW-4 EW-8
    EW-16
    EW-32
    YOLOv2 EW-2 EW-4 EW-8
    EW-16
    EW-32
    YOLOv2 EW-4
    EW-16
    TinyYOLO
    Evaluation Results
    15
    Accuracy
    Norm. Energy
    66% system energy saving with ~ 1% accuracy loss.
    EW = Extrapolation Window

    View full-size slide

  96. 0.1
    0.2
    0.3
    0.4
    0.5
    0.6
    0.7
    YOLOv2
    0
    0.25
    0.5
    0.75
    1
    YOLOv2
    YOLOv2 EW-2 EW-4 EW-8
    EW-16
    EW-32
    YOLOv2 EW-2 EW-4 EW-8
    EW-16
    EW-32
    YOLOv2 EW-4
    EW-16
    TinyYOLO
    Evaluation Results
    15
    Accuracy
    Norm. Energy
    66% system energy saving with ~ 1% accuracy loss.
    More efficient than simply scaling-down the CNN.
    EW = Extrapolation Window

    View full-size slide

  97. Conclusions
    16

    View full-size slide

  98. Conclusions
    16
    ▸ We must expand our focus from isolated
    accelerators to holistic SoC architecture.

    View full-size slide

  99. Conclusions
    16
    ▸ We must expand our focus from isolated
    accelerators to holistic SoC architecture.

    View full-size slide

  100. Conclusions
    16
    ▸ Euphrates co-designs the SoC with a
    motion-based synthesis algorithm.
    ▸ We must expand our focus from isolated
    accelerators to holistic SoC architecture.

    View full-size slide

  101. Conclusions
    16
    ▸ Euphrates co-designs the SoC with a
    motion-based synthesis algorithm.
    ▸ We must expand our focus from isolated
    accelerators to holistic SoC architecture.
    ▸ 66% SoC energy savings with ~1% accuracy
    loss. More efficient than scaling-down CNNs.

    View full-size slide

  102. Thank you!
    17
    Georgia Tech
    Anand Samajdar Paul Whatmough
    ARM Research
    Matt Mattina
    ARM Research

    View full-size slide