$30 off During Our Annual Pro Sale. View Details »

Tigris: Architecture and Algorithms for 3D Perception in Point Clouds

HorizonLab
October 08, 2019

Tigris: Architecture and Algorithms for 3D Perception in Point Clouds

MICRO 2019 Talk. Presented by Boyuan Tian and Tiancheng Xu.

HorizonLab

October 08, 2019
Tweet

More Decks by HorizonLab

Other Decks in Science

Transcript

  1. 1
    Tigris: Architecture and Algorithms for 

    3D Perception in Point Clouds
    Tiancheng Xu*, Boyuan Tian* 

    with Yuhao Zhu
    Department of Computer Science

    University of Rochester

    http://horizon-lab.org

    View Slide

  2. To-do: add Notre Dame Cathedral figure
    Goal: Intro
    2
    Source: https://www.nationalgeographic.com/news/2015/06/150622-andrew-tallon-notre-dame-cathedral-
    laser-scan-art-history-medieval-gothic/

    View Slide

  3. 3
    Source: https://www.nationalgeographic.com/news/2015/06/150622-andrew-tallon-notre-dame-cathedral-
    laser-scan-art-history-medieval-gothic/

    View Slide

  4. 3
    Source: https://www.nationalgeographic.com/news/2015/06/150622-andrew-tallon-notre-dame-cathedral-
    laser-scan-art-history-medieval-gothic/

    View Slide

  5. 4
    Point Cloud

    View Slide

  6. 4
    Point Cloud
    ‣ Points in 3-d space, i.e., XYZ coordinates

    View Slide

  7. 4
    Point Cloud
    ‣ Points in 3-d space, i.e., XYZ coordinates
    ‣ Effective in capturing visual features

    View Slide

  8. 4
    Point Cloud
    ‣ Points in 3-d space, i.e., XYZ coordinates
    ‣ Effective in capturing visual features
    ‣ 3-d scanners/sensors

    View Slide

  9. 4
    Point Cloud
    ‣ Points in 3-d space, i.e., XYZ coordinates
    ‣ Effective in capturing visual features
    ‣ 3-d scanners/sensors
    ▹ Scan from multiple perspectives

    View Slide

  10. 4
    Point Cloud
    ‣ Points in 3-d space, i.e., XYZ coordinates
    ‣ Effective in capturing visual features
    ‣ 3-d scanners/sensors
    ▹ Scan from multiple perspectives
    ▹ Stitch these Point Clouds

    to form a complete Point Cloud

    View Slide

  11. 5
    Source: https://www.nationalgeographic.com/news/2015/06/150622-andrew-tallon-notre-dame-cathedral-
    laser-scan-art-history-medieval-gothic/

    View Slide

  12. 5
    Source: https://www.nationalgeographic.com/news/2015/06/150622-andrew-tallon-notre-dame-cathedral-
    laser-scan-art-history-medieval-gothic/

    View Slide

  13. Point Cloud Registration
    6

    View Slide

  14. Point Cloud Registration
    6
    ▸ Aligns two point clouds by calculating a transformation

    View Slide

  15. Point Cloud Registration
    6
    ▸ Aligns two point clouds by calculating a transformation
    Transformation = Rotation + Translation

    View Slide

  16. Point Cloud Registration
    6
    ▸ Aligns two point clouds by calculating a transformation
    Transformation = Rotation + Translation

    View Slide

  17. Motivation
    7

    View Slide

  18. Motivation
    7
    Point Cloud Registration: a fundamental building block

    View Slide

  19. Motivation
    7
    3-d Reconstruction
    Point Cloud Registration: a fundamental building block

    View Slide

  20. Motivation
    7
    3-d Reconstruction
    Point Cloud Registration: a fundamental building block
    Autonomous Driving

    View Slide

  21. Motivation
    7
    3-d Reconstruction
    Point Cloud Registration: a fundamental building block
    Autonomous Driving Mixed Reality

    View Slide

  22. Motivation
    7
    3-d Reconstruction
    Point Cloud Registration: a fundamental building block
    Autonomous Driving Mixed Reality
    3-D Visual Computing

    View Slide

  23. Motivation
    7
    3-d Reconstruction
    Point Cloud Registration: a fundamental building block
    Autonomous Driving Mixed Reality
    3-D Visual Computing

    View Slide

  24. Motivation
    7
    3-d Reconstruction
    Point Cloud Registration: a fundamental building block
    Autonomous Driving Mixed Reality
    3-D Visual Computing
    Limited Energy Budget

    View Slide

  25. Motivation
    7
    3-d Reconstruction
    Point Cloud Registration: a fundamental building block
    Autonomous Driving Mixed Reality
    3-D Visual Computing
    High Performance Requirement
    Limited Energy Budget

    View Slide

  26. Tigris System Overview
    8
    Towards real-time and energy-efficient Point Cloud Registration

    View Slide

  27. Tigris System Overview
    8
    Towards real-time and energy-efficient Point Cloud Registration
    Characterization

    View Slide

  28. Tigris System Overview
    8
    Towards real-time and energy-efficient Point Cloud Registration
    Characterization SW/HW Co-design

    View Slide

  29. Tigris System Overview
    8
    Towards real-time and energy-efficient Point Cloud Registration
    Characterization SW/HW Co-design Evaluation

    View Slide

  30. Tigris System
    9
    Towards real-time and energy-efficient Point Cloud Registration
    Characterization SW/HW Co-design Evaluation

    View Slide

  31. Point Cloud Registration Pipeline
    10
    Registration

    View Slide

  32. Point Cloud Registration Pipeline
    10
    Registration

    View Slide

  33. Point Cloud Registration Pipeline
    10
    Registration
    Initial Estimation Fine-tuning

    View Slide

  34. Point Cloud Registration Pipeline
    11
    Registration
    Fine-tuning
    Initial Estimation

    View Slide

  35. Point Cloud Registration Pipeline
    11
    Registration
    Fine-tuning
    NE KPTD DC KPCE CR RPCE EM
    Initial Estimation
    Stage1 Stage2 Stage3 Stage4 Stage5 Stage6 Stage7

    View Slide

  36. Example: Normal Estimation (NE)
    12
    NE KPDT DC KPCE CR RPCE EM
    Registration

    View Slide

  37. Example: Normal Estimation (NE)
    12
    ALG
    NE KPDT DC KPCE CR RPCE EM
    Registration

    View Slide

  38. Example: Normal Estimation (NE)
    12
    ALG
    NE KPDT DC KPCE CR RPCE EM
    Registration
    • SVD

    View Slide

  39. Example: Normal Estimation (NE)
    12
    ALG
    NE KPDT DC KPCE CR RPCE EM
    Registration
    • DNN
    • SVD

    View Slide

  40. Example: Normal Estimation (NE)
    12
    ALG
    PARAM
    NE KPDT DC KPCE CR RPCE EM
    Registration
    • DNN
    • SVD

    View Slide

  41. Example: Normal Estimation (NE)
    12
    ALG
    PARAM
    NE KPDT DC KPCE CR RPCE EM
    Registration
    • DNN
    • SVD
    • Search

    radius

    View Slide

  42. • DNN
    • SVD
    • Search

    radius
    • SIFT
    • NARF
    • Scale

    range
    • FPFH
    • 3DSC
    • Search

    radius
    • Reci-

    procity
    • Ratio
    • Dist
    • RANSAC
    • THRESH
    • NORM-S
    • PROJECT
    • Converging

    criteria
    • Reci-

    procity
    • MetricT
    • SolverT
    13
    ALG
    PARAM
    -
    NE KD DC KPCE CR RPCE EM
    Registration
    Huge Design Space

    View Slide

  43. • DNN
    • SVD
    • Search

    radius
    • SIFT
    • NARF
    • Scale

    range
    • FPFH
    • 3DSC
    • Search

    radius
    • Reci-

    procity
    • Ratio
    • Dist
    • RANSAC
    • THRESH
    • NORM-S
    • PROJECT
    • Converging

    criteria
    • Reci-

    procity
    • MetricT
    • SolverT
    13
    ALG
    PARAM
    -
    NE KD DC KPCE CR RPCE EM
    Registration
    Huge Design Space

    View Slide

  44. • DNN
    • SVD
    • Search

    radius
    • SIFT
    • NARF
    • Scale

    range
    • FPFH
    • 3DSC
    • Search

    radius
    • Reci-

    procity
    • Ratio
    • Dist
    • RANSAC
    • THRESH
    • NORM-S
    • PROJECT
    • Converging

    criteria
    • Reci-

    procity
    • MetricT
    • SolverT
    13
    ALG
    PARAM
    -
    NE KD DC KPCE CR RPCE EM
    Registration
    Huge Design Space
    Configurable pipeline: 

    https://github.com/horizon-research/PointCloud-pipeline

    View Slide

  45. Design Space Exploration
    14

    View Slide

  46. Design Space Exploration
    14
    Error Rate

    View Slide

  47. Design Space Exploration
    14
    Error Rate
    Execution Time

    View Slide

  48. Design Space Exploration
    15
    Execution Time
    Error Rate

    View Slide

  49. Design Space Exploration
    15
    Execution Time
    Error Rate
    A Design Point with

    X Error Rate and Y Latency
    (X, Y)

    View Slide

  50. Design Space Exploration
    16
    Translational Error Rotational Error
    Execution Time
    Execution Time

    View Slide

  51. Design Space Exploration
    16
    Translational Error Rotational Error
    Execution Time
    Execution Time
    Transformation = Rotation + Translation
    Error Rate: Rotational & Translational Error

    View Slide

  52. Design Space Exploration
    17
    Translational Error Rotational Error
    Execution Time
    Execution Time

    View Slide

  53. Design Space Exploration
    18
    Translational Error Rotational Error
    Execution Time
    Execution Time

    View Slide

  54. Design Space Exploration
    19
    Translational Error Rotational Error
    Execution Time
    Execution Time

    View Slide

  55. Representative Design Points
    20
    DP8
    DP6
    DP4 DP2
    DP3
    DP1
    DP7
    DP5
    DP6
    DP4 DP2
    Execution Time
    Translational Error
    Execution Time
    Rotational Error

    View Slide

  56. Characterization
    21
    NE KPTD DC KPCE CR RPCE EM
    Using the representative design points (DP1-8)

    View Slide

  57. Characterization
    22
    NE KPTD DC KPCE CR RPCE EM

    View Slide

  58. Characterization
    22
    NE KPTD DC KPCE CR RPCE EM

    View Slide

  59. Characterization
    22
    NE KPTD DC KPCE CR RPCE EM
    KD-Tree Search

    View Slide

  60. Bottleneck: KD-Tree Search
    23
    KD-Tree Search / End-to-End Pipeline Latency (%)
    0%
    50%
    100%
    DP1 DP2 DP3 DP4 DP5 DP6 DP7 DP8

    View Slide

  61. Bottleneck: KD-Tree Search
    24
    KD-Tree Search / End-to-End Pipeline Latency (%)
    0%
    50%
    100%
    DP1 DP2 DP3 DP4 DP5 DP6 DP7 DP8
    85%
    85%
    80%
    76%
    75%
    74%
    65%
    52%

    View Slide

  62. KD-Tree Search
    25

    View Slide

  63. KD-Tree Search
    ▸ Neighbor Search (NS)
    25

    View Slide

  64. KD-Tree Search
    ▸ Neighbor Search (NS)
    ▹ Universal in Point Cloud processing
    25

    View Slide

  65. KD-Tree Search
    ▸ Neighbor Search (NS)
    ▹ Universal in Point Cloud processing
    ▹ To find the neighbors
    25

    View Slide

  66. KD-Tree Search
    ▸ Neighbor Search (NS)
    ▹ Universal in Point Cloud processing
    ▹ To find the neighbors
    25
    Query Point
    of one point

    View Slide

  67. KD-Tree Search
    ▸ Neighbor Search (NS)
    ▹ Universal in Point Cloud processing
    ▹ To find the neighbors
    25
    Query Point
    Search Points
    of one point among a set of points

    View Slide

  68. KD-Tree Search
    ▸ Neighbor Search (NS)
    ▹ Universal in Point Cloud processing
    ▹ To find the neighbors

    26
    of one point among a set of points

    View Slide

  69. KD-Tree Search
    ▸ Neighbor Search (NS)
    ▹ Universal in Point Cloud processing
    ▹ To find the neighbors

    ▸ KD-Tree Search
    26
    of one point among a set of points

    View Slide

  70. KD-Tree Search
    ▸ Neighbor Search (NS)
    ▹ Universal in Point Cloud processing
    ▹ To find the neighbors

    ▸ KD-Tree Search
    ▹ Standard implementation for NS in point clouds
    26
    of one point among a set of points

    View Slide

  71. KD-Tree Search
    ▸ Neighbor Search (NS)
    ▹ Universal in Point Cloud processing
    ▹ To find the neighbors

    ▸ KD-Tree Search
    ▹ Standard implementation for NS in point clouds
    ▹ Effectively reduces the computation workload of NS
    26
    of one point among a set of points

    View Slide

  72. KD-Tree Search
    ▸ Neighbor Search (NS)
    ▹ Universal in Point Cloud processing
    ▹ To find the neighbors

    ▸ KD-Tree Search
    ▹ Standard implementation for NS in point clouds
    ▹ Effectively reduces the computation workload of NS
    ▹ Inefficient on GPUs due to its sequential nature
    26
    of one point among a set of points

    View Slide

  73. KD-Tree Search
    ▸ Neighbor Search (NS)
    ▹ Universal in Point Cloud processing
    ▹ To find the neighbors

    ▸ KD-Tree Search
    ▹ Standard implementation for NS in point clouds
    ▹ Effectively reduces the computation workload of NS
    ▹ Inefficient on GPUs due to its sequential nature
    ▹ Challenging for hardware acceleration
    26
    of one point among a set of points

    View Slide

  74. Tigris System Overview
    27
    Towards real-time and energy-efficient Point Cloud Registration
    Characterization SW/HW Co-design Evaluation

    View Slide

  75. Redundancy vs. Parallelism
    28

    View Slide

  76. Redundancy vs. Parallelism
    28
    Unordered Set
    Canonical KD-Tree

    View Slide

  77. Redundancy vs. Parallelism
    28
    Unordered Set
    Canonical KD-Tree
    Current
    Node

    View Slide

  78. Redundancy vs. Parallelism
    28
    Unordered Set
    Canonical KD-Tree
    ▹No Redundancy, No Parallelism
    Current
    Node

    View Slide

  79. Redundancy vs. Parallelism
    28
    Unordered Set
    ▹Huge Parallelism, Huge Redundancy
    Canonical KD-Tree
    ▹No Redundancy, No Parallelism
    Current
    Node

    View Slide

  80. Two-Stage KD-Tree
    New data structure
    ▹Balances parallelism and redundancy
    29

    View Slide

  81. Two-Stage KD-Tree
    New data structure
    ▹Balances parallelism and redundancy
    29
    Two-Stage KD-Tree

    View Slide

  82. Two-Stage KD-Tree
    New data structure
    ▹Balances parallelism and redundancy
    29
    Two-Stage KD-Tree
    Top-Tree

    View Slide

  83. Two-Stage KD-Tree
    New data structure
    ▹Balances parallelism and redundancy
    29
    Two-Stage KD-Tree
    Top-Tree
    Children of
    Leaf Nodes
    Leaf Nodes

    View Slide

  84. Two-Stage KD-Tree
    New data structure
    ▹Balances parallelism and redundancy
    30
    Two-Stage KD-Tree

    View Slide

  85. Two-Stage KD-Tree
    New data structure
    ▹Balances parallelism and redundancy
    30
    Two-Stage KD-Tree
    Canonical KD-Tree
    Same First Few Levels

    View Slide

  86. Two-Stage KD-Tree
    New data structure
    ▹Balances parallelism and redundancy
    30
    Two-Stage KD-Tree
    Canonical KD-Tree
    Sequential Traversal

    View Slide

  87. Two-Stage KD-Tree
    New data structure
    ▹Balances parallelism and redundancy
    30
    Two-Stage KD-Tree
    Canonical KD-Tree
    Sub-Tree
    Unordered Set
    Sequential Traversal

    View Slide

  88. Parallel Search
    Two-Stage KD-Tree
    New data structure
    ▹Balances parallelism and redundancy
    30
    Two-Stage KD-Tree
    Canonical KD-Tree
    Sequential Traversal

    View Slide

  89. Quantifying Redundancy
    31
    Two-Stage KD-Tree
    Canonical KD-Tree

    View Slide

  90. Quantifying Redundancy
    31
    Two-Stage KD-Tree
    Canonical KD-Tree

    View Slide

  91. Quantifying Redundancy
    31
    Two-Stage KD-Tree
    Canonical KD-Tree

    View Slide

  92. Quantifying Redundancy
    32

    View Slide

  93. Quantifying Redundancy
    32
    ……
    ……
    35X more points need to be visited

    View Slide

  94. Approximate Search
    New search algorithm
    ▹Mitigates redundancy introduced by new data structure
    33

    View Slide

  95. Approximate Search
    New search algorithm
    ▹Close queries are likely to share similar search results
    34

    View Slide

  96. Approximate Search
    New search algorithm
    ▹Close queries are likely to share similar search results
    34
    Qi

    View Slide

  97. Approximate Search
    35
    Qi
    N
    New search algorithm
    ▹Close queries are likely to share similar search results

    View Slide

  98. Approximate Search
    35
    Qi
    N
    Qj
    New search algorithm
    ▹Close queries are likely to share similar search results

    View Slide

  99. Approximate Search
    36
    Qi
    R
    New search algorithm
    ▹Close queries are likely to share similar search results

    View Slide

  100. Approximate Search
    37
    Qi
    R
    New search algorithm
    ▹Close queries are likely to share similar search results

    View Slide

  101. Approximate Search
    37
    Qi
    R
    Qj
    R
    New search algorithm
    ▹Close queries are likely to share similar search results

    View Slide

  102. Approximate Search
    38
    Qj
    R
    R
    Qi
    New search algorithm
    ▹Close queries are likely to share similar search results

    View Slide

  103. Approximate Search
    39
    R
    Qj
    R
    Qi
    New search algorithm
    ▹Close queries are likely to share similar search results

    View Slide

  104. Approximate Search
    40
    Qi
    New search algorithm
    ▹Leader: search in children of leaf nodes as usual

    View Slide

  105. Approximate Search
    40
    Qi leader
    New search algorithm
    ▹Leader: search in children of leaf nodes as usual

    View Slide

  106. Approximate Search
    40
    Qi leader
    New search algorithm
    ▹Leader: search in children of leaf nodes as usual
    R

    View Slide

  107. Approximate Search
    41
    Qi leader
    R
    New search algorithm
    ▹Leader: search in children of leaf nodes as usual

    View Slide

  108. Approximate Search
    42
    Qi leader
    New search algorithm
    ▹Follower: search in neighbors of a leader
    R

    View Slide

  109. Approximate Search
    42
    Qi leader
    Qj
    follower
    New search algorithm
    ▹Follower: search in neighbors of a leader
    R

    View Slide

  110. Approximate Search
    43
    Qi leader
    R
    Qj
    follower
    New search algorithm
    ▹Follower: search in neighbors of a leader

    View Slide

  111. Approximate Search
    44
    Qi leader
    R
    Qj
    follower
    New search algorithm
    ▹Efficiently mitigate search redundancy

    View Slide

  112. Total savings of node visits 72.8%
    Negligible effect on registration accuracy
    Approximate Search
    44
    Qi leader
    R
    Qj
    follower
    New search algorithm
    ▹Efficiently mitigate search redundancy

    View Slide

  113. New data structure + new search algorithm:
    ▹Expose huge parallelism with negligible search redundancy

    Software-Hardware Co-design for Neighbor Search
    45

    View Slide

  114. New data structure + new search algorithm:
    ▹Expose huge parallelism with negligible search redundancy

    Software-Hardware co-design:
    Software-Hardware Co-design for Neighbor Search
    45

    View Slide

  115. New data structure + new search algorithm:
    ▹Expose huge parallelism with negligible search redundancy

    Software-Hardware co-design:
    Software-Hardware Co-design for Neighbor Search
    45
    Sequential traversal
    in top tree
    Parallel search
    in leaf nodes

    View Slide

  116. Hardware Architecture
    46
    Front-end
    Back-end
    Query Distribution Network
    Front-end Buffer
    Global buffer
    Decoupled architecture
    ▹Front-end for tree traversal
    ▹Back-end for parallel search
    Sequential traverse
    in top tree
    Parallel search in
    leaf nodes

    View Slide

  117. Hardware Architecture
    46
    Front-end
    Back-end
    Query Distribution Network
    Front-end Buffer
    Global buffer
    Decoupled architecture
    ▹Front-end for tree traversal
    ▹Back-end for parallel search
    Sequential traverse
    in top tree
    Parallel search in
    leaf nodes

    View Slide

  118. Hardware Architecture
    46
    Front-end
    Back-end
    Query Distribution Network
    Front-end Buffer
    Global buffer
    Decoupled architecture
    ▹Front-end for tree traversal
    ▹Back-end for parallel search
    ▹QLP: Query-Level Parallelism
    ▹NLP: Node-Level Parallelism
    Sequential traverse
    in top tree
    Parallel search in
    leaf nodes

    View Slide

  119. Hardware Architecture
    46
    Front-end
    Back-end
    Query Distribution Network
    Front-end Buffer
    Global buffer
    Decoupled architecture
    ▹Front-end for tree traversal
    ▹Back-end for parallel search
    Front-end
    ▹Exploits QLP + limited NLP
    ▹QLP: Query-Level Parallelism
    ▹NLP: Node-Level Parallelism
    Sequential traverse
    in top tree
    Parallel search in
    leaf nodes

    View Slide

  120. Hardware Architecture
    46
    Back-end
    Query Distribution Network
    Front-end Buffer
    Global buffer
    Decoupled architecture
    ▹Front-end for tree traversal
    ▹Back-end for parallel search
    Front-end
    ▹Exploits QLP + limited NLP
    RU RU RU

    Front-end Buffer
    ▹QLP: Query-Level Parallelism
    ▹NLP: Node-Level Parallelism
    Sequential traverse
    in top tree
    Parallel search in
    leaf nodes

    View Slide

  121. Recursive Unit Hardware
    Originally no NLP can be exploited
    ▹Tree traversal in top tree is still sequential
    47
    ▹QLP: Query-Level Parallelism
    ▹NLP: Node-Level Parallelism

    View Slide

  122. Recursive Unit Hardware
    Originally no NLP can be exploited
    ▹Tree traversal in top tree is still sequential
    Limited NLP exploited by pipelining different nodes
    47
    ▹QLP: Query-Level Parallelism
    ▹NLP: Node-Level Parallelism
    Recursion Unit
    Microarchitecture

    View Slide

  123. Recursive Unit Hardware
    Originally no NLP can be exploited
    ▹Tree traversal in top tree is still sequential
    Limited NLP exploited by pipelining different nodes
    ▹Two optimizations to avoid data dependency and pipeline stall
    48
    Recursion Unit
    Microarchitecture
    ▹QLP: Query-Level Parallelism
    ▹NLP: Node-Level Parallelism

    View Slide

  124. Hardware Architecture
    49
    RU RU RU
    Query Distribution Network

    Global buffer
    Back-end SUs
    Back-end SUs
    Front-end Buffer
    Decoupled architecture
    ▹Front-end for tree traversal
    ▹Back-end for parallel search
    Front-end
    ▹Exploits QLP + limited NLP
    Back-end
    ▹QLP: Query-Level Parallelism
    ▹NLP: Node-Level Parallelism

    View Slide

  125. Hardware Architecture
    50
    RU RU RU
    SU SU SU
    Buf Buf Buf
    Front-end Buffer
    Query Distribution Network
    Decoupled architecture
    ▹Front-end for tree traversal
    ▹Back-end for parallel search
    Front-end
    ▹Exploits QLP + limited NLP
    Back-end


    Global buffer
    ▹QLP: Query-Level Parallelism
    ▹NLP: Node-Level Parallelism

    View Slide

  126. Hardware Architecture
    50
    RU RU RU
    SU SU SU
    Buf Buf Buf
    Front-end Buffer
    Query Distribution Network
    Decoupled architecture
    ▹Front-end for tree traversal
    ▹Back-end for parallel search
    Front-end
    ▹Exploits QLP + limited NLP
    Back-end


    Global buffer
    ▹QLP: Query-Level Parallelism
    ▹NLP: Node-Level Parallelism

    View Slide

  127. Hardware Architecture
    50
    RU RU RU
    SU SU SU
    Buf Buf Buf
    Front-end Buffer
    Query Distribution Network
    Decoupled architecture
    ▹Front-end for tree traversal
    ▹Back-end for parallel search
    Front-end
    ▹Exploits QLP + limited NLP
    Back-end
    ▹Exploits QLP + NLP


    Global buffer
    Leaf node ID % num. of SUs
    ▹QLP: Query-Level Parallelism
    ▹NLP: Node-Level Parallelism

    View Slide

  128. Search Unit Hardware
    51
    Q Q
    Q Q Q
    ▹QLP: Query-Level Parallelism
    ▹NLP: Node-Level Parallelism

    View Slide

  129. Search Unit Hardware
    51
    Q
    Q
    Q
    Q
    Q
    ▹QLP: Query-Level Parallelism
    ▹NLP: Node-Level Parallelism

    View Slide

  130. Search Unit Hardware
    51
    SU
    Buf
    Q
    Q
    Q
    Q
    Q
    ▹QLP: Query-Level Parallelism
    ▹NLP: Node-Level Parallelism

    View Slide

  131. Q Q
    Q Q Q
    Search Unit Hardware
    51
    SU
    Buf
    Q
    Q
    Q
    Q
    Q
    PE PE PE
    PE
    ▹QLP: Query-Level Parallelism
    ▹NLP: Node-Level Parallelism

    View Slide

  132. Q Q
    Q Q Q
    Search Unit Hardware
    51
    SU
    Buf
    Q
    Q
    Q
    Q
    Q
    Query-Level Parallelism
    PE PE PE
    PE
    ▹QLP: Query-Level Parallelism
    ▹NLP: Node-Level Parallelism

    View Slide

  133. Q Q
    Q Q Q
    Search Unit Hardware
    51
    SU
    Buf
    Q
    Q
    Q
    Q
    Q
    Query-Level Parallelism
    PE PE PE
    PE
    Execution Model Exploiting QLP
    ▹QLP: Query-Level Parallelism
    ▹NLP: Node-Level Parallelism

    View Slide

  134. Q
    Q
    Q Q Q
    Search Unit Hardware
    51
    SU
    Buf
    Q
    Q
    Q
    Q
    Q
    Query-Level Parallelism
    PE PE PE
    PE
    Execution Model Exploiting QLP
    ▹QLP: Query-Level Parallelism
    ▹NLP: Node-Level Parallelism

    View Slide

  135. Q
    Q
    Q Q Q
    Search Unit Hardware
    51
    SU
    Buf
    Q
    Q
    Q
    Q
    Q
    Query-Level Parallelism
    PE PE PE
    PE
    Execution Model Exploiting QLP
    ▸ Multiple Query Multiple LeafNodes
    ▹High PE Utilization; High data bandwidth
    ▹QLP: Query-Level Parallelism
    ▹NLP: Node-Level Parallelism

    View Slide

  136. Q
    Q
    Q Q Q
    Search Unit Hardware
    51
    SU
    Buf
    Q
    Q
    Q
    Q
    Q
    Query-Level Parallelism
    PE PE PE
    PE
    Execution Model Exploiting QLP
    ▸ Multiple Query Multiple LeafNodes
    ▹High PE Utilization; High data bandwidth
    ▸ Multiple Query Single LeafNodes
    ▹Low data bandwidth; Low PE Utilization
    ▹QLP: Query-Level Parallelism
    ▹NLP: Node-Level Parallelism

    View Slide

  137. Q
    Q
    Q
    Q Q
    Search Unit Hardware
    51
    SU
    Buf
    Q
    Q
    Q
    Q
    Q
    Query-Level Parallelism
    PE PE PE
    PE
    Execution Model Exploiting QLP
    ▸ Multiple Query Multiple LeafNodes
    ▹High PE Utilization; High data bandwidth
    ▸ Multiple Query Single LeafNodes
    ▹Low data bandwidth; Low PE Utilization
    ▹QLP: Query-Level Parallelism
    ▹NLP: Node-Level Parallelism

    View Slide

  138. Execution Model Exploiting QLP
    ▸ Multiple Query Multiple LeafNodes
    ▹High PE Utilization; High data bandwidth
    ▸ Multiple Query Single LeafNodes
    ▹Low data bandwidth; Low PE Utilization
    PE
    PE PE
    Q
    Q Q
    Q
    Q
    Q Q
    Search Unit Hardware
    52
    Q
    Q
    Q
    Q
    Q
    PE
    N N N
    N N N
    Children in a leaf node to
    be searched in parallel
    Query-Level Parallelism
    ▹QLP: Query-Level Parallelism
    ▹NLP: Node-Level Parallelism

    View Slide

  139. Execution Model Exploiting QLP
    ▸ Multiple Query Multiple LeafNodes
    ▹High PE Utilization; High data bandwidth
    ▸ Multiple Query Single LeafNodes
    ▹Low data bandwidth; Low PE Utilization
    PE
    PE PE
    Q
    Q Q
    Q
    Q
    Q Q
    Search Unit Hardware
    52
    Q
    Q
    Q
    Q
    Q
    Node-Level Parallelism
    PE
    N N N
    N N N
    Children in a leaf node to
    be searched in parallel
    Query-Level Parallelism
    ▹QLP: Query-Level Parallelism
    ▹NLP: Node-Level Parallelism

    View Slide

  140. Execution Model Exploiting QLP
    ▸ Multiple Query Multiple LeafNodes
    ▹High PE Utilization; High data bandwidth
    ▸ Multiple Query Single LeafNodes
    ▹Low data bandwidth; Low PE Utilization
    PE
    PE PE
    Q
    Q Q
    Q
    Q
    Q Q
    Search Unit Hardware
    52
    Q
    Q
    Q
    Q
    Q
    Node-Level Parallelism
    PE Data-flow Exploiting NLP
    ▸ 1-D systolic array
    ▸ Query-stationary data-flow
    N N N
    N N N
    Children in a leaf node to
    be searched in parallel
    Query-Level Parallelism
    ▹QLP: Query-Level Parallelism
    ▹NLP: Node-Level Parallelism

    View Slide

  141. Execution Model Exploiting QLP
    ▸ Multiple Query Multiple LeafNodes
    ▹High PE Utilization; High data bandwidth
    ▸ Multiple Query Single LeafNodes
    ▹Low data bandwidth; Low PE Utilization
    PE
    PE PE
    Q
    Q Q
    Q
    Q
    Q Q
    Search Unit Hardware
    52
    Q
    Q
    Q
    Q
    Q
    Node-Level Parallelism
    PE Data-flow Exploiting NLP
    ▸ 1-D systolic array
    ▸ Query-stationary data-flow
    N N N
    N N N
    Children in a leaf node to
    be searched in parallel
    Query-Level Parallelism
    ▹QLP: Query-Level Parallelism
    ▹NLP: Node-Level Parallelism

    View Slide

  142. Tigris system overview
    53
    Towards real-time and energy-efficient Point Cloud Registration
    Characterization SW/HW Co-design Evaluation

    View Slide

  143. Metrics & Dataset
    54

    View Slide

  144. Metrics & Dataset
    54
    ‣ Speed-up & Power Reduction

    View Slide

  145. Metrics & Dataset
    54
    ‣ Speed-up & Power Reduction
    • Performance Bottleneck: KD-Tree Search

    View Slide

  146. Metrics & Dataset
    54
    ‣ Speed-up & Power Reduction
    • Performance Bottleneck: KD-Tree Search
    • End-to-end Registration Pipeline


    View Slide

  147. Metrics & Dataset
    54
    ‣ Speed-up & Power Reduction
    • Performance Bottleneck: KD-Tree Search
    • End-to-end Registration Pipeline

    ‣ Dataset: Self-Driving Benchmark (KITTI)

    View Slide

  148. Metrics & Dataset
    54
    ‣ Speed-up & Power Reduction
    • Performance Bottleneck: KD-Tree Search
    • End-to-end Registration Pipeline

    ‣ Dataset: Self-Driving Benchmark (KITTI)
    • Each Point Cloud frame: ~130,000 Points

    View Slide

  149. Hardware
    55

    View Slide

  150. Hardware
    55
    ‣ KD-Tree Search

    View Slide

  151. Hardware
    55
    ‣ KD-Tree Search
    • Baseline: GPU (RTX 2080 Ti)

    View Slide

  152. Hardware
    55
    ‣ KD-Tree Search
    • Baseline: GPU (RTX 2080 Ti)
    • Our System:

    View Slide

  153. Hardware
    55
    ‣ KD-Tree Search
    • Baseline: GPU (RTX 2080 Ti)
    • Our System:
    ✓ Performance: 

    Cycle-accurate simulator 

    parameterized by an RTL model

    View Slide

  154. Hardware
    55
    ‣ KD-Tree Search
    • Baseline: GPU (RTX 2080 Ti)
    • Our System:
    ✓ Performance: 

    Cycle-accurate simulator 

    parameterized by an RTL model
    ✓ Power & Area: 

    Post layout simulation in a 16nm process node

    View Slide

  155. Hardware
    55
    ‣ KD-Tree Search
    • Baseline: GPU (RTX 2080 Ti)
    • Our System:
    ✓ Performance: 

    Cycle-accurate simulator 

    parameterized by an RTL model
    ✓ Power & Area: 

    Post layout simulation in a 16nm process node
    ‣ All Other Parts: CPU (Intel Xeon Silver 4110)

    View Slide

  156. Comparisons
    56

    View Slide

  157. Comparisons
    56
    To examine the individual benefits of SW and HW optimizations

    View Slide

  158. Comparisons
    56
    To examine the individual benefits of SW and HW optimizations
    Four systems for Comparison

    View Slide

  159. Comparisons
    56
    To examine the individual benefits of SW and HW optimizations

    No SW Optimization
    No HW Optimization
    Four systems for Comparison
    Baseline (KD)

    View Slide

  160. Comparisons
    56
    To examine the individual benefits of SW and HW optimizations

    No SW Optimization
    No HW Optimization

    + SW Optimization
    No HW Optimization
    Four systems for Comparison
    Baseline (KD) Baseline (2SKD)

    View Slide

  161. Comparisons
    56
    To examine the individual benefits of SW and HW optimizations

    No SW Optimization
    No HW Optimization

    + SW Optimization
    No HW Optimization

    No SW Optimization
    + HW Optimization
    Four systems for Comparison
    Baseline (KD) Baseline (2SKD)
    Tigris (KD)

    View Slide

  162. Comparisons
    56
    To examine the individual benefits of SW and HW optimizations

    No SW Optimization
    No HW Optimization

    + SW Optimization
    No HW Optimization

    No SW Optimization
    + HW Optimization

    + SW Optimization
    + HW Optimization
    Four systems for Comparison
    Baseline (KD) Baseline (2SKD)
    Tigris (KD) Tigris (2SKD)

    View Slide

  163. Performance
    57

    View Slide

  164. Performance
    57
    Power Reduction (x)
    0.0
    1.0
    2.0
    3.0
    4.0
    Speedup (x)
    0.0
    4.4
    8.8
    13.2
    17.6
    22.0
    Baseline (KD)
    Baseline (2SKD)
    Our System (KD)
    Our System (2SKD)

    View Slide

  165. Performance
    57
    Power Reduction (x)
    0.0
    1.0
    2.0
    3.0
    4.0
    Speedup (x)
    0.0
    4.4
    8.8
    13.2
    17.6
    22.0
    Baseline (KD)
    Baseline (2SKD)
    Our System (KD)
    Our System (2SKD)
    5.9
    20.9
    1.0 1.1

    View Slide

  166. 5.9
    20.9
    Power Reduction (x)
    0.0
    4.5
    9.0
    13.5
    18.0
    Speedup (x)
    0.0
    4.4
    8.8
    13.2
    17.6
    22.0
    Baseline (KD)
    Baseline (2SKD)
    Our System (KD)
    Our System (2SKD)
    Performance & Power
    58
    17.8
    10.5
    1.0 1.0

    View Slide

  167. 5.9
    20.9
    Power Reduction (x)
    0.0
    4.5
    9.0
    13.5
    18.0
    Speedup (x)
    0.0
    4.4
    8.8
    13.2
    17.6
    22.0
    Baseline (KD)
    Baseline (2SKD)
    Our System (KD)
    Our System (2SKD)
    Performance & Power
    58
    17.8
    10.5
    1.0 1.0

    View Slide

  168. 5.9
    20.9
    Power Reduction (x)
    0.0
    4.5
    9.0
    13.5
    18.0
    Speedup (x)
    0.0
    4.4
    8.8
    13.2
    17.6
    22.0
    Baseline (KD)
    Baseline (2SKD)
    Our System (KD)
    Our System (2SKD)
    Performance & Power
    58
    17.8
    10.5
    1.0 1.0
    20.9X speed-up 

    on KD-Tree search
    3.5X end-to-end 

    speed-up

    View Slide

  169. 5.9
    20.9
    Power Reduction (x)
    0.0
    4.5
    9.0
    13.5
    18.0
    Speedup (x)
    0.0
    4.4
    8.8
    13.2
    17.6
    22.0
    Baseline (KD)
    Baseline (2SKD)
    Our System (KD)
    Our System (2SKD)
    Performance & Power
    58
    17.8
    10.5
    1.0 1.0
    20.9X speed-up 

    on KD-Tree search
    3.5X end-to-end 

    speed-up
    10.5X power reduction 

    on KD-Tree search
    3.0X end-to-end 

    power reduction

    View Slide

  170. Summary
    59

    View Slide

  171. Summary
    59
    ‣ Point Cloud Registration
    ▹ A fundamental building block in emerging domains such as
    Autonomous Driving and Mixed Reality

    View Slide

  172. Summary
    59
    ‣ Point Cloud Registration
    ▹ A fundamental building block in emerging domains such as
    Autonomous Driving and Mixed Reality
    ‣ Our Tigris System
    ▹ An early step towards efficient Point Cloud Registration

    View Slide

  173. Summary
    59
    ‣ Point Cloud Registration
    ▹ A fundamental building block in emerging domains such as
    Autonomous Driving and Mixed Reality
    ‣ Our Tigris System
    ▹ An early step towards efficient Point Cloud Registration
    ‣ Key Insight
    ▹ Co-designing Software and Hardware to boost efficiency

    View Slide

  174. Rethink Systems Stack for Point Cloud Processing
    60

    View Slide

  175. Rethink Systems Stack for Point Cloud Processing
    60
    2-D Image / Video

    View Slide

  176. Rethink Systems Stack for Point Cloud Processing
    60
    2-D Image / Video
    Application
    Starfish (LiKamWa et al., 2013)
    Focus (Hsieh et al., 2018)
    … …

    View Slide

  177. Rethink Systems Stack for Point Cloud Processing
    60
    2-D Image / Video
    Application
    Compiler
    Halide (Ragan-Kelley et al., 2013)
    Darkroom (Hegarty et al., 2014)
    Opt (Devito et al., 2018)
    … …
    Starfish (LiKamWa et al., 2013)
    Focus (Hsieh et al., 2018)
    … …

    View Slide

  178. Rethink Systems Stack for Point Cloud Processing
    60
    2-D Image / Video
    Application
    Compiler
    Architecture
    Halide (Ragan-Kelley et al., 2013)
    Darkroom (Hegarty et al., 2014)
    Opt (Devito et al., 2018)
    … …
    Starfish (LiKamWa et al., 2013)
    Focus (Hsieh et al., 2018)
    … …
    IDEAL (Mahmoud et al., 2017)
    Eyeriss (Chen et al., 2016)
    … …
    Euphrates (Zhu et al., 2018)

    View Slide

  179. Rethink Systems Stack for Point Cloud Processing
    60
    2-D Image / Video
    Application
    Compiler
    Architecture
    Point Cloud
    Halide (Ragan-Kelley et al., 2013)
    Darkroom (Hegarty et al., 2014)
    Opt (Devito et al., 2018)
    … …
    Starfish (LiKamWa et al., 2013)
    Focus (Hsieh et al., 2018)
    … …
    ??????
    ??????
    IDEAL (Mahmoud et al., 2017)
    Eyeriss (Chen et al., 2016)
    … …
    Euphrates (Zhu et al., 2018)
    ??????

    View Slide

  180. Rethink Systems Stack for Point Cloud Processing
    60
    2-D Image / Video
    Application
    Compiler
    Architecture
    Point Clouds are high-dimensional, sparse and irregular
    Computation / Memory Access Pattern are fundamentally different
    Point Cloud
    Halide (Ragan-Kelley et al., 2013)
    Darkroom (Hegarty et al., 2014)
    Opt (Devito et al., 2018)
    … …
    Starfish (LiKamWa et al., 2013)
    Focus (Hsieh et al., 2018)
    … …
    ??????
    ??????
    IDEAL (Mahmoud et al., 2017)
    Eyeriss (Chen et al., 2016)
    … …
    Euphrates (Zhu et al., 2018)
    ??????

    View Slide

  181. Thank you!

    View Slide

  182. Q & A

    View Slide

  183. Representative Design Points
    63
    DP8
    DP6
    DP4 DP2
    DP3
    DP1
    DP7
    DP5
    DP6
    DP4 DP2
    Execution Time
    Translational Error
    Execution Time
    Rotational Error

    View Slide

  184. Performance: absolute time
    Intel Xeon Silver 4110 core: ~5.0 - 10.0 seconds
    Our system: ~1.0 - 3.0 seconds
    64

    View Slide

  185. Speed-up & Power Reduction
    65
    80
    60
    40
    20
    0
    Speedup (X)
    Base-KD
    Base-2SKD
    Acc-KD
    Acc-2SKD
    16
    12
    8
    4
    0
    Power Reduction (X)
    30
    24
    18
    12
    6
    0
    Speedup (X)
    Base-KD
    Base-2SKD
    Acc-KD
    Acc-2SKD
    20
    15
    10
    5
    0
    Power Reduction (X)
    Design Point 7 Design Point 4

    View Slide

  186. Approximate Search
    In Design Point 4:
    ▸ Computation Saving: 

    72.8 % less distance compute & comparison;
    ▸ Accuracy Loss:
    ▹ Translational Error: 0
    ▹ Rotational Error: 0.05 °/meter
    66

    View Slide

  187. Area Analysis
    67
    SRAM: 

    8.38 mm^2 (53.8%)
    Compute Logic: 

    7.19 mm^2 (46.2%)
    Global
    Buffer
    Search Unit
    ……
    BE Query
    Buffer
    PE

    PE
    Search Unit
    PE

    PE
    Search Unit
    PE

    PE
    BE Query
    Buffer
    BE Query
    Buffer
    Recursion Unit
    FQ
    RS
    RN
    CD
    PI
    CL
    Bypass Forward
    Query Distribution Network
    Recursion Unit
    FQ
    RS
    RN
    CD
    PI
    CL
    Bypass Forward
    Recursion Unit
    FQ
    RS
    RN
    CD
    PI
    CL
    Bypass Forward
    ……
    FE Query Queue
    Query
    Buffer
    Point
    Buffer
    Query
    Stack
    Buffer
    Result
    Buffer

    View Slide

  188. Hardware Architecture
    68
    Global
    Buffer
    Search Unit
    ……
    BE Query
    Buffer
    PE

    PE
    Search Unit
    PE

    PE
    Search Unit
    PE

    PE
    BE Query
    Buffer
    BE Query
    Buffer
    Recursion Unit
    FQ
    RS
    RN
    CD
    PI
    CL
    Bypass Forward
    Query Distribution Network
    Recursion Unit
    FQ
    RS
    RN
    CD
    PI
    CL
    Bypass Forward
    Recursion Unit
    FQ
    RS
    RN
    CD
    PI
    CL
    Bypass Forward
    ……
    FE Query Queue
    Query
    Buffer
    Point
    Buffer
    Query
    Stack
    Buffer
    Result
    Buffer

    View Slide

  189. Memory Traffic
    69
    100
    80
    60
    40
    20
    0
    Memory Traffic Dist. (%)
    ACC-2SKD
    ACC-KD
    FE Query Q
    Query Buf
    Query Stacks
    Res. Buf
    BE Query Q
    Node Cache
    Points Buf
    Global
    Buffer
    Search Unit
    ……
    BE Query
    Buffer
    PE

    PE
    Search Unit
    PE

    PE
    Search Unit
    PE

    PE
    BE Query
    Buffer
    BE Query
    Buffer
    Recursion Unit
    FQ
    RS
    RN
    CD
    PI
    CL
    Bypass Forward
    Query Distribution Network
    Recursion Unit
    FQ
    RS
    RN
    CD
    PI
    CL
    Bypass Forward
    Recursion Unit
    FQ
    RS
    RN
    CD
    PI
    CL
    Bypass Forward
    ……
    FE Query Queue
    Query
    Buffer
    Point
    Buffer
    Query
    Stack
    Buffer
    Result
    Buffer

    View Slide