Tigris: Architecture and Algorithms for 3D Perception in Point Clouds

F0c4b39a71fc7c752d4e6c451f6f678b?s=47 HorizonLab
October 08, 2019

Tigris: Architecture and Algorithms for 3D Perception in Point Clouds

MICRO 2019 Talk. Presented by Boyuan Tian and Tiancheng Xu.

F0c4b39a71fc7c752d4e6c451f6f678b?s=128

HorizonLab

October 08, 2019
Tweet

Transcript

  1. 1 Tigris: Architecture and Algorithms for 
 3D Perception in

    Point Clouds Tiancheng Xu*, Boyuan Tian* 
 with Yuhao Zhu Department of Computer Science University of Rochester http://horizon-lab.org
  2. To-do: add Notre Dame Cathedral figure Goal: Intro 2 Source:

    https://www.nationalgeographic.com/news/2015/06/150622-andrew-tallon-notre-dame-cathedral- laser-scan-art-history-medieval-gothic/
  3. 3 Source: https://www.nationalgeographic.com/news/2015/06/150622-andrew-tallon-notre-dame-cathedral- laser-scan-art-history-medieval-gothic/

  4. 3 Source: https://www.nationalgeographic.com/news/2015/06/150622-andrew-tallon-notre-dame-cathedral- laser-scan-art-history-medieval-gothic/

  5. 4 Point Cloud

  6. 4 Point Cloud ‣ Points in 3-d space, i.e., XYZ

    coordinates
  7. 4 Point Cloud ‣ Points in 3-d space, i.e., XYZ

    coordinates ‣ Effective in capturing visual features
  8. 4 Point Cloud ‣ Points in 3-d space, i.e., XYZ

    coordinates ‣ Effective in capturing visual features ‣ 3-d scanners/sensors
  9. 4 Point Cloud ‣ Points in 3-d space, i.e., XYZ

    coordinates ‣ Effective in capturing visual features ‣ 3-d scanners/sensors ▹ Scan from multiple perspectives
  10. 4 Point Cloud ‣ Points in 3-d space, i.e., XYZ

    coordinates ‣ Effective in capturing visual features ‣ 3-d scanners/sensors ▹ Scan from multiple perspectives ▹ Stitch these Point Clouds
 to form a complete Point Cloud
  11. 5 Source: https://www.nationalgeographic.com/news/2015/06/150622-andrew-tallon-notre-dame-cathedral- laser-scan-art-history-medieval-gothic/

  12. 5 Source: https://www.nationalgeographic.com/news/2015/06/150622-andrew-tallon-notre-dame-cathedral- laser-scan-art-history-medieval-gothic/

  13. Point Cloud Registration 6

  14. Point Cloud Registration 6 ▸ Aligns two point clouds by

    calculating a transformation
  15. Point Cloud Registration 6 ▸ Aligns two point clouds by

    calculating a transformation Transformation = Rotation + Translation
  16. Point Cloud Registration 6 ▸ Aligns two point clouds by

    calculating a transformation Transformation = Rotation + Translation
  17. Motivation 7

  18. Motivation 7 Point Cloud Registration: a fundamental building block

  19. Motivation 7 3-d Reconstruction Point Cloud Registration: a fundamental building

    block
  20. Motivation 7 3-d Reconstruction Point Cloud Registration: a fundamental building

    block Autonomous Driving
  21. Motivation 7 3-d Reconstruction Point Cloud Registration: a fundamental building

    block Autonomous Driving Mixed Reality
  22. Motivation 7 3-d Reconstruction Point Cloud Registration: a fundamental building

    block Autonomous Driving Mixed Reality 3-D Visual Computing
  23. Motivation 7 3-d Reconstruction Point Cloud Registration: a fundamental building

    block Autonomous Driving Mixed Reality 3-D Visual Computing
  24. Motivation 7 3-d Reconstruction Point Cloud Registration: a fundamental building

    block Autonomous Driving Mixed Reality 3-D Visual Computing Limited Energy Budget
  25. Motivation 7 3-d Reconstruction Point Cloud Registration: a fundamental building

    block Autonomous Driving Mixed Reality 3-D Visual Computing High Performance Requirement Limited Energy Budget
  26. Tigris System Overview 8 Towards real-time and energy-efficient Point Cloud

    Registration
  27. Tigris System Overview 8 Towards real-time and energy-efficient Point Cloud

    Registration Characterization
  28. Tigris System Overview 8 Towards real-time and energy-efficient Point Cloud

    Registration Characterization SW/HW Co-design
  29. Tigris System Overview 8 Towards real-time and energy-efficient Point Cloud

    Registration Characterization SW/HW Co-design Evaluation
  30. Tigris System 9 Towards real-time and energy-efficient Point Cloud Registration

    Characterization SW/HW Co-design Evaluation
  31. Point Cloud Registration Pipeline 10 Registration

  32. Point Cloud Registration Pipeline 10 Registration

  33. Point Cloud Registration Pipeline 10 Registration Initial Estimation Fine-tuning

  34. Point Cloud Registration Pipeline 11 Registration Fine-tuning Initial Estimation

  35. Point Cloud Registration Pipeline 11 Registration Fine-tuning NE KPTD DC

    KPCE CR RPCE EM Initial Estimation Stage1 Stage2 Stage3 Stage4 Stage5 Stage6 Stage7
  36. Example: Normal Estimation (NE) 12 NE KPDT DC KPCE CR

    RPCE EM Registration
  37. Example: Normal Estimation (NE) 12 ALG NE KPDT DC KPCE

    CR RPCE EM Registration
  38. Example: Normal Estimation (NE) 12 ALG NE KPDT DC KPCE

    CR RPCE EM Registration • SVD
  39. Example: Normal Estimation (NE) 12 ALG NE KPDT DC KPCE

    CR RPCE EM Registration • DNN • SVD
  40. Example: Normal Estimation (NE) 12 ALG PARAM NE KPDT DC

    KPCE CR RPCE EM Registration • DNN • SVD
  41. Example: Normal Estimation (NE) 12 ALG PARAM NE KPDT DC

    KPCE CR RPCE EM Registration • DNN • SVD • Search
 radius
  42. • DNN • SVD • Search
 radius • SIFT •

    NARF • Scale
 range • FPFH • 3DSC • Search
 radius • Reci-
 procity • Ratio • Dist • RANSAC • THRESH • NORM-S • PROJECT • Converging
 criteria • Reci-
 procity • MetricT • SolverT 13 ALG PARAM - NE KD DC KPCE CR RPCE EM Registration Huge Design Space
  43. • DNN • SVD • Search
 radius • SIFT •

    NARF • Scale
 range • FPFH • 3DSC • Search
 radius • Reci-
 procity • Ratio • Dist • RANSAC • THRESH • NORM-S • PROJECT • Converging
 criteria • Reci-
 procity • MetricT • SolverT 13 ALG PARAM - NE KD DC KPCE CR RPCE EM Registration Huge Design Space
  44. • DNN • SVD • Search
 radius • SIFT •

    NARF • Scale
 range • FPFH • 3DSC • Search
 radius • Reci-
 procity • Ratio • Dist • RANSAC • THRESH • NORM-S • PROJECT • Converging
 criteria • Reci-
 procity • MetricT • SolverT 13 ALG PARAM - NE KD DC KPCE CR RPCE EM Registration Huge Design Space Configurable pipeline: 
 https://github.com/horizon-research/PointCloud-pipeline
  45. Design Space Exploration 14

  46. Design Space Exploration 14 Error Rate

  47. Design Space Exploration 14 Error Rate Execution Time

  48. Design Space Exploration 15 Execution Time Error Rate

  49. Design Space Exploration 15 Execution Time Error Rate A Design

    Point with
 X Error Rate and Y Latency (X, Y)
  50. Design Space Exploration 16 Translational Error Rotational Error Execution Time

    Execution Time
  51. Design Space Exploration 16 Translational Error Rotational Error Execution Time

    Execution Time Transformation = Rotation + Translation Error Rate: Rotational & Translational Error
  52. Design Space Exploration 17 Translational Error Rotational Error Execution Time

    Execution Time
  53. Design Space Exploration 18 Translational Error Rotational Error Execution Time

    Execution Time
  54. Design Space Exploration 19 Translational Error Rotational Error Execution Time

    Execution Time
  55. Representative Design Points 20 DP8 DP6 DP4 DP2 DP3 DP1

    DP7 DP5 DP6 DP4 DP2 Execution Time Translational Error Execution Time Rotational Error
  56. Characterization 21 NE KPTD DC KPCE CR RPCE EM Using

    the representative design points (DP1-8)
  57. Characterization 22 NE KPTD DC KPCE CR RPCE EM

  58. Characterization 22 NE KPTD DC KPCE CR RPCE EM

  59. Characterization 22 NE KPTD DC KPCE CR RPCE EM KD-Tree

    Search
  60. Bottleneck: KD-Tree Search 23 KD-Tree Search / End-to-End Pipeline Latency

    (%) 0% 50% 100% DP1 DP2 DP3 DP4 DP5 DP6 DP7 DP8
  61. Bottleneck: KD-Tree Search 24 KD-Tree Search / End-to-End Pipeline Latency

    (%) 0% 50% 100% DP1 DP2 DP3 DP4 DP5 DP6 DP7 DP8 85% 85% 80% 76% 75% 74% 65% 52%
  62. KD-Tree Search 25

  63. KD-Tree Search ▸ Neighbor Search (NS) 25

  64. KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point

    Cloud processing 25
  65. KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point

    Cloud processing ▹ To find the neighbors 25
  66. KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point

    Cloud processing ▹ To find the neighbors 25 Query Point of one point
  67. KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point

    Cloud processing ▹ To find the neighbors 25 Query Point Search Points of one point among a set of points
  68. KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point

    Cloud processing ▹ To find the neighbors
 26 of one point among a set of points
  69. KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point

    Cloud processing ▹ To find the neighbors
 ▸ KD-Tree Search 26 of one point among a set of points
  70. KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point

    Cloud processing ▹ To find the neighbors
 ▸ KD-Tree Search ▹ Standard implementation for NS in point clouds 26 of one point among a set of points
  71. KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point

    Cloud processing ▹ To find the neighbors
 ▸ KD-Tree Search ▹ Standard implementation for NS in point clouds ▹ Effectively reduces the computation workload of NS 26 of one point among a set of points
  72. KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point

    Cloud processing ▹ To find the neighbors
 ▸ KD-Tree Search ▹ Standard implementation for NS in point clouds ▹ Effectively reduces the computation workload of NS ▹ Inefficient on GPUs due to its sequential nature 26 of one point among a set of points
  73. KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point

    Cloud processing ▹ To find the neighbors
 ▸ KD-Tree Search ▹ Standard implementation for NS in point clouds ▹ Effectively reduces the computation workload of NS ▹ Inefficient on GPUs due to its sequential nature ▹ Challenging for hardware acceleration 26 of one point among a set of points
  74. Tigris System Overview 27 Towards real-time and energy-efficient Point Cloud

    Registration Characterization SW/HW Co-design Evaluation
  75. Redundancy vs. Parallelism 28

  76. Redundancy vs. Parallelism 28 Unordered Set Canonical KD-Tree

  77. Redundancy vs. Parallelism 28 Unordered Set Canonical KD-Tree Current Node

  78. Redundancy vs. Parallelism 28 Unordered Set Canonical KD-Tree ▹No Redundancy,

    No Parallelism Current Node
  79. Redundancy vs. Parallelism 28 Unordered Set ▹Huge Parallelism, Huge Redundancy

    Canonical KD-Tree ▹No Redundancy, No Parallelism Current Node
  80. Two-Stage KD-Tree New data structure ▹Balances parallelism and redundancy 29

  81. Two-Stage KD-Tree New data structure ▹Balances parallelism and redundancy 29

    Two-Stage KD-Tree
  82. Two-Stage KD-Tree New data structure ▹Balances parallelism and redundancy 29

    Two-Stage KD-Tree Top-Tree
  83. Two-Stage KD-Tree New data structure ▹Balances parallelism and redundancy 29

    Two-Stage KD-Tree Top-Tree Children of Leaf Nodes Leaf Nodes
  84. Two-Stage KD-Tree New data structure ▹Balances parallelism and redundancy 30

    Two-Stage KD-Tree
  85. Two-Stage KD-Tree New data structure ▹Balances parallelism and redundancy 30

    Two-Stage KD-Tree Canonical KD-Tree Same First Few Levels
  86. Two-Stage KD-Tree New data structure ▹Balances parallelism and redundancy 30

    Two-Stage KD-Tree Canonical KD-Tree Sequential Traversal
  87. Two-Stage KD-Tree New data structure ▹Balances parallelism and redundancy 30

    Two-Stage KD-Tree Canonical KD-Tree Sub-Tree Unordered Set Sequential Traversal
  88. Parallel Search Two-Stage KD-Tree New data structure ▹Balances parallelism and

    redundancy 30 Two-Stage KD-Tree Canonical KD-Tree Sequential Traversal
  89. Quantifying Redundancy 31 Two-Stage KD-Tree Canonical KD-Tree

  90. Quantifying Redundancy 31 Two-Stage KD-Tree Canonical KD-Tree

  91. Quantifying Redundancy 31 Two-Stage KD-Tree Canonical KD-Tree

  92. Quantifying Redundancy 32

  93. Quantifying Redundancy 32 …… …… 35X more points need to

    be visited
  94. Approximate Search New search algorithm ▹Mitigates redundancy introduced by new

    data structure 33
  95. Approximate Search New search algorithm ▹Close queries are likely to

    share similar search results 34
  96. Approximate Search New search algorithm ▹Close queries are likely to

    share similar search results 34 Qi
  97. Approximate Search 35 Qi N New search algorithm ▹Close queries

    are likely to share similar search results
  98. Approximate Search 35 Qi N Qj New search algorithm ▹Close

    queries are likely to share similar search results
  99. Approximate Search 36 Qi R New search algorithm ▹Close queries

    are likely to share similar search results
  100. Approximate Search 37 Qi R New search algorithm ▹Close queries

    are likely to share similar search results
  101. Approximate Search 37 Qi R Qj R New search algorithm

    ▹Close queries are likely to share similar search results
  102. Approximate Search 38 Qj R R Qi New search algorithm

    ▹Close queries are likely to share similar search results
  103. Approximate Search 39 R Qj R Qi New search algorithm

    ▹Close queries are likely to share similar search results
  104. Approximate Search 40 Qi New search algorithm ▹Leader: search in

    children of leaf nodes as usual
  105. Approximate Search 40 Qi leader New search algorithm ▹Leader: search

    in children of leaf nodes as usual
  106. Approximate Search 40 Qi leader New search algorithm ▹Leader: search

    in children of leaf nodes as usual R
  107. Approximate Search 41 Qi leader R New search algorithm ▹Leader:

    search in children of leaf nodes as usual
  108. Approximate Search 42 Qi leader New search algorithm ▹Follower: search

    in neighbors of a leader R
  109. Approximate Search 42 Qi leader Qj follower New search algorithm

    ▹Follower: search in neighbors of a leader R
  110. Approximate Search 43 Qi leader R Qj follower New search

    algorithm ▹Follower: search in neighbors of a leader
  111. Approximate Search 44 Qi leader R Qj follower New search

    algorithm ▹Efficiently mitigate search redundancy
  112. Total savings of node visits 72.8% Negligible effect on registration

    accuracy Approximate Search 44 Qi leader R Qj follower New search algorithm ▹Efficiently mitigate search redundancy
  113. New data structure + new search algorithm: ▹Expose huge parallelism

    with negligible search redundancy
 Software-Hardware Co-design for Neighbor Search 45
  114. New data structure + new search algorithm: ▹Expose huge parallelism

    with negligible search redundancy
 Software-Hardware co-design: Software-Hardware Co-design for Neighbor Search 45
  115. New data structure + new search algorithm: ▹Expose huge parallelism

    with negligible search redundancy
 Software-Hardware co-design: Software-Hardware Co-design for Neighbor Search 45 Sequential traversal in top tree Parallel search in leaf nodes
  116. Hardware Architecture 46 Front-end Back-end Query Distribution Network Front-end Buffer

    Global buffer Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Sequential traverse in top tree Parallel search in leaf nodes
  117. Hardware Architecture 46 Front-end Back-end Query Distribution Network Front-end Buffer

    Global buffer Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Sequential traverse in top tree Parallel search in leaf nodes
  118. Hardware Architecture 46 Front-end Back-end Query Distribution Network Front-end Buffer

    Global buffer Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism Sequential traverse in top tree Parallel search in leaf nodes
  119. Hardware Architecture 46 Front-end Back-end Query Distribution Network Front-end Buffer

    Global buffer Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Front-end ▹Exploits QLP + limited NLP ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism Sequential traverse in top tree Parallel search in leaf nodes
  120. Hardware Architecture 46 Back-end Query Distribution Network Front-end Buffer Global

    buffer Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Front-end ▹Exploits QLP + limited NLP RU RU RU … Front-end Buffer ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism Sequential traverse in top tree Parallel search in leaf nodes
  121. Recursive Unit Hardware Originally no NLP can be exploited ▹Tree

    traversal in top tree is still sequential 47 ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  122. Recursive Unit Hardware Originally no NLP can be exploited ▹Tree

    traversal in top tree is still sequential Limited NLP exploited by pipelining different nodes 47 ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism Recursion Unit Microarchitecture
  123. Recursive Unit Hardware Originally no NLP can be exploited ▹Tree

    traversal in top tree is still sequential Limited NLP exploited by pipelining different nodes ▹Two optimizations to avoid data dependency and pipeline stall 48 Recursion Unit Microarchitecture ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  124. Hardware Architecture 49 RU RU RU Query Distribution Network …

    Global buffer Back-end SUs Back-end SUs Front-end Buffer Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Front-end ▹Exploits QLP + limited NLP Back-end ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  125. Hardware Architecture 50 RU RU RU SU SU SU Buf

    Buf Buf Front-end Buffer Query Distribution Network Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Front-end ▹Exploits QLP + limited NLP Back-end … … Global buffer ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  126. Hardware Architecture 50 RU RU RU SU SU SU Buf

    Buf Buf Front-end Buffer Query Distribution Network Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Front-end ▹Exploits QLP + limited NLP Back-end … … Global buffer ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  127. Hardware Architecture 50 RU RU RU SU SU SU Buf

    Buf Buf Front-end Buffer Query Distribution Network Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Front-end ▹Exploits QLP + limited NLP Back-end ▹Exploits QLP + NLP … … Global buffer Leaf node ID % num. of SUs ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  128. Search Unit Hardware 51 Q Q Q Q Q ▹QLP:

    Query-Level Parallelism ▹NLP: Node-Level Parallelism
  129. Search Unit Hardware 51 Q Q Q Q Q ▹QLP:

    Query-Level Parallelism ▹NLP: Node-Level Parallelism
  130. Search Unit Hardware 51 SU Buf Q Q Q Q

    Q ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  131. Q Q Q Q Q Search Unit Hardware 51 SU

    Buf Q Q Q Q Q PE PE PE PE ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  132. Q Q Q Q Q Search Unit Hardware 51 SU

    Buf Q Q Q Q Q Query-Level Parallelism PE PE PE PE ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  133. Q Q Q Q Q Search Unit Hardware 51 SU

    Buf Q Q Q Q Q Query-Level Parallelism PE PE PE PE Execution Model Exploiting QLP ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  134. Q Q Q Q Q Search Unit Hardware 51 SU

    Buf Q Q Q Q Q Query-Level Parallelism PE PE PE PE Execution Model Exploiting QLP ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  135. Q Q Q Q Q Search Unit Hardware 51 SU

    Buf Q Q Q Q Q Query-Level Parallelism PE PE PE PE Execution Model Exploiting QLP ▸ Multiple Query Multiple LeafNodes ▹High PE Utilization; High data bandwidth ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  136. Q Q Q Q Q Search Unit Hardware 51 SU

    Buf Q Q Q Q Q Query-Level Parallelism PE PE PE PE Execution Model Exploiting QLP ▸ Multiple Query Multiple LeafNodes ▹High PE Utilization; High data bandwidth ▸ Multiple Query Single LeafNodes ▹Low data bandwidth; Low PE Utilization ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  137. Q Q Q Q Q Search Unit Hardware 51 SU

    Buf Q Q Q Q Q Query-Level Parallelism PE PE PE PE Execution Model Exploiting QLP ▸ Multiple Query Multiple LeafNodes ▹High PE Utilization; High data bandwidth ▸ Multiple Query Single LeafNodes ▹Low data bandwidth; Low PE Utilization ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  138. Execution Model Exploiting QLP ▸ Multiple Query Multiple LeafNodes ▹High

    PE Utilization; High data bandwidth ▸ Multiple Query Single LeafNodes ▹Low data bandwidth; Low PE Utilization PE PE PE Q Q Q Q Q Q Q Search Unit Hardware 52 Q Q Q Q Q PE N N N N N N Children in a leaf node to be searched in parallel Query-Level Parallelism ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  139. Execution Model Exploiting QLP ▸ Multiple Query Multiple LeafNodes ▹High

    PE Utilization; High data bandwidth ▸ Multiple Query Single LeafNodes ▹Low data bandwidth; Low PE Utilization PE PE PE Q Q Q Q Q Q Q Search Unit Hardware 52 Q Q Q Q Q Node-Level Parallelism PE N N N N N N Children in a leaf node to be searched in parallel Query-Level Parallelism ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  140. Execution Model Exploiting QLP ▸ Multiple Query Multiple LeafNodes ▹High

    PE Utilization; High data bandwidth ▸ Multiple Query Single LeafNodes ▹Low data bandwidth; Low PE Utilization PE PE PE Q Q Q Q Q Q Q Search Unit Hardware 52 Q Q Q Q Q Node-Level Parallelism PE Data-flow Exploiting NLP ▸ 1-D systolic array ▸ Query-stationary data-flow N N N N N N Children in a leaf node to be searched in parallel Query-Level Parallelism ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  141. Execution Model Exploiting QLP ▸ Multiple Query Multiple LeafNodes ▹High

    PE Utilization; High data bandwidth ▸ Multiple Query Single LeafNodes ▹Low data bandwidth; Low PE Utilization PE PE PE Q Q Q Q Q Q Q Search Unit Hardware 52 Q Q Q Q Q Node-Level Parallelism PE Data-flow Exploiting NLP ▸ 1-D systolic array ▸ Query-stationary data-flow N N N N N N Children in a leaf node to be searched in parallel Query-Level Parallelism ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  142. Tigris system overview 53 Towards real-time and energy-efficient Point Cloud

    Registration Characterization SW/HW Co-design Evaluation
  143. Metrics & Dataset 54

  144. Metrics & Dataset 54 ‣ Speed-up & Power Reduction

  145. Metrics & Dataset 54 ‣ Speed-up & Power Reduction •

    Performance Bottleneck: KD-Tree Search
  146. Metrics & Dataset 54 ‣ Speed-up & Power Reduction •

    Performance Bottleneck: KD-Tree Search • End-to-end Registration Pipeline

  147. Metrics & Dataset 54 ‣ Speed-up & Power Reduction •

    Performance Bottleneck: KD-Tree Search • End-to-end Registration Pipeline
 ‣ Dataset: Self-Driving Benchmark (KITTI)
  148. Metrics & Dataset 54 ‣ Speed-up & Power Reduction •

    Performance Bottleneck: KD-Tree Search • End-to-end Registration Pipeline
 ‣ Dataset: Self-Driving Benchmark (KITTI) • Each Point Cloud frame: ~130,000 Points
  149. Hardware 55

  150. Hardware 55 ‣ KD-Tree Search

  151. Hardware 55 ‣ KD-Tree Search • Baseline: GPU (RTX 2080

    Ti)
  152. Hardware 55 ‣ KD-Tree Search • Baseline: GPU (RTX 2080

    Ti) • Our System:
  153. Hardware 55 ‣ KD-Tree Search • Baseline: GPU (RTX 2080

    Ti) • Our System: ✓ Performance: 
 Cycle-accurate simulator 
 parameterized by an RTL model
  154. Hardware 55 ‣ KD-Tree Search • Baseline: GPU (RTX 2080

    Ti) • Our System: ✓ Performance: 
 Cycle-accurate simulator 
 parameterized by an RTL model ✓ Power & Area: 
 Post layout simulation in a 16nm process node
  155. Hardware 55 ‣ KD-Tree Search • Baseline: GPU (RTX 2080

    Ti) • Our System: ✓ Performance: 
 Cycle-accurate simulator 
 parameterized by an RTL model ✓ Power & Area: 
 Post layout simulation in a 16nm process node ‣ All Other Parts: CPU (Intel Xeon Silver 4110)
  156. Comparisons 56

  157. Comparisons 56 To examine the individual benefits of SW and

    HW optimizations
  158. Comparisons 56 To examine the individual benefits of SW and

    HW optimizations Four systems for Comparison
  159. Comparisons 56 To examine the individual benefits of SW and

    HW optimizations 
 No SW Optimization No HW Optimization Four systems for Comparison Baseline (KD)
  160. Comparisons 56 To examine the individual benefits of SW and

    HW optimizations 
 No SW Optimization No HW Optimization 
 + SW Optimization No HW Optimization Four systems for Comparison Baseline (KD) Baseline (2SKD)
  161. Comparisons 56 To examine the individual benefits of SW and

    HW optimizations 
 No SW Optimization No HW Optimization 
 + SW Optimization No HW Optimization 
 No SW Optimization + HW Optimization Four systems for Comparison Baseline (KD) Baseline (2SKD) Tigris (KD)
  162. Comparisons 56 To examine the individual benefits of SW and

    HW optimizations 
 No SW Optimization No HW Optimization 
 + SW Optimization No HW Optimization 
 No SW Optimization + HW Optimization 
 + SW Optimization + HW Optimization Four systems for Comparison Baseline (KD) Baseline (2SKD) Tigris (KD) Tigris (2SKD)
  163. Performance 57

  164. Performance 57 Power Reduction (x) 0.0 1.0 2.0 3.0 4.0

    Speedup (x) 0.0 4.4 8.8 13.2 17.6 22.0 Baseline (KD) Baseline (2SKD) Our System (KD) Our System (2SKD)
  165. Performance 57 Power Reduction (x) 0.0 1.0 2.0 3.0 4.0

    Speedup (x) 0.0 4.4 8.8 13.2 17.6 22.0 Baseline (KD) Baseline (2SKD) Our System (KD) Our System (2SKD) 5.9 20.9 1.0 1.1
  166. 5.9 20.9 Power Reduction (x) 0.0 4.5 9.0 13.5 18.0

    Speedup (x) 0.0 4.4 8.8 13.2 17.6 22.0 Baseline (KD) Baseline (2SKD) Our System (KD) Our System (2SKD) Performance & Power 58 17.8 10.5 1.0 1.0
  167. 5.9 20.9 Power Reduction (x) 0.0 4.5 9.0 13.5 18.0

    Speedup (x) 0.0 4.4 8.8 13.2 17.6 22.0 Baseline (KD) Baseline (2SKD) Our System (KD) Our System (2SKD) Performance & Power 58 17.8 10.5 1.0 1.0
  168. 5.9 20.9 Power Reduction (x) 0.0 4.5 9.0 13.5 18.0

    Speedup (x) 0.0 4.4 8.8 13.2 17.6 22.0 Baseline (KD) Baseline (2SKD) Our System (KD) Our System (2SKD) Performance & Power 58 17.8 10.5 1.0 1.0 20.9X speed-up 
 on KD-Tree search 3.5X end-to-end 
 speed-up
  169. 5.9 20.9 Power Reduction (x) 0.0 4.5 9.0 13.5 18.0

    Speedup (x) 0.0 4.4 8.8 13.2 17.6 22.0 Baseline (KD) Baseline (2SKD) Our System (KD) Our System (2SKD) Performance & Power 58 17.8 10.5 1.0 1.0 20.9X speed-up 
 on KD-Tree search 3.5X end-to-end 
 speed-up 10.5X power reduction 
 on KD-Tree search 3.0X end-to-end 
 power reduction
  170. Summary 59

  171. Summary 59 ‣ Point Cloud Registration ▹ A fundamental building

    block in emerging domains such as Autonomous Driving and Mixed Reality
  172. Summary 59 ‣ Point Cloud Registration ▹ A fundamental building

    block in emerging domains such as Autonomous Driving and Mixed Reality ‣ Our Tigris System ▹ An early step towards efficient Point Cloud Registration
  173. Summary 59 ‣ Point Cloud Registration ▹ A fundamental building

    block in emerging domains such as Autonomous Driving and Mixed Reality ‣ Our Tigris System ▹ An early step towards efficient Point Cloud Registration ‣ Key Insight ▹ Co-designing Software and Hardware to boost efficiency
  174. Rethink Systems Stack for Point Cloud Processing 60

  175. Rethink Systems Stack for Point Cloud Processing 60 2-D Image

    / Video
  176. Rethink Systems Stack for Point Cloud Processing 60 2-D Image

    / Video Application Starfish (LiKamWa et al., 2013) Focus (Hsieh et al., 2018) … …
  177. Rethink Systems Stack for Point Cloud Processing 60 2-D Image

    / Video Application Compiler Halide (Ragan-Kelley et al., 2013) Darkroom (Hegarty et al., 2014) Opt (Devito et al., 2018) … … Starfish (LiKamWa et al., 2013) Focus (Hsieh et al., 2018) … …
  178. Rethink Systems Stack for Point Cloud Processing 60 2-D Image

    / Video Application Compiler Architecture Halide (Ragan-Kelley et al., 2013) Darkroom (Hegarty et al., 2014) Opt (Devito et al., 2018) … … Starfish (LiKamWa et al., 2013) Focus (Hsieh et al., 2018) … … IDEAL (Mahmoud et al., 2017) Eyeriss (Chen et al., 2016) … … Euphrates (Zhu et al., 2018)
  179. Rethink Systems Stack for Point Cloud Processing 60 2-D Image

    / Video Application Compiler Architecture Point Cloud Halide (Ragan-Kelley et al., 2013) Darkroom (Hegarty et al., 2014) Opt (Devito et al., 2018) … … Starfish (LiKamWa et al., 2013) Focus (Hsieh et al., 2018) … … ?????? ?????? IDEAL (Mahmoud et al., 2017) Eyeriss (Chen et al., 2016) … … Euphrates (Zhu et al., 2018) ??????
  180. Rethink Systems Stack for Point Cloud Processing 60 2-D Image

    / Video Application Compiler Architecture Point Clouds are high-dimensional, sparse and irregular Computation / Memory Access Pattern are fundamentally different Point Cloud Halide (Ragan-Kelley et al., 2013) Darkroom (Hegarty et al., 2014) Opt (Devito et al., 2018) … … Starfish (LiKamWa et al., 2013) Focus (Hsieh et al., 2018) … … ?????? ?????? IDEAL (Mahmoud et al., 2017) Eyeriss (Chen et al., 2016) … … Euphrates (Zhu et al., 2018) ??????
  181. Thank you!

  182. Q & A

  183. Representative Design Points 63 DP8 DP6 DP4 DP2 DP3 DP1

    DP7 DP5 DP6 DP4 DP2 Execution Time Translational Error Execution Time Rotational Error
  184. Performance: absolute time Intel Xeon Silver 4110 core: ~5.0 -

    10.0 seconds Our system: ~1.0 - 3.0 seconds 64
  185. Speed-up & Power Reduction 65 80 60 40 20 0

    Speedup (X) Base-KD Base-2SKD Acc-KD Acc-2SKD 16 12 8 4 0 Power Reduction (X) 30 24 18 12 6 0 Speedup (X) Base-KD Base-2SKD Acc-KD Acc-2SKD 20 15 10 5 0 Power Reduction (X) Design Point 7 Design Point 4
  186. Approximate Search In Design Point 4: ▸ Computation Saving: 


    72.8 % less distance compute & comparison; ▸ Accuracy Loss: ▹ Translational Error: 0 ▹ Rotational Error: 0.05 °/meter 66
  187. Area Analysis 67 SRAM: 
 8.38 mm^2 (53.8%) Compute Logic:

    
 7.19 mm^2 (46.2%) Global Buffer Search Unit …… BE Query Buffer PE … PE Search Unit PE … PE Search Unit PE … PE BE Query Buffer BE Query Buffer Recursion Unit FQ RS RN CD PI CL Bypass Forward Query Distribution Network Recursion Unit FQ RS RN CD PI CL Bypass Forward Recursion Unit FQ RS RN CD PI CL Bypass Forward …… FE Query Queue Query Buffer Point Buffer Query Stack Buffer Result Buffer
  188. Hardware Architecture 68 Global Buffer Search Unit …… BE Query

    Buffer PE … PE Search Unit PE … PE Search Unit PE … PE BE Query Buffer BE Query Buffer Recursion Unit FQ RS RN CD PI CL Bypass Forward Query Distribution Network Recursion Unit FQ RS RN CD PI CL Bypass Forward Recursion Unit FQ RS RN CD PI CL Bypass Forward …… FE Query Queue Query Buffer Point Buffer Query Stack Buffer Result Buffer
  189. Memory Traffic 69 100 80 60 40 20 0 Memory

    Traffic Dist. (%) ACC-2SKD ACC-KD FE Query Q Query Buf Query Stacks Res. Buf BE Query Q Node Cache Points Buf Global Buffer Search Unit …… BE Query Buffer PE … PE Search Unit PE … PE Search Unit PE … PE BE Query Buffer BE Query Buffer Recursion Unit FQ RS RN CD PI CL Bypass Forward Query Distribution Network Recursion Unit FQ RS RN CD PI CL Bypass Forward Recursion Unit FQ RS RN CD PI CL Bypass Forward …… FE Query Queue Query Buffer Point Buffer Query Stack Buffer Result Buffer