Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

Tigris: Architecture and Algorithms for 3D Perc...

HorizonLab
October 08, 2019

Tigris: Architecture and Algorithms for 3D Perception in Point Clouds

MICRO 2019 Talk. Presented by Boyuan Tian and Tiancheng Xu.

HorizonLab

October 08, 2019
Tweet

More Decks by HorizonLab

Other Decks in Science

Transcript

  1. 1 Tigris: Architecture and Algorithms for 
 3D Perception in

    Point Clouds Tiancheng Xu*, Boyuan Tian* 
 with Yuhao Zhu Department of Computer Science University of Rochester http://horizon-lab.org
  2. To-do: add Notre Dame Cathedral figure Goal: Intro 2 Source:

    https://www.nationalgeographic.com/news/2015/06/150622-andrew-tallon-notre-dame-cathedral- laser-scan-art-history-medieval-gothic/
  3. 4 Point Cloud ‣ Points in 3-d space, i.e., XYZ

    coordinates ‣ Effective in capturing visual features
  4. 4 Point Cloud ‣ Points in 3-d space, i.e., XYZ

    coordinates ‣ Effective in capturing visual features ‣ 3-d scanners/sensors
  5. 4 Point Cloud ‣ Points in 3-d space, i.e., XYZ

    coordinates ‣ Effective in capturing visual features ‣ 3-d scanners/sensors ▹ Scan from multiple perspectives
  6. 4 Point Cloud ‣ Points in 3-d space, i.e., XYZ

    coordinates ‣ Effective in capturing visual features ‣ 3-d scanners/sensors ▹ Scan from multiple perspectives ▹ Stitch these Point Clouds
 to form a complete Point Cloud
  7. Point Cloud Registration 6 ▸ Aligns two point clouds by

    calculating a transformation Transformation = Rotation + Translation
  8. Point Cloud Registration 6 ▸ Aligns two point clouds by

    calculating a transformation Transformation = Rotation + Translation
  9. Motivation 7 3-d Reconstruction Point Cloud Registration: a fundamental building

    block Autonomous Driving Mixed Reality 3-D Visual Computing
  10. Motivation 7 3-d Reconstruction Point Cloud Registration: a fundamental building

    block Autonomous Driving Mixed Reality 3-D Visual Computing
  11. Motivation 7 3-d Reconstruction Point Cloud Registration: a fundamental building

    block Autonomous Driving Mixed Reality 3-D Visual Computing Limited Energy Budget
  12. Motivation 7 3-d Reconstruction Point Cloud Registration: a fundamental building

    block Autonomous Driving Mixed Reality 3-D Visual Computing High Performance Requirement Limited Energy Budget
  13. Tigris System Overview 8 Towards real-time and energy-efficient Point Cloud

    Registration Characterization SW/HW Co-design Evaluation
  14. Point Cloud Registration Pipeline 11 Registration Fine-tuning NE KPTD DC

    KPCE CR RPCE EM Initial Estimation Stage1 Stage2 Stage3 Stage4 Stage5 Stage6 Stage7
  15. Example: Normal Estimation (NE) 12 ALG NE KPDT DC KPCE

    CR RPCE EM Registration • DNN • SVD
  16. Example: Normal Estimation (NE) 12 ALG PARAM NE KPDT DC

    KPCE CR RPCE EM Registration • DNN • SVD
  17. Example: Normal Estimation (NE) 12 ALG PARAM NE KPDT DC

    KPCE CR RPCE EM Registration • DNN • SVD • Search
 radius
  18. • DNN • SVD • Search
 radius • SIFT •

    NARF • Scale
 range • FPFH • 3DSC • Search
 radius • Reci-
 procity • Ratio • Dist • RANSAC • THRESH • NORM-S • PROJECT • Converging
 criteria • Reci-
 procity • MetricT • SolverT 13 ALG PARAM - NE KD DC KPCE CR RPCE EM Registration Huge Design Space
  19. • DNN • SVD • Search
 radius • SIFT •

    NARF • Scale
 range • FPFH • 3DSC • Search
 radius • Reci-
 procity • Ratio • Dist • RANSAC • THRESH • NORM-S • PROJECT • Converging
 criteria • Reci-
 procity • MetricT • SolverT 13 ALG PARAM - NE KD DC KPCE CR RPCE EM Registration Huge Design Space
  20. • DNN • SVD • Search
 radius • SIFT •

    NARF • Scale
 range • FPFH • 3DSC • Search
 radius • Reci-
 procity • Ratio • Dist • RANSAC • THRESH • NORM-S • PROJECT • Converging
 criteria • Reci-
 procity • MetricT • SolverT 13 ALG PARAM - NE KD DC KPCE CR RPCE EM Registration Huge Design Space Configurable pipeline: 
 https://github.com/horizon-research/PointCloud-pipeline
  21. Design Space Exploration 15 Execution Time Error Rate A Design

    Point with
 X Error Rate and Y Latency (X, Y)
  22. Design Space Exploration 16 Translational Error Rotational Error Execution Time

    Execution Time Transformation = Rotation + Translation Error Rate: Rotational & Translational Error
  23. Representative Design Points 20 DP8 DP6 DP4 DP2 DP3 DP1

    DP7 DP5 DP6 DP4 DP2 Execution Time Translational Error Execution Time Rotational Error
  24. Characterization 21 NE KPTD DC KPCE CR RPCE EM Using

    the representative design points (DP1-8)
  25. Bottleneck: KD-Tree Search 23 KD-Tree Search / End-to-End Pipeline Latency

    (%) 0% 50% 100% DP1 DP2 DP3 DP4 DP5 DP6 DP7 DP8
  26. Bottleneck: KD-Tree Search 24 KD-Tree Search / End-to-End Pipeline Latency

    (%) 0% 50% 100% DP1 DP2 DP3 DP4 DP5 DP6 DP7 DP8 85% 85% 80% 76% 75% 74% 65% 52%
  27. KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point

    Cloud processing ▹ To find the neighbors 25
  28. KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point

    Cloud processing ▹ To find the neighbors 25 Query Point of one point
  29. KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point

    Cloud processing ▹ To find the neighbors 25 Query Point Search Points of one point among a set of points
  30. KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point

    Cloud processing ▹ To find the neighbors
 26 of one point among a set of points
  31. KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point

    Cloud processing ▹ To find the neighbors
 ▸ KD-Tree Search 26 of one point among a set of points
  32. KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point

    Cloud processing ▹ To find the neighbors
 ▸ KD-Tree Search ▹ Standard implementation for NS in point clouds 26 of one point among a set of points
  33. KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point

    Cloud processing ▹ To find the neighbors
 ▸ KD-Tree Search ▹ Standard implementation for NS in point clouds ▹ Effectively reduces the computation workload of NS 26 of one point among a set of points
  34. KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point

    Cloud processing ▹ To find the neighbors
 ▸ KD-Tree Search ▹ Standard implementation for NS in point clouds ▹ Effectively reduces the computation workload of NS ▹ Inefficient on GPUs due to its sequential nature 26 of one point among a set of points
  35. KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point

    Cloud processing ▹ To find the neighbors
 ▸ KD-Tree Search ▹ Standard implementation for NS in point clouds ▹ Effectively reduces the computation workload of NS ▹ Inefficient on GPUs due to its sequential nature ▹ Challenging for hardware acceleration 26 of one point among a set of points
  36. Tigris System Overview 27 Towards real-time and energy-efficient Point Cloud

    Registration Characterization SW/HW Co-design Evaluation
  37. Redundancy vs. Parallelism 28 Unordered Set ▹Huge Parallelism, Huge Redundancy

    Canonical KD-Tree ▹No Redundancy, No Parallelism Current Node
  38. Two-Stage KD-Tree New data structure ▹Balances parallelism and redundancy 29

    Two-Stage KD-Tree Top-Tree Children of Leaf Nodes Leaf Nodes
  39. Two-Stage KD-Tree New data structure ▹Balances parallelism and redundancy 30

    Two-Stage KD-Tree Canonical KD-Tree Same First Few Levels
  40. Two-Stage KD-Tree New data structure ▹Balances parallelism and redundancy 30

    Two-Stage KD-Tree Canonical KD-Tree Sequential Traversal
  41. Two-Stage KD-Tree New data structure ▹Balances parallelism and redundancy 30

    Two-Stage KD-Tree Canonical KD-Tree Sub-Tree Unordered Set Sequential Traversal
  42. Parallel Search Two-Stage KD-Tree New data structure ▹Balances parallelism and

    redundancy 30 Two-Stage KD-Tree Canonical KD-Tree Sequential Traversal
  43. Approximate Search 35 Qi N New search algorithm ▹Close queries

    are likely to share similar search results
  44. Approximate Search 35 Qi N Qj New search algorithm ▹Close

    queries are likely to share similar search results
  45. Approximate Search 36 Qi R New search algorithm ▹Close queries

    are likely to share similar search results
  46. Approximate Search 37 Qi R New search algorithm ▹Close queries

    are likely to share similar search results
  47. Approximate Search 37 Qi R Qj R New search algorithm

    ▹Close queries are likely to share similar search results
  48. Approximate Search 38 Qj R R Qi New search algorithm

    ▹Close queries are likely to share similar search results
  49. Approximate Search 39 R Qj R Qi New search algorithm

    ▹Close queries are likely to share similar search results
  50. Approximate Search 41 Qi leader R New search algorithm ▹Leader:

    search in children of leaf nodes as usual
  51. Approximate Search 42 Qi leader Qj follower New search algorithm

    ▹Follower: search in neighbors of a leader R
  52. Approximate Search 43 Qi leader R Qj follower New search

    algorithm ▹Follower: search in neighbors of a leader
  53. Approximate Search 44 Qi leader R Qj follower New search

    algorithm ▹Efficiently mitigate search redundancy
  54. Total savings of node visits 72.8% Negligible effect on registration

    accuracy Approximate Search 44 Qi leader R Qj follower New search algorithm ▹Efficiently mitigate search redundancy
  55. New data structure + new search algorithm: ▹Expose huge parallelism

    with negligible search redundancy
 Software-Hardware Co-design for Neighbor Search 45
  56. New data structure + new search algorithm: ▹Expose huge parallelism

    with negligible search redundancy
 Software-Hardware co-design: Software-Hardware Co-design for Neighbor Search 45
  57. New data structure + new search algorithm: ▹Expose huge parallelism

    with negligible search redundancy
 Software-Hardware co-design: Software-Hardware Co-design for Neighbor Search 45 Sequential traversal in top tree Parallel search in leaf nodes
  58. Hardware Architecture 46 Front-end Back-end Query Distribution Network Front-end Buffer

    Global buffer Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Sequential traverse in top tree Parallel search in leaf nodes
  59. Hardware Architecture 46 Front-end Back-end Query Distribution Network Front-end Buffer

    Global buffer Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Sequential traverse in top tree Parallel search in leaf nodes
  60. Hardware Architecture 46 Front-end Back-end Query Distribution Network Front-end Buffer

    Global buffer Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism Sequential traverse in top tree Parallel search in leaf nodes
  61. Hardware Architecture 46 Front-end Back-end Query Distribution Network Front-end Buffer

    Global buffer Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Front-end ▹Exploits QLP + limited NLP ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism Sequential traverse in top tree Parallel search in leaf nodes
  62. Hardware Architecture 46 Back-end Query Distribution Network Front-end Buffer Global

    buffer Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Front-end ▹Exploits QLP + limited NLP RU RU RU … Front-end Buffer ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism Sequential traverse in top tree Parallel search in leaf nodes
  63. Recursive Unit Hardware Originally no NLP can be exploited ▹Tree

    traversal in top tree is still sequential 47 ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  64. Recursive Unit Hardware Originally no NLP can be exploited ▹Tree

    traversal in top tree is still sequential Limited NLP exploited by pipelining different nodes 47 ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism Recursion Unit Microarchitecture
  65. Recursive Unit Hardware Originally no NLP can be exploited ▹Tree

    traversal in top tree is still sequential Limited NLP exploited by pipelining different nodes ▹Two optimizations to avoid data dependency and pipeline stall 48 Recursion Unit Microarchitecture ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  66. Hardware Architecture 49 RU RU RU Query Distribution Network …

    Global buffer Back-end SUs Back-end SUs Front-end Buffer Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Front-end ▹Exploits QLP + limited NLP Back-end ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  67. Hardware Architecture 50 RU RU RU SU SU SU Buf

    Buf Buf Front-end Buffer Query Distribution Network Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Front-end ▹Exploits QLP + limited NLP Back-end … … Global buffer ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  68. Hardware Architecture 50 RU RU RU SU SU SU Buf

    Buf Buf Front-end Buffer Query Distribution Network Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Front-end ▹Exploits QLP + limited NLP Back-end … … Global buffer ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  69. Hardware Architecture 50 RU RU RU SU SU SU Buf

    Buf Buf Front-end Buffer Query Distribution Network Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Front-end ▹Exploits QLP + limited NLP Back-end ▹Exploits QLP + NLP … … Global buffer Leaf node ID % num. of SUs ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  70. Search Unit Hardware 51 Q Q Q Q Q ▹QLP:

    Query-Level Parallelism ▹NLP: Node-Level Parallelism
  71. Search Unit Hardware 51 Q Q Q Q Q ▹QLP:

    Query-Level Parallelism ▹NLP: Node-Level Parallelism
  72. Search Unit Hardware 51 SU Buf Q Q Q Q

    Q ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  73. Q Q Q Q Q Search Unit Hardware 51 SU

    Buf Q Q Q Q Q PE PE PE PE ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  74. Q Q Q Q Q Search Unit Hardware 51 SU

    Buf Q Q Q Q Q Query-Level Parallelism PE PE PE PE ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  75. Q Q Q Q Q Search Unit Hardware 51 SU

    Buf Q Q Q Q Q Query-Level Parallelism PE PE PE PE Execution Model Exploiting QLP ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  76. Q Q Q Q Q Search Unit Hardware 51 SU

    Buf Q Q Q Q Q Query-Level Parallelism PE PE PE PE Execution Model Exploiting QLP ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  77. Q Q Q Q Q Search Unit Hardware 51 SU

    Buf Q Q Q Q Q Query-Level Parallelism PE PE PE PE Execution Model Exploiting QLP ▸ Multiple Query Multiple LeafNodes ▹High PE Utilization; High data bandwidth ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  78. Q Q Q Q Q Search Unit Hardware 51 SU

    Buf Q Q Q Q Q Query-Level Parallelism PE PE PE PE Execution Model Exploiting QLP ▸ Multiple Query Multiple LeafNodes ▹High PE Utilization; High data bandwidth ▸ Multiple Query Single LeafNodes ▹Low data bandwidth; Low PE Utilization ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  79. Q Q Q Q Q Search Unit Hardware 51 SU

    Buf Q Q Q Q Q Query-Level Parallelism PE PE PE PE Execution Model Exploiting QLP ▸ Multiple Query Multiple LeafNodes ▹High PE Utilization; High data bandwidth ▸ Multiple Query Single LeafNodes ▹Low data bandwidth; Low PE Utilization ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  80. Execution Model Exploiting QLP ▸ Multiple Query Multiple LeafNodes ▹High

    PE Utilization; High data bandwidth ▸ Multiple Query Single LeafNodes ▹Low data bandwidth; Low PE Utilization PE PE PE Q Q Q Q Q Q Q Search Unit Hardware 52 Q Q Q Q Q PE N N N N N N Children in a leaf node to be searched in parallel Query-Level Parallelism ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  81. Execution Model Exploiting QLP ▸ Multiple Query Multiple LeafNodes ▹High

    PE Utilization; High data bandwidth ▸ Multiple Query Single LeafNodes ▹Low data bandwidth; Low PE Utilization PE PE PE Q Q Q Q Q Q Q Search Unit Hardware 52 Q Q Q Q Q Node-Level Parallelism PE N N N N N N Children in a leaf node to be searched in parallel Query-Level Parallelism ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  82. Execution Model Exploiting QLP ▸ Multiple Query Multiple LeafNodes ▹High

    PE Utilization; High data bandwidth ▸ Multiple Query Single LeafNodes ▹Low data bandwidth; Low PE Utilization PE PE PE Q Q Q Q Q Q Q Search Unit Hardware 52 Q Q Q Q Q Node-Level Parallelism PE Data-flow Exploiting NLP ▸ 1-D systolic array ▸ Query-stationary data-flow N N N N N N Children in a leaf node to be searched in parallel Query-Level Parallelism ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  83. Execution Model Exploiting QLP ▸ Multiple Query Multiple LeafNodes ▹High

    PE Utilization; High data bandwidth ▸ Multiple Query Single LeafNodes ▹Low data bandwidth; Low PE Utilization PE PE PE Q Q Q Q Q Q Q Search Unit Hardware 52 Q Q Q Q Q Node-Level Parallelism PE Data-flow Exploiting NLP ▸ 1-D systolic array ▸ Query-stationary data-flow N N N N N N Children in a leaf node to be searched in parallel Query-Level Parallelism ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism
  84. Tigris system overview 53 Towards real-time and energy-efficient Point Cloud

    Registration Characterization SW/HW Co-design Evaluation
  85. Metrics & Dataset 54 ‣ Speed-up & Power Reduction •

    Performance Bottleneck: KD-Tree Search
  86. Metrics & Dataset 54 ‣ Speed-up & Power Reduction •

    Performance Bottleneck: KD-Tree Search • End-to-end Registration Pipeline

  87. Metrics & Dataset 54 ‣ Speed-up & Power Reduction •

    Performance Bottleneck: KD-Tree Search • End-to-end Registration Pipeline
 ‣ Dataset: Self-Driving Benchmark (KITTI)
  88. Metrics & Dataset 54 ‣ Speed-up & Power Reduction •

    Performance Bottleneck: KD-Tree Search • End-to-end Registration Pipeline
 ‣ Dataset: Self-Driving Benchmark (KITTI) • Each Point Cloud frame: ~130,000 Points
  89. Hardware 55 ‣ KD-Tree Search • Baseline: GPU (RTX 2080

    Ti) • Our System: ✓ Performance: 
 Cycle-accurate simulator 
 parameterized by an RTL model
  90. Hardware 55 ‣ KD-Tree Search • Baseline: GPU (RTX 2080

    Ti) • Our System: ✓ Performance: 
 Cycle-accurate simulator 
 parameterized by an RTL model ✓ Power & Area: 
 Post layout simulation in a 16nm process node
  91. Hardware 55 ‣ KD-Tree Search • Baseline: GPU (RTX 2080

    Ti) • Our System: ✓ Performance: 
 Cycle-accurate simulator 
 parameterized by an RTL model ✓ Power & Area: 
 Post layout simulation in a 16nm process node ‣ All Other Parts: CPU (Intel Xeon Silver 4110)
  92. Comparisons 56 To examine the individual benefits of SW and

    HW optimizations Four systems for Comparison
  93. Comparisons 56 To examine the individual benefits of SW and

    HW optimizations 
 No SW Optimization No HW Optimization Four systems for Comparison Baseline (KD)
  94. Comparisons 56 To examine the individual benefits of SW and

    HW optimizations 
 No SW Optimization No HW Optimization 
 + SW Optimization No HW Optimization Four systems for Comparison Baseline (KD) Baseline (2SKD)
  95. Comparisons 56 To examine the individual benefits of SW and

    HW optimizations 
 No SW Optimization No HW Optimization 
 + SW Optimization No HW Optimization 
 No SW Optimization + HW Optimization Four systems for Comparison Baseline (KD) Baseline (2SKD) Tigris (KD)
  96. Comparisons 56 To examine the individual benefits of SW and

    HW optimizations 
 No SW Optimization No HW Optimization 
 + SW Optimization No HW Optimization 
 No SW Optimization + HW Optimization 
 + SW Optimization + HW Optimization Four systems for Comparison Baseline (KD) Baseline (2SKD) Tigris (KD) Tigris (2SKD)
  97. Performance 57 Power Reduction (x) 0.0 1.0 2.0 3.0 4.0

    Speedup (x) 0.0 4.4 8.8 13.2 17.6 22.0 Baseline (KD) Baseline (2SKD) Our System (KD) Our System (2SKD)
  98. Performance 57 Power Reduction (x) 0.0 1.0 2.0 3.0 4.0

    Speedup (x) 0.0 4.4 8.8 13.2 17.6 22.0 Baseline (KD) Baseline (2SKD) Our System (KD) Our System (2SKD) 5.9 20.9 1.0 1.1
  99. 5.9 20.9 Power Reduction (x) 0.0 4.5 9.0 13.5 18.0

    Speedup (x) 0.0 4.4 8.8 13.2 17.6 22.0 Baseline (KD) Baseline (2SKD) Our System (KD) Our System (2SKD) Performance & Power 58 17.8 10.5 1.0 1.0
  100. 5.9 20.9 Power Reduction (x) 0.0 4.5 9.0 13.5 18.0

    Speedup (x) 0.0 4.4 8.8 13.2 17.6 22.0 Baseline (KD) Baseline (2SKD) Our System (KD) Our System (2SKD) Performance & Power 58 17.8 10.5 1.0 1.0
  101. 5.9 20.9 Power Reduction (x) 0.0 4.5 9.0 13.5 18.0

    Speedup (x) 0.0 4.4 8.8 13.2 17.6 22.0 Baseline (KD) Baseline (2SKD) Our System (KD) Our System (2SKD) Performance & Power 58 17.8 10.5 1.0 1.0 20.9X speed-up 
 on KD-Tree search 3.5X end-to-end 
 speed-up
  102. 5.9 20.9 Power Reduction (x) 0.0 4.5 9.0 13.5 18.0

    Speedup (x) 0.0 4.4 8.8 13.2 17.6 22.0 Baseline (KD) Baseline (2SKD) Our System (KD) Our System (2SKD) Performance & Power 58 17.8 10.5 1.0 1.0 20.9X speed-up 
 on KD-Tree search 3.5X end-to-end 
 speed-up 10.5X power reduction 
 on KD-Tree search 3.0X end-to-end 
 power reduction
  103. Summary 59 ‣ Point Cloud Registration ▹ A fundamental building

    block in emerging domains such as Autonomous Driving and Mixed Reality
  104. Summary 59 ‣ Point Cloud Registration ▹ A fundamental building

    block in emerging domains such as Autonomous Driving and Mixed Reality ‣ Our Tigris System ▹ An early step towards efficient Point Cloud Registration
  105. Summary 59 ‣ Point Cloud Registration ▹ A fundamental building

    block in emerging domains such as Autonomous Driving and Mixed Reality ‣ Our Tigris System ▹ An early step towards efficient Point Cloud Registration ‣ Key Insight ▹ Co-designing Software and Hardware to boost efficiency
  106. Rethink Systems Stack for Point Cloud Processing 60 2-D Image

    / Video Application Starfish (LiKamWa et al., 2013) Focus (Hsieh et al., 2018) … …
  107. Rethink Systems Stack for Point Cloud Processing 60 2-D Image

    / Video Application Compiler Halide (Ragan-Kelley et al., 2013) Darkroom (Hegarty et al., 2014) Opt (Devito et al., 2018) … … Starfish (LiKamWa et al., 2013) Focus (Hsieh et al., 2018) … …
  108. Rethink Systems Stack for Point Cloud Processing 60 2-D Image

    / Video Application Compiler Architecture Halide (Ragan-Kelley et al., 2013) Darkroom (Hegarty et al., 2014) Opt (Devito et al., 2018) … … Starfish (LiKamWa et al., 2013) Focus (Hsieh et al., 2018) … … IDEAL (Mahmoud et al., 2017) Eyeriss (Chen et al., 2016) … … Euphrates (Zhu et al., 2018)
  109. Rethink Systems Stack for Point Cloud Processing 60 2-D Image

    / Video Application Compiler Architecture Point Cloud Halide (Ragan-Kelley et al., 2013) Darkroom (Hegarty et al., 2014) Opt (Devito et al., 2018) … … Starfish (LiKamWa et al., 2013) Focus (Hsieh et al., 2018) … … ?????? ?????? IDEAL (Mahmoud et al., 2017) Eyeriss (Chen et al., 2016) … … Euphrates (Zhu et al., 2018) ??????
  110. Rethink Systems Stack for Point Cloud Processing 60 2-D Image

    / Video Application Compiler Architecture Point Clouds are high-dimensional, sparse and irregular Computation / Memory Access Pattern are fundamentally different Point Cloud Halide (Ragan-Kelley et al., 2013) Darkroom (Hegarty et al., 2014) Opt (Devito et al., 2018) … … Starfish (LiKamWa et al., 2013) Focus (Hsieh et al., 2018) … … ?????? ?????? IDEAL (Mahmoud et al., 2017) Eyeriss (Chen et al., 2016) … … Euphrates (Zhu et al., 2018) ??????
  111. Representative Design Points 63 DP8 DP6 DP4 DP2 DP3 DP1

    DP7 DP5 DP6 DP4 DP2 Execution Time Translational Error Execution Time Rotational Error
  112. Performance: absolute time Intel Xeon Silver 4110 core: ~5.0 -

    10.0 seconds Our system: ~1.0 - 3.0 seconds 64
  113. Speed-up & Power Reduction 65 80 60 40 20 0

    Speedup (X) Base-KD Base-2SKD Acc-KD Acc-2SKD 16 12 8 4 0 Power Reduction (X) 30 24 18 12 6 0 Speedup (X) Base-KD Base-2SKD Acc-KD Acc-2SKD 20 15 10 5 0 Power Reduction (X) Design Point 7 Design Point 4
  114. Approximate Search In Design Point 4: ▸ Computation Saving: 


    72.8 % less distance compute & comparison; ▸ Accuracy Loss: ▹ Translational Error: 0 ▹ Rotational Error: 0.05 °/meter 66
  115. Area Analysis 67 SRAM: 
 8.38 mm^2 (53.8%) Compute Logic:

    
 7.19 mm^2 (46.2%) Global Buffer Search Unit …… BE Query Buffer PE … PE Search Unit PE … PE Search Unit PE … PE BE Query Buffer BE Query Buffer Recursion Unit FQ RS RN CD PI CL Bypass Forward Query Distribution Network Recursion Unit FQ RS RN CD PI CL Bypass Forward Recursion Unit FQ RS RN CD PI CL Bypass Forward …… FE Query Queue Query Buffer Point Buffer Query Stack Buffer Result Buffer
  116. Hardware Architecture 68 Global Buffer Search Unit …… BE Query

    Buffer PE … PE Search Unit PE … PE Search Unit PE … PE BE Query Buffer BE Query Buffer Recursion Unit FQ RS RN CD PI CL Bypass Forward Query Distribution Network Recursion Unit FQ RS RN CD PI CL Bypass Forward Recursion Unit FQ RS RN CD PI CL Bypass Forward …… FE Query Queue Query Buffer Point Buffer Query Stack Buffer Result Buffer
  117. Memory Traffic 69 100 80 60 40 20 0 Memory

    Traffic Dist. (%) ACC-2SKD ACC-KD FE Query Q Query Buf Query Stacks Res. Buf BE Query Q Node Cache Points Buf Global Buffer Search Unit …… BE Query Buffer PE … PE Search Unit PE … PE Search Unit PE … PE BE Query Buffer BE Query Buffer Recursion Unit FQ RS RN CD PI CL Bypass Forward Query Distribution Network Recursion Unit FQ RS RN CD PI CL Bypass Forward Recursion Unit FQ RS RN CD PI CL Bypass Forward …… FE Query Queue Query Buffer Point Buffer Query Stack Buffer Result Buffer