Slide 1

Slide 1 text

1 Tigris: Architecture and Algorithms for 
 3D Perception in Point Clouds Tiancheng Xu*, Boyuan Tian* 
 with Yuhao Zhu Department of Computer Science University of Rochester http://horizon-lab.org

Slide 2

Slide 2 text

To-do: add Notre Dame Cathedral figure Goal: Intro 2 Source: https://www.nationalgeographic.com/news/2015/06/150622-andrew-tallon-notre-dame-cathedral- laser-scan-art-history-medieval-gothic/

Slide 3

Slide 3 text

3 Source: https://www.nationalgeographic.com/news/2015/06/150622-andrew-tallon-notre-dame-cathedral- laser-scan-art-history-medieval-gothic/

Slide 4

Slide 4 text

3 Source: https://www.nationalgeographic.com/news/2015/06/150622-andrew-tallon-notre-dame-cathedral- laser-scan-art-history-medieval-gothic/

Slide 5

Slide 5 text

4 Point Cloud

Slide 6

Slide 6 text

4 Point Cloud ‣ Points in 3-d space, i.e., XYZ coordinates

Slide 7

Slide 7 text

4 Point Cloud ‣ Points in 3-d space, i.e., XYZ coordinates ‣ Effective in capturing visual features

Slide 8

Slide 8 text

4 Point Cloud ‣ Points in 3-d space, i.e., XYZ coordinates ‣ Effective in capturing visual features ‣ 3-d scanners/sensors

Slide 9

Slide 9 text

4 Point Cloud ‣ Points in 3-d space, i.e., XYZ coordinates ‣ Effective in capturing visual features ‣ 3-d scanners/sensors ▹ Scan from multiple perspectives

Slide 10

Slide 10 text

4 Point Cloud ‣ Points in 3-d space, i.e., XYZ coordinates ‣ Effective in capturing visual features ‣ 3-d scanners/sensors ▹ Scan from multiple perspectives ▹ Stitch these Point Clouds
 to form a complete Point Cloud

Slide 11

Slide 11 text

5 Source: https://www.nationalgeographic.com/news/2015/06/150622-andrew-tallon-notre-dame-cathedral- laser-scan-art-history-medieval-gothic/

Slide 12

Slide 12 text

5 Source: https://www.nationalgeographic.com/news/2015/06/150622-andrew-tallon-notre-dame-cathedral- laser-scan-art-history-medieval-gothic/

Slide 13

Slide 13 text

Point Cloud Registration 6

Slide 14

Slide 14 text

Point Cloud Registration 6 ▸ Aligns two point clouds by calculating a transformation

Slide 15

Slide 15 text

Point Cloud Registration 6 ▸ Aligns two point clouds by calculating a transformation Transformation = Rotation + Translation

Slide 16

Slide 16 text

Point Cloud Registration 6 ▸ Aligns two point clouds by calculating a transformation Transformation = Rotation + Translation

Slide 17

Slide 17 text

Motivation 7

Slide 18

Slide 18 text

Motivation 7 Point Cloud Registration: a fundamental building block

Slide 19

Slide 19 text

Motivation 7 3-d Reconstruction Point Cloud Registration: a fundamental building block

Slide 20

Slide 20 text

Motivation 7 3-d Reconstruction Point Cloud Registration: a fundamental building block Autonomous Driving

Slide 21

Slide 21 text

Motivation 7 3-d Reconstruction Point Cloud Registration: a fundamental building block Autonomous Driving Mixed Reality

Slide 22

Slide 22 text

Motivation 7 3-d Reconstruction Point Cloud Registration: a fundamental building block Autonomous Driving Mixed Reality 3-D Visual Computing

Slide 23

Slide 23 text

Motivation 7 3-d Reconstruction Point Cloud Registration: a fundamental building block Autonomous Driving Mixed Reality 3-D Visual Computing

Slide 24

Slide 24 text

Motivation 7 3-d Reconstruction Point Cloud Registration: a fundamental building block Autonomous Driving Mixed Reality 3-D Visual Computing Limited Energy Budget

Slide 25

Slide 25 text

Motivation 7 3-d Reconstruction Point Cloud Registration: a fundamental building block Autonomous Driving Mixed Reality 3-D Visual Computing High Performance Requirement Limited Energy Budget

Slide 26

Slide 26 text

Tigris System Overview 8 Towards real-time and energy-efficient Point Cloud Registration

Slide 27

Slide 27 text

Tigris System Overview 8 Towards real-time and energy-efficient Point Cloud Registration Characterization

Slide 28

Slide 28 text

Tigris System Overview 8 Towards real-time and energy-efficient Point Cloud Registration Characterization SW/HW Co-design

Slide 29

Slide 29 text

Tigris System Overview 8 Towards real-time and energy-efficient Point Cloud Registration Characterization SW/HW Co-design Evaluation

Slide 30

Slide 30 text

Tigris System 9 Towards real-time and energy-efficient Point Cloud Registration Characterization SW/HW Co-design Evaluation

Slide 31

Slide 31 text

Point Cloud Registration Pipeline 10 Registration

Slide 32

Slide 32 text

Point Cloud Registration Pipeline 10 Registration

Slide 33

Slide 33 text

Point Cloud Registration Pipeline 10 Registration Initial Estimation Fine-tuning

Slide 34

Slide 34 text

Point Cloud Registration Pipeline 11 Registration Fine-tuning Initial Estimation

Slide 35

Slide 35 text

Point Cloud Registration Pipeline 11 Registration Fine-tuning NE KPTD DC KPCE CR RPCE EM Initial Estimation Stage1 Stage2 Stage3 Stage4 Stage5 Stage6 Stage7

Slide 36

Slide 36 text

Example: Normal Estimation (NE) 12 NE KPDT DC KPCE CR RPCE EM Registration

Slide 37

Slide 37 text

Example: Normal Estimation (NE) 12 ALG NE KPDT DC KPCE CR RPCE EM Registration

Slide 38

Slide 38 text

Example: Normal Estimation (NE) 12 ALG NE KPDT DC KPCE CR RPCE EM Registration • SVD

Slide 39

Slide 39 text

Example: Normal Estimation (NE) 12 ALG NE KPDT DC KPCE CR RPCE EM Registration • DNN • SVD

Slide 40

Slide 40 text

Example: Normal Estimation (NE) 12 ALG PARAM NE KPDT DC KPCE CR RPCE EM Registration • DNN • SVD

Slide 41

Slide 41 text

Example: Normal Estimation (NE) 12 ALG PARAM NE KPDT DC KPCE CR RPCE EM Registration • DNN • SVD • Search
 radius

Slide 42

Slide 42 text

• DNN • SVD • Search
 radius • SIFT • NARF • Scale
 range • FPFH • 3DSC • Search
 radius • Reci-
 procity • Ratio • Dist • RANSAC • THRESH • NORM-S • PROJECT • Converging
 criteria • Reci-
 procity • MetricT • SolverT 13 ALG PARAM - NE KD DC KPCE CR RPCE EM Registration Huge Design Space

Slide 43

Slide 43 text

• DNN • SVD • Search
 radius • SIFT • NARF • Scale
 range • FPFH • 3DSC • Search
 radius • Reci-
 procity • Ratio • Dist • RANSAC • THRESH • NORM-S • PROJECT • Converging
 criteria • Reci-
 procity • MetricT • SolverT 13 ALG PARAM - NE KD DC KPCE CR RPCE EM Registration Huge Design Space

Slide 44

Slide 44 text

• DNN • SVD • Search
 radius • SIFT • NARF • Scale
 range • FPFH • 3DSC • Search
 radius • Reci-
 procity • Ratio • Dist • RANSAC • THRESH • NORM-S • PROJECT • Converging
 criteria • Reci-
 procity • MetricT • SolverT 13 ALG PARAM - NE KD DC KPCE CR RPCE EM Registration Huge Design Space Configurable pipeline: 
 https://github.com/horizon-research/PointCloud-pipeline

Slide 45

Slide 45 text

Design Space Exploration 14

Slide 46

Slide 46 text

Design Space Exploration 14 Error Rate

Slide 47

Slide 47 text

Design Space Exploration 14 Error Rate Execution Time

Slide 48

Slide 48 text

Design Space Exploration 15 Execution Time Error Rate

Slide 49

Slide 49 text

Design Space Exploration 15 Execution Time Error Rate A Design Point with
 X Error Rate and Y Latency (X, Y)

Slide 50

Slide 50 text

Design Space Exploration 16 Translational Error Rotational Error Execution Time Execution Time

Slide 51

Slide 51 text

Design Space Exploration 16 Translational Error Rotational Error Execution Time Execution Time Transformation = Rotation + Translation Error Rate: Rotational & Translational Error

Slide 52

Slide 52 text

Design Space Exploration 17 Translational Error Rotational Error Execution Time Execution Time

Slide 53

Slide 53 text

Design Space Exploration 18 Translational Error Rotational Error Execution Time Execution Time

Slide 54

Slide 54 text

Design Space Exploration 19 Translational Error Rotational Error Execution Time Execution Time

Slide 55

Slide 55 text

Representative Design Points 20 DP8 DP6 DP4 DP2 DP3 DP1 DP7 DP5 DP6 DP4 DP2 Execution Time Translational Error Execution Time Rotational Error

Slide 56

Slide 56 text

Characterization 21 NE KPTD DC KPCE CR RPCE EM Using the representative design points (DP1-8)

Slide 57

Slide 57 text

Characterization 22 NE KPTD DC KPCE CR RPCE EM

Slide 58

Slide 58 text

Characterization 22 NE KPTD DC KPCE CR RPCE EM

Slide 59

Slide 59 text

Characterization 22 NE KPTD DC KPCE CR RPCE EM KD-Tree Search

Slide 60

Slide 60 text

Bottleneck: KD-Tree Search 23 KD-Tree Search / End-to-End Pipeline Latency (%) 0% 50% 100% DP1 DP2 DP3 DP4 DP5 DP6 DP7 DP8

Slide 61

Slide 61 text

Bottleneck: KD-Tree Search 24 KD-Tree Search / End-to-End Pipeline Latency (%) 0% 50% 100% DP1 DP2 DP3 DP4 DP5 DP6 DP7 DP8 85% 85% 80% 76% 75% 74% 65% 52%

Slide 62

Slide 62 text

KD-Tree Search 25

Slide 63

Slide 63 text

KD-Tree Search ▸ Neighbor Search (NS) 25

Slide 64

Slide 64 text

KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point Cloud processing 25

Slide 65

Slide 65 text

KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point Cloud processing ▹ To find the neighbors 25

Slide 66

Slide 66 text

KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point Cloud processing ▹ To find the neighbors 25 Query Point of one point

Slide 67

Slide 67 text

KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point Cloud processing ▹ To find the neighbors 25 Query Point Search Points of one point among a set of points

Slide 68

Slide 68 text

KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point Cloud processing ▹ To find the neighbors
 26 of one point among a set of points

Slide 69

Slide 69 text

KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point Cloud processing ▹ To find the neighbors
 ▸ KD-Tree Search 26 of one point among a set of points

Slide 70

Slide 70 text

KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point Cloud processing ▹ To find the neighbors
 ▸ KD-Tree Search ▹ Standard implementation for NS in point clouds 26 of one point among a set of points

Slide 71

Slide 71 text

KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point Cloud processing ▹ To find the neighbors
 ▸ KD-Tree Search ▹ Standard implementation for NS in point clouds ▹ Effectively reduces the computation workload of NS 26 of one point among a set of points

Slide 72

Slide 72 text

KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point Cloud processing ▹ To find the neighbors
 ▸ KD-Tree Search ▹ Standard implementation for NS in point clouds ▹ Effectively reduces the computation workload of NS ▹ Inefficient on GPUs due to its sequential nature 26 of one point among a set of points

Slide 73

Slide 73 text

KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point Cloud processing ▹ To find the neighbors
 ▸ KD-Tree Search ▹ Standard implementation for NS in point clouds ▹ Effectively reduces the computation workload of NS ▹ Inefficient on GPUs due to its sequential nature ▹ Challenging for hardware acceleration 26 of one point among a set of points

Slide 74

Slide 74 text

Tigris System Overview 27 Towards real-time and energy-efficient Point Cloud Registration Characterization SW/HW Co-design Evaluation

Slide 75

Slide 75 text

Redundancy vs. Parallelism 28

Slide 76

Slide 76 text

Redundancy vs. Parallelism 28 Unordered Set Canonical KD-Tree

Slide 77

Slide 77 text

Redundancy vs. Parallelism 28 Unordered Set Canonical KD-Tree Current Node

Slide 78

Slide 78 text

Redundancy vs. Parallelism 28 Unordered Set Canonical KD-Tree ▹No Redundancy, No Parallelism Current Node

Slide 79

Slide 79 text

Redundancy vs. Parallelism 28 Unordered Set ▹Huge Parallelism, Huge Redundancy Canonical KD-Tree ▹No Redundancy, No Parallelism Current Node

Slide 80

Slide 80 text

Two-Stage KD-Tree New data structure ▹Balances parallelism and redundancy 29

Slide 81

Slide 81 text

Two-Stage KD-Tree New data structure ▹Balances parallelism and redundancy 29 Two-Stage KD-Tree

Slide 82

Slide 82 text

Two-Stage KD-Tree New data structure ▹Balances parallelism and redundancy 29 Two-Stage KD-Tree Top-Tree

Slide 83

Slide 83 text

Two-Stage KD-Tree New data structure ▹Balances parallelism and redundancy 29 Two-Stage KD-Tree Top-Tree Children of Leaf Nodes Leaf Nodes

Slide 84

Slide 84 text

Two-Stage KD-Tree New data structure ▹Balances parallelism and redundancy 30 Two-Stage KD-Tree

Slide 85

Slide 85 text

Two-Stage KD-Tree New data structure ▹Balances parallelism and redundancy 30 Two-Stage KD-Tree Canonical KD-Tree Same First Few Levels

Slide 86

Slide 86 text

Two-Stage KD-Tree New data structure ▹Balances parallelism and redundancy 30 Two-Stage KD-Tree Canonical KD-Tree Sequential Traversal

Slide 87

Slide 87 text

Two-Stage KD-Tree New data structure ▹Balances parallelism and redundancy 30 Two-Stage KD-Tree Canonical KD-Tree Sub-Tree Unordered Set Sequential Traversal

Slide 88

Slide 88 text

Parallel Search Two-Stage KD-Tree New data structure ▹Balances parallelism and redundancy 30 Two-Stage KD-Tree Canonical KD-Tree Sequential Traversal

Slide 89

Slide 89 text

Quantifying Redundancy 31 Two-Stage KD-Tree Canonical KD-Tree

Slide 90

Slide 90 text

Quantifying Redundancy 31 Two-Stage KD-Tree Canonical KD-Tree

Slide 91

Slide 91 text

Quantifying Redundancy 31 Two-Stage KD-Tree Canonical KD-Tree

Slide 92

Slide 92 text

Quantifying Redundancy 32

Slide 93

Slide 93 text

Quantifying Redundancy 32 …… …… 35X more points need to be visited

Slide 94

Slide 94 text

Approximate Search New search algorithm ▹Mitigates redundancy introduced by new data structure 33

Slide 95

Slide 95 text

Approximate Search New search algorithm ▹Close queries are likely to share similar search results 34

Slide 96

Slide 96 text

Approximate Search New search algorithm ▹Close queries are likely to share similar search results 34 Qi

Slide 97

Slide 97 text

Approximate Search 35 Qi N New search algorithm ▹Close queries are likely to share similar search results

Slide 98

Slide 98 text

Approximate Search 35 Qi N Qj New search algorithm ▹Close queries are likely to share similar search results

Slide 99

Slide 99 text

Approximate Search 36 Qi R New search algorithm ▹Close queries are likely to share similar search results

Slide 100

Slide 100 text

Approximate Search 37 Qi R New search algorithm ▹Close queries are likely to share similar search results

Slide 101

Slide 101 text

Approximate Search 37 Qi R Qj R New search algorithm ▹Close queries are likely to share similar search results

Slide 102

Slide 102 text

Approximate Search 38 Qj R R Qi New search algorithm ▹Close queries are likely to share similar search results

Slide 103

Slide 103 text

Approximate Search 39 R Qj R Qi New search algorithm ▹Close queries are likely to share similar search results

Slide 104

Slide 104 text

Approximate Search 40 Qi New search algorithm ▹Leader: search in children of leaf nodes as usual

Slide 105

Slide 105 text

Approximate Search 40 Qi leader New search algorithm ▹Leader: search in children of leaf nodes as usual

Slide 106

Slide 106 text

Approximate Search 40 Qi leader New search algorithm ▹Leader: search in children of leaf nodes as usual R

Slide 107

Slide 107 text

Approximate Search 41 Qi leader R New search algorithm ▹Leader: search in children of leaf nodes as usual

Slide 108

Slide 108 text

Approximate Search 42 Qi leader New search algorithm ▹Follower: search in neighbors of a leader R

Slide 109

Slide 109 text

Approximate Search 42 Qi leader Qj follower New search algorithm ▹Follower: search in neighbors of a leader R

Slide 110

Slide 110 text

Approximate Search 43 Qi leader R Qj follower New search algorithm ▹Follower: search in neighbors of a leader

Slide 111

Slide 111 text

Approximate Search 44 Qi leader R Qj follower New search algorithm ▹Efficiently mitigate search redundancy

Slide 112

Slide 112 text

Total savings of node visits 72.8% Negligible effect on registration accuracy Approximate Search 44 Qi leader R Qj follower New search algorithm ▹Efficiently mitigate search redundancy

Slide 113

Slide 113 text

New data structure + new search algorithm: ▹Expose huge parallelism with negligible search redundancy
 Software-Hardware Co-design for Neighbor Search 45

Slide 114

Slide 114 text

New data structure + new search algorithm: ▹Expose huge parallelism with negligible search redundancy
 Software-Hardware co-design: Software-Hardware Co-design for Neighbor Search 45

Slide 115

Slide 115 text

New data structure + new search algorithm: ▹Expose huge parallelism with negligible search redundancy
 Software-Hardware co-design: Software-Hardware Co-design for Neighbor Search 45 Sequential traversal in top tree Parallel search in leaf nodes

Slide 116

Slide 116 text

Hardware Architecture 46 Front-end Back-end Query Distribution Network Front-end Buffer Global buffer Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Sequential traverse in top tree Parallel search in leaf nodes

Slide 117

Slide 117 text

Hardware Architecture 46 Front-end Back-end Query Distribution Network Front-end Buffer Global buffer Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Sequential traverse in top tree Parallel search in leaf nodes

Slide 118

Slide 118 text

Hardware Architecture 46 Front-end Back-end Query Distribution Network Front-end Buffer Global buffer Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism Sequential traverse in top tree Parallel search in leaf nodes

Slide 119

Slide 119 text

Hardware Architecture 46 Front-end Back-end Query Distribution Network Front-end Buffer Global buffer Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Front-end ▹Exploits QLP + limited NLP ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism Sequential traverse in top tree Parallel search in leaf nodes

Slide 120

Slide 120 text

Hardware Architecture 46 Back-end Query Distribution Network Front-end Buffer Global buffer Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Front-end ▹Exploits QLP + limited NLP RU RU RU … Front-end Buffer ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism Sequential traverse in top tree Parallel search in leaf nodes

Slide 121

Slide 121 text

Recursive Unit Hardware Originally no NLP can be exploited ▹Tree traversal in top tree is still sequential 47 ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Slide 122

Slide 122 text

Recursive Unit Hardware Originally no NLP can be exploited ▹Tree traversal in top tree is still sequential Limited NLP exploited by pipelining different nodes 47 ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism Recursion Unit Microarchitecture

Slide 123

Slide 123 text

Recursive Unit Hardware Originally no NLP can be exploited ▹Tree traversal in top tree is still sequential Limited NLP exploited by pipelining different nodes ▹Two optimizations to avoid data dependency and pipeline stall 48 Recursion Unit Microarchitecture ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Slide 124

Slide 124 text

Hardware Architecture 49 RU RU RU Query Distribution Network … Global buffer Back-end SUs Back-end SUs Front-end Buffer Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Front-end ▹Exploits QLP + limited NLP Back-end ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Slide 125

Slide 125 text

Hardware Architecture 50 RU RU RU SU SU SU Buf Buf Buf Front-end Buffer Query Distribution Network Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Front-end ▹Exploits QLP + limited NLP Back-end … … Global buffer ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Slide 126

Slide 126 text

Hardware Architecture 50 RU RU RU SU SU SU Buf Buf Buf Front-end Buffer Query Distribution Network Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Front-end ▹Exploits QLP + limited NLP Back-end … … Global buffer ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Slide 127

Slide 127 text

Hardware Architecture 50 RU RU RU SU SU SU Buf Buf Buf Front-end Buffer Query Distribution Network Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Front-end ▹Exploits QLP + limited NLP Back-end ▹Exploits QLP + NLP … … Global buffer Leaf node ID % num. of SUs ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Slide 128

Slide 128 text

Search Unit Hardware 51 Q Q Q Q Q ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Slide 129

Slide 129 text

Search Unit Hardware 51 Q Q Q Q Q ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Slide 130

Slide 130 text

Search Unit Hardware 51 SU Buf Q Q Q Q Q ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Slide 131

Slide 131 text

Q Q Q Q Q Search Unit Hardware 51 SU Buf Q Q Q Q Q PE PE PE PE ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Slide 132

Slide 132 text

Q Q Q Q Q Search Unit Hardware 51 SU Buf Q Q Q Q Q Query-Level Parallelism PE PE PE PE ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Slide 133

Slide 133 text

Q Q Q Q Q Search Unit Hardware 51 SU Buf Q Q Q Q Q Query-Level Parallelism PE PE PE PE Execution Model Exploiting QLP ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Slide 134

Slide 134 text

Q Q Q Q Q Search Unit Hardware 51 SU Buf Q Q Q Q Q Query-Level Parallelism PE PE PE PE Execution Model Exploiting QLP ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Slide 135

Slide 135 text

Q Q Q Q Q Search Unit Hardware 51 SU Buf Q Q Q Q Q Query-Level Parallelism PE PE PE PE Execution Model Exploiting QLP ▸ Multiple Query Multiple LeafNodes ▹High PE Utilization; High data bandwidth ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Slide 136

Slide 136 text

Q Q Q Q Q Search Unit Hardware 51 SU Buf Q Q Q Q Q Query-Level Parallelism PE PE PE PE Execution Model Exploiting QLP ▸ Multiple Query Multiple LeafNodes ▹High PE Utilization; High data bandwidth ▸ Multiple Query Single LeafNodes ▹Low data bandwidth; Low PE Utilization ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Slide 137

Slide 137 text

Q Q Q Q Q Search Unit Hardware 51 SU Buf Q Q Q Q Q Query-Level Parallelism PE PE PE PE Execution Model Exploiting QLP ▸ Multiple Query Multiple LeafNodes ▹High PE Utilization; High data bandwidth ▸ Multiple Query Single LeafNodes ▹Low data bandwidth; Low PE Utilization ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Slide 138

Slide 138 text

Execution Model Exploiting QLP ▸ Multiple Query Multiple LeafNodes ▹High PE Utilization; High data bandwidth ▸ Multiple Query Single LeafNodes ▹Low data bandwidth; Low PE Utilization PE PE PE Q Q Q Q Q Q Q Search Unit Hardware 52 Q Q Q Q Q PE N N N N N N Children in a leaf node to be searched in parallel Query-Level Parallelism ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Slide 139

Slide 139 text

Execution Model Exploiting QLP ▸ Multiple Query Multiple LeafNodes ▹High PE Utilization; High data bandwidth ▸ Multiple Query Single LeafNodes ▹Low data bandwidth; Low PE Utilization PE PE PE Q Q Q Q Q Q Q Search Unit Hardware 52 Q Q Q Q Q Node-Level Parallelism PE N N N N N N Children in a leaf node to be searched in parallel Query-Level Parallelism ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Slide 140

Slide 140 text

Execution Model Exploiting QLP ▸ Multiple Query Multiple LeafNodes ▹High PE Utilization; High data bandwidth ▸ Multiple Query Single LeafNodes ▹Low data bandwidth; Low PE Utilization PE PE PE Q Q Q Q Q Q Q Search Unit Hardware 52 Q Q Q Q Q Node-Level Parallelism PE Data-flow Exploiting NLP ▸ 1-D systolic array ▸ Query-stationary data-flow N N N N N N Children in a leaf node to be searched in parallel Query-Level Parallelism ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Slide 141

Slide 141 text

Execution Model Exploiting QLP ▸ Multiple Query Multiple LeafNodes ▹High PE Utilization; High data bandwidth ▸ Multiple Query Single LeafNodes ▹Low data bandwidth; Low PE Utilization PE PE PE Q Q Q Q Q Q Q Search Unit Hardware 52 Q Q Q Q Q Node-Level Parallelism PE Data-flow Exploiting NLP ▸ 1-D systolic array ▸ Query-stationary data-flow N N N N N N Children in a leaf node to be searched in parallel Query-Level Parallelism ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Slide 142

Slide 142 text

Tigris system overview 53 Towards real-time and energy-efficient Point Cloud Registration Characterization SW/HW Co-design Evaluation

Slide 143

Slide 143 text

Metrics & Dataset 54

Slide 144

Slide 144 text

Metrics & Dataset 54 ‣ Speed-up & Power Reduction

Slide 145

Slide 145 text

Metrics & Dataset 54 ‣ Speed-up & Power Reduction • Performance Bottleneck: KD-Tree Search

Slide 146

Slide 146 text

Metrics & Dataset 54 ‣ Speed-up & Power Reduction • Performance Bottleneck: KD-Tree Search • End-to-end Registration Pipeline


Slide 147

Slide 147 text

Metrics & Dataset 54 ‣ Speed-up & Power Reduction • Performance Bottleneck: KD-Tree Search • End-to-end Registration Pipeline
 ‣ Dataset: Self-Driving Benchmark (KITTI)

Slide 148

Slide 148 text

Metrics & Dataset 54 ‣ Speed-up & Power Reduction • Performance Bottleneck: KD-Tree Search • End-to-end Registration Pipeline
 ‣ Dataset: Self-Driving Benchmark (KITTI) • Each Point Cloud frame: ~130,000 Points

Slide 149

Slide 149 text

Hardware 55

Slide 150

Slide 150 text

Hardware 55 ‣ KD-Tree Search

Slide 151

Slide 151 text

Hardware 55 ‣ KD-Tree Search • Baseline: GPU (RTX 2080 Ti)

Slide 152

Slide 152 text

Hardware 55 ‣ KD-Tree Search • Baseline: GPU (RTX 2080 Ti) • Our System:

Slide 153

Slide 153 text

Hardware 55 ‣ KD-Tree Search • Baseline: GPU (RTX 2080 Ti) • Our System: ✓ Performance: 
 Cycle-accurate simulator 
 parameterized by an RTL model

Slide 154

Slide 154 text

Hardware 55 ‣ KD-Tree Search • Baseline: GPU (RTX 2080 Ti) • Our System: ✓ Performance: 
 Cycle-accurate simulator 
 parameterized by an RTL model ✓ Power & Area: 
 Post layout simulation in a 16nm process node

Slide 155

Slide 155 text

Hardware 55 ‣ KD-Tree Search • Baseline: GPU (RTX 2080 Ti) • Our System: ✓ Performance: 
 Cycle-accurate simulator 
 parameterized by an RTL model ✓ Power & Area: 
 Post layout simulation in a 16nm process node ‣ All Other Parts: CPU (Intel Xeon Silver 4110)

Slide 156

Slide 156 text

Comparisons 56

Slide 157

Slide 157 text

Comparisons 56 To examine the individual benefits of SW and HW optimizations

Slide 158

Slide 158 text

Comparisons 56 To examine the individual benefits of SW and HW optimizations Four systems for Comparison

Slide 159

Slide 159 text

Comparisons 56 To examine the individual benefits of SW and HW optimizations 
 No SW Optimization No HW Optimization Four systems for Comparison Baseline (KD)

Slide 160

Slide 160 text

Comparisons 56 To examine the individual benefits of SW and HW optimizations 
 No SW Optimization No HW Optimization 
 + SW Optimization No HW Optimization Four systems for Comparison Baseline (KD) Baseline (2SKD)

Slide 161

Slide 161 text

Comparisons 56 To examine the individual benefits of SW and HW optimizations 
 No SW Optimization No HW Optimization 
 + SW Optimization No HW Optimization 
 No SW Optimization + HW Optimization Four systems for Comparison Baseline (KD) Baseline (2SKD) Tigris (KD)

Slide 162

Slide 162 text

Comparisons 56 To examine the individual benefits of SW and HW optimizations 
 No SW Optimization No HW Optimization 
 + SW Optimization No HW Optimization 
 No SW Optimization + HW Optimization 
 + SW Optimization + HW Optimization Four systems for Comparison Baseline (KD) Baseline (2SKD) Tigris (KD) Tigris (2SKD)

Slide 163

Slide 163 text

Performance 57

Slide 164

Slide 164 text

Performance 57 Power Reduction (x) 0.0 1.0 2.0 3.0 4.0 Speedup (x) 0.0 4.4 8.8 13.2 17.6 22.0 Baseline (KD) Baseline (2SKD) Our System (KD) Our System (2SKD)

Slide 165

Slide 165 text

Performance 57 Power Reduction (x) 0.0 1.0 2.0 3.0 4.0 Speedup (x) 0.0 4.4 8.8 13.2 17.6 22.0 Baseline (KD) Baseline (2SKD) Our System (KD) Our System (2SKD) 5.9 20.9 1.0 1.1

Slide 166

Slide 166 text

5.9 20.9 Power Reduction (x) 0.0 4.5 9.0 13.5 18.0 Speedup (x) 0.0 4.4 8.8 13.2 17.6 22.0 Baseline (KD) Baseline (2SKD) Our System (KD) Our System (2SKD) Performance & Power 58 17.8 10.5 1.0 1.0

Slide 167

Slide 167 text

5.9 20.9 Power Reduction (x) 0.0 4.5 9.0 13.5 18.0 Speedup (x) 0.0 4.4 8.8 13.2 17.6 22.0 Baseline (KD) Baseline (2SKD) Our System (KD) Our System (2SKD) Performance & Power 58 17.8 10.5 1.0 1.0

Slide 168

Slide 168 text

5.9 20.9 Power Reduction (x) 0.0 4.5 9.0 13.5 18.0 Speedup (x) 0.0 4.4 8.8 13.2 17.6 22.0 Baseline (KD) Baseline (2SKD) Our System (KD) Our System (2SKD) Performance & Power 58 17.8 10.5 1.0 1.0 20.9X speed-up 
 on KD-Tree search 3.5X end-to-end 
 speed-up

Slide 169

Slide 169 text

5.9 20.9 Power Reduction (x) 0.0 4.5 9.0 13.5 18.0 Speedup (x) 0.0 4.4 8.8 13.2 17.6 22.0 Baseline (KD) Baseline (2SKD) Our System (KD) Our System (2SKD) Performance & Power 58 17.8 10.5 1.0 1.0 20.9X speed-up 
 on KD-Tree search 3.5X end-to-end 
 speed-up 10.5X power reduction 
 on KD-Tree search 3.0X end-to-end 
 power reduction

Slide 170

Slide 170 text

Summary 59

Slide 171

Slide 171 text

Summary 59 ‣ Point Cloud Registration ▹ A fundamental building block in emerging domains such as Autonomous Driving and Mixed Reality

Slide 172

Slide 172 text

Summary 59 ‣ Point Cloud Registration ▹ A fundamental building block in emerging domains such as Autonomous Driving and Mixed Reality ‣ Our Tigris System ▹ An early step towards efficient Point Cloud Registration

Slide 173

Slide 173 text

Summary 59 ‣ Point Cloud Registration ▹ A fundamental building block in emerging domains such as Autonomous Driving and Mixed Reality ‣ Our Tigris System ▹ An early step towards efficient Point Cloud Registration ‣ Key Insight ▹ Co-designing Software and Hardware to boost efficiency

Slide 174

Slide 174 text

Rethink Systems Stack for Point Cloud Processing 60

Slide 175

Slide 175 text

Rethink Systems Stack for Point Cloud Processing 60 2-D Image / Video

Slide 176

Slide 176 text

Rethink Systems Stack for Point Cloud Processing 60 2-D Image / Video Application Starfish (LiKamWa et al., 2013) Focus (Hsieh et al., 2018) … …

Slide 177

Slide 177 text

Rethink Systems Stack for Point Cloud Processing 60 2-D Image / Video Application Compiler Halide (Ragan-Kelley et al., 2013) Darkroom (Hegarty et al., 2014) Opt (Devito et al., 2018) … … Starfish (LiKamWa et al., 2013) Focus (Hsieh et al., 2018) … …

Slide 178

Slide 178 text

Rethink Systems Stack for Point Cloud Processing 60 2-D Image / Video Application Compiler Architecture Halide (Ragan-Kelley et al., 2013) Darkroom (Hegarty et al., 2014) Opt (Devito et al., 2018) … … Starfish (LiKamWa et al., 2013) Focus (Hsieh et al., 2018) … … IDEAL (Mahmoud et al., 2017) Eyeriss (Chen et al., 2016) … … Euphrates (Zhu et al., 2018)

Slide 179

Slide 179 text

Rethink Systems Stack for Point Cloud Processing 60 2-D Image / Video Application Compiler Architecture Point Cloud Halide (Ragan-Kelley et al., 2013) Darkroom (Hegarty et al., 2014) Opt (Devito et al., 2018) … … Starfish (LiKamWa et al., 2013) Focus (Hsieh et al., 2018) … … ?????? ?????? IDEAL (Mahmoud et al., 2017) Eyeriss (Chen et al., 2016) … … Euphrates (Zhu et al., 2018) ??????

Slide 180

Slide 180 text

Rethink Systems Stack for Point Cloud Processing 60 2-D Image / Video Application Compiler Architecture Point Clouds are high-dimensional, sparse and irregular Computation / Memory Access Pattern are fundamentally different Point Cloud Halide (Ragan-Kelley et al., 2013) Darkroom (Hegarty et al., 2014) Opt (Devito et al., 2018) … … Starfish (LiKamWa et al., 2013) Focus (Hsieh et al., 2018) … … ?????? ?????? IDEAL (Mahmoud et al., 2017) Eyeriss (Chen et al., 2016) … … Euphrates (Zhu et al., 2018) ??????

Slide 181

Slide 181 text

Thank you!

Slide 182

Slide 182 text

Q & A

Slide 183

Slide 183 text

Representative Design Points 63 DP8 DP6 DP4 DP2 DP3 DP1 DP7 DP5 DP6 DP4 DP2 Execution Time Translational Error Execution Time Rotational Error

Slide 184

Slide 184 text

Performance: absolute time Intel Xeon Silver 4110 core: ~5.0 - 10.0 seconds Our system: ~1.0 - 3.0 seconds 64

Slide 185

Slide 185 text

Speed-up & Power Reduction 65 80 60 40 20 0 Speedup (X) Base-KD Base-2SKD Acc-KD Acc-2SKD 16 12 8 4 0 Power Reduction (X) 30 24 18 12 6 0 Speedup (X) Base-KD Base-2SKD Acc-KD Acc-2SKD 20 15 10 5 0 Power Reduction (X) Design Point 7 Design Point 4

Slide 186

Slide 186 text

Approximate Search In Design Point 4: ▸ Computation Saving: 
 72.8 % less distance compute & comparison; ▸ Accuracy Loss: ▹ Translational Error: 0 ▹ Rotational Error: 0.05 °/meter 66

Slide 187

Slide 187 text

Area Analysis 67 SRAM: 
 8.38 mm^2 (53.8%) Compute Logic: 
 7.19 mm^2 (46.2%) Global Buffer Search Unit …… BE Query Buffer PE … PE Search Unit PE … PE Search Unit PE … PE BE Query Buffer BE Query Buffer Recursion Unit FQ RS RN CD PI CL Bypass Forward Query Distribution Network Recursion Unit FQ RS RN CD PI CL Bypass Forward Recursion Unit FQ RS RN CD PI CL Bypass Forward …… FE Query Queue Query Buffer Point Buffer Query Stack Buffer Result Buffer

Slide 188

Slide 188 text

Hardware Architecture 68 Global Buffer Search Unit …… BE Query Buffer PE … PE Search Unit PE … PE Search Unit PE … PE BE Query Buffer BE Query Buffer Recursion Unit FQ RS RN CD PI CL Bypass Forward Query Distribution Network Recursion Unit FQ RS RN CD PI CL Bypass Forward Recursion Unit FQ RS RN CD PI CL Bypass Forward …… FE Query Queue Query Buffer Point Buffer Query Stack Buffer Result Buffer

Slide 189

Slide 189 text

Memory Traffic 69 100 80 60 40 20 0 Memory Traffic Dist. (%) ACC-2SKD ACC-KD FE Query Q Query Buf Query Stacks Res. Buf BE Query Q Node Cache Points Buf Global Buffer Search Unit …… BE Query Buffer PE … PE Search Unit PE … PE Search Unit PE … PE BE Query Buffer BE Query Buffer Recursion Unit FQ RS RN CD PI CL Bypass Forward Query Distribution Network Recursion Unit FQ RS RN CD PI CL Bypass Forward Recursion Unit FQ RS RN CD PI CL Bypass Forward …… FE Query Queue Query Buffer Point Buffer Query Stack Buffer Result Buffer