Tigris: Architecture and Algorithms for 3D Perception in Point Clouds

1 Tigris: Architecture and Algorithms for   3D Perception in
Point Clouds Tiancheng Xu*, Boyuan Tian*   with Yuhao Zhu Department of Computer Science University of Rochester http://horizon-lab.org

To-do: add Notre Dame Cathedral ﬁgure Goal: Intro 2 Source:
https://www.nationalgeographic.com/news/2015/06/150622-andrew-tallon-notre-dame-cathedral- laser-scan-art-history-medieval-gothic/

3 Source: https://www.nationalgeographic.com/news/2015/06/150622-andrew-tallon-notre-dame-cathedral- laser-scan-art-history-medieval-gothic/

4 Point Cloud

4 Point Cloud ‣ Points in 3-d space, i.e., XYZ
coordinates

coordinates ‣ Effective in capturing visual features

coordinates ‣ Effective in capturing visual features ‣ 3-d scanners/sensors

coordinates ‣ Effective in capturing visual features ‣ 3-d scanners/sensors ▹ Scan from multiple perspectives

coordinates ‣ Effective in capturing visual features ‣ 3-d scanners/sensors ▹ Scan from multiple perspectives ▹ Stitch these Point Clouds  to form a complete Point Cloud

5 Source: https://www.nationalgeographic.com/news/2015/06/150622-andrew-tallon-notre-dame-cathedral- laser-scan-art-history-medieval-gothic/

Point Cloud Registration 6

Point Cloud Registration 6 ▸ Aligns two point clouds by
calculating a transformation

Point Cloud Registration 6 ▸ Aligns two point clouds by
calculating a transformation Transformation = Rotation + Translation

Motivation 7

Motivation 7 Point Cloud Registration: a fundamental building block

Motivation 7 3-d Reconstruction Point Cloud Registration: a fundamental building
block

block Autonomous Driving

block Autonomous Driving Mixed Reality

block Autonomous Driving Mixed Reality 3-D Visual Computing

block Autonomous Driving Mixed Reality 3-D Visual Computing Limited Energy Budget

block Autonomous Driving Mixed Reality 3-D Visual Computing High Performance Requirement Limited Energy Budget

Tigris System Overview 8 Towards real-time and energy-efﬁcient Point Cloud
Registration

Registration Characterization

Registration Characterization SW/HW Co-design

Registration Characterization SW/HW Co-design Evaluation

Tigris System 9 Towards real-time and energy-efﬁcient Point Cloud Registration
Characterization SW/HW Co-design Evaluation

Point Cloud Registration Pipeline 10 Registration

Point Cloud Registration Pipeline 10 Registration Initial Estimation Fine-tuning

Point Cloud Registration Pipeline 11 Registration Fine-tuning Initial Estimation

Point Cloud Registration Pipeline 11 Registration Fine-tuning NE KPTD DC
KPCE CR RPCE EM Initial Estimation Stage1 Stage2 Stage3 Stage4 Stage5 Stage6 Stage7

Example: Normal Estimation (NE) 12 NE KPDT DC KPCE CR
RPCE EM Registration

Example: Normal Estimation (NE) 12 ALG NE KPDT DC KPCE
CR RPCE EM Registration

CR RPCE EM Registration • SVD

CR RPCE EM Registration • DNN • SVD

Example: Normal Estimation (NE) 12 ALG PARAM NE KPDT DC
KPCE CR RPCE EM Registration • DNN • SVD

Example: Normal Estimation (NE) 12 ALG PARAM NE KPDT DC
KPCE CR RPCE EM Registration • DNN • SVD • Search  radius

• DNN • SVD • Search  radius • SIFT •
NARF • Scale  range • FPFH • 3DSC • Search  radius • Reci-  procity • Ratio • Dist • RANSAC • THRESH • NORM-S • PROJECT • Converging  criteria • Reci-  procity • MetricT • SolverT 13 ALG PARAM - NE KD DC KPCE CR RPCE EM Registration Huge Design Space

• DNN • SVD • Search  radius • SIFT •
NARF • Scale  range • FPFH • 3DSC • Search  radius • Reci-  procity • Ratio • Dist • RANSAC • THRESH • NORM-S • PROJECT • Converging  criteria • Reci-  procity • MetricT • SolverT 13 ALG PARAM - NE KD DC KPCE CR RPCE EM Registration Huge Design Space Conﬁgurable pipeline:   https://github.com/horizon-research/PointCloud-pipeline

Design Space Exploration 14

Design Space Exploration 14 Error Rate

Design Space Exploration 14 Error Rate Execution Time

Design Space Exploration 15 Execution Time Error Rate

Design Space Exploration 15 Execution Time Error Rate A Design
Point with  X Error Rate and Y Latency (X, Y)

Design Space Exploration 16 Translational Error Rotational Error Execution Time
Execution Time

Execution Time Transformation = Rotation + Translation Error Rate: Rotational & Translational Error

Execution Time

Representative Design Points 20 DP8 DP6 DP4 DP2 DP3 DP1
DP7 DP5 DP6 DP4 DP2 Execution Time Translational Error Execution Time Rotational Error

Characterization 21 NE KPTD DC KPCE CR RPCE EM Using
the representative design points (DP1-8)

Characterization 22 NE KPTD DC KPCE CR RPCE EM

Characterization 22 NE KPTD DC KPCE CR RPCE EM KD-Tree
Search

Bottleneck: KD-Tree Search 23 KD-Tree Search / End-to-End Pipeline Latency
(%) 0% 50% 100% DP1 DP2 DP3 DP4 DP5 DP6 DP7 DP8

Bottleneck: KD-Tree Search 24 KD-Tree Search / End-to-End Pipeline Latency
(%) 0% 50% 100% DP1 DP2 DP3 DP4 DP5 DP6 DP7 DP8 85% 85% 80% 76% 75% 74% 65% 52%

KD-Tree Search 25

KD-Tree Search ▸ Neighbor Search (NS) 25

KD-Tree Search ▸ Neighbor Search (NS) ▹ Universal in Point
Cloud processing 25

Cloud processing ▹ To ﬁnd the neighbors 25

Cloud processing ▹ To ﬁnd the neighbors 25 Query Point of one point

Cloud processing ▹ To ﬁnd the neighbors 25 Query Point Search Points of one point among a set of points

Cloud processing ▹ To ﬁnd the neighbors  26 of one point among a set of points

Cloud processing ▹ To ﬁnd the neighbors  ▸ KD-Tree Search 26 of one point among a set of points

Cloud processing ▹ To ﬁnd the neighbors  ▸ KD-Tree Search ▹ Standard implementation for NS in point clouds 26 of one point among a set of points

Cloud processing ▹ To ﬁnd the neighbors  ▸ KD-Tree Search ▹ Standard implementation for NS in point clouds ▹ Effectively reduces the computation workload of NS 26 of one point among a set of points

Cloud processing ▹ To ﬁnd the neighbors  ▸ KD-Tree Search ▹ Standard implementation for NS in point clouds ▹ Effectively reduces the computation workload of NS ▹ Inefﬁcient on GPUs due to its sequential nature 26 of one point among a set of points

Cloud processing ▹ To ﬁnd the neighbors  ▸ KD-Tree Search ▹ Standard implementation for NS in point clouds ▹ Effectively reduces the computation workload of NS ▹ Inefﬁcient on GPUs due to its sequential nature ▹ Challenging for hardware acceleration 26 of one point among a set of points

Redundancy vs. Parallelism 28

Redundancy vs. Parallelism 28 Unordered Set Canonical KD-Tree

Redundancy vs. Parallelism 28 Unordered Set Canonical KD-Tree Current Node

Redundancy vs. Parallelism 28 Unordered Set Canonical KD-Tree ▹No Redundancy,
No Parallelism Current Node

Redundancy vs. Parallelism 28 Unordered Set ▹Huge Parallelism, Huge Redundancy
Canonical KD-Tree ▹No Redundancy, No Parallelism Current Node

Two-Stage KD-Tree New data structure ▹Balances parallelism and redundancy 29

Two-Stage KD-Tree

Two-Stage KD-Tree Top-Tree

Two-Stage KD-Tree Top-Tree Children of Leaf Nodes Leaf Nodes

Two-Stage KD-Tree

Two-Stage KD-Tree Canonical KD-Tree Same First Few Levels

Two-Stage KD-Tree Canonical KD-Tree Sequential Traversal

Two-Stage KD-Tree Canonical KD-Tree Sub-Tree Unordered Set Sequential Traversal

Parallel Search Two-Stage KD-Tree New data structure ▹Balances parallelism and
redundancy 30 Two-Stage KD-Tree Canonical KD-Tree Sequential Traversal

Quantifying Redundancy 31 Two-Stage KD-Tree Canonical KD-Tree

Quantifying Redundancy 32

Quantifying Redundancy 32 …… …… 35X more points need to
be visited

Approximate Search New search algorithm ▹Mitigates redundancy introduced by new
data structure 33

Approximate Search New search algorithm ▹Close queries are likely to
share similar search results 34

Approximate Search New search algorithm ▹Close queries are likely to
share similar search results 34 Qi

Approximate Search 35 Qi N New search algorithm ▹Close queries
are likely to share similar search results

Approximate Search 35 Qi N Qj New search algorithm ▹Close
queries are likely to share similar search results

Approximate Search 36 Qi R New search algorithm ▹Close queries

Approximate Search 37 Qi R New search algorithm ▹Close queries

Approximate Search 37 Qi R Qj R New search algorithm
▹Close queries are likely to share similar search results

Approximate Search 38 Qj R R Qi New search algorithm

Approximate Search 39 R Qj R Qi New search algorithm

Approximate Search 40 Qi New search algorithm ▹Leader: search in
children of leaf nodes as usual

Approximate Search 40 Qi leader New search algorithm ▹Leader: search
in children of leaf nodes as usual

Approximate Search 40 Qi leader New search algorithm ▹Leader: search
in children of leaf nodes as usual R

Approximate Search 41 Qi leader R New search algorithm ▹Leader:
search in children of leaf nodes as usual

Approximate Search 42 Qi leader New search algorithm ▹Follower: search
in neighbors of a leader R

Approximate Search 42 Qi leader Qj follower New search algorithm
▹Follower: search in neighbors of a leader R

Approximate Search 43 Qi leader R Qj follower New search
algorithm ▹Follower: search in neighbors of a leader

Approximate Search 44 Qi leader R Qj follower New search
algorithm ▹Efﬁciently mitigate search redundancy

Total savings of node visits 72.8% Negligible eﬀect on registration
accuracy Approximate Search 44 Qi leader R Qj follower New search algorithm ▹Efﬁciently mitigate search redundancy

New data structure + new search algorithm: ▹Expose huge parallelism
with negligible search redundancy  Software-Hardware Co-design for Neighbor Search 45

with negligible search redundancy  Software-Hardware co-design: Software-Hardware Co-design for Neighbor Search 45

with negligible search redundancy  Software-Hardware co-design: Software-Hardware Co-design for Neighbor Search 45 Sequential traversal in top tree Parallel search in leaf nodes

Hardware Architecture 46 Front-end Back-end Query Distribution Network Front-end Buffer
Global buffer Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Sequential traverse in top tree Parallel search in leaf nodes

Global buffer Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism Sequential traverse in top tree Parallel search in leaf nodes

Global buffer Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Front-end ▹Exploits QLP + limited NLP ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism Sequential traverse in top tree Parallel search in leaf nodes

Hardware Architecture 46 Back-end Query Distribution Network Front-end Buffer Global
buffer Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Front-end ▹Exploits QLP + limited NLP RU RU RU … Front-end Buffer ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism Sequential traverse in top tree Parallel search in leaf nodes

Recursive Unit Hardware Originally no NLP can be exploited ▹Tree
traversal in top tree is still sequential 47 ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

traversal in top tree is still sequential Limited NLP exploited by pipelining diﬀerent nodes 47 ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism Recursion Unit Microarchitecture

traversal in top tree is still sequential Limited NLP exploited by pipelining diﬀerent nodes ▹Two optimizations to avoid data dependency and pipeline stall 48 Recursion Unit Microarchitecture ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Hardware Architecture 49 RU RU RU Query Distribution Network …
Global buffer Back-end SUs Back-end SUs Front-end Buffer Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Front-end ▹Exploits QLP + limited NLP Back-end ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Hardware Architecture 50 RU RU RU SU SU SU Buf
Buf Buf Front-end Buffer Query Distribution Network Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Front-end ▹Exploits QLP + limited NLP Back-end … … Global buffer ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Hardware Architecture 50 RU RU RU SU SU SU Buf
Buf Buf Front-end Buffer Query Distribution Network Decoupled architecture ▹Front-end for tree traversal ▹Back-end for parallel search Front-end ▹Exploits QLP + limited NLP Back-end ▹Exploits QLP + NLP … … Global buffer Leaf node ID % num. of SUs ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Search Unit Hardware 51 Q Q Q Q Q ▹QLP:
Query-Level Parallelism ▹NLP: Node-Level Parallelism

Search Unit Hardware 51 SU Buf Q Q Q Q
Q ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Q Q Q Q Q Search Unit Hardware 51 SU
Buf Q Q Q Q Q PE PE PE PE ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Buf Q Q Q Q Q Query-Level Parallelism PE PE PE PE ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Buf Q Q Q Q Q Query-Level Parallelism PE PE PE PE Execution Model Exploiting QLP ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Buf Q Q Q Q Q Query-Level Parallelism PE PE PE PE Execution Model Exploiting QLP ▸ Multiple Query Multiple LeafNodes ▹High PE Utilization; High data bandwidth ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Buf Q Q Q Q Q Query-Level Parallelism PE PE PE PE Execution Model Exploiting QLP ▸ Multiple Query Multiple LeafNodes ▹High PE Utilization; High data bandwidth ▸ Multiple Query Single LeafNodes ▹Low data bandwidth; Low PE Utilization ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Execution Model Exploiting QLP ▸ Multiple Query Multiple LeafNodes ▹High
PE Utilization; High data bandwidth ▸ Multiple Query Single LeafNodes ▹Low data bandwidth; Low PE Utilization PE PE PE Q Q Q Q Q Q Q Search Unit Hardware 52 Q Q Q Q Q PE N N N N N N Children in a leaf node to be searched in parallel Query-Level Parallelism ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

PE Utilization; High data bandwidth ▸ Multiple Query Single LeafNodes ▹Low data bandwidth; Low PE Utilization PE PE PE Q Q Q Q Q Q Q Search Unit Hardware 52 Q Q Q Q Q Node-Level Parallelism PE N N N N N N Children in a leaf node to be searched in parallel Query-Level Parallelism ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

PE Utilization; High data bandwidth ▸ Multiple Query Single LeafNodes ▹Low data bandwidth; Low PE Utilization PE PE PE Q Q Q Q Q Q Q Search Unit Hardware 52 Q Q Q Q Q Node-Level Parallelism PE Data-ﬂow Exploiting NLP ▸ 1-D systolic array ▸ Query-stationary data-ﬂow N N N N N N Children in a leaf node to be searched in parallel Query-Level Parallelism ▹QLP: Query-Level Parallelism ▹NLP: Node-Level Parallelism

Tigris system overview 53 Towards real-time and energy-efﬁcient Point Cloud

Metrics & Dataset 54

Metrics & Dataset 54 ‣ Speed-up & Power Reduction

Metrics & Dataset 54 ‣ Speed-up & Power Reduction •
Performance Bottleneck: KD-Tree Search

Performance Bottleneck: KD-Tree Search • End-to-end Registration Pipeline 

Performance Bottleneck: KD-Tree Search • End-to-end Registration Pipeline  ‣ Dataset: Self-Driving Benchmark (KITTI)

Performance Bottleneck: KD-Tree Search • End-to-end Registration Pipeline  ‣ Dataset: Self-Driving Benchmark (KITTI) • Each Point Cloud frame: ~130,000 Points

Hardware 55

Hardware 55 ‣ KD-Tree Search

Hardware 55 ‣ KD-Tree Search • Baseline: GPU (RTX 2080
Ti)

Ti) • Our System:

Ti) • Our System: ✓ Performance:   Cycle-accurate simulator   parameterized by an RTL model

Ti) • Our System: ✓ Performance:   Cycle-accurate simulator   parameterized by an RTL model ✓ Power & Area:   Post layout simulation in a 16nm process node

Ti) • Our System: ✓ Performance:   Cycle-accurate simulator   parameterized by an RTL model ✓ Power & Area:   Post layout simulation in a 16nm process node ‣ All Other Parts: CPU (Intel Xeon Silver 4110)

Comparisons 56

Comparisons 56 To examine the individual beneﬁts of SW and
HW optimizations

HW optimizations Four systems for Comparison

HW optimizations   No SW Optimization No HW Optimization Four systems for Comparison Baseline (KD)

HW optimizations   No SW Optimization No HW Optimization   + SW Optimization No HW Optimization Four systems for Comparison Baseline (KD) Baseline (2SKD)

HW optimizations   No SW Optimization No HW Optimization   + SW Optimization No HW Optimization   No SW Optimization + HW Optimization Four systems for Comparison Baseline (KD) Baseline (2SKD) Tigris (KD)

HW optimizations   No SW Optimization No HW Optimization   + SW Optimization No HW Optimization   No SW Optimization + HW Optimization   + SW Optimization + HW Optimization Four systems for Comparison Baseline (KD) Baseline (2SKD) Tigris (KD) Tigris (2SKD)

Performance 57

Performance 57 Power Reduction (x) 0.0 1.0 2.0 3.0 4.0
Speedup (x) 0.0 4.4 8.8 13.2 17.6 22.0 Baseline (KD) Baseline (2SKD) Our System (KD) Our System (2SKD)

Performance 57 Power Reduction (x) 0.0 1.0 2.0 3.0 4.0
Speedup (x) 0.0 4.4 8.8 13.2 17.6 22.0 Baseline (KD) Baseline (2SKD) Our System (KD) Our System (2SKD) 5.9 20.9 1.0 1.1

5.9 20.9 Power Reduction (x) 0.0 4.5 9.0 13.5 18.0
Speedup (x) 0.0 4.4 8.8 13.2 17.6 22.0 Baseline (KD) Baseline (2SKD) Our System (KD) Our System (2SKD) Performance & Power 58 17.8 10.5 1.0 1.0

Speedup (x) 0.0 4.4 8.8 13.2 17.6 22.0 Baseline (KD) Baseline (2SKD) Our System (KD) Our System (2SKD) Performance & Power 58 17.8 10.5 1.0 1.0 20.9X speed-up   on KD-Tree search 3.5X end-to-end   speed-up

Speedup (x) 0.0 4.4 8.8 13.2 17.6 22.0 Baseline (KD) Baseline (2SKD) Our System (KD) Our System (2SKD) Performance & Power 58 17.8 10.5 1.0 1.0 20.9X speed-up   on KD-Tree search 3.5X end-to-end   speed-up 10.5X power reduction   on KD-Tree search 3.0X end-to-end   power reduction

Summary 59

Summary 59 ‣ Point Cloud Registration ▹ A fundamental building
block in emerging domains such as Autonomous Driving and Mixed Reality

block in emerging domains such as Autonomous Driving and Mixed Reality ‣ Our Tigris System ▹ An early step towards eﬃcient Point Cloud Registration

block in emerging domains such as Autonomous Driving and Mixed Reality ‣ Our Tigris System ▹ An early step towards eﬃcient Point Cloud Registration ‣ Key Insight ▹ Co-designing Software and Hardware to boost efﬁciency

Rethink Systems Stack for Point Cloud Processing 60

Rethink Systems Stack for Point Cloud Processing 60 2-D Image
/ Video

/ Video Application Starﬁsh (LiKamWa et al., 2013) Focus (Hsieh et al., 2018) … …

/ Video Application Compiler Halide (Ragan-Kelley et al., 2013) Darkroom (Hegarty et al., 2014) Opt (Devito et al., 2018) … … Starﬁsh (LiKamWa et al., 2013) Focus (Hsieh et al., 2018) … …

/ Video Application Compiler Architecture Halide (Ragan-Kelley et al., 2013) Darkroom (Hegarty et al., 2014) Opt (Devito et al., 2018) … … Starﬁsh (LiKamWa et al., 2013) Focus (Hsieh et al., 2018) … … IDEAL (Mahmoud et al., 2017) Eyeriss (Chen et al., 2016) … … Euphrates (Zhu et al., 2018)

/ Video Application Compiler Architecture Point Cloud Halide (Ragan-Kelley et al., 2013) Darkroom (Hegarty et al., 2014) Opt (Devito et al., 2018) … … Starﬁsh (LiKamWa et al., 2013) Focus (Hsieh et al., 2018) … … ?????? ?????? IDEAL (Mahmoud et al., 2017) Eyeriss (Chen et al., 2016) … … Euphrates (Zhu et al., 2018) ??????

/ Video Application Compiler Architecture Point Clouds are high-dimensional, sparse and irregular Computation / Memory Access Pattern are fundamentally different Point Cloud Halide (Ragan-Kelley et al., 2013) Darkroom (Hegarty et al., 2014) Opt (Devito et al., 2018) … … Starﬁsh (LiKamWa et al., 2013) Focus (Hsieh et al., 2018) … … ?????? ?????? IDEAL (Mahmoud et al., 2017) Eyeriss (Chen et al., 2016) … … Euphrates (Zhu et al., 2018) ??????

Thank you!

Representative Design Points 63 DP8 DP6 DP4 DP2 DP3 DP1
DP7 DP5 DP6 DP4 DP2 Execution Time Translational Error Execution Time Rotational Error

Performance: absolute time Intel Xeon Silver 4110 core: ~5.0 -
10.0 seconds Our system: ~1.0 - 3.0 seconds 64

Speed-up & Power Reduction 65 80 60 40 20 0
Speedup (X) Base-KD Base-2SKD Acc-KD Acc-2SKD 16 12 8 4 0 Power Reduction (X) 30 24 18 12 6 0 Speedup (X) Base-KD Base-2SKD Acc-KD Acc-2SKD 20 15 10 5 0 Power Reduction (X) Design Point 7 Design Point 4

Approximate Search In Design Point 4: ▸ Computation Saving:  
72.8 % less distance compute & comparison; ▸ Accuracy Loss: ▹ Translational Error: 0 ▹ Rotational Error: 0.05 °/meter 66

Area Analysis 67 SRAM:   8.38 mm^2 (53.8%) Compute Logic:
  7.19 mm^2 (46.2%) Global Buffer Search Unit …… BE Query Buffer PE … PE Search Unit PE … PE Search Unit PE … PE BE Query Buffer BE Query Buffer Recursion Unit FQ RS RN CD PI CL Bypass Forward Query Distribution Network Recursion Unit FQ RS RN CD PI CL Bypass Forward Recursion Unit FQ RS RN CD PI CL Bypass Forward …… FE Query Queue Query Buffer Point Buffer Query Stack Buffer Result Buffer

Hardware Architecture 68 Global Buffer Search Unit …… BE Query
Buffer PE … PE Search Unit PE … PE Search Unit PE … PE BE Query Buffer BE Query Buffer Recursion Unit FQ RS RN CD PI CL Bypass Forward Query Distribution Network Recursion Unit FQ RS RN CD PI CL Bypass Forward Recursion Unit FQ RS RN CD PI CL Bypass Forward …… FE Query Queue Query Buffer Point Buffer Query Stack Buffer Result Buffer

Memory Traffic 69 100 80 60 40 20 0 Memory
Traffic Dist. (%) ACC-2SKD ACC-KD FE Query Q Query Buf Query Stacks Res. Buf BE Query Q Node Cache Points Buf Global Buffer Search Unit …… BE Query Buffer PE … PE Search Unit PE … PE Search Unit PE … PE BE Query Buffer BE Query Buffer Recursion Unit FQ RS RN CD PI CL Bypass Forward Query Distribution Network Recursion Unit FQ RS RN CD PI CL Bypass Forward Recursion Unit FQ RS RN CD PI CL Bypass Forward …… FE Query Queue Query Buffer Point Buffer Query Stack Buffer Result Buffer

Tigris: Architecture and Algorithms for 3D Perc...

Tigris: Architecture and Algorithms for 3D Perception in Point Clouds

More Decks by HorizonLab

Other Decks in Science

Featured

Transcript