Mesorasi: Architecture Support for Point Cloud Analytics via Delayed-Aggregation

Slide 1

Slide 1 text

1 Mesorasi: Architecture Support for Point Cloud Analytics via Delayed-Aggregation Yu Feng, Boyuan Tian, Tiancheng Xu with Paul Whatmough (Arm Research) and Yuhao Zhu Department of Computer Science University of Rochester http://horizon-lab.org https://github.com/horizon-research/Efﬁcient-Deep-Learning-for-Point-Clouds

Slide 2

Slide 2 text

Slide 3

Slide 3 text

2 ✘

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

4 Autonomous Driving Robotics Mixed Reality Drone Navigation

Slide 7

Slide 7 text

Classiﬁcation Segmentation Detection SLAM Deep Learning on Point Clouds 5

Slide 8

Slide 8 text

Classiﬁcation Segmentation Detection SLAM Deep Learning on Point Clouds 5

Slide 9

Slide 9 text

Classiﬁcation Segmentation Detection SLAM Deep Learning on Point Clouds 5   Performance  Energy-eﬃciency 

Slide 10

Slide 10 text

Classiﬁcation Segmentation Detection SLAM Deep Learning on Point Clouds 5 Can we use existing neural network accelerators on point cloud workloads?

Slide 11

Slide 11 text

6 Aggregation P3 P4 P12 P13 Neighbor Search P0 P2 P1 P6 P5 P7 P8 P9 P10 P11 Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU Key Operators P1: {P2, P3, ….} P3: {P8, P7, ….} P6: {P8, P5, ….} … … P6: … P1:

Slide 12

Slide 12 text

P1: {P2, P3, ….} P3: {P8, P7, ….} P6: {P8, P5, ….} … … P6: … P1: 7 Aggregation P3 P4 P12 P13 Neighbor Search P0 P2 P1 P6 P5 P7 P8 P9 P10 P11 Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU Key Operators

Slide 13

Slide 13 text

8 Neighbor Search in Point Clouds

Slide 14

Slide 14 text

8 Neighbor Search in Point Clouds P1 P8 P0 P2 P3 P6 P5 P7 P9 P10 P12 P13 P11 P4

Slide 15

Slide 15 text

8 Neighbor Search in Point Clouds P1 P8 P0 P2 P3 P6 P5 P7 P9 P10 P12 P13 P11 P4

Slide 16

Slide 16 text

8 Neighbor Search in Point Clouds N P1 P8 P0 P2 P3 P6 P5 P7 P9 P10 P12 P13 P11 P4

Slide 17

Slide 17 text

8 Neighbor Search in Point Clouds N P1: { P0, P2, P3 P4, P5, P6 } P8: { P5, P6, P7 P9, P10, P11 } P1 P8 P0 P2 P3 P6 P5 P7 P9 P10 P12 P13 P11 P4

Slide 18

Slide 18 text

8 Neighbor Search in Point Clouds N P1: { P0, P2, P3 P4, P5, P6 } P8: { P5, P6, P7 P9, P10, P11 } P1 P8 P0 P2 P3 P6 P5 P7 P9 P10 P12 P13 P11 P4 Sorting, KD-Tree Irregular Memory Access

Slide 19

Slide 19 text

Neighbor Search in conventional DNN 9 P0,0 P0,1 P0,2 P1,0 P1,1 P1,2 P2,0 P2,1 P2,2 P1,1 P1,2 P1,3 P2,1 P2,2 P2,3 P3,1 P3,2 P3,3 N P0,0 P0,1 P0,2 P0,3 P0,4 P1,0 P1,1 P1,2 P1,3 P1,4 P2,0 P2,1 P2,2 P2,3 P2,4 P3,0 P3,1 P3,2 P3,3 P3,4 Pixels: inherently regular   Points: irregularly scattered Regular DNNs don’t need explicit neighbor search

Slide 20

Slide 20 text

10 Aggregation P3 P4 P12 P13 Neighbor Search P0 P2 P1 P6 P5 P7 P8 P9 P10 P11 Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU Key Operators P1: {P2, P3, ….} P3: {P8, P7, ….} P6: {P8, P5, ….} … … P6: … P1:

Slide 21

Slide 21 text

11 Aggregation P0 P2 P3 P1 P6 P5 P7 P8 P9 P10 P12 P13 P11 N P1: { P0, P2, P3, P4, P5, P6 } P8: { P5, P6, P7, P9, P10, P11 } P4 N A Points: feature vectors Neighbor Index Table

Slide 22

Slide 22 text

11 Aggregation P0 P2 P3 P1 P6 P5 P7 P8 P9 P10 P12 P13 P11 N P1: { P0, P2, P3, P4, P5, P6 } P8: { P5, P6, P7, P9, P10, P11 } P4 N A P1: Feature Matrix Points: feature vectors Neighbor Index Table

Slide 23

Slide 23 text

11 Aggregation P0 P2 P3 P1 P6 P5 P7 P8 P9 P10 P12 P13 P11 N P1: { P0, P2, P3, P4, P5, P6 } P8: { P5, P6, P7, P9, P10, P11 } P4 N A P0 P1: Feature Matrix Points: feature vectors Neighbor Index Table

Slide 24

Slide 24 text

11 Aggregation P0 P2 P3 P1 P6 P5 P7 P8 P9 P10 P12 P13 P11 N P1: { P0, P2, P3, P4, P5, P6 } P8: { P5, P6, P7, P9, P10, P11 } P4 N A P0 - P1 P1: Feature Matrix Points: feature vectors Neighbor Index Table

Slide 25

Slide 25 text

11 Aggregation P0 P2 P3 P1 P6 P5 P7 P8 P9 P10 P12 P13 P11 N P1: { P0, P2, P3, P4, P5, P6 } P8: { P5, P6, P7, P9, P10, P11 } P4 N A P0 - P1 P2 - P1 P3 - P1 P4 - P1 P5 - P1 P6 - P1 P1: Feature Matrix Points: feature vectors Neighbor Index Table

Slide 26

Slide 26 text

11 Aggregation P0 P2 P3 P1 P6 P5 P7 P8 P9 P10 P12 P13 P11 N P1: { P0, P2, P3, P4, P5, P6 } P8: { P5, P6, P7, P9, P10, P11 } P4 N A P0 - P1 P2 - P1 P3 - P1 P4 - P1 P5 - P1 P6 - P1 P1: Feature Matrix P5 - P8 P6 - P8 P7 - P8 P9 - P8 P10 - P8 P11 - P8 P8: Feature Matrix Points: feature vectors Neighbor Index Table

Slide 27

Slide 27 text

P1: {P2, P3, ….} P3: {P8, P7, ….} P6: {P8, P5, ….} … … P6: … P1: 12 Aggregation P3 P4 P12 P13 Neighbor Search P0 P2 P1 P6 P5 P7 P8 P9 P10 P11 Key Operators Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU

Slide 28

Slide 28 text

13 Feature Computation MatMul ReLU MatMul ReLU MLP layers P2 - P1 P3 - P1 P4 - P1 P5 - P1 P6 - P1 P1: Feature Matrix P5 - P8 P6 - P8 P7 - P8 P9 - P8 P10 - P8 P11 - P8 P8: Feature Matrix P0 - P1 P0 - P1

Slide 29

Slide 29 text

13 Feature Computation MatMul ReLU MatMul ReLU MLP layers P2 - P1 P3 - P1 P4 - P1 P5 - P1 P6 - P1 P1: Feature Matrix P5 - P8 P6 - P8 P7 - P8 P9 - P8 P10 - P8 P11 - P8 P8: Feature Matrix P0’ P0 - P1

Slide 30

Slide 30 text

14 Feature Computation MatMul ReLU MatMul ReLU P1: Feature Matrix P5 - P8 P6 - P8 P7 - P8 P9 - P8 P10 - P8 P11 - P8 P8: Feature Matrix P2 - P1 P3 - P1 P4 - P1 P5 - P1 P6 - P1 P0 - P1 P0’ P2’ P3’ P4’ P5’ P6’ P1’: New Feature Matrix P5’ P6’ P7’ P9’ P10’ P11’ P8’: New Feature Matrix

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Point Cloud Network Layer 15 P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool Aggregation Neighbor Feature Matrix (NFM) Each centroid point has a NFM N x M P1: P3 - P1 P2 - P1 N x M P5: P6 - P5 P3 - P5 N neighbors (N == 2) Neighbor Search Neighbor Index Table P1: {P2, P3} P5: {P3, P6} P1, P5: centroid points M-dimension feature vectors P1 P4 P6 P2 P3 P5 Points

Slide 33

Slide 33 text

Can We Use Existing DNN Accelerators? 16

Slide 34

Slide 34 text

Can We Use Existing DNN Accelerators? 16 ▸ No. They are not enough.

Slide 35

Slide 35 text

Can We Use Existing DNN Accelerators? 16 ▸ No. They are not enough. ▹ Neighbor Search

Slide 36

Slide 36 text

Can We Use Existing DNN Accelerators? 16 ▸ No. They are not enough. ▹ Neighbor Search ▹ Aggregation

Slide 37

Slide 37 text

Can We Use Existing DNN Accelerators? 16 ▸ No. They are not enough. ▹ Neighbor Search ▹ Aggregation 0 25 50 75 100 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et Neighbor Search Aggregation Feature Computation Others Characterization of   Point Cloud Networks % Execution Time

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text

Point Cloud Network Layer 17 P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool Aggregation Neighbor Feature Matrix (NFM) Each centroid point has a NFM N x M P1: P3 - P1 P2 - P1 N x M P5: P6 - P5 P3 - P5 N neighbors (N == 2) Neighbor Search Neighbor Index Table P1: {P2, P3} P5: {P3, P6} P1, P5: centroid points M-dimension feature vectors P1 P4 P6 P2 P3 P5 Points

Slide 41

Slide 41 text

Slide 42

Slide 42 text

Slide 43

Slide 43 text

Slide 44

Slide 44 text

Slide 45

Slide 45 text

Slide 46

Slide 46 text

Slide 47

Slide 47 text

Neighbor Feature Matrix (NFM) N x M P3 - P1 P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool = Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 - ▸ Eﬀectively introduces reuse opportunities  Beneﬁt:

Slide 48

Slide 48 text

Slide 49

Slide 49 text

Slide 50

Slide 50 text

Neighbor Feature Matrix (NFM) N x M P3 - P1 P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 - ReLU ( ) ReLU ( ) ReLU ( ) ≠ ▸ Eﬀectively introduces reuse opportunities  Reduces up to 90%   MAC (Multiply-Accumulate) operations ▸ Dependency elimination Beneﬁt: Each point is used ~30 times

Slide 51

Slide 51 text

Approximation 19 Neighbor Feature Matrix (NFM) MLP + Reduction Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 - ReLU ( ) ReLU ( ) ReLU ( ) ≈ P1’ P5’ M’-dimension features Each centroid point has a NFM N x M P3 - P1 P2 - P1 N x M P5: P6 - P5 P3 - P5 P1:

Slide 52

Slide 52 text

Slide 53

Slide 53 text

Delayed Aggregation 20 Neighbor Search Neighbor Index Table Feature Compute Point Feature Table Aggregation Neighbor Feature Matrix (NFM) } Reduction P1’ P5’ M’-dimension features Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP M-dimension feature vectors P1 P4 P6 P2 P3 P5 Points

Slide 54

Slide 54 text

Bottleneck Shift 21 Time (msec.) 0 6 12 18 24 30 Original Delayed Aggr. Feature Computation Aggregation Neighbor Search Execution Time Distribution of PointNet++

Slide 55

Slide 55 text

Bottleneck Shift 21 Time (msec.) 0 6 12 18 24 30 Original Delayed Aggr. Feature Computation Aggregation Neighbor Search Execution Time Distribution of PointNet++

Slide 56

Slide 56 text

Bottleneck Shift 21 Time (msec.) 0 6 12 18 24 30 Original Delayed Aggr. Feature Computation Aggregation Neighbor Search Execution Time Distribution of PointNet++

Slide 57

Slide 57 text

Mesorasi: Point Cloud Acceleration Frame 22 Algorithm Hardware Delayed Aggregation Aggregation Acceleration

Slide 58

Slide 58 text

Aggregation Operation 23 Neighbor Feature Matrix Point Feature Table N neighbors P1: {P2, P3, P11, P23, P31, P39, …} P3: {P5, P7, P16, P19, P38, P41, …} … Neighbor Index Table P99:{P16, P71, P96, P119, P128, P142, …} P11 P31 P23 P3 P39 P2 … …

Slide 59

Slide 59 text

Slide 60

Slide 60 text

Slide 61

Slide 61 text

Slide 62

Slide 62 text

Slide 63

Slide 63 text

Aggregation Unit 24 1: {2, 17, 9, …} 3: {8, 17, 6, …} 6: {18, 25, 34, …} …. 900: {832, 987, …} Point Feature Table (PFT) Neighbor Index Table

Slide 64

Slide 64 text

Aggregation Unit 24 1: {2, 17, 9, …} 3: {8, 17, 6, …} 6: {18, 25, 34, …} …. 900: {832, 987, …} Address Generation Point Feature Table (PFT) Neighbor Index Table

Slide 65

Slide 65 text

Aggregation Unit 24 1: {2, 17, 9, …} 3: {8, 17, 6, …} 6: {18, 25, 34, …} …. 900: {832, 987, …} Address Generation Bank 1 Bank 2 Bank B … Bank 3 … Point Feature Table (PFT) Neighbor Index Table

Slide 66

Slide 66 text

Aggregation Unit 24 1: {2, 17, 9, …} 3: {8, 17, 6, …} 6: {18, 25, 34, …} …. 900: {832, 987, …} Address Generation Bank 1 Bank 2 Bank B … Bank 3 … B-ported, B-banked; No Crossbar Point Feature Table (PFT) Neighbor Index Table

Slide 67

Slide 67 text

Aggregation Unit 25 1: {2, 17, 9, …} 3: {8, 17, 6, …} 6: {18, 25, 34, …} …. 900: {832, 987, …} Address Generation Bank 1 Bank 2 Bank B … Bank 3 … Neighbor Index Table Point Feature Table (PFT) Reduction (Max) … Shift Registers B-ported, B-banked; No Crossbar

Slide 68

Slide 68 text

Slide 69

Slide 69 text

Slide 70

Slide 70 text

Slide 71

Slide 71 text

DNN Accelerator (NPU) DRAM CPU GPU Overall Hardware Design 26

Slide 72

Slide 72 text

DNN Accelerator (NPU) DRAM CPU GPU Overall Hardware Design 26 Input Point Cloud MLP Kernel Weights MLP Intermediate Activations Neighbor Index Table

Slide 73

Slide 73 text

DNN Accelerator (NPU) DRAM CPU GPU Overall Hardware Design 26 Input Point Cloud MLP Kernel Weights MLP Intermediate Activations Neighbor Index Table GPU (Neighbor Search)

Slide 74

Slide 74 text

DNN Accelerator (NPU) DRAM CPU GPU Overall Hardware Design 26 Input Point Cloud MLP Kernel Weights MLP Intermediate Activations Neighbor Index Table GPU (Neighbor Search) Feature Extraction + Aggregation

Slide 75

Slide 75 text

DNN Accelerator (NPU) DRAM CPU GPU Overall Hardware Design 26 Systolic MAC Unit Array BN/ReLU/ MaxPool Input Point Cloud MLP Kernel Weights MLP Intermediate Activations Neighbor Index Table GPU (Neighbor Search)

Slide 76

Slide 76 text

Slide 77

Slide 77 text

DNN Accelerator (NPU) DRAM CPU GPU Overall Hardware Design 26 Global Buffer (Weights /Fmaps) Global Buffer (Weights /FMaps) MCU MCU Systolic MAC Unit Array BN/ReLU/ MaxPool Input Point Cloud MLP Kernel Weights MLP Intermediate Activations Neighbor Index Table Aggregation Logic Neighbor Index Table Point Feature Buﬀer Reduction (Max) Neighbor Index Table Neighbor Index Buﬀer GPU (Neighbor Search)

Slide 78

Slide 78 text

Slide 79

Slide 79 text

Experimental Setup 27

Slide 80

Slide 80 text

Experimental Setup 27 Three Point Cloud Applications: ▹ Object Classiﬁcation, Object Segmentation, and Object Detection

Slide 81

Slide 81 text

Experimental Setup 27 Three Point Cloud Applications: ▹ Object Classiﬁcation, Object Segmentation, and Object Detection Datasets: ▹ ModelNet40, ShapeNet, and KITTI dataset

Slide 82

Slide 82 text

Experimental Setup 27 Three Point Cloud Applications: ▹ Object Classiﬁcation, Object Segmentation, and Object Detection Datasets: ▹ ModelNet40, ShapeNet, and KITTI dataset Models: ▹ Classiﬁcation: PointNet++ (c), DGCNN (c), LDGCNN, DensePoint ▹ Segmentation: PointNet++ (s), DGCNN (s) ▹ Detection: F-PointNet

Slide 83

Slide 83 text

Slide 84

Slide 84 text

Accuracy Comparison 28 Accuracy (%) 0 20 40 60 80 100 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint Orignal Delayed Aggr.

Slide 85

Slide 85 text

Accuracy Comparison 28 Accuracy (%) 0 20 40 60 80 100 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint Orignal Delayed Aggr.

Slide 86

Slide 86 text

Accuracy Comparison 28 Accuracy (%) 0 20 40 60 80 100 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint Orignal Delayed Aggr.

Slide 87

Slide 87 text

Speedup and Energy Saving on GPU 29

Slide 88

Slide 88 text

Speedup and Energy Saving on GPU 29 Speedup 0 0.5 1 1.5 2 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG .

Slide 89

Slide 89 text

Speedup and Energy Saving on GPU 29 Speedup 0 0.5 1 1.5 2 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG . 1.6

Slide 90

Slide 90 text

Speedup and Energy Saving on GPU 29 Speedup 0 0.5 1 1.5 2 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG . Saving % 0 25 50 75 100 1.6

Slide 91

Slide 91 text

Speedup and Energy Saving on GPU 29 Speedup 0 0.5 1 1.5 2 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG . Saving % 0 25 50 75 100 1.6 51.1%

Slide 92

Slide 92 text

Hardware Performance Evaluation 30

Slide 93

Slide 93 text

Hardware Performance Evaluation 30 Hardware Baseline: ▹ A generic NPU + GPU SoC.

Slide 94

Slide 94 text

Slide 95

Slide 95 text

Hardware Performance Evaluation 30 Variants: ▹ Mesorasi-SW: delayed-aggregation without AU support. ▹ Mesorasi-HW: delayed-aggregation with AU support. Hardware Baseline: ▹ A generic NPU + GPU SoC. Implementation: ▹ 16x16 Systolic Array ▹ Synposys synthesis, TSMC 16nm FinFET technology ▹

Slide 96

Slide 96 text

Speedup 31 Speedup 0 0.5 1 1.5 2 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG . Mesorasi-SW Mesorasi-HW

Slide 97

Slide 97 text

Speedup 31 Speedup 0 0.5 1 1.5 2 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG . Mesorasi-SW Mesorasi-HW 1.3

Slide 98

Slide 98 text

Speedup 31 Speedup 0 0.5 1 1.5 2 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG . Mesorasi-SW Mesorasi-HW 2.2 3.6 1.3 1.9

Slide 99

Slide 99 text

Energy Savings 32 Saving (%) 0 25 50 75 100 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG . Mesorasi-SW Mesorasi-HW

Slide 100

Slide 100 text

Energy Savings 32 Saving (%) 0 25 50 75 100 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG . Mesorasi-SW Mesorasi-HW 22 %

Slide 101

Slide 101 text

Energy Savings 32 Saving (%) 0 25 50 75 100 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG . Mesorasi-SW Mesorasi-HW 22 % 38%

Slide 102

Slide 102 text

Conclusion 33 https://github.com/horizon-research/Efﬁcient-Deep-Learning-for-Point-Clouds

Slide 103

Slide 103 text

Conclusion 33 ‣Delayed-aggregation decouples neighbor search with feature computation and signiﬁcantly reduces the overall workload. https://github.com/horizon-research/Efﬁcient-Deep-Learning-for-Point-Clouds

Slide 104

Slide 104 text

Conclusion 33 ‣Delayed-aggregation decouples neighbor search with feature computation and signiﬁcantly reduces the overall workload. ‣Hardware support further maximizes the effectiveness of delayed-aggregation. https://github.com/horizon-research/Efﬁcient-Deep-Learning-for-Point-Clouds