Yu Feng, Boyuan Tian, Tiancheng Xu with Paul Whatmough (Arm Research) and Yuhao Zhu Department of Computer Science University of Rochester http://horizon-lab.org https://github.com/horizon-research/Efficient-Deep-Learning-for-Point-Clouds
Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool Aggregation Neighbor Feature Matrix (NFM) Each centroid point has a NFM N x M P1: P3 - P1 P2 - P1 N x M P5: P6 - P5 P3 - P5 N neighbors (N == 2) Neighbor Search Neighbor Index Table P1: {P2, P3} P5: {P3, P6} P1, P5: centroid points M-dimension feature vectors P1 P4 P6 P2 P3 P5 Points
are not enough. ▹ Neighbor Search ▹ Aggregation 0 25 50 75 100 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et Neighbor Search Aggregation Feature Computation Others Characterization of Point Cloud Networks % Execution Time
are not enough. ▹ Neighbor Search ▹ Aggregation 0 25 50 75 100 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et Neighbor Search Aggregation Feature Computation Others Characterization of Point Cloud Networks % Execution Time
are not enough. ▹ Neighbor Search ▹ Aggregation 0 25 50 75 100 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et Neighbor Search Aggregation Feature Computation Others Characterization of Point Cloud Networks % Execution Time ▹ Slow on high-end mobile GPUs: PointNet++: 132 ms FPointNet: 141 ms DGCNN: 5200 ms
Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool Aggregation Neighbor Feature Matrix (NFM) Each centroid point has a NFM N x M P1: P3 - P1 P2 - P1 N x M P5: P6 - P5 P3 - P5 N neighbors (N == 2) Neighbor Search Neighbor Index Table P1: {P2, P3} P5: {P3, P6} P1, P5: centroid points M-dimension feature vectors P1 P4 P6 P2 P3 P5 Points
P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool Optimization 18
P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , )
P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , )
P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool = Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 -
P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool = Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 -
P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool = Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 - Benefit:
P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool = Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 - ▸ Effectively introduces reuse opportunities Benefit:
P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool = Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 - ▸ Effectively introduces reuse opportunities Reduces up to 90% MAC (Multiply-Accumulate) operations Benefit: Each point is used ~30 times
P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool = Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 - ▸ Effectively introduces reuse opportunities Reduces up to 90% MAC (Multiply-Accumulate) operations ▸ Dependency elimination Benefit: Each point is used ~30 times
P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 - ReLU ( ) ReLU ( ) ReLU ( ) ≠ ▸ Effectively introduces reuse opportunities Reduces up to 90% MAC (Multiply-Accumulate) operations ▸ Dependency elimination Benefit: Each point is used ~30 times
Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 - ReLU ( ) ReLU ( ) ReLU ( ) ≈ P1’ P5’ M’-dimension features Each centroid point has a NFM N x M P3 - P1 P2 - P1 N x M P5: P6 - P5 P3 - P5 P1:
Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 - ReLU ( ) ReLU ( ) ReLU ( ) ≈ P1’ P5’ M’-dimension features Each centroid point has a NFM N x M P3 - P1 P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: The accuracy change ranges from -0.9% to +1.2%
Point Feature Table Aggregation Neighbor Feature Matrix (NFM) } Reduction P1’ P5’ M’-dimension features Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP M-dimension feature vectors P1 P4 P6 P2 P3 P5 Points
17, 6, …} 6: {18, 25, 34, …} …. 900: {832, 987, …} Address Generation Bank 1 Bank 2 Bank B … Bank 3 … B-ported, B-banked; No Crossbar Point Feature Table (PFT) Neighbor Index Table
17, 6, …} 6: {18, 25, 34, …} …. 900: {832, 987, …} Address Generation Bank 1 Bank 2 Bank B … Bank 3 … Neighbor Index Table Point Feature Table (PFT) Reduction (Max) Sub Global Buffer … Shift Registers MUX … (Store centroid’s feature vector) B-ported, B-banked; No Crossbar
Global Buffer (Weights /Fmaps) Global Buffer (Weights /FMaps) MCU MCU Systolic MAC Unit Array BN/ReLU/ MaxPool Input Point Cloud MLP Kernel Weights MLP Intermediate Activations Neighbor Index Table Aggregation Logic Neighbor Index Table Point Feature Buffer Reduction (Max) Neighbor Index Table Neighbor Index Buffer GPU (Neighbor Search)
Global Buffer (Weights /Fmaps) Global Buffer (Weights /FMaps) MCU MCU Systolic MAC Unit Array BN/ReLU/ MaxPool Input Point Cloud MLP Kernel Weights MLP Intermediate Activations Neighbor Index Table Aggregation Logic Neighbor Index Table Point Feature Buffer Reduction (Max) Neighbor Index Table Neighbor Index Buffer GPU (Neighbor Search) With 3.8% area overhead to the NPU
significantly reduces the overall workload. ‣Hardware support further maximizes the effectiveness of delayed-aggregation. https://github.com/horizon-research/Efficient-Deep-Learning-for-Point-Clouds