Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mesorasi: Architecture Support for Point Cloud Analytics via Delayed-Aggregation

HorizonLab
October 24, 2020

Mesorasi: Architecture Support for Point Cloud Analytics via Delayed-Aggregation

MICRO 2020 talk slides by Yu Feng and Tiancheng Xu.

HorizonLab

October 24, 2020
Tweet

More Decks by HorizonLab

Other Decks in Education

Transcript

  1. 1 Mesorasi: Architecture Support for Point Cloud Analytics via Delayed-Aggregation

    Yu Feng, Boyuan Tian, Tiancheng Xu with Paul Whatmough (Arm Research) and Yuhao Zhu Department of Computer Science University of Rochester http://horizon-lab.org https://github.com/horizon-research/Efficient-Deep-Learning-for-Point-Clouds
  2. 2

  3. 3

  4. 4

  5. Classification Segmentation Detection SLAM Deep Learning on Point Clouds 5

    Can we use existing neural network accelerators on point cloud workloads?
  6. 6 Aggregation P3 P4 P12 P13 Neighbor Search P0 P2

    P1 P6 P5 P7 P8 P9 P10 P11 Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU Key Operators P1: {P2, P3, ….} P3: {P8, P7, ….} P6: {P8, P5, ….} … … P6: … P1:
  7. P1: {P2, P3, ….} P3: {P8, P7, ….} P6: {P8,

    P5, ….} … … P6: … P1: 7 Aggregation P3 P4 P12 P13 Neighbor Search P0 P2 P1 P6 P5 P7 P8 P9 P10 P11 Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU Key Operators
  8. 8 Neighbor Search in Point Clouds P1 P8 P0 P2

    P3 P6 P5 P7 P9 P10 P12 P13 P11 P4
  9. 8 Neighbor Search in Point Clouds P1 P8 P0 P2

    P3 P6 P5 P7 P9 P10 P12 P13 P11 P4
  10. 8 Neighbor Search in Point Clouds N P1 P8 P0

    P2 P3 P6 P5 P7 P9 P10 P12 P13 P11 P4
  11. 8 Neighbor Search in Point Clouds N P1: { P0,

    P2, P3 P4, P5, P6 } P8: { P5, P6, P7 P9, P10, P11 } P1 P8 P0 P2 P3 P6 P5 P7 P9 P10 P12 P13 P11 P4
  12. 8 Neighbor Search in Point Clouds N P1: { P0,

    P2, P3 P4, P5, P6 } P8: { P5, P6, P7 P9, P10, P11 } P1 P8 P0 P2 P3 P6 P5 P7 P9 P10 P12 P13 P11 P4 Sorting, KD-Tree Irregular Memory Access
  13. Neighbor Search in conventional DNN 9 P0,0 P0,1 P0,2 P1,0

    P1,1 P1,2 P2,0 P2,1 P2,2 P1,1 P1,2 P1,3 P2,1 P2,2 P2,3 P3,1 P3,2 P3,3 N P0,0 P0,1 P0,2 P0,3 P0,4 P1,0 P1,1 P1,2 P1,3 P1,4 P2,0 P2,1 P2,2 P2,3 P2,4 P3,0 P3,1 P3,2 P3,3 P3,4 Pixels: inherently regular 
 Points: irregularly scattered Regular DNNs don’t need explicit neighbor search
  14. 10 Aggregation P3 P4 P12 P13 Neighbor Search P0 P2

    P1 P6 P5 P7 P8 P9 P10 P11 Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU Key Operators P1: {P2, P3, ….} P3: {P8, P7, ….} P6: {P8, P5, ….} … … P6: … P1:
  15. 11 Aggregation P0 P2 P3 P1 P6 P5 P7 P8

    P9 P10 P12 P13 P11 N P1: { P0, P2, P3, P4, P5, P6 } P8: { P5, P6, P7, P9, P10, P11 } P4 N A Points: feature vectors Neighbor Index Table
  16. 11 Aggregation P0 P2 P3 P1 P6 P5 P7 P8

    P9 P10 P12 P13 P11 N P1: { P0, P2, P3, P4, P5, P6 } P8: { P5, P6, P7, P9, P10, P11 } P4 N A P1: Feature Matrix Points: feature vectors Neighbor Index Table
  17. 11 Aggregation P0 P2 P3 P1 P6 P5 P7 P8

    P9 P10 P12 P13 P11 N P1: { P0, P2, P3, P4, P5, P6 } P8: { P5, P6, P7, P9, P10, P11 } P4 N A P0 P1: Feature Matrix Points: feature vectors Neighbor Index Table
  18. 11 Aggregation P0 P2 P3 P1 P6 P5 P7 P8

    P9 P10 P12 P13 P11 N P1: { P0, P2, P3, P4, P5, P6 } P8: { P5, P6, P7, P9, P10, P11 } P4 N A P0 - P1 P1: Feature Matrix Points: feature vectors Neighbor Index Table
  19. 11 Aggregation P0 P2 P3 P1 P6 P5 P7 P8

    P9 P10 P12 P13 P11 N P1: { P0, P2, P3, P4, P5, P6 } P8: { P5, P6, P7, P9, P10, P11 } P4 N A P0 - P1 P2 - P1 P3 - P1 P4 - P1 P5 - P1 P6 - P1 P1: Feature Matrix Points: feature vectors Neighbor Index Table
  20. 11 Aggregation P0 P2 P3 P1 P6 P5 P7 P8

    P9 P10 P12 P13 P11 N P1: { P0, P2, P3, P4, P5, P6 } P8: { P5, P6, P7, P9, P10, P11 } P4 N A P0 - P1 P2 - P1 P3 - P1 P4 - P1 P5 - P1 P6 - P1 P1: Feature Matrix P5 - P8 P6 - P8 P7 - P8 P9 - P8 P10 - P8 P11 - P8 P8: Feature Matrix Points: feature vectors Neighbor Index Table
  21. P1: {P2, P3, ….} P3: {P8, P7, ….} P6: {P8,

    P5, ….} … … P6: … P1: 12 Aggregation P3 P4 P12 P13 Neighbor Search P0 P2 P1 P6 P5 P7 P8 P9 P10 P11 Key Operators Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU
  22. 13 Feature Computation MatMul ReLU MatMul ReLU MLP layers P2

    - P1 P3 - P1 P4 - P1 P5 - P1 P6 - P1 P1: Feature Matrix P5 - P8 P6 - P8 P7 - P8 P9 - P8 P10 - P8 P11 - P8 P8: Feature Matrix P0 - P1 P0 - P1
  23. 13 Feature Computation MatMul ReLU MatMul ReLU MLP layers P2

    - P1 P3 - P1 P4 - P1 P5 - P1 P6 - P1 P1: Feature Matrix P5 - P8 P6 - P8 P7 - P8 P9 - P8 P10 - P8 P11 - P8 P8: Feature Matrix P0’ P0 - P1
  24. 14 Feature Computation MatMul ReLU MatMul ReLU P1: Feature Matrix

    P5 - P8 P6 - P8 P7 - P8 P9 - P8 P10 - P8 P11 - P8 P8: Feature Matrix P2 - P1 P3 - P1 P4 - P1 P5 - P1 P6 - P1 P0 - P1 P0’ P2’ P3’ P4’ P5’ P6’ P1’: New Feature Matrix P5’ P6’ P7’ P9’ P10’ P11’ P8’: New Feature Matrix
  25. 14 Feature Computation MatMul ReLU MatMul ReLU P1: Feature Matrix

    P5 - P8 P6 - P8 P7 - P8 P9 - P8 P10 - P8 P11 - P8 P8: Feature Matrix P2 - P1 P3 - P1 P4 - P1 P5 - P1 P6 - P1 P0 - P1 P0’ P2’ P3’ P4’ P5’ P6’ P1’: New Feature Matrix P5’ P6’ P7’ P9’ P10’ P11’ P8’: New Feature Matrix Reduction P1’ P1’: New Feature Vector P8’ P8’: New Feature Vector
  26. Point Cloud Network Layer 15 P1’ P5’ M’-dimension features Feature

    Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool Aggregation Neighbor Feature Matrix (NFM) Each centroid point has a NFM N x M P1: P3 - P1 P2 - P1 N x M P5: P6 - P5 P3 - P5 N neighbors (N == 2) Neighbor Search Neighbor Index Table P1: {P2, P3} P5: {P3, P6} P1, P5: centroid points M-dimension feature vectors P1 P4 P6 P2 P3 P5 Points
  27. Can We Use Existing DNN Accelerators? 16 ▸ No. They

    are not enough. ▹ Neighbor Search
  28. Can We Use Existing DNN Accelerators? 16 ▸ No. They

    are not enough. ▹ Neighbor Search ▹ Aggregation
  29. Can We Use Existing DNN Accelerators? 16 ▸ No. They

    are not enough. ▹ Neighbor Search ▹ Aggregation 0 25 50 75 100 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et Neighbor Search Aggregation Feature Computation Others Characterization of 
 Point Cloud Networks % Execution Time
  30. Can We Use Existing DNN Accelerators? 16 ▸ No. They

    are not enough. ▹ Neighbor Search ▹ Aggregation 0 25 50 75 100 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et Neighbor Search Aggregation Feature Computation Others Characterization of 
 Point Cloud Networks % Execution Time
  31. Can We Use Existing DNN Accelerators? 16 ▸ No. They

    are not enough. ▹ Neighbor Search ▹ Aggregation 0 25 50 75 100 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et Neighbor Search Aggregation Feature Computation Others Characterization of 
 Point Cloud Networks % Execution Time ▹ Slow on high-end mobile GPUs: PointNet++: 132 ms FPointNet: 141 ms DGCNN: 5200 ms
  32. Point Cloud Network Layer 17 P1’ P5’ M’-dimension features Feature

    Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool Aggregation Neighbor Feature Matrix (NFM) Each centroid point has a NFM N x M P1: P3 - P1 P2 - P1 N x M P5: P6 - P5 P3 - P5 N neighbors (N == 2) Neighbor Search Neighbor Index Table P1: {P2, P3} P5: {P3, P6} P1, P5: centroid points M-dimension feature vectors P1 P4 P6 P2 P3 P5 Points
  33. Neighbor Feature Matrix (NFM) N x M P3 - P1

    P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool Optimization 18
  34. Neighbor Feature Matrix (NFM) N x M P3 - P1

    P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , )
  35. Neighbor Feature Matrix (NFM) N x M P3 - P1

    P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , )
  36. Neighbor Feature Matrix (NFM) N x M P3 - P1

    P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool = Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 -
  37. Neighbor Feature Matrix (NFM) N x M P3 - P1

    P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool = Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 -
  38. Neighbor Feature Matrix (NFM) N x M P3 - P1

    P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool = Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 - Benefit:
  39. Neighbor Feature Matrix (NFM) N x M P3 - P1

    P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool = Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 - ▸ Effectively introduces reuse opportunities
 Benefit:
  40. Neighbor Feature Matrix (NFM) N x M P3 - P1

    P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool = Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 - ▸ Effectively introduces reuse opportunities
 Reduces up to 90% 
 MAC (Multiply-Accumulate) operations Benefit: Each point is used ~30 times
  41. Neighbor Feature Matrix (NFM) N x M P3 - P1

    P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool = Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 - ▸ Effectively introduces reuse opportunities
 Reduces up to 90% 
 MAC (Multiply-Accumulate) operations ▸ Dependency elimination Benefit: Each point is used ~30 times
  42. Neighbor Feature Matrix (NFM) N x M P3 - P1

    P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 - ReLU ( ) ReLU ( ) ReLU ( ) ≠ ▸ Effectively introduces reuse opportunities
 Reduces up to 90% 
 MAC (Multiply-Accumulate) operations ▸ Dependency elimination Benefit: Each point is used ~30 times
  43. Approximation 19 Neighbor Feature Matrix (NFM) MLP + Reduction Mat

    Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 - ReLU ( ) ReLU ( ) ReLU ( ) ≈ P1’ P5’ M’-dimension features Each centroid point has a NFM N x M P3 - P1 P2 - P1 N x M P5: P6 - P5 P3 - P5 P1:
  44. Approximation 19 Neighbor Feature Matrix (NFM) MLP + Reduction Mat

    Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 - ReLU ( ) ReLU ( ) ReLU ( ) ≈ P1’ P5’ M’-dimension features Each centroid point has a NFM N x M P3 - P1 P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: The accuracy change 
 ranges from -0.9% to +1.2%
  45. Delayed Aggregation 20 Neighbor Search Neighbor Index Table Feature Compute

    Point Feature Table Aggregation Neighbor Feature Matrix (NFM) } Reduction P1’ P5’ M’-dimension features Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP M-dimension feature vectors P1 P4 P6 P2 P3 P5 Points
  46. Bottleneck Shift 21 Time (msec.) 0 6 12 18 24

    30 Original Delayed Aggr. Feature Computation Aggregation Neighbor Search Execution Time Distribution of PointNet++
  47. Bottleneck Shift 21 Time (msec.) 0 6 12 18 24

    30 Original Delayed Aggr. Feature Computation Aggregation Neighbor Search Execution Time Distribution of PointNet++
  48. Bottleneck Shift 21 Time (msec.) 0 6 12 18 24

    30 Original Delayed Aggr. Feature Computation Aggregation Neighbor Search Execution Time Distribution of PointNet++
  49. Aggregation Operation 23 Neighbor Feature Matrix Point Feature Table N

    neighbors P1: {P2, P3, P11, P23, P31, P39, …} P3: {P5, P7, P16, P19, P38, P41, …} … Neighbor Index Table P99:{P16, P71, P96, P119, P128, P142, …} P11 P31 P23 P3 P39 P2 … …
  50. Aggregation Operation 23 Neighbor Feature Matrix Point Feature Table N

    neighbors P1: {P2, P3, P11, P23, P31, P39, …} P3: {P5, P7, P16, P19, P38, P41, …} … Neighbor Index Table P99:{P16, P71, P96, P119, P128, P142, …} P11 P31 P23 P3 P39 P2 … … P1: {P2, P3, P11, P23, P31, P39, …}
  51. Aggregation Operation 23 Neighbor Feature Matrix Point Feature Table N

    neighbors P1: {P2, P3, P11, P23, P31, P39, …} P3: {P5, P7, P16, P19, P38, P41, …} … Neighbor Index Table P99:{P16, P71, P96, P119, P128, P142, …} P11 P31 P23 P3 P39 P2 … … P1: {P2, P3, P11, P23, P31, P39, …}
  52. Aggregation Operation 23 Neighbor Feature Matrix Point Feature Table N

    neighbors P1: {P2, P3, P11, P23, P31, P39, …} P3: {P5, P7, P16, P19, P38, P41, …} … Neighbor Index Table P99:{P16, P71, P96, P119, P128, P142, …} P11 P31 P23 P3 P39 P2 … … P1: {P2, P3, P11, P23, P31, P39, …}
  53. Aggregation Operation 23 Neighbor Feature Matrix Point Feature Table N

    neighbors P1: {P2, P3, P11, P23, P31, P39, …} P3: {P5, P7, P16, P19, P38, P41, …} … Neighbor Index Table P99:{P16, P71, P96, P119, P128, P142, …} P11 P31 P23 P3 P39 P2 … … P1: {P2, P3, P11, P23, P31, P39, …}
  54. Aggregation Unit 24 1: {2, 17, 9, …} 3: {8,

    17, 6, …} 6: {18, 25, 34, …} …. 900: {832, 987, …} Point Feature Table (PFT) Neighbor Index Table
  55. Aggregation Unit 24 1: {2, 17, 9, …} 3: {8,

    17, 6, …} 6: {18, 25, 34, …} …. 900: {832, 987, …} Address Generation Point Feature Table (PFT) Neighbor Index Table
  56. Aggregation Unit 24 1: {2, 17, 9, …} 3: {8,

    17, 6, …} 6: {18, 25, 34, …} …. 900: {832, 987, …} Address Generation Bank 1 Bank 2 Bank B … Bank 3 … Point Feature Table (PFT) Neighbor Index Table
  57. Aggregation Unit 24 1: {2, 17, 9, …} 3: {8,

    17, 6, …} 6: {18, 25, 34, …} …. 900: {832, 987, …} Address Generation Bank 1 Bank 2 Bank B … Bank 3 … B-ported, B-banked; No Crossbar Point Feature Table (PFT) Neighbor Index Table
  58. Aggregation Unit 25 1: {2, 17, 9, …} 3: {8,

    17, 6, …} 6: {18, 25, 34, …} …. 900: {832, 987, …} Address Generation Bank 1 Bank 2 Bank B … Bank 3 … Neighbor Index Table Point Feature Table (PFT) Reduction (Max) … Shift Registers B-ported, B-banked; No Crossbar
  59. Aggregation Unit 25 1: {2, 17, 9, …} 3: {8,

    17, 6, …} 6: {18, 25, 34, …} …. 900: {832, 987, …} Address Generation Bank 1 Bank 2 Bank B … Bank 3 … Neighbor Index Table Point Feature Table (PFT) Reduction (Max) … Shift Registers B-ported, B-banked; No Crossbar
  60. Aggregation Unit 25 1: {2, 17, 9, …} 3: {8,

    17, 6, …} 6: {18, 25, 34, …} …. 900: {832, 987, …} Address Generation Bank 1 Bank 2 Bank B … Bank 3 … Neighbor Index Table Point Feature Table (PFT) Reduction (Max) … Shift Registers MUX … (Store centroid’s feature vector) B-ported, B-banked; No Crossbar
  61. Aggregation Unit 25 1: {2, 17, 9, …} 3: {8,

    17, 6, …} 6: {18, 25, 34, …} …. 900: {832, 987, …} Address Generation Bank 1 Bank 2 Bank B … Bank 3 … Neighbor Index Table Point Feature Table (PFT) Reduction (Max) Sub Global Buffer … Shift Registers MUX … (Store centroid’s feature vector) B-ported, B-banked; No Crossbar
  62. DNN Accelerator (NPU) DRAM CPU GPU Overall Hardware Design 26

    Input Point Cloud MLP Kernel Weights MLP Intermediate Activations Neighbor Index Table
  63. DNN Accelerator (NPU) DRAM CPU GPU Overall Hardware Design 26

    Input Point Cloud MLP Kernel Weights MLP Intermediate Activations Neighbor Index Table GPU (Neighbor Search)
  64. DNN Accelerator (NPU) DRAM CPU GPU Overall Hardware Design 26

    Input Point Cloud MLP Kernel Weights MLP Intermediate Activations Neighbor Index Table GPU (Neighbor Search) Feature Extraction + Aggregation
  65. DNN Accelerator (NPU) DRAM CPU GPU Overall Hardware Design 26

    Systolic MAC Unit Array BN/ReLU/ MaxPool Input Point Cloud MLP Kernel Weights MLP Intermediate Activations Neighbor Index Table GPU (Neighbor Search)
  66. DNN Accelerator (NPU) DRAM CPU GPU Overall Hardware Design 26

    Global Buffer (Weights /Fmaps) Global Buffer (Weights /FMaps) MCU MCU Systolic MAC Unit Array BN/ReLU/ MaxPool Input Point Cloud MLP Kernel Weights MLP Intermediate Activations Neighbor Index Table GPU (Neighbor Search)
  67. DNN Accelerator (NPU) DRAM CPU GPU Overall Hardware Design 26

    Global Buffer (Weights /Fmaps) Global Buffer (Weights /FMaps) MCU MCU Systolic MAC Unit Array BN/ReLU/ MaxPool Input Point Cloud MLP Kernel Weights MLP Intermediate Activations Neighbor Index Table Aggregation Logic Neighbor Index Table Point Feature Buffer Reduction (Max) Neighbor Index Table Neighbor Index Buffer GPU (Neighbor Search)
  68. DNN Accelerator (NPU) DRAM CPU GPU Overall Hardware Design 26

    Global Buffer (Weights /Fmaps) Global Buffer (Weights /FMaps) MCU MCU Systolic MAC Unit Array BN/ReLU/ MaxPool Input Point Cloud MLP Kernel Weights MLP Intermediate Activations Neighbor Index Table Aggregation Logic Neighbor Index Table Point Feature Buffer Reduction (Max) Neighbor Index Table Neighbor Index Buffer GPU (Neighbor Search) 
 With 3.8% area overhead to the NPU 

  69. Experimental Setup 27 Three Point Cloud Applications: ▹ Object Classification,

    Object Segmentation, and Object Detection Datasets: ▹ ModelNet40, ShapeNet, and KITTI dataset
  70. Experimental Setup 27 Three Point Cloud Applications: ▹ Object Classification,

    Object Segmentation, and Object Detection Datasets: ▹ ModelNet40, ShapeNet, and KITTI dataset Models: ▹ Classification: PointNet++ (c), DGCNN (c), LDGCNN, DensePoint ▹ Segmentation: PointNet++ (s), DGCNN (s) ▹ Detection: F-PointNet
  71. Experimental Setup 27 Three Point Cloud Applications: ▹ Object Classification,

    Object Segmentation, and Object Detection Datasets: ▹ ModelNet40, ShapeNet, and KITTI dataset Models: ▹ Classification: PointNet++ (c), DGCNN (c), LDGCNN, DensePoint ▹ Segmentation: PointNet++ (s), DGCNN (s) ▹ Detection: F-PointNet https://github.com/horizon-research/Efficient-Deep-Learning-for-Point-Clouds Github:
  72. Accuracy Comparison 28 Accuracy (%) 0 20 40 60 80

    100 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint Orignal Delayed Aggr.
  73. Accuracy Comparison 28 Accuracy (%) 0 20 40 60 80

    100 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint Orignal Delayed Aggr.
  74. Accuracy Comparison 28 Accuracy (%) 0 20 40 60 80

    100 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint Orignal Delayed Aggr.
  75. Speedup and Energy Saving on GPU 29 Speedup 0 0.5

    1 1.5 2 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG .
  76. Speedup and Energy Saving on GPU 29 Speedup 0 0.5

    1 1.5 2 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG . 1.6
  77. Speedup and Energy Saving on GPU 29 Speedup 0 0.5

    1 1.5 2 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG . Saving % 0 25 50 75 100 1.6
  78. Speedup and Energy Saving on GPU 29 Speedup 0 0.5

    1 1.5 2 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG . Saving % 0 25 50 75 100 1.6 51.1%
  79. Hardware Performance Evaluation 30 Variants: ▹ Mesorasi-SW: delayed-aggregation without AU

    support. ▹ Mesorasi-HW: delayed-aggregation with AU support. Hardware Baseline: ▹ A generic NPU + GPU SoC.
  80. Hardware Performance Evaluation 30 Variants: ▹ Mesorasi-SW: delayed-aggregation without AU

    support. ▹ Mesorasi-HW: delayed-aggregation with AU support. Hardware Baseline: ▹ A generic NPU + GPU SoC. Implementation: ▹ 16x16 Systolic Array ▹ Synposys synthesis, TSMC 16nm FinFET technology ▹
  81. Speedup 31 Speedup 0 0.5 1 1.5 2 PointN et++

    (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG . Mesorasi-SW Mesorasi-HW
  82. Speedup 31 Speedup 0 0.5 1 1.5 2 PointN et++

    (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG . Mesorasi-SW Mesorasi-HW 1.3
  83. Speedup 31 Speedup 0 0.5 1 1.5 2 PointN et++

    (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG . Mesorasi-SW Mesorasi-HW 2.2 3.6 1.3 1.9
  84. Energy Savings 32 Saving (%) 0 25 50 75 100

    PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG . Mesorasi-SW Mesorasi-HW
  85. Energy Savings 32 Saving (%) 0 25 50 75 100

    PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG . Mesorasi-SW Mesorasi-HW 22 %
  86. Energy Savings 32 Saving (%) 0 25 50 75 100

    PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG . Mesorasi-SW Mesorasi-HW 22 % 38%
  87. Conclusion 33 ‣Delayed-aggregation decouples neighbor search with feature computation and

    significantly reduces the overall workload. https://github.com/horizon-research/Efficient-Deep-Learning-for-Point-Clouds
  88. Conclusion 33 ‣Delayed-aggregation decouples neighbor search with feature computation and

    significantly reduces the overall workload. ‣Hardware support further maximizes the effectiveness of delayed-aggregation. https://github.com/horizon-research/Efficient-Deep-Learning-for-Point-Clouds