Mesorasi: Architecture Support for Point Cloud Analytics via Delayed-Aggregation

F0c4b39a71fc7c752d4e6c451f6f678b?s=47 HorizonLab
October 24, 2020

Mesorasi: Architecture Support for Point Cloud Analytics via Delayed-Aggregation

MICRO 2020 talk slides by Yu Feng and Tiancheng Xu.

F0c4b39a71fc7c752d4e6c451f6f678b?s=128

HorizonLab

October 24, 2020
Tweet

Transcript

  1. 1 Mesorasi: Architecture Support for Point Cloud Analytics via Delayed-Aggregation

    Yu Feng, Boyuan Tian, Tiancheng Xu with Paul Whatmough (Arm Research) and Yuhao Zhu Department of Computer Science University of Rochester http://horizon-lab.org https://github.com/horizon-research/Efficient-Deep-Learning-for-Point-Clouds
  2. 2

  3. 2 ✘

  4. 3

  5. 4

  6. 4 Autonomous Driving Robotics Mixed Reality Drone Navigation

  7. Classification Segmentation Detection SLAM Deep Learning on Point Clouds 5

  8. Classification Segmentation Detection SLAM Deep Learning on Point Clouds 5

  9. Classification Segmentation Detection SLAM Deep Learning on Point Clouds 5

    
 Performance
 Energy-efficiency

  10. Classification Segmentation Detection SLAM Deep Learning on Point Clouds 5

    Can we use existing neural network accelerators on point cloud workloads?
  11. 6 Aggregation P3 P4 P12 P13 Neighbor Search P0 P2

    P1 P6 P5 P7 P8 P9 P10 P11 Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU Key Operators P1: {P2, P3, ….} P3: {P8, P7, ….} P6: {P8, P5, ….} … … P6: … P1:
  12. P1: {P2, P3, ….} P3: {P8, P7, ….} P6: {P8,

    P5, ….} … … P6: … P1: 7 Aggregation P3 P4 P12 P13 Neighbor Search P0 P2 P1 P6 P5 P7 P8 P9 P10 P11 Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU Key Operators
  13. 8 Neighbor Search in Point Clouds

  14. 8 Neighbor Search in Point Clouds P1 P8 P0 P2

    P3 P6 P5 P7 P9 P10 P12 P13 P11 P4
  15. 8 Neighbor Search in Point Clouds P1 P8 P0 P2

    P3 P6 P5 P7 P9 P10 P12 P13 P11 P4
  16. 8 Neighbor Search in Point Clouds N P1 P8 P0

    P2 P3 P6 P5 P7 P9 P10 P12 P13 P11 P4
  17. 8 Neighbor Search in Point Clouds N P1: { P0,

    P2, P3 P4, P5, P6 } P8: { P5, P6, P7 P9, P10, P11 } P1 P8 P0 P2 P3 P6 P5 P7 P9 P10 P12 P13 P11 P4
  18. 8 Neighbor Search in Point Clouds N P1: { P0,

    P2, P3 P4, P5, P6 } P8: { P5, P6, P7 P9, P10, P11 } P1 P8 P0 P2 P3 P6 P5 P7 P9 P10 P12 P13 P11 P4 Sorting, KD-Tree Irregular Memory Access
  19. Neighbor Search in conventional DNN 9 P0,0 P0,1 P0,2 P1,0

    P1,1 P1,2 P2,0 P2,1 P2,2 P1,1 P1,2 P1,3 P2,1 P2,2 P2,3 P3,1 P3,2 P3,3 N P0,0 P0,1 P0,2 P0,3 P0,4 P1,0 P1,1 P1,2 P1,3 P1,4 P2,0 P2,1 P2,2 P2,3 P2,4 P3,0 P3,1 P3,2 P3,3 P3,4 Pixels: inherently regular 
 Points: irregularly scattered Regular DNNs don’t need explicit neighbor search
  20. 10 Aggregation P3 P4 P12 P13 Neighbor Search P0 P2

    P1 P6 P5 P7 P8 P9 P10 P11 Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU Key Operators P1: {P2, P3, ….} P3: {P8, P7, ….} P6: {P8, P5, ….} … … P6: … P1:
  21. 11 Aggregation P0 P2 P3 P1 P6 P5 P7 P8

    P9 P10 P12 P13 P11 N P1: { P0, P2, P3, P4, P5, P6 } P8: { P5, P6, P7, P9, P10, P11 } P4 N A Points: feature vectors Neighbor Index Table
  22. 11 Aggregation P0 P2 P3 P1 P6 P5 P7 P8

    P9 P10 P12 P13 P11 N P1: { P0, P2, P3, P4, P5, P6 } P8: { P5, P6, P7, P9, P10, P11 } P4 N A P1: Feature Matrix Points: feature vectors Neighbor Index Table
  23. 11 Aggregation P0 P2 P3 P1 P6 P5 P7 P8

    P9 P10 P12 P13 P11 N P1: { P0, P2, P3, P4, P5, P6 } P8: { P5, P6, P7, P9, P10, P11 } P4 N A P0 P1: Feature Matrix Points: feature vectors Neighbor Index Table
  24. 11 Aggregation P0 P2 P3 P1 P6 P5 P7 P8

    P9 P10 P12 P13 P11 N P1: { P0, P2, P3, P4, P5, P6 } P8: { P5, P6, P7, P9, P10, P11 } P4 N A P0 - P1 P1: Feature Matrix Points: feature vectors Neighbor Index Table
  25. 11 Aggregation P0 P2 P3 P1 P6 P5 P7 P8

    P9 P10 P12 P13 P11 N P1: { P0, P2, P3, P4, P5, P6 } P8: { P5, P6, P7, P9, P10, P11 } P4 N A P0 - P1 P2 - P1 P3 - P1 P4 - P1 P5 - P1 P6 - P1 P1: Feature Matrix Points: feature vectors Neighbor Index Table
  26. 11 Aggregation P0 P2 P3 P1 P6 P5 P7 P8

    P9 P10 P12 P13 P11 N P1: { P0, P2, P3, P4, P5, P6 } P8: { P5, P6, P7, P9, P10, P11 } P4 N A P0 - P1 P2 - P1 P3 - P1 P4 - P1 P5 - P1 P6 - P1 P1: Feature Matrix P5 - P8 P6 - P8 P7 - P8 P9 - P8 P10 - P8 P11 - P8 P8: Feature Matrix Points: feature vectors Neighbor Index Table
  27. P1: {P2, P3, ….} P3: {P8, P7, ….} P6: {P8,

    P5, ….} … … P6: … P1: 12 Aggregation P3 P4 P12 P13 Neighbor Search P0 P2 P1 P6 P5 P7 P8 P9 P10 P11 Key Operators Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU
  28. 13 Feature Computation MatMul ReLU MatMul ReLU MLP layers P2

    - P1 P3 - P1 P4 - P1 P5 - P1 P6 - P1 P1: Feature Matrix P5 - P8 P6 - P8 P7 - P8 P9 - P8 P10 - P8 P11 - P8 P8: Feature Matrix P0 - P1 P0 - P1
  29. 13 Feature Computation MatMul ReLU MatMul ReLU MLP layers P2

    - P1 P3 - P1 P4 - P1 P5 - P1 P6 - P1 P1: Feature Matrix P5 - P8 P6 - P8 P7 - P8 P9 - P8 P10 - P8 P11 - P8 P8: Feature Matrix P0’ P0 - P1
  30. 14 Feature Computation MatMul ReLU MatMul ReLU P1: Feature Matrix

    P5 - P8 P6 - P8 P7 - P8 P9 - P8 P10 - P8 P11 - P8 P8: Feature Matrix P2 - P1 P3 - P1 P4 - P1 P5 - P1 P6 - P1 P0 - P1 P0’ P2’ P3’ P4’ P5’ P6’ P1’: New Feature Matrix P5’ P6’ P7’ P9’ P10’ P11’ P8’: New Feature Matrix
  31. 14 Feature Computation MatMul ReLU MatMul ReLU P1: Feature Matrix

    P5 - P8 P6 - P8 P7 - P8 P9 - P8 P10 - P8 P11 - P8 P8: Feature Matrix P2 - P1 P3 - P1 P4 - P1 P5 - P1 P6 - P1 P0 - P1 P0’ P2’ P3’ P4’ P5’ P6’ P1’: New Feature Matrix P5’ P6’ P7’ P9’ P10’ P11’ P8’: New Feature Matrix Reduction P1’ P1’: New Feature Vector P8’ P8’: New Feature Vector
  32. Point Cloud Network Layer 15 P1’ P5’ M’-dimension features Feature

    Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool Aggregation Neighbor Feature Matrix (NFM) Each centroid point has a NFM N x M P1: P3 - P1 P2 - P1 N x M P5: P6 - P5 P3 - P5 N neighbors (N == 2) Neighbor Search Neighbor Index Table P1: {P2, P3} P5: {P3, P6} P1, P5: centroid points M-dimension feature vectors P1 P4 P6 P2 P3 P5 Points
  33. Can We Use Existing DNN Accelerators? 16

  34. Can We Use Existing DNN Accelerators? 16 ▸ No. They

    are not enough.
  35. Can We Use Existing DNN Accelerators? 16 ▸ No. They

    are not enough. ▹ Neighbor Search
  36. Can We Use Existing DNN Accelerators? 16 ▸ No. They

    are not enough. ▹ Neighbor Search ▹ Aggregation
  37. Can We Use Existing DNN Accelerators? 16 ▸ No. They

    are not enough. ▹ Neighbor Search ▹ Aggregation 0 25 50 75 100 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et Neighbor Search Aggregation Feature Computation Others Characterization of 
 Point Cloud Networks % Execution Time
  38. Can We Use Existing DNN Accelerators? 16 ▸ No. They

    are not enough. ▹ Neighbor Search ▹ Aggregation 0 25 50 75 100 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et Neighbor Search Aggregation Feature Computation Others Characterization of 
 Point Cloud Networks % Execution Time
  39. Can We Use Existing DNN Accelerators? 16 ▸ No. They

    are not enough. ▹ Neighbor Search ▹ Aggregation 0 25 50 75 100 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et Neighbor Search Aggregation Feature Computation Others Characterization of 
 Point Cloud Networks % Execution Time ▹ Slow on high-end mobile GPUs: PointNet++: 132 ms FPointNet: 141 ms DGCNN: 5200 ms
  40. Point Cloud Network Layer 17 P1’ P5’ M’-dimension features Feature

    Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool Aggregation Neighbor Feature Matrix (NFM) Each centroid point has a NFM N x M P1: P3 - P1 P2 - P1 N x M P5: P6 - P5 P3 - P5 N neighbors (N == 2) Neighbor Search Neighbor Index Table P1: {P2, P3} P5: {P3, P6} P1, P5: centroid points M-dimension feature vectors P1 P4 P6 P2 P3 P5 Points
  41. Neighbor Feature Matrix (NFM) N x M P3 - P1

    P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool Optimization 18
  42. Neighbor Feature Matrix (NFM) N x M P3 - P1

    P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , )
  43. Neighbor Feature Matrix (NFM) N x M P3 - P1

    P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , )
  44. Neighbor Feature Matrix (NFM) N x M P3 - P1

    P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool = Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 -
  45. Neighbor Feature Matrix (NFM) N x M P3 - P1

    P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool = Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 -
  46. Neighbor Feature Matrix (NFM) N x M P3 - P1

    P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool = Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 - Benefit:
  47. Neighbor Feature Matrix (NFM) N x M P3 - P1

    P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool = Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 - ▸ Effectively introduces reuse opportunities
 Benefit:
  48. Neighbor Feature Matrix (NFM) N x M P3 - P1

    P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool = Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 - ▸ Effectively introduces reuse opportunities
 Reduces up to 90% 
 MAC (Multiply-Accumulate) operations Benefit: Each point is used ~30 times
  49. Neighbor Feature Matrix (NFM) N x M P3 - P1

    P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool = Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 - ▸ Effectively introduces reuse opportunities
 Reduces up to 90% 
 MAC (Multiply-Accumulate) operations ▸ Dependency elimination Benefit: Each point is used ~30 times
  50. Neighbor Feature Matrix (NFM) N x M P3 - P1

    P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: Each centroid point has a NFM P1’ P5’ M’-dimension features Feature Computation Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool Optimization 18 N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 - ReLU ( ) ReLU ( ) ReLU ( ) ≠ ▸ Effectively introduces reuse opportunities
 Reduces up to 90% 
 MAC (Multiply-Accumulate) operations ▸ Dependency elimination Benefit: Each point is used ~30 times
  51. Approximation 19 Neighbor Feature Matrix (NFM) MLP + Reduction Mat

    Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 - ReLU ( ) ReLU ( ) ReLU ( ) ≈ P1’ P5’ M’-dimension features Each centroid point has a NFM N x M P3 - P1 P2 - P1 N x M P5: P6 - P5 P3 - P5 P1:
  52. Approximation 19 Neighbor Feature Matrix (NFM) MLP + Reduction Mat

    Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP + Reduction Max Pool N x M P3 - P1 Mat Mul ( Kernel weights , ) Mat Mul ( Kernel weights , ) N x M P3 Mat Mul ( Kernel weights , ) N x M P1 - ReLU ( ) ReLU ( ) ReLU ( ) ≈ P1’ P5’ M’-dimension features Each centroid point has a NFM N x M P3 - P1 P2 - P1 N x M P5: P6 - P5 P3 - P5 P1: The accuracy change 
 ranges from -0.9% to +1.2%
  53. Delayed Aggregation 20 Neighbor Search Neighbor Index Table Feature Compute

    Point Feature Table Aggregation Neighbor Feature Matrix (NFM) } Reduction P1’ P5’ M’-dimension features Mat Mul 1 ReLU Mat Mul 2 ReLU Mat Mul 3 ReLU MLP M-dimension feature vectors P1 P4 P6 P2 P3 P5 Points
  54. Bottleneck Shift 21 Time (msec.) 0 6 12 18 24

    30 Original Delayed Aggr. Feature Computation Aggregation Neighbor Search Execution Time Distribution of PointNet++
  55. Bottleneck Shift 21 Time (msec.) 0 6 12 18 24

    30 Original Delayed Aggr. Feature Computation Aggregation Neighbor Search Execution Time Distribution of PointNet++
  56. Bottleneck Shift 21 Time (msec.) 0 6 12 18 24

    30 Original Delayed Aggr. Feature Computation Aggregation Neighbor Search Execution Time Distribution of PointNet++
  57. Mesorasi: Point Cloud Acceleration Frame 22 Algorithm Hardware Delayed Aggregation

    Aggregation Acceleration
  58. Aggregation Operation 23 Neighbor Feature Matrix Point Feature Table N

    neighbors P1: {P2, P3, P11, P23, P31, P39, …} P3: {P5, P7, P16, P19, P38, P41, …} … Neighbor Index Table P99:{P16, P71, P96, P119, P128, P142, …} P11 P31 P23 P3 P39 P2 … …
  59. Aggregation Operation 23 Neighbor Feature Matrix Point Feature Table N

    neighbors P1: {P2, P3, P11, P23, P31, P39, …} P3: {P5, P7, P16, P19, P38, P41, …} … Neighbor Index Table P99:{P16, P71, P96, P119, P128, P142, …} P11 P31 P23 P3 P39 P2 … … P1: {P2, P3, P11, P23, P31, P39, …}
  60. Aggregation Operation 23 Neighbor Feature Matrix Point Feature Table N

    neighbors P1: {P2, P3, P11, P23, P31, P39, …} P3: {P5, P7, P16, P19, P38, P41, …} … Neighbor Index Table P99:{P16, P71, P96, P119, P128, P142, …} P11 P31 P23 P3 P39 P2 … … P1: {P2, P3, P11, P23, P31, P39, …}
  61. Aggregation Operation 23 Neighbor Feature Matrix Point Feature Table N

    neighbors P1: {P2, P3, P11, P23, P31, P39, …} P3: {P5, P7, P16, P19, P38, P41, …} … Neighbor Index Table P99:{P16, P71, P96, P119, P128, P142, …} P11 P31 P23 P3 P39 P2 … … P1: {P2, P3, P11, P23, P31, P39, …}
  62. Aggregation Operation 23 Neighbor Feature Matrix Point Feature Table N

    neighbors P1: {P2, P3, P11, P23, P31, P39, …} P3: {P5, P7, P16, P19, P38, P41, …} … Neighbor Index Table P99:{P16, P71, P96, P119, P128, P142, …} P11 P31 P23 P3 P39 P2 … … P1: {P2, P3, P11, P23, P31, P39, …}
  63. Aggregation Unit 24 1: {2, 17, 9, …} 3: {8,

    17, 6, …} 6: {18, 25, 34, …} …. 900: {832, 987, …} Point Feature Table (PFT) Neighbor Index Table
  64. Aggregation Unit 24 1: {2, 17, 9, …} 3: {8,

    17, 6, …} 6: {18, 25, 34, …} …. 900: {832, 987, …} Address Generation Point Feature Table (PFT) Neighbor Index Table
  65. Aggregation Unit 24 1: {2, 17, 9, …} 3: {8,

    17, 6, …} 6: {18, 25, 34, …} …. 900: {832, 987, …} Address Generation Bank 1 Bank 2 Bank B … Bank 3 … Point Feature Table (PFT) Neighbor Index Table
  66. Aggregation Unit 24 1: {2, 17, 9, …} 3: {8,

    17, 6, …} 6: {18, 25, 34, …} …. 900: {832, 987, …} Address Generation Bank 1 Bank 2 Bank B … Bank 3 … B-ported, B-banked; No Crossbar Point Feature Table (PFT) Neighbor Index Table
  67. Aggregation Unit 25 1: {2, 17, 9, …} 3: {8,

    17, 6, …} 6: {18, 25, 34, …} …. 900: {832, 987, …} Address Generation Bank 1 Bank 2 Bank B … Bank 3 … Neighbor Index Table Point Feature Table (PFT) Reduction (Max) … Shift Registers B-ported, B-banked; No Crossbar
  68. Aggregation Unit 25 1: {2, 17, 9, …} 3: {8,

    17, 6, …} 6: {18, 25, 34, …} …. 900: {832, 987, …} Address Generation Bank 1 Bank 2 Bank B … Bank 3 … Neighbor Index Table Point Feature Table (PFT) Reduction (Max) … Shift Registers B-ported, B-banked; No Crossbar
  69. Aggregation Unit 25 1: {2, 17, 9, …} 3: {8,

    17, 6, …} 6: {18, 25, 34, …} …. 900: {832, 987, …} Address Generation Bank 1 Bank 2 Bank B … Bank 3 … Neighbor Index Table Point Feature Table (PFT) Reduction (Max) … Shift Registers MUX … (Store centroid’s feature vector) B-ported, B-banked; No Crossbar
  70. Aggregation Unit 25 1: {2, 17, 9, …} 3: {8,

    17, 6, …} 6: {18, 25, 34, …} …. 900: {832, 987, …} Address Generation Bank 1 Bank 2 Bank B … Bank 3 … Neighbor Index Table Point Feature Table (PFT) Reduction (Max) Sub Global Buffer … Shift Registers MUX … (Store centroid’s feature vector) B-ported, B-banked; No Crossbar
  71. DNN Accelerator (NPU) DRAM CPU GPU Overall Hardware Design 26

  72. DNN Accelerator (NPU) DRAM CPU GPU Overall Hardware Design 26

    Input Point Cloud MLP Kernel Weights MLP Intermediate Activations Neighbor Index Table
  73. DNN Accelerator (NPU) DRAM CPU GPU Overall Hardware Design 26

    Input Point Cloud MLP Kernel Weights MLP Intermediate Activations Neighbor Index Table GPU (Neighbor Search)
  74. DNN Accelerator (NPU) DRAM CPU GPU Overall Hardware Design 26

    Input Point Cloud MLP Kernel Weights MLP Intermediate Activations Neighbor Index Table GPU (Neighbor Search) Feature Extraction + Aggregation
  75. DNN Accelerator (NPU) DRAM CPU GPU Overall Hardware Design 26

    Systolic MAC Unit Array BN/ReLU/ MaxPool Input Point Cloud MLP Kernel Weights MLP Intermediate Activations Neighbor Index Table GPU (Neighbor Search)
  76. DNN Accelerator (NPU) DRAM CPU GPU Overall Hardware Design 26

    Global Buffer (Weights /Fmaps) Global Buffer (Weights /FMaps) MCU MCU Systolic MAC Unit Array BN/ReLU/ MaxPool Input Point Cloud MLP Kernel Weights MLP Intermediate Activations Neighbor Index Table GPU (Neighbor Search)
  77. DNN Accelerator (NPU) DRAM CPU GPU Overall Hardware Design 26

    Global Buffer (Weights /Fmaps) Global Buffer (Weights /FMaps) MCU MCU Systolic MAC Unit Array BN/ReLU/ MaxPool Input Point Cloud MLP Kernel Weights MLP Intermediate Activations Neighbor Index Table Aggregation Logic Neighbor Index Table Point Feature Buffer Reduction (Max) Neighbor Index Table Neighbor Index Buffer GPU (Neighbor Search)
  78. DNN Accelerator (NPU) DRAM CPU GPU Overall Hardware Design 26

    Global Buffer (Weights /Fmaps) Global Buffer (Weights /FMaps) MCU MCU Systolic MAC Unit Array BN/ReLU/ MaxPool Input Point Cloud MLP Kernel Weights MLP Intermediate Activations Neighbor Index Table Aggregation Logic Neighbor Index Table Point Feature Buffer Reduction (Max) Neighbor Index Table Neighbor Index Buffer GPU (Neighbor Search) 
 With 3.8% area overhead to the NPU 

  79. Experimental Setup 27

  80. Experimental Setup 27 Three Point Cloud Applications: ▹ Object Classification,

    Object Segmentation, and Object Detection
  81. Experimental Setup 27 Three Point Cloud Applications: ▹ Object Classification,

    Object Segmentation, and Object Detection Datasets: ▹ ModelNet40, ShapeNet, and KITTI dataset
  82. Experimental Setup 27 Three Point Cloud Applications: ▹ Object Classification,

    Object Segmentation, and Object Detection Datasets: ▹ ModelNet40, ShapeNet, and KITTI dataset Models: ▹ Classification: PointNet++ (c), DGCNN (c), LDGCNN, DensePoint ▹ Segmentation: PointNet++ (s), DGCNN (s) ▹ Detection: F-PointNet
  83. Experimental Setup 27 Three Point Cloud Applications: ▹ Object Classification,

    Object Segmentation, and Object Detection Datasets: ▹ ModelNet40, ShapeNet, and KITTI dataset Models: ▹ Classification: PointNet++ (c), DGCNN (c), LDGCNN, DensePoint ▹ Segmentation: PointNet++ (s), DGCNN (s) ▹ Detection: F-PointNet https://github.com/horizon-research/Efficient-Deep-Learning-for-Point-Clouds Github:
  84. Accuracy Comparison 28 Accuracy (%) 0 20 40 60 80

    100 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint Orignal Delayed Aggr.
  85. Accuracy Comparison 28 Accuracy (%) 0 20 40 60 80

    100 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint Orignal Delayed Aggr.
  86. Accuracy Comparison 28 Accuracy (%) 0 20 40 60 80

    100 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint Orignal Delayed Aggr.
  87. Speedup and Energy Saving on GPU 29

  88. Speedup and Energy Saving on GPU 29 Speedup 0 0.5

    1 1.5 2 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG .
  89. Speedup and Energy Saving on GPU 29 Speedup 0 0.5

    1 1.5 2 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG . 1.6
  90. Speedup and Energy Saving on GPU 29 Speedup 0 0.5

    1 1.5 2 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG . Saving % 0 25 50 75 100 1.6
  91. Speedup and Energy Saving on GPU 29 Speedup 0 0.5

    1 1.5 2 PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG . Saving % 0 25 50 75 100 1.6 51.1%
  92. Hardware Performance Evaluation 30

  93. Hardware Performance Evaluation 30 Hardware Baseline: ▹ A generic NPU

    + GPU SoC.
  94. Hardware Performance Evaluation 30 Variants: ▹ Mesorasi-SW: delayed-aggregation without AU

    support. ▹ Mesorasi-HW: delayed-aggregation with AU support. Hardware Baseline: ▹ A generic NPU + GPU SoC.
  95. Hardware Performance Evaluation 30 Variants: ▹ Mesorasi-SW: delayed-aggregation without AU

    support. ▹ Mesorasi-HW: delayed-aggregation with AU support. Hardware Baseline: ▹ A generic NPU + GPU SoC. Implementation: ▹ 16x16 Systolic Array ▹ Synposys synthesis, TSMC 16nm FinFET technology ▹
  96. Speedup 31 Speedup 0 0.5 1 1.5 2 PointN et++

    (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG . Mesorasi-SW Mesorasi-HW
  97. Speedup 31 Speedup 0 0.5 1 1.5 2 PointN et++

    (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG . Mesorasi-SW Mesorasi-HW 1.3
  98. Speedup 31 Speedup 0 0.5 1 1.5 2 PointN et++

    (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG . Mesorasi-SW Mesorasi-HW 2.2 3.6 1.3 1.9
  99. Energy Savings 32 Saving (%) 0 25 50 75 100

    PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG . Mesorasi-SW Mesorasi-HW
  100. Energy Savings 32 Saving (%) 0 25 50 75 100

    PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG . Mesorasi-SW Mesorasi-HW 22 %
  101. Energy Savings 32 Saving (%) 0 25 50 75 100

    PointN et++ (c) PointN et++ (s) DG C N N (c) DG C N N (s) F-PointN et LDG C N N DensePoint AVG . Mesorasi-SW Mesorasi-HW 22 % 38%
  102. Conclusion 33 https://github.com/horizon-research/Efficient-Deep-Learning-for-Point-Clouds

  103. Conclusion 33 ‣Delayed-aggregation decouples neighbor search with feature computation and

    significantly reduces the overall workload. https://github.com/horizon-research/Efficient-Deep-Learning-for-Point-Clouds
  104. Conclusion 33 ‣Delayed-aggregation decouples neighbor search with feature computation and

    significantly reduces the overall workload. ‣Hardware support further maximizes the effectiveness of delayed-aggregation. https://github.com/horizon-research/Efficient-Deep-Learning-for-Point-Clouds