Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Mesorasi: Architecture Support for Point Cloud Analytics via Delayed-Aggregation

HorizonLab
October 24, 2020

Mesorasi: Architecture Support for Point Cloud Analytics via Delayed-Aggregation

MICRO 2020 talk slides by Yu Feng and Tiancheng Xu.

HorizonLab

October 24, 2020
Tweet

More Decks by HorizonLab

Other Decks in Education

Transcript

  1. 1
    Mesorasi: Architecture Support for Point Cloud
    Analytics via Delayed-Aggregation
    Yu Feng, Boyuan Tian, Tiancheng Xu
    with Paul Whatmough (Arm Research) and Yuhao Zhu
    Department of Computer Science

    University of Rochester

    http://horizon-lab.org
    https://github.com/horizon-research/Efficient-Deep-Learning-for-Point-Clouds

    View Slide

  2. 2

    View Slide

  3. 2

    View Slide

  4. 3

    View Slide

  5. 4

    View Slide

  6. 4
    Autonomous Driving Robotics
    Mixed Reality Drone Navigation

    View Slide

  7. Classification Segmentation Detection SLAM
    Deep Learning on Point Clouds
    5

    View Slide

  8. Classification Segmentation Detection SLAM
    Deep Learning on Point Clouds
    5

    View Slide

  9. Classification Segmentation Detection SLAM
    Deep Learning on Point Clouds
    5

    Performance

    Energy-efficiency


    View Slide

  10. Classification Segmentation Detection SLAM
    Deep Learning on Point Clouds
    5
    Can we use existing neural network accelerators
    on point cloud workloads?

    View Slide

  11. 6
    Aggregation
    P3
    P4
    P12
    P13
    Neighbor Search
    P0 P2
    P1
    P6
    P5
    P7
    P8
    P9
    P10
    P11
    Feature Computation
    Mat

    Mul 1
    ReLU
    Mat

    Mul 2
    ReLU
    Mat

    Mul 3
    ReLU
    Key Operators
    P1: {P2, P3, ….}
    P3: {P8, P7, ….}
    P6: {P8, P5, ….}


    P6:

    P1:

    View Slide

  12. P1: {P2, P3, ….}
    P3: {P8, P7, ….}
    P6: {P8, P5, ….}


    P6:

    P1:
    7
    Aggregation
    P3
    P4
    P12
    P13
    Neighbor Search
    P0 P2
    P1
    P6
    P5
    P7
    P8
    P9
    P10
    P11
    Feature Computation
    Mat

    Mul 1
    ReLU
    Mat

    Mul 2
    ReLU
    Mat

    Mul 3
    ReLU
    Key Operators

    View Slide

  13. 8
    Neighbor Search in Point Clouds

    View Slide

  14. 8
    Neighbor Search in Point Clouds
    P1
    P8
    P0 P2
    P3
    P6
    P5
    P7
    P9
    P10
    P12
    P13 P11
    P4

    View Slide

  15. 8
    Neighbor Search in Point Clouds
    P1
    P8
    P0 P2
    P3
    P6
    P5
    P7
    P9
    P10
    P12
    P13 P11
    P4

    View Slide

  16. 8
    Neighbor Search in Point Clouds
    N
    P1
    P8
    P0 P2
    P3
    P6
    P5
    P7
    P9
    P10
    P12
    P13 P11
    P4

    View Slide

  17. 8
    Neighbor Search in Point Clouds
    N
    P1:
    { P0, P2, P3
    P4, P5, P6 }
    P8:
    { P5, P6, P7
    P9, P10, P11 }
    P1
    P8
    P0 P2
    P3
    P6
    P5
    P7
    P9
    P10
    P12
    P13 P11
    P4

    View Slide

  18. 8
    Neighbor Search in Point Clouds
    N
    P1:
    { P0, P2, P3
    P4, P5, P6 }
    P8:
    { P5, P6, P7
    P9, P10, P11 }
    P1
    P8
    P0 P2
    P3
    P6
    P5
    P7
    P9
    P10
    P12
    P13 P11
    P4
    Sorting, KD-Tree
    Irregular Memory Access

    View Slide

  19. Neighbor Search in conventional DNN
    9
    P0,0 P0,1 P0,2
    P1,0 P1,1 P1,2
    P2,0 P2,1 P2,2
    P1,1 P1,2 P1,3
    P2,1 P2,2 P2,3
    P3,1 P3,2 P3,3
    N
    P0,0 P0,1 P0,2 P0,3 P0,4
    P1,0 P1,1 P1,2 P1,3 P1,4
    P2,0 P2,1 P2,2 P2,3 P2,4
    P3,0 P3,1 P3,2 P3,3 P3,4
    Pixels: inherently regular 

    Points: irregularly scattered
    Regular DNNs don’t need
    explicit neighbor search

    View Slide

  20. 10
    Aggregation
    P3
    P4
    P12
    P13
    Neighbor Search
    P0 P2
    P1
    P6
    P5
    P7
    P8
    P9
    P10
    P11
    Feature Computation
    Mat

    Mul 1
    ReLU
    Mat

    Mul 2
    ReLU
    Mat

    Mul 3
    ReLU
    Key Operators
    P1: {P2, P3, ….}
    P3: {P8, P7, ….}
    P6: {P8, P5, ….}


    P6:

    P1:

    View Slide

  21. 11
    Aggregation
    P0 P2
    P3
    P1
    P6
    P5
    P7
    P8
    P9
    P10
    P12
    P13 P11
    N
    P1:
    { P0, P2, P3,
    P4, P5, P6 }
    P8:
    { P5, P6, P7,
    P9, P10, P11 }
    P4
    N
    A
    Points: feature vectors
    Neighbor
    Index Table

    View Slide

  22. 11
    Aggregation
    P0 P2
    P3
    P1
    P6
    P5
    P7
    P8
    P9
    P10
    P12
    P13 P11
    N
    P1:
    { P0, P2, P3,
    P4, P5, P6 }
    P8:
    { P5, P6, P7,
    P9, P10, P11 }
    P4
    N
    A
    P1:
    Feature
    Matrix
    Points: feature vectors
    Neighbor
    Index Table

    View Slide

  23. 11
    Aggregation
    P0 P2
    P3
    P1
    P6
    P5
    P7
    P8
    P9
    P10
    P12
    P13 P11
    N
    P1:
    { P0, P2, P3,
    P4, P5, P6 }
    P8:
    { P5, P6, P7,
    P9, P10, P11 }
    P4
    N
    A
    P0
    P1:
    Feature
    Matrix
    Points: feature vectors
    Neighbor
    Index Table

    View Slide

  24. 11
    Aggregation
    P0 P2
    P3
    P1
    P6
    P5
    P7
    P8
    P9
    P10
    P12
    P13 P11
    N
    P1:
    { P0, P2, P3,
    P4, P5, P6 }
    P8:
    { P5, P6, P7,
    P9, P10, P11 }
    P4
    N
    A
    P0 - P1
    P1:
    Feature
    Matrix
    Points: feature vectors
    Neighbor
    Index Table

    View Slide

  25. 11
    Aggregation
    P0 P2
    P3
    P1
    P6
    P5
    P7
    P8
    P9
    P10
    P12
    P13 P11
    N
    P1:
    { P0, P2, P3,
    P4, P5, P6 }
    P8:
    { P5, P6, P7,
    P9, P10, P11 }
    P4
    N
    A
    P0 - P1
    P2 - P1
    P3 - P1
    P4 - P1
    P5 - P1
    P6 - P1
    P1:
    Feature
    Matrix
    Points: feature vectors
    Neighbor
    Index Table

    View Slide

  26. 11
    Aggregation
    P0 P2
    P3
    P1
    P6
    P5
    P7
    P8
    P9
    P10
    P12
    P13 P11
    N
    P1:
    { P0, P2, P3,
    P4, P5, P6 }
    P8:
    { P5, P6, P7,
    P9, P10, P11 }
    P4
    N
    A
    P0 - P1
    P2 - P1
    P3 - P1
    P4 - P1
    P5 - P1
    P6 - P1
    P1:
    Feature
    Matrix
    P5 - P8
    P6 - P8
    P7 - P8
    P9 - P8
    P10 - P8
    P11 - P8
    P8:
    Feature
    Matrix
    Points: feature vectors
    Neighbor
    Index Table

    View Slide

  27. P1: {P2, P3, ….}
    P3: {P8, P7, ….}
    P6: {P8, P5, ….}


    P6:

    P1:
    12
    Aggregation
    P3
    P4
    P12
    P13
    Neighbor Search
    P0 P2
    P1
    P6
    P5
    P7
    P8
    P9
    P10
    P11
    Key Operators
    Feature Computation
    Mat

    Mul 1
    ReLU
    Mat

    Mul 2
    ReLU
    Mat

    Mul 3
    ReLU

    View Slide

  28. 13
    Feature Computation
    MatMul
    ReLU
    MatMul
    ReLU
    MLP layers
    P2 - P1
    P3 - P1
    P4 - P1
    P5 - P1
    P6 - P1
    P1:
    Feature
    Matrix
    P5 - P8
    P6 - P8
    P7 - P8
    P9 - P8
    P10 - P8
    P11 - P8
    P8:
    Feature
    Matrix
    P0 - P1
    P0 - P1

    View Slide

  29. 13
    Feature Computation
    MatMul
    ReLU
    MatMul
    ReLU
    MLP layers
    P2 - P1
    P3 - P1
    P4 - P1
    P5 - P1
    P6 - P1
    P1:
    Feature
    Matrix
    P5 - P8
    P6 - P8
    P7 - P8
    P9 - P8
    P10 - P8
    P11 - P8
    P8:
    Feature
    Matrix
    P0’
    P0 - P1

    View Slide

  30. 14
    Feature Computation
    MatMul
    ReLU
    MatMul
    ReLU
    P1:
    Feature
    Matrix
    P5 - P8
    P6 - P8
    P7 - P8
    P9 - P8
    P10 - P8
    P11 - P8
    P8:
    Feature
    Matrix
    P2 - P1
    P3 - P1
    P4 - P1
    P5 - P1
    P6 - P1
    P0 - P1
    P0’
    P2’
    P3’
    P4’
    P5’
    P6’
    P1’:
    New Feature
    Matrix
    P5’
    P6’
    P7’
    P9’
    P10’
    P11’
    P8’:
    New Feature
    Matrix

    View Slide

  31. 14
    Feature Computation
    MatMul
    ReLU
    MatMul
    ReLU
    P1:
    Feature
    Matrix
    P5 - P8
    P6 - P8
    P7 - P8
    P9 - P8
    P10 - P8
    P11 - P8
    P8:
    Feature
    Matrix
    P2 - P1
    P3 - P1
    P4 - P1
    P5 - P1
    P6 - P1
    P0 - P1
    P0’
    P2’
    P3’
    P4’
    P5’
    P6’
    P1’:
    New Feature
    Matrix
    P5’
    P6’
    P7’
    P9’
    P10’
    P11’
    P8’:
    New Feature
    Matrix
    Reduction
    P1’
    P1’:
    New
    Feature
    Vector
    P8’
    P8’:
    New
    Feature
    Vector

    View Slide

  32. Point Cloud Network Layer
    15
    P1’
    P5’
    M’-dimension
    features
    Feature
    Computation
    Mat

    Mul 1
    ReLU
    Mat

    Mul 2
    ReLU
    Mat

    Mul 3
    ReLU
    MLP + Reduction
    Max

    Pool
    Aggregation Neighbor

    Feature

    Matrix (NFM)
    Each centroid point has a NFM
    N x M
    P1:
    P3 - P1
    P2 - P1
    N x M
    P5: P6 - P5
    P3 - P5
    N neighbors (N == 2)
    Neighbor
    Search Neighbor

    Index Table
    P1: {P2, P3}
    P5: {P3, P6}
    P1, P5: centroid points
    M-dimension
    feature vectors
    P1
    P4
    P6
    P2
    P3
    P5
    Points

    View Slide

  33. Can We Use Existing DNN Accelerators?
    16

    View Slide

  34. Can We Use Existing DNN Accelerators?
    16
    ▸ No. They are not enough.

    View Slide

  35. Can We Use Existing DNN Accelerators?
    16
    ▸ No. They are not enough.
    ▹ Neighbor Search

    View Slide

  36. Can We Use Existing DNN Accelerators?
    16
    ▸ No. They are not enough.
    ▹ Neighbor Search
    ▹ Aggregation

    View Slide

  37. Can We Use Existing DNN Accelerators?
    16
    ▸ No. They are not enough.
    ▹ Neighbor Search
    ▹ Aggregation
    0
    25
    50
    75
    100
    PointN
    et++
    (c)
    PointN
    et++
    (s)
    DG
    C
    N
    N
    (c)
    DG
    C
    N
    N
    (s)
    F-PointN
    et
    Neighbor Search Aggregation
    Feature Computation Others
    Characterization of 

    Point Cloud Networks
    % Execution Time

    View Slide

  38. Can We Use Existing DNN Accelerators?
    16
    ▸ No. They are not enough.
    ▹ Neighbor Search
    ▹ Aggregation
    0
    25
    50
    75
    100
    PointN
    et++
    (c)
    PointN
    et++
    (s)
    DG
    C
    N
    N
    (c)
    DG
    C
    N
    N
    (s)
    F-PointN
    et
    Neighbor Search Aggregation
    Feature Computation Others
    Characterization of 

    Point Cloud Networks
    % Execution Time

    View Slide

  39. Can We Use Existing DNN Accelerators?
    16
    ▸ No. They are not enough.
    ▹ Neighbor Search
    ▹ Aggregation
    0
    25
    50
    75
    100
    PointN
    et++
    (c)
    PointN
    et++
    (s)
    DG
    C
    N
    N
    (c)
    DG
    C
    N
    N
    (s)
    F-PointN
    et
    Neighbor Search Aggregation
    Feature Computation Others
    Characterization of 

    Point Cloud Networks
    % Execution Time
    ▹ Slow on high-end mobile GPUs:
    PointNet++: 132 ms
    FPointNet: 141 ms
    DGCNN: 5200 ms

    View Slide

  40. Point Cloud Network Layer
    17
    P1’
    P5’
    M’-dimension
    features
    Feature
    Computation
    Mat

    Mul 1
    ReLU
    Mat

    Mul 2
    ReLU
    Mat

    Mul 3
    ReLU
    MLP + Reduction
    Max

    Pool
    Aggregation Neighbor

    Feature

    Matrix (NFM)
    Each centroid point has a NFM
    N x M
    P1:
    P3 - P1
    P2 - P1
    N x M
    P5: P6 - P5
    P3 - P5
    N neighbors (N == 2)
    Neighbor
    Search Neighbor

    Index Table
    P1: {P2, P3}
    P5: {P3, P6}
    P1, P5: centroid points
    M-dimension
    feature vectors
    P1
    P4
    P6
    P2
    P3
    P5
    Points

    View Slide

  41. Neighbor

    Feature

    Matrix (NFM)
    N x M
    P3 - P1
    P2 - P1
    N x M
    P5: P6 - P5
    P3 - P5
    P1:
    Each centroid point has a NFM
    P1’
    P5’
    M’-dimension
    features
    Feature
    Computation
    Mat

    Mul 1
    ReLU
    Mat

    Mul 2
    ReLU
    Mat

    Mul 3
    ReLU
    MLP + Reduction
    Max

    Pool
    Optimization
    18

    View Slide

  42. Neighbor

    Feature

    Matrix (NFM)
    N x M
    P3 - P1
    P2 - P1
    N x M
    P5: P6 - P5
    P3 - P5
    P1:
    Each centroid point has a NFM
    P1’
    P5’
    M’-dimension
    features
    Feature
    Computation
    Mat

    Mul 1
    ReLU
    Mat

    Mul 2
    ReLU
    Mat

    Mul 3
    ReLU
    MLP + Reduction
    Max

    Pool
    Optimization
    18
    N x M
    P3 - P1
    Mat
    Mul (
    Kernel
    weights
    ,
    )

    View Slide

  43. Neighbor

    Feature

    Matrix (NFM)
    N x M
    P3 - P1
    P2 - P1
    N x M
    P5: P6 - P5
    P3 - P5
    P1:
    Each centroid point has a NFM
    P1’
    P5’
    M’-dimension
    features
    Feature
    Computation
    Mat

    Mul 1
    ReLU
    Mat

    Mul 2
    ReLU
    Mat

    Mul 3
    ReLU
    MLP + Reduction
    Max

    Pool
    Optimization
    18
    N x M
    P3 - P1
    Mat
    Mul (
    Kernel
    weights
    ,
    )

    View Slide

  44. Neighbor

    Feature

    Matrix (NFM)
    N x M
    P3 - P1
    P2 - P1
    N x M
    P5: P6 - P5
    P3 - P5
    P1:
    Each centroid point has a NFM
    P1’
    P5’
    M’-dimension
    features
    Feature
    Computation
    Mat

    Mul 1
    ReLU
    Mat

    Mul 2
    ReLU
    Mat

    Mul 3
    ReLU
    MLP + Reduction
    Max

    Pool
    =
    Optimization
    18
    N x M
    P3 - P1
    Mat
    Mul (
    Kernel
    weights
    ,
    )
    Mat
    Mul (
    Kernel
    weights
    ,
    )
    N x M
    P3
    Mat
    Mul (
    Kernel
    weights
    ,
    )
    N x M
    P1
    -

    View Slide

  45. Neighbor

    Feature

    Matrix (NFM)
    N x M
    P3 - P1
    P2 - P1
    N x M
    P5: P6 - P5
    P3 - P5
    P1:
    Each centroid point has a NFM
    P1’
    P5’
    M’-dimension
    features
    Feature
    Computation
    Mat

    Mul 1
    ReLU
    Mat

    Mul 2
    ReLU
    Mat

    Mul 3
    ReLU
    MLP + Reduction
    Max

    Pool
    =
    Optimization
    18
    N x M
    P3 - P1
    Mat
    Mul (
    Kernel
    weights
    ,
    )
    Mat
    Mul (
    Kernel
    weights
    ,
    )
    N x M
    P3
    Mat
    Mul (
    Kernel
    weights
    ,
    )
    N x M
    P1
    -

    View Slide

  46. Neighbor

    Feature

    Matrix (NFM)
    N x M
    P3 - P1
    P2 - P1
    N x M
    P5: P6 - P5
    P3 - P5
    P1:
    Each centroid point has a NFM
    P1’
    P5’
    M’-dimension
    features
    Feature
    Computation
    Mat

    Mul 1
    ReLU
    Mat

    Mul 2
    ReLU
    Mat

    Mul 3
    ReLU
    MLP + Reduction
    Max

    Pool
    =
    Optimization
    18
    N x M
    P3 - P1
    Mat
    Mul (
    Kernel
    weights
    ,
    )
    Mat
    Mul (
    Kernel
    weights
    ,
    )
    N x M
    P3
    Mat
    Mul (
    Kernel
    weights
    ,
    )
    N x M
    P1
    -
    Benefit:

    View Slide

  47. Neighbor

    Feature

    Matrix (NFM)
    N x M
    P3 - P1
    P2 - P1
    N x M
    P5: P6 - P5
    P3 - P5
    P1:
    Each centroid point has a NFM
    P1’
    P5’
    M’-dimension
    features
    Feature
    Computation
    Mat

    Mul 1
    ReLU
    Mat

    Mul 2
    ReLU
    Mat

    Mul 3
    ReLU
    MLP + Reduction
    Max

    Pool
    =
    Optimization
    18
    N x M
    P3 - P1
    Mat
    Mul (
    Kernel
    weights
    ,
    )
    Mat
    Mul (
    Kernel
    weights
    ,
    )
    N x M
    P3
    Mat
    Mul (
    Kernel
    weights
    ,
    )
    N x M
    P1
    -
    ▸ Effectively introduces reuse
    opportunities

    Benefit:

    View Slide

  48. Neighbor

    Feature

    Matrix (NFM)
    N x M
    P3 - P1
    P2 - P1
    N x M
    P5: P6 - P5
    P3 - P5
    P1:
    Each centroid point has a NFM
    P1’
    P5’
    M’-dimension
    features
    Feature
    Computation
    Mat

    Mul 1
    ReLU
    Mat

    Mul 2
    ReLU
    Mat

    Mul 3
    ReLU
    MLP + Reduction
    Max

    Pool
    =
    Optimization
    18
    N x M
    P3 - P1
    Mat
    Mul (
    Kernel
    weights
    ,
    )
    Mat
    Mul (
    Kernel
    weights
    ,
    )
    N x M
    P3
    Mat
    Mul (
    Kernel
    weights
    ,
    )
    N x M
    P1
    -
    ▸ Effectively introduces reuse
    opportunities

    Reduces up to 90% 

    MAC (Multiply-Accumulate) operations
    Benefit:
    Each point is used ~30 times

    View Slide

  49. Neighbor

    Feature

    Matrix (NFM)
    N x M
    P3 - P1
    P2 - P1
    N x M
    P5: P6 - P5
    P3 - P5
    P1:
    Each centroid point has a NFM
    P1’
    P5’
    M’-dimension
    features
    Feature
    Computation
    Mat

    Mul 1
    ReLU
    Mat

    Mul 2
    ReLU
    Mat

    Mul 3
    ReLU
    MLP + Reduction
    Max

    Pool
    =
    Optimization
    18
    N x M
    P3 - P1
    Mat
    Mul (
    Kernel
    weights
    ,
    )
    Mat
    Mul (
    Kernel
    weights
    ,
    )
    N x M
    P3
    Mat
    Mul (
    Kernel
    weights
    ,
    )
    N x M
    P1
    -
    ▸ Effectively introduces reuse
    opportunities

    Reduces up to 90% 

    MAC (Multiply-Accumulate) operations
    ▸ Dependency elimination
    Benefit:
    Each point is used ~30 times

    View Slide

  50. Neighbor

    Feature

    Matrix (NFM)
    N x M
    P3 - P1
    P2 - P1
    N x M
    P5: P6 - P5
    P3 - P5
    P1:
    Each centroid point has a NFM
    P1’
    P5’
    M’-dimension
    features
    Feature
    Computation
    Mat

    Mul 1
    ReLU
    Mat

    Mul 2
    ReLU
    Mat

    Mul 3
    ReLU
    MLP + Reduction
    Max

    Pool
    Optimization
    18
    N x M
    P3 - P1
    Mat
    Mul (
    Kernel
    weights
    ,
    )
    Mat
    Mul (
    Kernel
    weights
    ,
    )
    N x M
    P3
    Mat
    Mul (
    Kernel
    weights
    ,
    )
    N x M
    P1
    -
    ReLU ( )
    ReLU ( )
    ReLU ( )

    ▸ Effectively introduces reuse
    opportunities

    Reduces up to 90% 

    MAC (Multiply-Accumulate) operations
    ▸ Dependency elimination
    Benefit:
    Each point is used ~30 times

    View Slide

  51. Approximation
    19
    Neighbor

    Feature

    Matrix (NFM)
    MLP +
    Reduction
    Mat

    Mul 1
    ReLU
    Mat

    Mul 2
    ReLU
    Mat

    Mul 3
    ReLU
    MLP + Reduction
    Max

    Pool
    N x M
    P3 - P1
    Mat
    Mul (
    Kernel
    weights
    ,
    )
    Mat
    Mul (
    Kernel
    weights
    ,
    )
    N x M
    P3
    Mat
    Mul (
    Kernel
    weights
    ,
    )
    N x M
    P1
    -
    ReLU ( )
    ReLU ( )
    ReLU ( )
    ≈ P1’
    P5’
    M’-dimension
    features
    Each centroid point has a NFM
    N x M
    P3 - P1
    P2 - P1
    N x M
    P5: P6 - P5
    P3 - P5
    P1:

    View Slide

  52. Approximation
    19
    Neighbor

    Feature

    Matrix (NFM)
    MLP +
    Reduction
    Mat

    Mul 1
    ReLU
    Mat

    Mul 2
    ReLU
    Mat

    Mul 3
    ReLU
    MLP + Reduction
    Max

    Pool
    N x M
    P3 - P1
    Mat
    Mul (
    Kernel
    weights
    ,
    )
    Mat
    Mul (
    Kernel
    weights
    ,
    )
    N x M
    P3
    Mat
    Mul (
    Kernel
    weights
    ,
    )
    N x M
    P1
    -
    ReLU ( )
    ReLU ( )
    ReLU ( )
    ≈ P1’
    P5’
    M’-dimension
    features
    Each centroid point has a NFM
    N x M
    P3 - P1
    P2 - P1
    N x M
    P5: P6 - P5
    P3 - P5
    P1:
    The accuracy change 

    ranges from -0.9% to +1.2%

    View Slide

  53. Delayed Aggregation
    20
    Neighbor
    Search Neighbor

    Index Table
    Feature
    Compute
    Point Feature
    Table
    Aggregation Neighbor

    Feature

    Matrix (NFM)
    } Reduction P1’
    P5’
    M’-dimension
    features
    Mat

    Mul 1
    ReLU
    Mat

    Mul 2
    ReLU
    Mat

    Mul 3
    ReLU
    MLP
    M-dimension
    feature vectors
    P1
    P4
    P6
    P2
    P3
    P5
    Points

    View Slide

  54. Bottleneck Shift
    21
    Time (msec.)
    0
    6
    12
    18
    24
    30
    Original Delayed Aggr.
    Feature

    Computation
    Aggregation
    Neighbor

    Search
    Execution Time Distribution of PointNet++

    View Slide

  55. Bottleneck Shift
    21
    Time (msec.)
    0
    6
    12
    18
    24
    30
    Original Delayed Aggr.
    Feature

    Computation
    Aggregation
    Neighbor

    Search
    Execution Time Distribution of PointNet++

    View Slide

  56. Bottleneck Shift
    21
    Time (msec.)
    0
    6
    12
    18
    24
    30
    Original Delayed Aggr.
    Feature

    Computation
    Aggregation
    Neighbor

    Search
    Execution Time Distribution of PointNet++

    View Slide

  57. Mesorasi: Point Cloud Acceleration Frame
    22
    Algorithm
    Hardware
    Delayed Aggregation
    Aggregation Acceleration

    View Slide

  58. Aggregation Operation
    23
    Neighbor
    Feature Matrix
    Point Feature Table
    N neighbors
    P1: {P2, P3, P11, P23,
    P31, P39, …}
    P3: {P5, P7, P16, P19,
    P38, P41, …}

    Neighbor Index Table
    P99:{P16, P71, P96, P119,
    P128, P142, …}
    P11
    P31
    P23
    P3
    P39
    P2


    View Slide

  59. Aggregation Operation
    23
    Neighbor
    Feature Matrix
    Point Feature Table
    N neighbors
    P1: {P2, P3, P11, P23,
    P31, P39, …}
    P3: {P5, P7, P16, P19,
    P38, P41, …}

    Neighbor Index Table
    P99:{P16, P71, P96, P119,
    P128, P142, …}
    P11
    P31
    P23
    P3
    P39
    P2


    P1: {P2, P3, P11, P23,
    P31, P39, …}

    View Slide

  60. Aggregation Operation
    23
    Neighbor
    Feature Matrix
    Point Feature Table
    N neighbors
    P1: {P2, P3, P11, P23,
    P31, P39, …}
    P3: {P5, P7, P16, P19,
    P38, P41, …}

    Neighbor Index Table
    P99:{P16, P71, P96, P119,
    P128, P142, …}
    P11
    P31
    P23
    P3
    P39
    P2


    P1: {P2, P3, P11, P23,
    P31, P39, …}

    View Slide

  61. Aggregation Operation
    23
    Neighbor
    Feature Matrix
    Point Feature Table
    N neighbors
    P1: {P2, P3, P11, P23,
    P31, P39, …}
    P3: {P5, P7, P16, P19,
    P38, P41, …}

    Neighbor Index Table
    P99:{P16, P71, P96, P119,
    P128, P142, …}
    P11
    P31
    P23
    P3
    P39
    P2


    P1: {P2, P3, P11, P23,
    P31, P39, …}

    View Slide

  62. Aggregation Operation
    23
    Neighbor
    Feature Matrix
    Point Feature Table
    N neighbors
    P1: {P2, P3, P11, P23,
    P31, P39, …}
    P3: {P5, P7, P16, P19,
    P38, P41, …}

    Neighbor Index Table
    P99:{P16, P71, P96, P119,
    P128, P142, …}
    P11
    P31
    P23
    P3
    P39
    P2


    P1: {P2, P3, P11, P23,
    P31, P39, …}

    View Slide

  63. Aggregation Unit
    24
    1: {2, 17, 9, …}
    3: {8, 17, 6, …}
    6: {18, 25, 34, …}
    ….
    900: {832, 987, …}
    Point Feature Table
    (PFT)
    Neighbor Index Table

    View Slide

  64. Aggregation Unit
    24
    1: {2, 17, 9, …}
    3: {8, 17, 6, …}
    6: {18, 25, 34, …}
    ….
    900: {832, 987, …}
    Address

    Generation
    Point Feature Table
    (PFT)
    Neighbor Index Table

    View Slide

  65. Aggregation Unit
    24
    1: {2, 17, 9, …}
    3: {8, 17, 6, …}
    6: {18, 25, 34, …}
    ….
    900: {832, 987, …}
    Address

    Generation
    Bank 1
    Bank 2
    Bank B

    Bank 3

    Point Feature Table
    (PFT)
    Neighbor Index Table

    View Slide

  66. Aggregation Unit
    24
    1: {2, 17, 9, …}
    3: {8, 17, 6, …}
    6: {18, 25, 34, …}
    ….
    900: {832, 987, …}
    Address

    Generation
    Bank 1
    Bank 2
    Bank B

    Bank 3

    B-ported, B-banked;
    No Crossbar
    Point Feature Table
    (PFT)
    Neighbor Index Table

    View Slide

  67. Aggregation Unit
    25
    1: {2, 17, 9, …}
    3: {8, 17, 6, …}
    6: {18, 25, 34, …}
    ….
    900: {832, 987, …}
    Address

    Generation
    Bank 1
    Bank 2
    Bank B

    Bank 3

    Neighbor Index Table Point Feature Table
    (PFT)
    Reduction

    (Max)

    Shift Registers
    B-ported, B-banked;
    No Crossbar

    View Slide

  68. Aggregation Unit
    25
    1: {2, 17, 9, …}
    3: {8, 17, 6, …}
    6: {18, 25, 34, …}
    ….
    900: {832, 987, …}
    Address

    Generation
    Bank 1
    Bank 2
    Bank B

    Bank 3

    Neighbor Index Table Point Feature Table
    (PFT)
    Reduction

    (Max)

    Shift Registers
    B-ported, B-banked;
    No Crossbar

    View Slide

  69. Aggregation Unit
    25
    1: {2, 17, 9, …}
    3: {8, 17, 6, …}
    6: {18, 25, 34, …}
    ….
    900: {832, 987, …}
    Address

    Generation
    Bank 1
    Bank 2
    Bank B

    Bank 3

    Neighbor Index Table Point Feature Table
    (PFT)
    Reduction

    (Max)

    Shift Registers
    MUX

    (Store centroid’s

    feature vector)
    B-ported, B-banked;
    No Crossbar

    View Slide

  70. Aggregation Unit
    25
    1: {2, 17, 9, …}
    3: {8, 17, 6, …}
    6: {18, 25, 34, …}
    ….
    900: {832, 987, …}
    Address

    Generation
    Bank 1
    Bank 2
    Bank B

    Bank 3

    Neighbor Index Table Point Feature Table
    (PFT)
    Reduction

    (Max)
    Sub
    Global
    Buffer

    Shift Registers
    MUX

    (Store centroid’s

    feature vector)
    B-ported, B-banked;
    No Crossbar

    View Slide

  71. DNN Accelerator (NPU)
    DRAM
    CPU
    GPU
    Overall Hardware Design
    26

    View Slide

  72. DNN Accelerator (NPU)
    DRAM
    CPU
    GPU
    Overall Hardware Design
    26
    Input
    Point Cloud
    MLP Kernel
    Weights
    MLP Intermediate
    Activations
    Neighbor
    Index Table

    View Slide

  73. DNN Accelerator (NPU)
    DRAM
    CPU
    GPU
    Overall Hardware Design
    26
    Input
    Point Cloud
    MLP Kernel
    Weights
    MLP Intermediate
    Activations
    Neighbor
    Index Table
    GPU
    (Neighbor
    Search)

    View Slide

  74. DNN Accelerator (NPU)
    DRAM
    CPU
    GPU
    Overall Hardware Design
    26
    Input
    Point Cloud
    MLP Kernel
    Weights
    MLP Intermediate
    Activations
    Neighbor
    Index Table
    GPU
    (Neighbor
    Search)
    Feature Extraction
    +
    Aggregation

    View Slide

  75. DNN Accelerator (NPU)
    DRAM
    CPU
    GPU
    Overall Hardware Design
    26
    Systolic MAC
    Unit Array
    BN/ReLU/
    MaxPool
    Input
    Point Cloud
    MLP Kernel
    Weights
    MLP Intermediate
    Activations
    Neighbor
    Index Table
    GPU
    (Neighbor
    Search)

    View Slide

  76. DNN Accelerator (NPU)
    DRAM
    CPU
    GPU
    Overall Hardware Design
    26
    Global
    Buffer
    (Weights
    /Fmaps)
    Global
    Buffer
    (Weights
    /FMaps)
    MCU
    MCU
    Systolic MAC
    Unit Array
    BN/ReLU/
    MaxPool
    Input
    Point Cloud
    MLP Kernel
    Weights
    MLP Intermediate
    Activations
    Neighbor
    Index Table
    GPU
    (Neighbor
    Search)

    View Slide

  77. DNN Accelerator (NPU)
    DRAM
    CPU
    GPU
    Overall Hardware Design
    26
    Global
    Buffer
    (Weights
    /Fmaps)
    Global
    Buffer
    (Weights
    /FMaps)
    MCU
    MCU
    Systolic MAC
    Unit Array
    BN/ReLU/
    MaxPool
    Input
    Point Cloud
    MLP Kernel
    Weights
    MLP Intermediate
    Activations
    Neighbor
    Index Table
    Aggregation

    Logic
    Neighbor

    Index Table
    Point Feature

    Buffer
    Reduction

    (Max)
    Neighbor

    Index Table
    Neighbor

    Index Buffer
    GPU
    (Neighbor
    Search)

    View Slide

  78. DNN Accelerator (NPU)
    DRAM
    CPU
    GPU
    Overall Hardware Design
    26
    Global
    Buffer
    (Weights
    /Fmaps)
    Global
    Buffer
    (Weights
    /FMaps)
    MCU
    MCU
    Systolic MAC
    Unit Array
    BN/ReLU/
    MaxPool
    Input
    Point Cloud
    MLP Kernel
    Weights
    MLP Intermediate
    Activations
    Neighbor
    Index Table
    Aggregation

    Logic
    Neighbor

    Index Table
    Point Feature

    Buffer
    Reduction

    (Max)
    Neighbor

    Index Table
    Neighbor

    Index Buffer
    GPU
    (Neighbor
    Search)

    With 3.8% area overhead to the NPU 


    View Slide

  79. Experimental Setup
    27

    View Slide

  80. Experimental Setup
    27
    Three Point Cloud Applications:
    ▹ Object Classification, Object Segmentation, and Object Detection

    View Slide

  81. Experimental Setup
    27
    Three Point Cloud Applications:
    ▹ Object Classification, Object Segmentation, and Object Detection
    Datasets:
    ▹ ModelNet40, ShapeNet, and KITTI dataset

    View Slide

  82. Experimental Setup
    27
    Three Point Cloud Applications:
    ▹ Object Classification, Object Segmentation, and Object Detection
    Datasets:
    ▹ ModelNet40, ShapeNet, and KITTI dataset
    Models:
    ▹ Classification: PointNet++ (c), DGCNN (c), LDGCNN, DensePoint
    ▹ Segmentation: PointNet++ (s), DGCNN (s)
    ▹ Detection: F-PointNet

    View Slide

  83. Experimental Setup
    27
    Three Point Cloud Applications:
    ▹ Object Classification, Object Segmentation, and Object Detection
    Datasets:
    ▹ ModelNet40, ShapeNet, and KITTI dataset
    Models:
    ▹ Classification: PointNet++ (c), DGCNN (c), LDGCNN, DensePoint
    ▹ Segmentation: PointNet++ (s), DGCNN (s)
    ▹ Detection: F-PointNet
    https://github.com/horizon-research/Efficient-Deep-Learning-for-Point-Clouds
    Github:

    View Slide

  84. Accuracy Comparison
    28
    Accuracy (%)
    0
    20
    40
    60
    80
    100
    PointN
    et++
    (c)
    PointN
    et++
    (s)
    DG
    C
    N
    N
    (c)
    DG
    C
    N
    N
    (s)
    F-PointN
    et
    LDG
    C
    N
    N
    DensePoint
    Orignal Delayed Aggr.

    View Slide

  85. Accuracy Comparison
    28
    Accuracy (%)
    0
    20
    40
    60
    80
    100
    PointN
    et++
    (c)
    PointN
    et++
    (s)
    DG
    C
    N
    N
    (c)
    DG
    C
    N
    N
    (s)
    F-PointN
    et
    LDG
    C
    N
    N
    DensePoint
    Orignal Delayed Aggr.

    View Slide

  86. Accuracy Comparison
    28
    Accuracy (%)
    0
    20
    40
    60
    80
    100
    PointN
    et++
    (c)
    PointN
    et++
    (s)
    DG
    C
    N
    N
    (c)
    DG
    C
    N
    N
    (s)
    F-PointN
    et
    LDG
    C
    N
    N
    DensePoint
    Orignal Delayed Aggr.

    View Slide

  87. Speedup and Energy Saving on GPU
    29

    View Slide

  88. Speedup and Energy Saving on GPU
    29
    Speedup
    0
    0.5
    1
    1.5
    2
    PointN
    et++
    (c)
    PointN
    et++
    (s)
    DG
    C
    N
    N
    (c)
    DG
    C
    N
    N
    (s)
    F-PointN
    et
    LDG
    C
    N
    N
    DensePoint
    AVG
    .

    View Slide

  89. Speedup and Energy Saving on GPU
    29
    Speedup
    0
    0.5
    1
    1.5
    2
    PointN
    et++
    (c)
    PointN
    et++
    (s)
    DG
    C
    N
    N
    (c)
    DG
    C
    N
    N
    (s)
    F-PointN
    et
    LDG
    C
    N
    N
    DensePoint
    AVG
    .
    1.6

    View Slide

  90. Speedup and Energy Saving on GPU
    29
    Speedup
    0
    0.5
    1
    1.5
    2
    PointN
    et++
    (c)
    PointN
    et++
    (s)
    DG
    C
    N
    N
    (c)
    DG
    C
    N
    N
    (s)
    F-PointN
    et
    LDG
    C
    N
    N
    DensePoint
    AVG
    .
    Saving %
    0
    25
    50
    75
    100
    1.6

    View Slide

  91. Speedup and Energy Saving on GPU
    29
    Speedup
    0
    0.5
    1
    1.5
    2
    PointN
    et++
    (c)
    PointN
    et++
    (s)
    DG
    C
    N
    N
    (c)
    DG
    C
    N
    N
    (s)
    F-PointN
    et
    LDG
    C
    N
    N
    DensePoint
    AVG
    .
    Saving %
    0
    25
    50
    75
    100
    1.6
    51.1%

    View Slide

  92. Hardware Performance Evaluation
    30

    View Slide

  93. Hardware Performance Evaluation
    30
    Hardware Baseline:
    ▹ A generic NPU + GPU SoC.

    View Slide

  94. Hardware Performance Evaluation
    30
    Variants:
    ▹ Mesorasi-SW: delayed-aggregation without AU support.
    ▹ Mesorasi-HW: delayed-aggregation with AU support.
    Hardware Baseline:
    ▹ A generic NPU + GPU SoC.

    View Slide

  95. Hardware Performance Evaluation
    30
    Variants:
    ▹ Mesorasi-SW: delayed-aggregation without AU support.
    ▹ Mesorasi-HW: delayed-aggregation with AU support.
    Hardware Baseline:
    ▹ A generic NPU + GPU SoC.
    Implementation:
    ▹ 16x16 Systolic Array
    ▹ Synposys synthesis, TSMC 16nm FinFET technology

    View Slide

  96. Speedup
    31
    Speedup
    0
    0.5
    1
    1.5
    2
    PointN
    et++
    (c)
    PointN
    et++
    (s)
    DG
    C
    N
    N
    (c)
    DG
    C
    N
    N
    (s)
    F-PointN
    et
    LDG
    C
    N
    N
    DensePoint
    AVG
    .
    Mesorasi-SW Mesorasi-HW

    View Slide

  97. Speedup
    31
    Speedup
    0
    0.5
    1
    1.5
    2
    PointN
    et++
    (c)
    PointN
    et++
    (s)
    DG
    C
    N
    N
    (c)
    DG
    C
    N
    N
    (s)
    F-PointN
    et
    LDG
    C
    N
    N
    DensePoint
    AVG
    .
    Mesorasi-SW Mesorasi-HW
    1.3

    View Slide

  98. Speedup
    31
    Speedup
    0
    0.5
    1
    1.5
    2
    PointN
    et++
    (c)
    PointN
    et++
    (s)
    DG
    C
    N
    N
    (c)
    DG
    C
    N
    N
    (s)
    F-PointN
    et
    LDG
    C
    N
    N
    DensePoint
    AVG
    .
    Mesorasi-SW Mesorasi-HW
    2.2 3.6
    1.3
    1.9

    View Slide

  99. Energy Savings
    32
    Saving (%)
    0
    25
    50
    75
    100
    PointN
    et++
    (c)
    PointN
    et++
    (s)
    DG
    C
    N
    N
    (c)
    DG
    C
    N
    N
    (s)
    F-PointN
    et
    LDG
    C
    N
    N
    DensePoint
    AVG
    .
    Mesorasi-SW Mesorasi-HW

    View Slide

  100. Energy Savings
    32
    Saving (%)
    0
    25
    50
    75
    100
    PointN
    et++
    (c)
    PointN
    et++
    (s)
    DG
    C
    N
    N
    (c)
    DG
    C
    N
    N
    (s)
    F-PointN
    et
    LDG
    C
    N
    N
    DensePoint
    AVG
    .
    Mesorasi-SW Mesorasi-HW
    22 %

    View Slide

  101. Energy Savings
    32
    Saving (%)
    0
    25
    50
    75
    100
    PointN
    et++
    (c)
    PointN
    et++
    (s)
    DG
    C
    N
    N
    (c)
    DG
    C
    N
    N
    (s)
    F-PointN
    et
    LDG
    C
    N
    N
    DensePoint
    AVG
    .
    Mesorasi-SW Mesorasi-HW
    22 %
    38%

    View Slide

  102. Conclusion
    33
    https://github.com/horizon-research/Efficient-Deep-Learning-for-Point-Clouds

    View Slide

  103. Conclusion
    33
    ‣Delayed-aggregation decouples neighbor search
    with feature computation and significantly reduces
    the overall workload.
    https://github.com/horizon-research/Efficient-Deep-Learning-for-Point-Clouds

    View Slide

  104. Conclusion
    33
    ‣Delayed-aggregation decouples neighbor search
    with feature computation and significantly reduces
    the overall workload.
    ‣Hardware support further maximizes the
    effectiveness of delayed-aggregation.
    https://github.com/horizon-research/Efficient-Deep-Learning-for-Point-Clouds

    View Slide