Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RTNN: Accelerating Neighbor Search Using Hardware Ray Tracing

Yuhao Zhu
March 28, 2022

RTNN: Accelerating Neighbor Search Using Hardware Ray Tracing

A long talk for the PPoPP 2022 paper with the same title. The code is at: https://github.com/horizon-research/rtnn.

Yuhao Zhu

March 28, 2022
Tweet

More Decks by Yuhao Zhu

Other Decks in Research

Transcript

  1. Yuhao Zhu
    https://github.com/horizon-research/rtnn
    University of Rochester
    "DDFMFSBUJOH
    /FJHICPS4FBSDI
    6TJOH)BSEXBSF
    3BZ5SBDJOH

    View Slide

  2. 2
    (BNF1MBO
    "DDFMFSBUJOH
    /FJHICPS4FBSDI
    6TJOH)BSEXBSF
    3BZ5SBDJOH
    • What is ray tracing?

    View Slide

  3. 3
    (BNF1MBO
    "DDFMFSBUJOH
    /FJHICPS4FBSDI
    6TJOH)BSEXBSF
    3BZ5SBDJOH
    • What is ray tracing?


    • How does hardware
    support ray tracing?

    View Slide

  4. (BNF1MBO
    4
    "DDFMFSBUJOH
    /FJHICPS4FBSDI
    6TJOH)BSEXBSF
    3BZ5SBDJOH
    • What is ray tracing?


    • How does hardware
    support ray tracing?


    • What is neighbor search?

    View Slide

  5. 5
    • What is ray tracing?


    • How does hardware
    support ray tracing?


    • What is neighbor search?


    • How to use hardware ray
    tracing to accelerate
    neighbor search?
    (BNF1MBO
    "DDFMFSBUJOH
    /FJHICPS4FBSDI
    6TJOH)BSEXBSF
    3BZ5SBDJOH

    View Slide

  6. 6
    (BNF1MBO
    "DDFMFSBUJOH
    /FJHICPS4FBSDI
    6TJOH)BSEXBSF
    3BZ5SBDJOH
    • What is ray tracing?

    View Slide

  7. 7

    View Slide

  8. 8
    2D image
    cgarena.com

    View Slide

  9. 8
    Modeling
    3D mesh 2D image
    cgarena.com

    View Slide

  10. .FTI JF )PX4DFOFJT3FQSFTFOUFE

    9
    Very informally: 3D piece-wide linear
    approximation of arbitrary 3D surfaces
    free3d.com

    View Slide

  11. .FTI JF )PX4DFOFJT3FQSFTFOUFE

    9
    Very informally: 3D piece-wide linear
    approximation of arbitrary 3D surfaces
    Quadrilateral mesh
    free3d.com

    View Slide

  12. TABLE I
    PROCESSING TIMES AND QUALITY MEASURES FOR THE PROCESSED MESHES. THE COLUMNS ARE RESPECTIVELY THE NUMBER OF VERTICES OF THE
    INPUT AND OUTPUT MESHES, THE METRIC USED FOR THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE
    CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO.
    Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the
    buddha model (20k vertices).
    models (left : AQ metric; right: IQ metric). The anisotropic
    Fig. 13. Closeup view of the David model remeshed to 500k vertices
    (Isotropic metric)
    .FTI JF )PX4DFOFJT3FQSFTFOUFE

    9
    Very informally: 3D piece-wide linear
    approximation of arbitrary 3D surfaces
    Quadrilateral mesh Triangular mesh
    Valette, et al. [TVCG’08]
    free3d.com

    View Slide

  13. 10
    Modeling Rendering
    Lighting, camera,
    material, etc.
    3D mesh 2D image
    cgarena.com

    View Slide

  14. 10
    Modeling Rendering
    Lighting, camera,
    material, etc.
    Visibility Shading
    3D mesh 2D image
    cgarena.com

    View Slide

  15. 10
    Modeling Rendering
    Lighting, camera,
    material, etc.
    Visibility Shading
    3D mesh 2D image
    cgarena.com
    Visibility Problem


    For each pixel in the image (to be
    rendered), which point in the scene
    (i.e., on the mesh) corresponds to it?

    View Slide

  16. 10
    Modeling Rendering
    Lighting, camera,
    material, etc.
    Visibility Shading
    3D mesh 2D image
    cgarena.com

    View Slide

  17. 10
    Modeling Rendering
    Lighting, camera,
    material, etc.
    Visibility Shading
    3D mesh 2D image
    cgarena.com
    * Usually cast multiple rays for each pixel

    View Slide

  18. Shading Problem


    What’s the color of an
    intersecting scene point
    along the ray direction?
    10
    Modeling Rendering
    Lighting, camera,
    material, etc.
    Visibility Shading
    3D mesh 2D image
    cgarena.com
    * Usually cast multiple rays for each pixel

    View Slide

  19. "TJEF0UIFS(FPNFUSZ1SJNJUJWFT
    11
    Hair and furs are usually modeled using
    curves (e.g., Catmull–Rom spline).
    https://developer.nvidia.com/blog/optix-sdk-7-1/
    Points and spheres.
    https://www.sciencefocus.com/future-technology/notre-dame-how-faithfully-can-we-rebuild-the-cathedral-with-modern-tech/

    View Slide

  20. 3BZ4DFOF*OUFSTFDUJPO
    12
    INPUT AND OUTPUT MESHES, THE METRIC USED FOR THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE
    CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO.
    Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the
    buddha model (20k vertices).
    models (left : AQ metric; right: IQ metric). The anisotropic
    behavior of the AQ metric is clearly visible in elongated
    Fig. 13. Closeup view of the David model remeshed to 500k vertices
    (Isotropic metric)
    • Goal: calculate the [x, y, z]
    coordinates of the closest hit
    between the ray and the mesh.


    • Why closest hit?
    [x, y, z]
    Valette, et al. [TVCG’08]
    x
    y
    z

    View Slide

  21. 3BZ4DFOF*OUFSTFDUJPO
    12
    INPUT AND OUTPUT MESHES, THE METRIC USED FOR THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE
    CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO.
    Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the
    buddha model (20k vertices).
    models (left : AQ metric; right: IQ metric). The anisotropic
    behavior of the AQ metric is clearly visible in elongated
    Fig. 13. Closeup view of the David model remeshed to 500k vertices
    (Isotropic metric)
    • Goal: calculate the [x, y, z]
    coordinates of the closest hit
    between the ray and the mesh.


    • Why closest hit?
    [x, y, z]
    Valette, et al. [TVCG’08]
    x
    y
    z

    View Slide

  22. &YIBVTUJWF4FBSDI
    13
    INPUT AND OUTPUT MESHES, THE METRIC USED FOR THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE
    CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO.
    Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the
    buddha model (20k vertices).
    models (left : AQ metric; right: IQ metric). The anisotropic
    behavior of the AQ metric is clearly visible in elongated
    Fig. 13. Closeup view of the David model remeshed to 500k vertices
    (Isotropic metric)
    • The simplest solution:
    [x, y, z]
    Valette, et al. [TVCG’08]

    View Slide

  23. &YIBVTUJWF4FBSDI
    13
    INPUT AND OUTPUT MESHES, THE METRIC USED FOR THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE
    CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO.
    Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the
    buddha model (20k vertices).
    models (left : AQ metric; right: IQ metric). The anisotropic
    behavior of the AQ metric is clearly visible in elongated
    Fig. 13. Closeup view of the David model remeshed to 500k vertices
    (Isotropic metric)
    • The simplest solution:
    • iterate all triangles
    [x, y, z]
    Valette, et al. [TVCG’08]

    View Slide

  24. &YIBVTUJWF4FBSDI
    13
    INPUT AND OUTPUT MESHES, THE METRIC USED FOR THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE
    CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO.
    Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the
    buddha model (20k vertices).
    models (left : AQ metric; right: IQ metric). The anisotropic
    behavior of the AQ metric is clearly visible in elongated
    Fig. 13. Closeup view of the David model remeshed to 500k vertices
    (Isotropic metric)
    • The simplest solution:
    • iterate all triangles
    • test intersection for each triangle [x, y, z]
    Valette, et al. [TVCG’08]

    View Slide

  25. &YIBVTUJWF4FBSDI
    13
    INPUT AND OUTPUT MESHES, THE METRIC USED FOR THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE
    CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO.
    Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the
    buddha model (20k vertices).
    models (left : AQ metric; right: IQ metric). The anisotropic
    behavior of the AQ metric is clearly visible in elongated
    Fig. 13. Closeup view of the David model remeshed to 500k vertices
    (Isotropic metric)
    • The simplest solution:
    • iterate all triangles
    • test intersection for each triangle
    • return the closest hit, if any
    [x, y, z]
    Valette, et al. [TVCG’08]

    View Slide

  26. &YIBVTUJWF4FBSDI
    13
    INPUT AND OUTPUT MESHES, THE METRIC USED FOR THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE
    CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO.
    Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the
    buddha model (20k vertices).
    models (left : AQ metric; right: IQ metric). The anisotropic
    behavior of the AQ metric is clearly visible in elongated
    Fig. 13. Closeup view of the David model remeshed to 500k vertices
    (Isotropic metric)
    • The simplest solution:
    • iterate all triangles
    • test intersection for each triangle
    • return the closest hit, if any
    • Complexity:
    [x, y, z]
    Valette, et al. [TVCG’08]

    View Slide

  27. &YIBVTUJWF4FBSDI
    13
    INPUT AND OUTPUT MESHES, THE METRIC USED FOR THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE
    CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO.
    Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the
    buddha model (20k vertices).
    models (left : AQ metric; right: IQ metric). The anisotropic
    behavior of the AQ metric is clearly visible in elongated
    Fig. 13. Closeup view of the David model remeshed to 500k vertices
    (Isotropic metric)
    • The simplest solution:
    • iterate all triangles
    • test intersection for each triangle
    • return the closest hit, if any
    • Complexity:
    • O(# of rays x # of triangles)
    [x, y, z]
    Valette, et al. [TVCG’08]

    View Slide

  28. &YIBVTUJWF4FBSDI
    13
    INPUT AND OUTPUT MESHES, THE METRIC USED FOR THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE
    CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO.
    Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the
    buddha model (20k vertices).
    models (left : AQ metric; right: IQ metric). The anisotropic
    behavior of the AQ metric is clearly visible in elongated
    Fig. 13. Closeup view of the David model remeshed to 500k vertices
    (Isotropic metric)
    • The simplest solution:
    • iterate all triangles
    • test intersection for each triangle
    • return the closest hit, if any
    • Complexity:
    • O(# of rays x # of triangles)
    • Slow:
    [x, y, z]
    Valette, et al. [TVCG’08]

    View Slide

  29. &YIBVTUJWF4FBSDI
    13
    INPUT AND OUTPUT MESHES, THE METRIC USED FOR THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE
    CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO.
    Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the
    buddha model (20k vertices).
    models (left : AQ metric; right: IQ metric). The anisotropic
    behavior of the AQ metric is clearly visible in elongated
    Fig. 13. Closeup view of the David model remeshed to 500k vertices
    (Isotropic metric)
    • The simplest solution:
    • iterate all triangles
    • test intersection for each triangle
    • return the closest hit, if any
    • Complexity:
    • O(# of rays x # of triangles)
    • Slow:
    • lots of triangles and lots of rays
    [x, y, z]
    Valette, et al. [TVCG’08]

    View Slide

  30. &YIBVTUJWF4FBSDI
    13
    INPUT AND OUTPUT MESHES, THE METRIC USED FOR THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE
    CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO.
    Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the
    buddha model (20k vertices).
    models (left : AQ metric; right: IQ metric). The anisotropic
    behavior of the AQ metric is clearly visible in elongated
    Fig. 13. Closeup view of the David model remeshed to 500k vertices
    (Isotropic metric)
    • The simplest solution:
    • iterate all triangles
    • test intersection for each triangle
    • return the closest hit, if any
    • Complexity:
    • O(# of rays x # of triangles)
    • Slow:
    • lots of triangles and lots of rays
    • …and it’s recursive
    [x, y, z]
    Valette, et al. [TVCG’08]

    View Slide

  31. "TJEF8IZ3FDVSTJWF3BZ5SBDJOH
    14
    INPUT AND OUTPUT MESHES, THE METRIC USED FOR THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE
    CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO.
    Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the
    buddha model (20k vertices).
    models (left : AQ metric; right: IQ metric). The anisotropic
    behavior of the AQ metric is clearly visible in elongated
    Fig. 13. Closeup view of the David model remeshed to 500k vertices
    (Isotropic metric) Valette, et al. [TVCG’08]
    • To implement realistic shading.
    Color?

    View Slide

  32. "TJEF8IZ3FDVSTJWF3BZ5SBDJOH
    14
    INPUT AND OUTPUT MESHES, THE METRIC USED FOR THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE
    CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO.
    Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the
    buddha model (20k vertices).
    models (left : AQ metric; right: IQ metric). The anisotropic
    behavior of the AQ metric is clearly visible in elongated
    Fig. 13. Closeup view of the David model remeshed to 500k vertices
    (Isotropic metric) Valette, et al. [TVCG’08]
    • To implement realistic shading.
    • The color* of an exiting ray depends
    on the colors* of all incident rays.
    Color?
    Color?
    Color?
    Color?

    View Slide

  33. "TJEF8IZ3FDVSTJWF3BZ5SBDJOH
    14
    INPUT AND OUTPUT MESHES, THE METRIC USED FOR THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE
    CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO.
    Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the
    buddha model (20k vertices).
    models (left : AQ metric; right: IQ metric). The anisotropic
    behavior of the AQ metric is clearly visible in elongated
    Fig. 13. Closeup view of the David model remeshed to 500k vertices
    (Isotropic metric) Valette, et al. [TVCG’08]
    • To implement realistic shading.
    • The color* of an exiting ray depends
    on the colors* of all incident rays.
    • color* should technically be radiance; not
    important for our discussion here.
    Color?
    Color?
    Color?
    Color?

    View Slide

  34. "TJEF8IZ3FDVSTJWF3BZ5SBDJOH
    14
    INPUT AND OUTPUT MESHES, THE METRIC USED FOR THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE
    CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO.
    Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the
    buddha model (20k vertices).
    models (left : AQ metric; right: IQ metric). The anisotropic
    behavior of the AQ metric is clearly visible in elongated
    Fig. 13. Closeup view of the David model remeshed to 500k vertices
    (Isotropic metric) Valette, et al. [TVCG’08]
    • To implement realistic shading.
    • The color* of an exiting ray depends
    on the colors* of all incident rays.
    • color* should technically be radiance; not
    important for our discussion here.
    • also depends on the surface material (diffuse vs.
    specular vs. …); not important for our discussion here.
    Color?
    Color?
    Color?
    Color?

    View Slide

  35. "TJEF8IZ3FDVSTJWF3BZ5SBDJOH
    14
    INPUT AND OUTPUT MESHES, THE METRIC USED FOR THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE
    CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO.
    Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the
    buddha model (20k vertices).
    models (left : AQ metric; right: IQ metric). The anisotropic
    behavior of the AQ metric is clearly visible in elongated
    Fig. 13. Closeup view of the David model remeshed to 500k vertices
    (Isotropic metric) Valette, et al. [TVCG’08]
    • To implement realistic shading.
    • The color* of an exiting ray depends
    on the colors* of all incident rays.
    • color* should technically be radiance; not
    important for our discussion here.
    • also depends on the surface material (diffuse vs.
    specular vs. …); not important for our discussion here.
    • How do we know the color of an
    incident ray? Cast more rays!
    Color?
    Color?
    Color?
    Color?

    View Slide

  36. "TJEF8IZ3FDVSTJWF3BZ5SBDJOH
    15
    INPUT AND OUTPUT MESHES, THE METRIC USED FOR THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE
    CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO.
    Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the
    buddha model (20k vertices).
    models (left : AQ metric; right: IQ metric). The anisotropic
    behavior of the AQ metric is clearly visible in elongated
    Fig. 13. Closeup view of the David model remeshed to 500k vertices
    (Isotropic metric) Valette, et al. [TVCG’08]
    • To implement realistic shading.


    • The color* of an exiting ray depends
    on the colors* of all incident rays.


    • color* should technically be radiance; not
    important for our discussion here.


    • also depends on the surface material (diffuse vs.
    specular vs. …); not important for our discussion here.


    • How do we know the color of an
    incident ray? Cast more rays!
    Secondary Ray
    Secondary Ray
    Secondary Ray

    View Slide

  37. "TJEF3FOEFSJOH&RVBUJPO
    16
    https://en.wikipedia.org/wiki/Rendering_equation
    Lo
    (x, ωo
    ) =

    Ω
    fr
    (x, ωo
    , ωi
    ) Li
    (x, ωi
    ) cos θ dωi
    “Color” of
    exiting ray wo
    “Color” of
    incident ray wi
    Integrate incident rays over the hemisphere
    “Transfer
    function”

    View Slide

  38. 4QFFEJOH6Q3BZ5SJBOHMF*OUFSTFDUJPO5FTU
    17
    • Prune the search space.
    • Only search part of the scene that does intersect the ray.
    intersect(space, ray) {


    if ray doesn’t intersect space boundary:


    return


    else:


    foreach subspace in space


    if (subspace != empty)


    intersect(subspace, ray)


    }

    View Slide

  39. 4QFFEJOH6Q3BZ5SJBOHMF*OUFSTFDUJPO5FTU
    17
    • Prune the search space.
    • Only search part of the scene that does intersect the ray.
    intersect(space, ray) {


    if ray doesn’t intersect space boundary:


    return


    else:


    foreach subspace in space


    if (subspace != empty)


    intersect(subspace, ray)


    }

    View Slide

  40. 4QFFEJOH6Q3BZ5SJBOHMF*OUFSTFDUJPO5FTU
    17
    • Prune the search space.
    • Only search part of the scene that does intersect the ray.
    intersect(space, ray) {


    if ray doesn’t intersect space boundary:


    return


    else:


    foreach subspace in space


    if (subspace != empty)


    intersect(subspace, ray)


    }

    View Slide

  41. 4QFFEJOH6Q3BZ5SJBOHMF*OUFSTFDUJPO5FTU
    17
    • Prune the search space.
    • Only search part of the scene that does intersect the ray.
    • Key: how to partition the space?
    intersect(space, ray) {


    if ray doesn’t intersect space boundary:


    return


    else:


    foreach subspace in space


    if (subspace != empty)


    intersect(subspace, ray)


    }

    View Slide

  42. 4QBDF1BSUJUJPOWT0CKFDU1BSUJUJPO
    18
    Space partition: one object could
    be in different partitions
    Object partition: different
    partitions could overlap in space

    View Slide

  43. 4QBDF1BSUJUJPOJOH%BUB4USVDUVSFT
    19
    Uniform Grid Quadtree (or Octree in 3D)

    View Slide

  44. 4QBDF1BSUJUJPOJOH%BUB4USVDUVSFT
    20
    K-d Tree Binary Space Partitioning Tree

    View Slide

  45. #PVOEJOH7PMVNF)JFSBSDIZ 0CKFDU1BSUJUJPO

    Scene
    BVH Tree
    21
    2
    1
    4
    3

    View Slide

  46. #PVOEJOH7PMVNF)JFSBSDIZ 0CKFDU1BSUJUJPO

    Scene
    BVH Tree
    21
    2
    1
    4
    C
    3

    View Slide

  47. #PVOEJOH7PMVNF)JFSBSDIZ 0CKFDU1BSUJUJPO

    Scene
    BVH Tree
    21
    2
    1
    4
    C
    3
    C
    1

    View Slide

  48. #PVOEJOH7PMVNF)JFSBSDIZ 0CKFDU1BSUJUJPO

    Scene
    BVH Tree
    21
    2
    1
    4
    C
    D
    3
    C
    1

    View Slide

  49. #PVOEJOH7PMVNF)JFSBSDIZ 0CKFDU1BSUJUJPO

    Scene
    BVH Tree
    21
    2
    1
    4
    C
    D
    3
    C
    1
    D
    2 3

    View Slide

  50. #PVOEJOH7PMVNF)JFSBSDIZ 0CKFDU1BSUJUJPO

    Scene
    BVH Tree
    21
    2
    1
    4
    B
    C
    D
    3
    C
    1
    D
    2 3

    View Slide

  51. #PVOEJOH7PMVNF)JFSBSDIZ 0CKFDU1BSUJUJPO

    Scene
    BVH Tree
    21
    2
    1
    4
    B
    C
    D
    3
    B
    C
    1
    D
    2 3

    View Slide

  52. #PVOEJOH7PMVNF)JFSBSDIZ 0CKFDU1BSUJUJPO

    Scene
    BVH Tree
    21
    2
    1
    4
    B
    C
    D E
    3
    B
    C
    1
    D
    2 3

    View Slide

  53. #PVOEJOH7PMVNF)JFSBSDIZ 0CKFDU1BSUJUJPO

    Scene
    BVH Tree
    21
    2
    1
    4
    B
    C
    D E
    3
    B
    C
    1
    D
    2 3
    E
    4

    View Slide

  54. #PVOEJOH7PMVNF)JFSBSDIZ 0CKFDU1BSUJUJPO

    Scene
    BVH Tree
    21
    2
    1
    4
    A
    B
    C
    D E
    3
    B
    C
    1
    D
    2 3
    E
    4

    View Slide

  55. #PVOEJOH7PMVNF)JFSBSDIZ 0CKFDU1BSUJUJPO

    Scene
    BVH Tree
    21
    2
    1
    4
    A
    B
    C
    D E
    3
    A
    B
    C
    1
    D
    2 3
    E
    4

    View Slide

  56. #PVOEJOH7PMVNF)JFSBSDIZ 0CKFDU1BSUJUJPO

    Scene
    BVH Tree
    21
    2
    1
    4
    A
    B
    C
    D E
    3
    A
    B
    C
    1
    D
    2 3
    E
    4
    Interior
    node
    Leaf
    node
    Root
    Primitive

    View Slide

  57. #PVOEJOH7PMVNF)JFSBSDIZ 0CKFDU1BSUJUJPO

    2
    1
    4
    22
    A
    B
    C
    D E
    3
    • A, B, C, D, E are the bounding volumes, which are Axis-Aligned Bounding
    Boxes (AABBs) here. Other (irregular) bounding volumes are possible.
    A
    B
    C
    1
    D
    2 3
    E
    4
    Interior
    node
    Leaf
    node
    Root
    Primitive

    View Slide

  58. *OUFSTFDUJPO5FTU6TJOH#7)
    23
    2
    1
    4
    A
    B
    C
    D E
    A
    B E
    C D
    2 3
    4
    3
    1
    Current
    Stack
    A
    Ray
    Ray-AABB Intersection Test
    ClosestHit = NA

    View Slide

  59. *OUFSTFDUJPO5FTU6TJOH#7)
    24
    2
    1
    4
    A
    B
    C
    D E
    A
    B E
    C D
    2 3
    4
    3
    1
    Current
    Stack
    B
    E
    Ray-AABB Intersection Test
    Ray
    ClosestHit = NA

    View Slide

  60. *OUFSTFDUJPO5FTU6TJOH#7)
    25
    2
    1
    4
    A
    B
    C
    D E
    A
    B E
    C D
    2 3
    4
    3
    1
    Current
    Stack
    C
    E D
    Ray-AABB Intersection Test
    Ray
    ClosestHit = NA

    View Slide

  61. *OUFSTFDUJPO5FTU6TJOH#7)
    26
    2
    1
    4
    A
    B
    C
    D E
    A
    B E
    C D
    2 3
    4
    3
    1
    Current
    Stack
    D
    E
    Ray-AABB Intersection Test
    Ray
    ClosestHit = NA

    View Slide

  62. *OUFSTFDUJPO5FTU6TJOH#7)
    27
    2
    1
    4
    A
    B
    C
    D E
    A
    B E
    C D
    2 3
    4
    3
    1
    Current
    Stack E
    2
    Ray-Triangle Intersection Test
    Ray
    3
    ClosestHit = NA

    View Slide

  63. *OUFSTFDUJPO5FTU6TJOH#7)
    28
    2
    1
    4
    A
    B
    C
    D E
    A
    B E
    C D
    2 3
    4
    3
    1
    Current
    Stack
    Ray
    Ray-AABB Intersection Test
    E
    ClosestHit = 2

    View Slide

  64. *OUFSTFDUJPO5FTU6TJOH#7)
    28
    2
    1
    4
    A
    B
    C
    D E
    A
    B E
    C D
    2 3
    4
    3
    1
    Current
    Stack
    Ray
    Ray-AABB Intersection Test
    E
    ClosestHit = 2
    Distance to E > Distance to 2; Stop!

    View Slide

  65. 3BZ""##*OUFSTFDUJPO
    29
    Ray: O + tD, tmin
    <= t <= tmax
    O
    D
    thit
    tmin
    tmax

    View Slide

  66. "4VCUMFCVU$SJUJDBM$BTF
    30
    Ray: O + tD, tmin
    <= t <= tmax
    O
    D
    thit
    tmin
    tmax

    View Slide

  67. "4VCUMFCVU$SJUJDBM$BTF
    30
    Ray: O + tD, tmin
    <= t <= tmax
    O
    D
    thit
    tmin
    tmax
    Should this be counted as a hit?
    tmin
    tmax

    View Slide

  68. "4VCUMFCVU$SJUJDBM$BTF
    30
    Ray: O + tD, tmin
    <= t <= tmax
    O
    D
    thit
    tmin
    tmax
    Should this be counted as a hit?
    tmin
    tmax

    View Slide

  69. "4VCUMFCVU$SJUJDBM$BTF
    30
    Ray: O + tD, tmin
    <= t <= tmax
    O
    D
    thit
    Yes; any ray segment that’s completely inside
    an AABB must be treated as intersecting.
    tmin
    tmax
    Should this be counted as a hit?
    tmin
    tmax

    View Slide

  70. "TJEF5XP5FSNJOPMPHZ$POGVTJPOT
    31
    • Ray casting vs. ray tracing


    • Technically, finding the intersection of one ray and the scene is called ray casting.


    • Ray tracing referes to recursive ray casting.


    • Acceleration structures


    • Data structures that help speed up ray tracing is called “acceleration structures” (e.g., BVH),
    not to be confused with hardware accelerators.

    View Slide

  71. 32
    (BNF1MBO
    "DDFMFSBUJOH
    /FJHICPS4FBSDI
    6TJOH)BSEXBSF
    3BZ5SBDJOH
    • What is ray tracing?


    • How does hardware
    support ray tracing?

    View Slide

  72. 3BZ5SBDJOHPO(16T6TJOH#7)
    33
    2
    1
    4
    A
    B
    C
    D E
    3
    Ray
    Ray
    Ray

    View Slide

  73. 3BZ5SBDJOHPO(16T6TJOH#7)
    33
    2
    1
    4
    A
    B
    C
    D E
    3
    Ray
    Ray
    Ray
    • Build the BVH.

    View Slide

  74. 3BZ5SBDJOHPO(16T6TJOH#7)
    33
    2
    1
    4
    A
    B
    C
    D E
    3
    Ray
    Ray
    Ray
    • Build the BVH.
    • For each ray (thread):


    • Traverse the BVH (manage local stack)


    • Ray-AABB intersection test


    • Ray-primitive intersection test


    • Executes a shading algorithm

    View Slide

  75. 3BZ5SBDJOHPO(16T6TJOH#7)
    33
    2
    1
    4
    A
    B
    C
    D E
    3
    Ray
    Ray
    Ray
    • Build the BVH.
    • For each ray (thread):


    • Traverse the BVH (manage local stack)


    • Ray-AABB intersection test


    • Ray-primitive intersection test


    • Executes a shading algorithm
    • Prior to OptiX (2010)


    • Manually implement in CUDA.

    View Slide

  76. 3BZ5SBDJOHPO(16T6TJOH#7)
    33
    2
    1
    4
    A
    B
    C
    D E
    3
    Ray
    Ray
    Ray
    • Build the BVH.
    • For each ray (thread):


    • Traverse the BVH (manage local stack)


    • Ray-AABB intersection test


    • Ray-primitive intersection test


    • Executes a shading algorithm
    • Prior to OptiX (2010)


    • Manually implement in CUDA.
    Fixed-function
    ~Fixed-function

    View Slide

  77. 3BZ5SBDJOHJO0QUJ9BOE5VSJOH(16
    34
    • OptiX (2010): a ray tracing-specific
    programming model.


    • Provides a generic ray tracing pipeline.


    • Some pipeline stages are programmable; others
    are fixed functions.
    ACM Reference Format
    Parker, S., Bigler, J., Dietrich, A., Friedrich, H., Hoberock, J., Luebke, D., McAllister, D., McGuire, M., Morley,
    K., Robison, A., Stich, M. 2010. OptiX™: A General Purpose Ray Tracing Engine.
    ACM Trans. Graph. 29, 4, Article 66 (July 2010), 13 pages. DOI = 10.1145/1778765.1778803
    http://doi.acm.org/10.1145/1778765.1778803.
    Copyright Notice
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
    without fee provided that copies are not made or distributed for profi t or direct commercial advantage
    and that copies show this notice on the fi rst page or initial screen of a display along with the full citation.
    Copyrights for components of this work owned by others than ACM must be honored. Abstracting with
    credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any
    component of this work in other works requires prior specifi c permission and/or a fee. Permissions may be
    requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701, fax +1
    (212) 869-0481, or [email protected]
    © 2010 ACM 0730-0301/2010/07-ART66 $10.00 DOI 10.1145/1778765.1778803
    http://doi.acm.org/10.1145/1778765.1778803
    OptiX: A General Purpose Ray Tracing Engine
    Steven G. Parker1⇤ James Bigler1 Andreas Dietrich1 Heiko Friedrich1 Jared Hoberock1 David Luebke1
    David McAllister1 Morgan McGuire1,2 Keith Morley1 Austin Robison1 Martin Stich1
    NVIDIA1 Williams College2
    Figure 1: Images from various applications built with OptiX. Top: Physically based light transport through path tracing. Bottom: Ray tracing
    of a procedural Julia set, photon mapping, large-scale line of sight and collision detection, Whitted-style ray tracing of dynamic geometry,
    and ray traced ambient occlusion. All applications are interactive.
    Abstract
    The NVIDIA® OptiX™ ray tracing engine is a programmable sys-
    tem designed for NVIDIA GPUs and other highly parallel archi-
    tectures. The OptiX engine builds on the key observation that
    most ray tracing algorithms can be implemented using a small set
    of programmable operations. Consequently, the core of OptiX
    is a domain-specific just-in-time compiler that generates custom
    ray tracing kernels by combining user-supplied programs for ray
    generation, material shading, object intersection, and scene traver-
    sal. This enables the implementation of a highly diverse set of
    ray tracing-based algorithms and applications, including interactive
    rendering, offline rendering, collision detection systems, artificial
    intelligence queries, and scientific simulations such as sound prop-
    agation. OptiX achieves high performance through a compact ob-
    ject model and application of several ray tracing-specific compiler
    optimizations. For ease of use it exposes a single-ray programming
    model with full support for recursion and a dynamic dispatch mech-
    anism similar to virtual function calls.
    CR Categories: I.3.7 [Computer Graphics]: Three-Dimensional
    Graphics and Realism; D.2.11 [Software Architectures]: Domain-
    specific architectures; I.3.1 [Computer Graphics]: Hardware
    Architectures—;
    Keywords: ray tracing, graphics systems, graphics hardware
    ⇤e-mail: [email protected]
    1 Introduction
    To address the problem of creating an accessible, flexible, and effi-
    cient ray tracing system for many-core architectures, we introduce
    OptiX, a general purpose ray tracing engine. This engine combines
    a programmable ray tracing pipeline with a lightweight scene rep-
    resentation. A general programming interface enables the imple-
    mentation of a variety of ray tracing-based algorithms in graphics
    and non-graphics domains, such as rendering, sound propagation,
    collision detection and artificial intelligence.
    In this paper, we discuss the design goals of the OptiX engine as
    well as an implementation for NVIDIA Quadro®, GeForce®, and
    Tesla® GPUs. In our implementation, we compose domain-specific
    compilation with a flexible set of controls over scene hierarchy, ac-
    celeration structure creation and traversal, on-the-fly scene update,
    and a dynamically load-balanced GPU execution model. Although
    OptiX currently targets highly parallel architectures, it is applica-
    ble to a wide range of special- and general-purpose hardware and
    multiple execution models.
    To create a system for a broad range of ray tracing tasks, several
    ACM Transactions on Graphics, Vol. 29, No. 4, Article 66, Publication date: July 2010.

    View Slide

  78. 3BZ5SBDJOHJO0QUJ9BOE5VSJOH(16
    34
    • OptiX (2010): a ray tracing-specific
    programming model.


    • Provides a generic ray tracing pipeline.


    • Some pipeline stages are programmable; others
    are fixed functions.
    • Prior to Turing architecture (2018):


    • Everything runs on CUDA cores.
    ACM Reference Format
    Parker, S., Bigler, J., Dietrich, A., Friedrich, H., Hoberock, J., Luebke, D., McAllister, D., McGuire, M., Morley,
    K., Robison, A., Stich, M. 2010. OptiX™: A General Purpose Ray Tracing Engine.
    ACM Trans. Graph. 29, 4, Article 66 (July 2010), 13 pages. DOI = 10.1145/1778765.1778803
    http://doi.acm.org/10.1145/1778765.1778803.
    Copyright Notice
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
    without fee provided that copies are not made or distributed for profi t or direct commercial advantage
    and that copies show this notice on the fi rst page or initial screen of a display along with the full citation.
    Copyrights for components of this work owned by others than ACM must be honored. Abstracting with
    credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any
    component of this work in other works requires prior specifi c permission and/or a fee. Permissions may be
    requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701, fax +1
    (212) 869-0481, or [email protected]
    © 2010 ACM 0730-0301/2010/07-ART66 $10.00 DOI 10.1145/1778765.1778803
    http://doi.acm.org/10.1145/1778765.1778803
    OptiX: A General Purpose Ray Tracing Engine
    Steven G. Parker1⇤ James Bigler1 Andreas Dietrich1 Heiko Friedrich1 Jared Hoberock1 David Luebke1
    David McAllister1 Morgan McGuire1,2 Keith Morley1 Austin Robison1 Martin Stich1
    NVIDIA1 Williams College2
    Figure 1: Images from various applications built with OptiX. Top: Physically based light transport through path tracing. Bottom: Ray tracing
    of a procedural Julia set, photon mapping, large-scale line of sight and collision detection, Whitted-style ray tracing of dynamic geometry,
    and ray traced ambient occlusion. All applications are interactive.
    Abstract
    The NVIDIA® OptiX™ ray tracing engine is a programmable sys-
    tem designed for NVIDIA GPUs and other highly parallel archi-
    tectures. The OptiX engine builds on the key observation that
    most ray tracing algorithms can be implemented using a small set
    of programmable operations. Consequently, the core of OptiX
    is a domain-specific just-in-time compiler that generates custom
    ray tracing kernels by combining user-supplied programs for ray
    generation, material shading, object intersection, and scene traver-
    sal. This enables the implementation of a highly diverse set of
    ray tracing-based algorithms and applications, including interactive
    rendering, offline rendering, collision detection systems, artificial
    intelligence queries, and scientific simulations such as sound prop-
    agation. OptiX achieves high performance through a compact ob-
    ject model and application of several ray tracing-specific compiler
    optimizations. For ease of use it exposes a single-ray programming
    model with full support for recursion and a dynamic dispatch mech-
    anism similar to virtual function calls.
    CR Categories: I.3.7 [Computer Graphics]: Three-Dimensional
    Graphics and Realism; D.2.11 [Software Architectures]: Domain-
    specific architectures; I.3.1 [Computer Graphics]: Hardware
    Architectures—;
    Keywords: ray tracing, graphics systems, graphics hardware
    ⇤e-mail: [email protected]
    1 Introduction
    To address the problem of creating an accessible, flexible, and effi-
    cient ray tracing system for many-core architectures, we introduce
    OptiX, a general purpose ray tracing engine. This engine combines
    a programmable ray tracing pipeline with a lightweight scene rep-
    resentation. A general programming interface enables the imple-
    mentation of a variety of ray tracing-based algorithms in graphics
    and non-graphics domains, such as rendering, sound propagation,
    collision detection and artificial intelligence.
    In this paper, we discuss the design goals of the OptiX engine as
    well as an implementation for NVIDIA Quadro®, GeForce®, and
    Tesla® GPUs. In our implementation, we compose domain-specific
    compilation with a flexible set of controls over scene hierarchy, ac-
    celeration structure creation and traversal, on-the-fly scene update,
    and a dynamically load-balanced GPU execution model. Although
    OptiX currently targets highly parallel architectures, it is applica-
    ble to a wide range of special- and general-purpose hardware and
    multiple execution models.
    To create a system for a broad range of ray tracing tasks, several
    ACM Transactions on Graphics, Vol. 29, No. 4, Article 66, Publication date: July 2010.

    View Slide

  79. 3BZ5SBDJOHJO0QUJ9BOE5VSJOH(16
    34
    • OptiX (2010): a ray tracing-specific
    programming model.


    • Provides a generic ray tracing pipeline.


    • Some pipeline stages are programmable; others
    are fixed functions.
    • Prior to Turing architecture (2018):


    • Everything runs on CUDA cores.
    • Turing architecture:


    • RT Cores accelerate fixed-function stages.


    • Programmable stages on the CUDA cores.
    ACM Reference Format
    Parker, S., Bigler, J., Dietrich, A., Friedrich, H., Hoberock, J., Luebke, D., McAllister, D., McGuire, M., Morley,
    K., Robison, A., Stich, M. 2010. OptiX™: A General Purpose Ray Tracing Engine.
    ACM Trans. Graph. 29, 4, Article 66 (July 2010), 13 pages. DOI = 10.1145/1778765.1778803
    http://doi.acm.org/10.1145/1778765.1778803.
    Copyright Notice
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
    without fee provided that copies are not made or distributed for profi t or direct commercial advantage
    and that copies show this notice on the fi rst page or initial screen of a display along with the full citation.
    Copyrights for components of this work owned by others than ACM must be honored. Abstracting with
    credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any
    component of this work in other works requires prior specifi c permission and/or a fee. Permissions may be
    requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701, fax +1
    (212) 869-0481, or [email protected]
    © 2010 ACM 0730-0301/2010/07-ART66 $10.00 DOI 10.1145/1778765.1778803
    http://doi.acm.org/10.1145/1778765.1778803
    OptiX: A General Purpose Ray Tracing Engine
    Steven G. Parker1⇤ James Bigler1 Andreas Dietrich1 Heiko Friedrich1 Jared Hoberock1 David Luebke1
    David McAllister1 Morgan McGuire1,2 Keith Morley1 Austin Robison1 Martin Stich1
    NVIDIA1 Williams College2
    Figure 1: Images from various applications built with OptiX. Top: Physically based light transport through path tracing. Bottom: Ray tracing
    of a procedural Julia set, photon mapping, large-scale line of sight and collision detection, Whitted-style ray tracing of dynamic geometry,
    and ray traced ambient occlusion. All applications are interactive.
    Abstract
    The NVIDIA® OptiX™ ray tracing engine is a programmable sys-
    tem designed for NVIDIA GPUs and other highly parallel archi-
    tectures. The OptiX engine builds on the key observation that
    most ray tracing algorithms can be implemented using a small set
    of programmable operations. Consequently, the core of OptiX
    is a domain-specific just-in-time compiler that generates custom
    ray tracing kernels by combining user-supplied programs for ray
    generation, material shading, object intersection, and scene traver-
    sal. This enables the implementation of a highly diverse set of
    ray tracing-based algorithms and applications, including interactive
    rendering, offline rendering, collision detection systems, artificial
    intelligence queries, and scientific simulations such as sound prop-
    agation. OptiX achieves high performance through a compact ob-
    ject model and application of several ray tracing-specific compiler
    optimizations. For ease of use it exposes a single-ray programming
    model with full support for recursion and a dynamic dispatch mech-
    anism similar to virtual function calls.
    CR Categories: I.3.7 [Computer Graphics]: Three-Dimensional
    Graphics and Realism; D.2.11 [Software Architectures]: Domain-
    specific architectures; I.3.1 [Computer Graphics]: Hardware
    Architectures—;
    Keywords: ray tracing, graphics systems, graphics hardware
    ⇤e-mail: [email protected]
    1 Introduction
    To address the problem of creating an accessible, flexible, and effi-
    cient ray tracing system for many-core architectures, we introduce
    OptiX, a general purpose ray tracing engine. This engine combines
    a programmable ray tracing pipeline with a lightweight scene rep-
    resentation. A general programming interface enables the imple-
    mentation of a variety of ray tracing-based algorithms in graphics
    and non-graphics domains, such as rendering, sound propagation,
    collision detection and artificial intelligence.
    In this paper, we discuss the design goals of the OptiX engine as
    well as an implementation for NVIDIA Quadro®, GeForce®, and
    Tesla® GPUs. In our implementation, we compose domain-specific
    compilation with a flexible set of controls over scene hierarchy, ac-
    celeration structure creation and traversal, on-the-fly scene update,
    and a dynamically load-balanced GPU execution model. Although
    OptiX currently targets highly parallel architectures, it is applica-
    ble to a wide range of special- and general-purpose hardware and
    multiple execution models.
    To create a system for a broad range of ray tracing tasks, several
    ACM Transactions on Graphics, Vol. 29, No. 4, Article 66, Publication date: July 2010.

    View Slide

  80. "O4.JO5VSJOH(16
    35
    https://wccftech.com/nvidia-turing-gpu-architecture-geforce-rtx-graphics-cards-detailed/

    View Slide

  81. 0QUJ91SPHSBNNJOH.PEFM
    36
    Construct
    BVH
    • “Shaders” are user-defined functions executing on CUDA cores.

    View Slide

  82. 0QUJ91SPHSBNNJOH.PEFM
    36
    Construct
    BVH
    Ray Generation
    (RG) Shader
    • “Shaders” are user-defined functions executing on CUDA cores.

    View Slide

  83. 0QUJ91SPHSBNNJOH.PEFM
    36
    Construct
    BVH
    Ray Generation
    (RG) Shader
    BVH Traversal +
    Ray-AABB Test
    (TL)
    • “Shaders” are user-defined functions executing on CUDA cores.

    View Slide

  84. 0QUJ91SPHSBNNJOH.PEFM
    36
    Construct
    BVH
    Ray Generation
    (RG) Shader
    BVH Traversal +
    Ray-AABB Test
    (TL)
    • “Shaders” are user-defined functions executing on CUDA cores.
    A
    B E
    C D
    2 3
    4
    1

    View Slide

  85. 0QUJ91SPHSBNNJOH.PEFM
    36
    Construct
    BVH
    Ray Generation
    (RG) Shader
    Intersection (IS)
    Shader
    Enter leaf node
    BVH Traversal +
    Ray-AABB Test
    (TL)
    • “Shaders” are user-defined functions executing on CUDA cores.
    A
    B E
    C D
    2 3
    4
    1

    View Slide

  86. 0QUJ91SPHSBNNJOH.PEFM
    36
    Construct
    BVH
    Ray Generation
    (RG) Shader
    Intersection (IS)
    Shader
    Enter leaf node
    BVH Traversal +
    Ray-AABB Test
    (TL)
    • “Shaders” are user-defined functions executing on CUDA cores.
    • Allows custom primitives (not just triangles).
    A
    B E
    C D
    2 3
    4
    1

    View Slide

  87. 0QUJ91SPHSBNNJOH.PEFM
    36
    Construct
    BVH
    Ray Generation
    (RG) Shader
    Intersection (IS)
    Shader
    Enter leaf node
    BVH Traversal +
    Ray-AABB Test
    (TL)
    No
    Any-Hit (AH)
    Shader
    Ray primitive
    intersect?
    Yes
    • “Shaders” are user-defined functions executing on CUDA cores.
    • Allows custom primitives (not just triangles).

    View Slide

  88. 0QUJ91SPHSBNNJOH.PEFM
    36
    Construct
    BVH
    Ray Generation
    (RG) Shader
    Intersection (IS)
    Shader
    Enter leaf node
    BVH Traversal +
    Ray-AABB Test
    (TL)
    No
    Any-Hit (AH)
    Shader
    Ray primitive
    intersect?
    Yes
    Found
    a hit?
    Traversal
    completes
    • “Shaders” are user-defined functions executing on CUDA cores.
    • Allows custom primitives (not just triangles).

    View Slide

  89. 0QUJ91SPHSBNNJOH.PEFM
    36
    Construct
    BVH
    Ray Generation
    (RG) Shader
    Intersection (IS)
    Shader
    Enter leaf node
    BVH Traversal +
    Ray-AABB Test
    (TL)
    No
    Any-Hit (AH)
    Shader
    Ray primitive
    intersect?
    Yes
    Closest-Hit
    (CH) Shader
    Miss Shader
    Found
    a hit?
    Traversal
    completes
    • “Shaders” are user-defined functions executing on CUDA cores.
    • Allows custom primitives (not just triangles).

    View Slide

  90. 0QUJ91SPHSBNNJOH.PEFM
    36
    Construct
    BVH
    Ray Generation
    (RG) Shader
    Intersection (IS)
    Shader
    Enter leaf node
    BVH Traversal +
    Ray-AABB Test
    (TL)
    No
    Any-Hit (AH)
    Shader
    Ray primitive
    intersect?
    Yes
    Closest-Hit
    (CH) Shader
    Miss Shader
    Found
    a hit?
    Traversal
    completes
    • “Shaders” are user-defined functions executing on CUDA cores.
    • Allows custom primitives (not just triangles).
    Fixed functions executed
    on the RT cores.

    View Slide

  91. 0QUJ91SPHSBNNJOH.PEFM
    37
    Ray Generation
    (RG) Shader
    Construct
    BVH
    … …
    … …
    … …
    BVH Traversal +
    Ray-AABB Test
    (TL)
    Found
    a hit?
    Closest-Hit
    (CH) Shader
    Miss Shader
    Any-Hit (AH)
    Shader
    Intersection (IS)
    Shader
    Ray primitive
    intersect?
    Enter leaf node
    Yes
    No
    Traversal
    completes
    One Single CUDA Kernel
    CUDA Threads
    OptiX Rays

    View Slide

  92. -JGFPGBO0QUJ93BZ
    38
    2
    1
    4
    A
    B
    C
    D E
    3
    CUDA
    Cores
    RT Cores
    RG
    TL (A, B, D)
    IS (2, 3)
    TL (C, E)
    CH
    Think of RT cores as special function units for BVH traversal.

    View Slide

  93. "TJEF0UIFS/PUBCMF3BZ5SBDJOH&OHJOFT
    39
    • Intel OSPRay


    • Won 2020 Oscar for Scientific and Technical
    Achievement.


    • Built on Intel Embree, a collection of ray
    tracing kernels, which uses Intel Implicit
    SPMD Program Compiler (ISPC) for explicit
    vectorization.


    • PBRT


    • Pedagogical engine.


    • The book won 2014 Oscar for Scientific and
    Technical Achievement.

    View Slide

  94. (BNF1MBO
    40
    "DDFMFSBUJOH
    /FJHICPS4FBSDI
    6TJOH)BSEXBSF
    3BZ5SBDJOH
    • What is ray tracing?


    • How does hardware
    support ray tracing?


    • What is neighbor search


    • How does it relate to ray tracing?

    View Slide

  95. /FJHICPS4FBSDI
    41
    r
    Range Search
    query
    search points

    View Slide

  96. /FJHICPS4FBSDI
    42
    Range Search
    usually also limits the total # of neighbors:


    • practical memory constraint,


    • downstream algorithms expect a fixed # of neighbors.

    View Slide

  97. /FJHICPS4FBSDI
    42
    Range Search
    usually also limits the total # of neighbors:


    • practical memory constraint,


    • downstream algorithms expect a fixed # of neighbors.
    rangeSearch(query, points, range, K)
    Return any K points that are
    within range of query

    View Slide

  98. /FJHICPS4FBSDI
    43
    Range Search KNN Search
    2 nearest neighbors
    usually also limits the total # of neighbors:


    • practical memory constraint,


    • downstream algorithms expect a fixed # of neighbors.
    rangeSearch(query, points, range, K)
    Return any K points that are
    within range of query

    View Slide

  99. /FJHICPS4FBSDI
    44
    Range Search KNN Search
    usually also limits the total # of neighbors:


    • practical memory constraint,


    • downstream algorithms expect a fixed # of neighbors.
    usually also limits ranges of neighbors:


    • neighbors too far away are of no significance
    (e.g., force from a remote particle).
    rangeSearch(query, points, range, K)
    Return any K points that are
    within range of query

    View Slide

  100. /FJHICPS4FBSDI
    44
    Range Search KNN Search
    usually also limits the total # of neighbors:


    • practical memory constraint,


    • downstream algorithms expect a fixed # of neighbors.
    usually also limits ranges of neighbors:


    • neighbors too far away are of no significance
    (e.g., force from a remote particle).
    rangeSearch(query, points, range, K)
    Return any K points that are
    within range of query
    KNN(query, points, range, K)
    Return K nearest points that
    are within range of query

    View Slide

  101. 0VS'PDVT-PX%JNFOTJPOBM4FBSDI
    45
    • Low dimension: <= 3D.


    • Prevalent in science and engineering fields (e.g., computational
    fluid dynamics, graphics, vision).


    • They deal with physical data (e.g., particles, surface samples) that are inherent 2D/3D.


    • High-dimensional search is a completely different game.


    • “Curse of dimensionality” means we need different algorithms and distance metric.

    View Slide

  102. 5VSOUIF1SPCMFN"SPVOE
    46
    r
    Q
    Find all points within r from Q

    View Slide

  103. 5VSOUIF1SPCMFN"SPVOE
    47
    Find all points within r from Q
    Find whether Q is within r from other points
    Q
    r

    View Slide

  104. 1PJOUJO4QIFSF5FTU
    48
    Q

    View Slide

  105. 1PJOUJO4QIFSF5FTU
    48
    Q
    Is Q in the AABB?


    (Prunes remote points)

    View Slide

  106. 1PJOUJO4QIFSF5FTU
    48
    Q
    Is Q in the AABB?


    (Prunes remote points)

    View Slide

  107. 1PJOUJO4QIFSF5FTU
    48
    Q
    Is Q in the AABB?


    (Prunes remote points)
    If so, is Q in the sphere?

    View Slide

  108. • Recall: any ray that’s within an
    AABB must be treated as
    intersecting.
    1PJOUJO""##5FTU
    49
    Q
    2r

    View Slide

  109. • Recall: any ray that’s within an
    AABB must be treated as
    intersecting.
    • Idea: generate a short ray from
    Q and (ask the RT cores to)
    perform the ray-AABB test.
    1PJOUJO""##5FTU
    49
    Q
    2r

    View Slide

  110. • Recall: any ray that’s within an
    AABB must be treated as
    intersecting.
    • Idea: generate a short ray from
    Q and (ask the RT cores to)
    perform the ray-AABB test.
    • The ray has an arbitrary direction and a
    very small length.
    1PJOUJO""##5FTU
    49
    Q
    2r

    View Slide

  111. • Recall: any ray that’s within an
    AABB must be treated as
    intersecting.
    • Idea: generate a short ray from
    Q and (ask the RT cores to)
    perform the ray-AABB test.
    • The ray has an arbitrary direction and a
    very small length.
    • Why a very small ray length?
    1PJOUJO""##5FTU
    49
    Q
    2r

    View Slide

  112. • Recall: any ray that’s within an
    AABB must be treated as
    intersecting.
    • Idea: generate a short ray from
    Q and (ask the RT cores to)
    perform the ray-AABB test.
    • The ray has an arbitrary direction and a
    very small length.
    • Why a very small ray length?
    1PJOUJO""##5FTU
    49
    Q
    2r
    Q’

    View Slide

  113. 50
    • What is ray tracing?


    • How does hardware
    support ray tracing?


    • What is neighbor search?


    • How to use hardware ray
    tracing to accelerate
    neighbor search?
    (BNF1MBO
    "DDFMFSBUJOH
    /FJHICPS4FBSDI
    6TJOH)BSEXBSF
    3BZ5SBDJOH

    View Slide

  114. 0WFSBMM*EFB
    51
    rangeSearch(query, points, r, K)

    View Slide

  115. 0WFSBMM*EFB
    52
    Create an AABB of width 2r for every point
    rangeSearch(query, points, r, K)
    https://forums.developer.nvidia.com/t/bvh-building-algorithm-and-primitive-order/182231/8

    View Slide

  116. 0WFSBMM*EFB
    52
    Create an AABB of width 2r for every point
    rangeSearch(query, points, r, K)
    https://forums.developer.nvidia.com/t/bvh-building-algorithm-and-primitive-order/182231/8
    Construct a BVH from the AABBs


    (No control; hidden behind the OptiX APIs and most
    likely done in hardware)

    View Slide

  117. 0WFSBMM*EFB
    52
    Create an AABB of width 2r for every point
    rangeSearch(query, points, r, K)
    https://forums.developer.nvidia.com/t/bvh-building-algorithm-and-primitive-order/182231/8
    Construct a BVH from the AABBs


    (No control; hidden behind the OptiX APIs and most
    likely done in hardware)

    View Slide

  118. 0WFSBMM*EFB
    52
    Create an AABB of width 2r for every point
    rangeSearch(query, points, r, K)
    https://forums.developer.nvidia.com/t/bvh-building-algorithm-and-primitive-order/182231/8
    Construct a BVH from the AABBs


    (No control; hidden behind the OptiX APIs and most
    likely done in hardware)
    Use spheres as primitives,
    not triangles.

    View Slide

  119. 0WFSBMM*EFB
    52
    Create an AABB of width 2r for every point
    rangeSearch(query, points, r, K)
    https://forums.developer.nvidia.com/t/bvh-building-algorithm-and-primitive-order/182231/8
    Generate a ray for each query


    (RG Shader)
    Construct a BVH from the AABBs


    (No control; hidden behind the OptiX APIs and most
    likely done in hardware)

    View Slide

  120. 0WFSBMM*EFB
    52
    Create an AABB of width 2r for every point
    rangeSearch(query, points, r, K)
    https://forums.developer.nvidia.com/t/bvh-building-algorithm-and-primitive-order/182231/8
    Generate a ray for each query


    (RG Shader)
    Construct a BVH from the AABBs


    (No control; hidden behind the OptiX APIs and most
    likely done in hardware)
    Traverse BVH; skip non-circumscribing AABBs


    (No control; done in hardware)

    View Slide

  121. 0WFSBMM*EFB
    52
    Create an AABB of width 2r for every point
    rangeSearch(query, points, r, K)
    https://forums.developer.nvidia.com/t/bvh-building-algorithm-and-primitive-order/182231/8
    Generate a ray for each query


    (RG Shader)
    Construct a BVH from the AABBs


    (No control; hidden behind the OptiX APIs and most
    likely done in hardware)
    Traverse BVH; skip non-circumscribing AABBs


    (No control; done in hardware)
    At leaf nodes: calc dist, collect neighbors


    (IS Shader)

    View Slide

  122. 0WFSBMM*EFB
    52
    Create an AABB of width 2r for every point
    rangeSearch(query, points, r, K)
    https://forums.developer.nvidia.com/t/bvh-building-algorithm-and-primitive-order/182231/8
    Generate a ray for each query


    (RG Shader)
    Construct a BVH from the AABBs


    (No control; hidden behind the OptiX APIs and most
    likely done in hardware)
    Traverse BVH; skip non-circumscribing AABBs


    (No control; done in hardware)
    At leaf nodes: calc dist, collect neighbors


    (IS Shader)

    View Slide

  123. "OPUIFS1FSTQFDUJWF1PJOUJO4QIFSF5FTU
    53
    Ray Generation
    (RG) Shader
    Construct
    BVH
    BVH Traversal +
    Ray-AABB Test
    (TL)
    Found
    a hit?
    Closest-Hit
    (CH) Shader
    Miss Shader
    Any-Hit (AH)
    Shader
    Intersection (IS)
    Shader
    Ray primitive
    intersect?
    Enter leaf node
    Yes
    No
    Traversal
    completes
    Is Q in the AABB?


    (Prunes remote points)
    If so, is Q in the sphere?

    View Slide

  124. 1SPCMFN$POUSPM'MPX%JWFSHFODF
    54
    X
    OptiX groups every 32 adjacent rays into a warp.

    View Slide

  125. 1SPCMFN$POUSPM'MPX%JWFSHFODF
    54
    X
    OptiX groups every 32 adjacent rays into a warp.

    View Slide

  126. 1SPCMFN$POUSPM'MPX%JWFSHFODF
    55
    Y
    OptiX groups every 32 adjacent rays into a warp.

    View Slide

  127. 1SPCMFN$POUSPM'MPX%JWFSHFODF
    55
    Y
    OptiX groups every 32 adjacent rays into a warp.

    View Slide

  128. *EFB0SEFS2VFSJFT4QBUJBMMZ
    56
    • Intuition: group spatially close
    queries together so that their rays
    follow similar traversal paths.

    View Slide

  129. *EFB0SEFS2VFSJFT4QBUJBMMZ
    56
    • Intuition: group spatially close
    queries together so that their rays
    follow similar traversal paths.
    • Improving ray coherence in graphics parlance.

    View Slide

  130. *EFB0SEFS2VFSJFT4QBUJBMMZ
    56
    • Intuition: group spatially close
    queries together so that their rays
    follow similar traversal paths.
    • Improving ray coherence in graphics parlance.
    • How? A simple heuristic: queries
    enclosed by the same AABB are
    spatially close.

    View Slide

  131. *EFB0SEFS2VFSJFT4QBUJBMMZ
    57
    1
    2
    3
    4
    7
    6
    5
    8

    View Slide

  132. *EFB0SEFS2VFSJFT4QBUJBMMZ
    57
    • A query might be enclosed by many
    AABBs, but any AABB will do.
    1
    2
    3
    4
    7
    6
    5
    8

    View Slide

  133. *EFB0SEFS2VFSJFT4QBUJBMMZ
    57
    • A query might be enclosed by many
    AABBs, but any AABB will do.
    • How to find one? Cast a ray and
    immediately terminate the ray once
    the first IS shader is called.
    1
    2
    3
    4
    7
    6
    5
    8

    View Slide

  134. *EFB0SEFS2VFSJFT4QBUJBMMZ
    57
    • A query might be enclosed by many
    AABBs, but any AABB will do.
    • How to find one? Cast a ray and
    immediately terminate the ray once
    the first IS shader is called.
    • optixTerminateRay()
    1
    2
    3
    4
    7
    6
    5
    8

    View Slide

  135. *EFB0SEFS2VFSJFT4QBUJBMMZ
    57
    • A query might be enclosed by many
    AABBs, but any AABB will do.
    • How to find one? Cast a ray and
    immediately terminate the ray once
    the first IS shader is called.
    • optixTerminateRay()
    • Effectively returning ID (key) of the
    first enclosing leaf AABB.
    1
    2
    3
    4
    7
    6
    5
    8

    View Slide

  136. *EFB0SEFS2VFSJFT4QBUJBMMZ
    57
    • A query might be enclosed by many
    AABBs, but any AABB will do.
    • How to find one? Cast a ray and
    immediately terminate the ray once
    the first IS shader is called.
    • optixTerminateRay()
    • Effectively returning ID (key) of the
    first enclosing leaf AABB.
    • Then sort by key.
    1
    2
    3
    4
    7
    6
    5
    8

    View Slide

  137. 4FBSDI"MHPSJUIN 4P'BS

    58
    1
    2
    3
    4
    7
    6
    5
    8
    bvh ← buildBVH(points, r);


    firstHitAABBs ← traceRays(bvh, queries);


    reorderQueries(queries, firstHitAABBs);


    traceRays(bvh, queries);

    View Slide

  138. 1SPCMFN-BSHF""##T
    59
    Q
    2r
    rangeSearch(query, points, r, K)
    • Strictly speaking, the AABB width
    must be 2r.


    • What if we can find K neighbors
    in a smaller range? We can use a
    smaller AABB.


    • What’s the benefit?

    View Slide

  139. #FOF
    fi
    UTPG4NBMMFS""##T
    60
    35
    30
    25
    20
    15
    10
    5
    0
    Time (s)
    30
    25
    20
    15
    10
    5
    0
    AABB Width
    • Using smaller AABBs drastically
    reduces the search time.

    View Slide

  140. #FOF
    fi
    UTPG4NBMMFS""##T
    60
    35
    30
    25
    20
    15
    10
    5
    0
    Time (s)
    30
    25
    20
    15
    10
    5
    0
    AABB Width
    • Using smaller AABBs drastically
    reduces the search time.
    • Smaller AABB means a query is
    enclosed by fewer AABBs.

    View Slide

  141. #FOF
    fi
    UTPG4NBMMFS""##T
    60
    35
    30
    25
    20
    15
    10
    5
    0
    Time (s)
    30
    25
    20
    15
    10
    5
    0
    AABB Width
    • Using smaller AABBs drastically
    reduces the search time.
    • Smaller AABB means a query is
    enclosed by fewer AABBs.
    • …which leads to fewer
    traversals and IS shader calls.

    View Slide

  142. #FOF
    fi
    UTPG4NBMMFS""##T
    60
    35
    30
    25
    20
    15
    10
    5
    0
    Time (s)
    30
    25
    20
    15
    10
    5
    0
    AABB Width
    • Using smaller AABBs drastically
    reduces the search time.
    • Smaller AABB means a query is
    enclosed by fewer AABBs.
    • …which leads to fewer
    traversals and IS shader calls.
    • Particularly important for KNN search,
    where the IS shader manipulates a priority
    queue.

    View Slide

  143. *EFB2VFSZ1BSUJUJPOJOH
    61
    • For each query, find an AABB size
    that’s just large enough to ensure
    correctness.
    2r

    View Slide

  144. *EFB2VFSZ1BSUJUJPOJOH
    61
    • For each query, find an AABB size
    that’s just large enough to ensure
    correctness.
    2r
    d

    View Slide

  145. *EFB2VFSZ1BSUJUJPOJOH
    62
    • For each query, find an AABB size
    that’s just large enough to ensure
    correctness.
    • Group queries such that queries in
    each partition share the same AABB.
    q0
    q1
    q2
    q3
    Calc. Smallest AABB Size
    q1 .. .. .. BVH 0
    Partitions
    … ……
    Queries
    ……
    q0 .. BVH 1
    .. BVH n-1
    q2 BVH n
    q3 ..

    View Slide

  146. *EFB2VFSZ1BSUJUJPOJOH
    62
    • For each query, find an AABB size
    that’s just large enough to ensure
    correctness.
    • Group queries such that queries in
    each partition share the same AABB.
    • Build a different BVH for each
    partition.
    q0
    q1
    q2
    q3
    Calc. Smallest AABB Size
    q1 .. .. .. BVH 0
    Partitions
    … ……
    Queries
    ……
    q0 .. BVH 1
    .. BVH n-1
    q2 BVH n
    q3 ..

    View Slide

  147. *EFB2VFSZ1BSUJUJPOJOH
    62
    • For each query, find an AABB size
    that’s just large enough to ensure
    correctness.
    • Group queries such that queries in
    each partition share the same AABB.
    • Build a different BVH for each
    partition.
    • Essentially trades BVH construction
    overhead for faster search.
    q0
    q1
    q2
    q3
    Calc. Smallest AABB Size
    q1 .. .. .. BVH 0
    Partitions
    … ……
    Queries
    ……
    q0 .. BVH 1
    .. BVH n-1
    q2 BVH n
    q3 ..

    View Slide

  148. %FUFSNJOJOH""##4J[FGPS3BOHF4FBSDI
    63
    d

    View Slide

  149. %FUFSNJOJOH""##4J[FGPS3BOHF4FBSDI
    63
    • Build a uniform grid.
    d

    View Slide

  150. %FUFSNJOJOH""##4J[FGPS3BOHF4FBSDI
    63
    • Build a uniform grid.
    • Start from the cell that contains the
    query, and iteratively grow along all
    four (2D) or six (3D) directions.
    d

    View Slide

  151. %FUFSNJOJOH""##4J[FGPS3BOHF4FBSDI
    63
    • Build a uniform grid.
    • Start from the cell that contains the
    query, and iteratively grow along all
    four (2D) or six (3D) directions.
    • Stop when K neighbors are found (or
    the sphere boundary is reached).
    d

    View Slide

  152. %FUFSNJOJOH""##4J[FGPS3BOHF4FBSDI
    63
    • Build a uniform grid.
    • Start from the cell that contains the
    query, and iteratively grow along all
    four (2D) or six (3D) directions.
    • Stop when K neighbors are found (or
    the sphere boundary is reached).
    • We call the final collection of cells
    the megacell, with a width d.
    d

    View Slide

  153. %FUFSNJOJOH""##4J[FGPS3BOHF4FBSDI
    63
    • Build a uniform grid.
    • Start from the cell that contains the
    query, and iteratively grow along all
    four (2D) or six (3D) directions.
    • Stop when K neighbors are found (or
    the sphere boundary is reached).
    • We call the final collection of cells
    the megacell, with a width d.
    • d is the AABB size.
    d

    View Slide

  154. %FUFSNJOJOH""##4J[FGPS,//4FBSDI
    64
    • Find the megacell (width d), just like
    in range search.


    • Can we use d as the AABB size?
    d

    View Slide

  155. %FUFSNJOJOH""##4J[FGPS,//4FBSDI
    65
    • Find the megacell (width d), just like
    in range search.


    • Can we use d as the AABB size?


    • No! Some of the nearest K neighbors
    might be outside of the megacell.
    d
    p2
    q
    qp1
    > qp2
    p1

    View Slide

  156. "$POTFSWBUJWF""##4J[FGPS,//
    66
    d
    p2
    q
    p1

    View Slide

  157. "$POTFSWBUJWF""##4J[FGPS,//
    66
    d
    p2
    q
    p1

    View Slide

  158. "$POTFSWBUJWF""##4J[FGPS,//
    66
    • The circumscribing circle/sphere of
    the megacall is guaranteed to have
    the K nearest neighbors.
    d
    p2
    q
    p1

    View Slide

  159. "$POTFSWBUJWF""##4J[FGPS,//
    66
    • The circumscribing circle/sphere of
    the megacall is guaranteed to have
    the K nearest neighbors.
    • Why? Given a circle with N neighbors, those N
    neighbors are by definition the N nearest
    neighbors; N is guaranteed to be >= K.
    d
    p2
    q
    p1

    View Slide

  160. "$POTFSWBUJWF""##4J[FGPS,//
    66
    • The circumscribing circle/sphere of
    the megacall is guaranteed to have
    the K nearest neighbors.
    • Why? Given a circle with N neighbors, those N
    neighbors are by definition the N nearest
    neighbors; N is guaranteed to be >= K.
    • AABB must be the circumscribing
    square/cube of that circle/sphere.
    d
    p2
    q
    p1

    View Slide

  161. "$POTFSWBUJWF""##4J[FGPS,//
    66
    • The circumscribing circle/sphere of
    the megacall is guaranteed to have
    the K nearest neighbors.
    • Why? Given a circle with N neighbors, those N
    neighbors are by definition the N nearest
    neighbors; N is guaranteed to be >= K.
    • AABB must be the circumscribing
    square/cube of that circle/sphere.
    • Width is for 2D and for 3D.
    2d 3d
    d
    p2
    q
    p1

    View Slide

  162. $BO8F%P#FUUFS
    67
    d
    p2
    q
    p1
    A
    B

    View Slide

  163. $BO8F%P#FUUFS
    67
    • What we really want to find is sphere
    C, which is smallest sphere that
    contains K nearest neighbors.
    d
    p2
    q
    p1
    A
    B
    C

    View Slide

  164. $BO8F%P#FUUFS
    67
    • What we really want to find is sphere
    C, which is smallest sphere that
    contains K nearest neighbors.
    • How? We know cube A has at least K
    neighbors.
    d
    p2
    q
    p1
    A
    B
    C

    View Slide

  165. "#FUUFS""##4J[FGPS,//
    68
    • Assumption: point density is locally
    uniform within and around a
    megacell.
    d
    p2
    q
    p1
    A
    B
    C

    View Slide

  166. "#FUUFS""##4J[FGPS,//
    68
    • Assumption: point density is locally
    uniform within and around a
    megacell.
    • A sphere C that has the same
    volume as cube A will contain K
    neighbors, which are guaranteed to
    be the K nearest neighbors.
    d
    p2
    q
    p1
    A
    B
    C

    View Slide

  167. "#FUUFS""##4J[FGPS,//
    68
    • Assumption: point density is locally
    uniform within and around a
    megacell.
    • A sphere C that has the same
    volume as cube A will contain K
    neighbors, which are guaranteed to
    be the K nearest neighbors.
    • AABB size is for 3D.
    2 3
    3

    d
    d
    p2
    q
    p1
    A
    B
    C

    View Slide

  168. 4FBSDI"MHPSJUIN 4P'BS

    69
    bvh ← buildBVH(points, r);


    firstHitAABBs ← traceRays(bvh, queries);


    reorderQueries(queries, firstHitAABBs);


    traceRays(bvh, queries);

    View Slide

  169. 4FBSDI"MHPSJUIN 4P'BS

    69
    bvh ← buildBVH(points, r);


    firstHitAABBs ← traceRays(bvh, queries);


    reorderQueries(queries, firstHitAABBs);


    traceRays(bvh, queries);
    foreach q in queries:


    AABBSize ← findSmallestAABBSize(q);


    partitions.add(AABBSize, q); // assuming a hash table


    foreach p in partitions:


    queries ← all queries in p;


    r ← AABBSize of p;

    View Slide

  170. #VOEMF1BSUJUJPOT
    70
    • Problem: too many partitions leads
    to high BVH construction overhead.

    View Slide

  171. #VOEMF1BSUJUJPOT
    70
    • Problem: too many partitions leads
    to high BVH construction overhead.
    • Especially bad when point density is globally
    non-uniform (e.g., astrophysics simulation).

    View Slide

  172. #VOEMF1BSUJUJPOT
    70
    • Problem: too many partitions leads
    to high BVH construction overhead.
    • Especially bad when point density is globally
    non-uniform (e.g., astrophysics simulation).
    • Bundle partitions to minimize overall
    search time. Bundling two partitions:
    p1 p2 p3 p4
    b1 b2 b3
    Partitions
    Bundles

    View Slide

  173. #VOEMF1BSUJUJPOT
    70
    • Problem: too many partitions leads
    to high BVH construction overhead.
    • Especially bad when point density is globally
    non-uniform (e.g., astrophysics simulation).
    • Bundle partitions to minimize overall
    search time. Bundling two partitions:
    • eliminates one BVH construction cost.
    p1 p2 p3 p4
    b1 b2 b3
    Partitions
    Bundles

    View Slide

  174. #VOEMF1BSUJUJPOT
    70
    • Problem: too many partitions leads
    to high BVH construction overhead.
    • Especially bad when point density is globally
    non-uniform (e.g., astrophysics simulation).
    • Bundle partitions to minimize overall
    search time. Bundling two partitions:
    • eliminates one BVH construction cost.
    • but also increases the search cost. Why?
    p1 p2 p3 p4
    b1 b2 b3
    Partitions
    Bundles

    View Slide

  175. $PTU.PEFM
    71
    • Search cost is dictated by the number of IS shader calls, which
    35
    28
    21
    14
    7
    0
    Execution Time (s)
    0.9
    0.6
    0.3
    0.0
    # of IS Shader Calls (millions)

    View Slide

  176. $PTU.PEFM
    71
    • Search cost is dictated by the number of IS shader calls, which
    • …is dictated by the number of AABBs a query resides in, which
    35
    28
    21
    14
    7
    0
    Execution Time (s)
    0.9
    0.6
    0.3
    0.0
    # of IS Shader Calls (millions)

    View Slide

  177. $PTU.PEFM
    71
    • Search cost is dictated by the number of IS shader calls, which
    • …is dictated by the number of AABBs a query resides in, which
    • …is equivalent to the number of points inside an AABB, which
    35
    28
    21
    14
    7
    0
    Execution Time (s)
    0.9
    0.6
    0.3
    0.0
    # of IS Shader Calls (millions)

    View Slide

  178. $PTU.PEFM
    71
    • Search cost is dictated by the number of IS shader calls, which
    • …is dictated by the number of AABBs a query resides in, which
    • …is equivalent to the number of points inside an AABB, which
    • …is density x volume (r3), assuming locally-uniform density
    35
    28
    21
    14
    7
    0
    Execution Time (s)
    0.9
    0.6
    0.3
    0.0
    # of IS Shader Calls (millions)

    View Slide

  179. $PTU.PEFM
    71
    • Search cost is dictated by the number of IS shader calls, which
    • …is dictated by the number of AABBs a query resides in, which
    • …is equivalent to the number of points inside an AABB, which
    • …is density x volume (r3), assuming locally-uniform density
    • Search cost ∝ r3
    35
    28
    21
    14
    7
    0
    Execution Time (s)
    0.9
    0.6
    0.3
    0.0
    # of IS Shader Calls (millions)

    View Slide

  180. $PTU.PEFM
    71
    • Search cost is dictated by the number of IS shader calls, which
    • …is dictated by the number of AABBs a query resides in, which
    • …is equivalent to the number of points inside an AABB, which
    • …is density x volume (r3), assuming locally-uniform density
    • Search cost ∝ r3
    35
    28
    21
    14
    7
    0
    Execution Time (s)
    0.9
    0.6
    0.3
    0.0
    # of IS Shader Calls (millions)
    Tsearch
    = kNρS3

    View Slide

  181. $PTU.PEFM
    71
    • Search cost is dictated by the number of IS shader calls, which
    • …is dictated by the number of AABBs a query resides in, which
    • …is equivalent to the number of points inside an AABB, which
    • …is density x volume (r3), assuming locally-uniform density
    • Search cost ∝ r3
    35
    28
    21
    14
    7
    0
    Execution Time (s)
    0.9
    0.6
    0.3
    0.0
    # of IS Shader Calls (millions)
    Tsearch
    = kNρS3
    # of queries in
    a partition

    View Slide

  182. $PTU.PEFM
    71
    • Search cost is dictated by the number of IS shader calls, which
    • …is dictated by the number of AABBs a query resides in, which
    • …is equivalent to the number of points inside an AABB, which
    • …is density x volume (r3), assuming locally-uniform density
    • Search cost ∝ r3
    35
    28
    21
    14
    7
    0
    Execution Time (s)
    0.9
    0.6
    0.3
    0.0
    # of IS Shader Calls (millions)
    Tsearch
    = kNρS3
    # of queries in
    a partition
    Point density in
    a partition

    View Slide

  183. $PTU.PEFM
    71
    • Search cost is dictated by the number of IS shader calls, which
    • …is dictated by the number of AABBs a query resides in, which
    • …is equivalent to the number of points inside an AABB, which
    • …is density x volume (r3), assuming locally-uniform density
    • Search cost ∝ r3
    35
    28
    21
    14
    7
    0
    Execution Time (s)
    0.9
    0.6
    0.3
    0.0
    # of IS Shader Calls (millions)
    Tsearch
    = kNρS3
    # of queries in
    a partition
    Point density in
    a partition
    AABB size of
    the partition

    View Slide

  184. $PTU.PEFM
    71
    • Search cost is dictated by the number of IS shader calls, which
    • …is dictated by the number of AABBs a query resides in, which
    • …is equivalent to the number of points inside an AABB, which
    • …is density x volume (r3), assuming locally-uniform density
    • Search cost ∝ r3
    35
    28
    21
    14
    7
    0
    Execution Time (s)
    0.9
    0.6
    0.3
    0.0
    # of IS Shader Calls (millions)
    Tsearch
    = kNρS3
    # of queries in
    a partition
    Point density in
    a partition
    AABB size of
    the partition
    A constant
    regressed offline

    View Slide

  185. $PTU.PEFM
    72
    • When combining two partitions, the AABB size of the new partition
    must be the max of the two.
    k(N1
    ρ1
    + N2
    ρ2
    )[max(S1
    , S2
    )]3 k(N1
    ρ1
    S3
    1
    + N2
    ρ2
    S3
    2
    )
    >

    View Slide

  186. 0QUJNBM#VOEMJOH
    73
    • Bundling increases search cost, but
    reduces BVH construction cost.
    What’s the optimal bundling?

    View Slide

  187. 0QUJNBM#VOEMJOH
    73
    • Bundling increases search cost, but
    reduces BVH construction cost.
    What’s the optimal bundling?
    p1 p2 p3 p4
    b1 b2 b3
    Partitions
    Bundles
    p1 p2 p3 p4
    b1 b2 b3
    Partitions
    Bundles

    View Slide

  188. 0QUJNBM#VOEMJOH
    73
    • Bundling increases search cost, but
    reduces BVH construction cost.
    What’s the optimal bundling?
    • Combinatorial optimization, but we
    have to solve it at run-time.
    p1 p2 p3 p4
    b1 b2 b3
    Partitions
    Bundles
    p1 p2 p3 p4
    b1 b2 b3
    Partitions
    Bundles

    View Slide

  189. 0QUJNBM#VOEMJOH
    73
    • Bundling increases search cost, but
    reduces BVH construction cost.
    What’s the optimal bundling?
    • Combinatorial optimization, but we
    have to solve it at run-time.
    • We leverage an empirical
    observation to simplify the problem
    structure, which yields an efficient
    linear-time solution.
    p1 p2 p3 p4
    b1 b2 b3
    Partitions
    Bundles
    p1 p2 p3 p4
    b1 b2 b3
    Partitions
    Bundles

    View Slide

  190. &NQJSJDBM0CTFSWBUJPO
    74
    • Empirically: AABB size and # of
    queries are inversely correlated.

    View Slide

  191. &NQJSJDBM0CTFSWBUJPO
    74
    • Empirically: AABB size and # of
    queries are inversely correlated.
    104
    105
    106
    107
    Number of Queries
    2.3
    1.9
    1.5
    1.1
    0.7
    0.3
    AABB Size

    View Slide

  192. &NQJSJDBM0CTFSWBUJPO
    74
    • Empirically: AABB size and # of
    queries are inversely correlated.
    104
    105
    106
    107
    Number of Queries
    2.3
    1.9
    1.5
    1.1
    0.7
    0.3
    AABB Size
    Intuitively, only a handful of
    sparsely located queries need a
    large AABB to find K neighbors.

    View Slide

  193. &NQJSJDBM0CTFSWBUJPO
    74
    • Empirically: AABB size and # of
    queries are inversely correlated.
    • Given this empirical observation, we
    can derive the optimal bundling in
    linear time.


    • Proof omitted; see paper.
    104
    105
    106
    107
    Number of Queries
    2.3
    1.9
    1.5
    1.1
    0.7
    0.3
    AABB Size
    Intuitively, only a handful of
    sparsely located queries need a
    large AABB to find K neighbors.

    View Slide

  194. 0QUJNBM#VOEMJOH"MHPSJUIN
    75
    • Algorithm:


    • Sort partitions according to the ascending order
    of their AABB sizes.


    • Start from the last partition and scan backward;
    at each step, bundle all partitions that have been
    scanned, leave the rest unbundled.


    • Pick the one with the lowest search cost.
    p1 p2 p3 p4
    b1 b2 b3
    Partitions
    Bundles
    Larger AABBs, fewer queries.

    View Slide

  195. 0QUJNBM#VOEMJOH"MHPSJUIN
    75
    • Algorithm:


    • Sort partitions according to the ascending order
    of their AABB sizes.


    • Start from the last partition and scan backward;
    at each step, bundle all partitions that have been
    scanned, leave the rest unbundled.


    • Pick the one with the lowest search cost.
    p1 p2 p3 p4
    b1 b2 b3
    Partitions
    Bundles
    Larger AABBs, fewer queries.
    p1 p2 p3 p4
    b1 b2 b3
    Partitions
    Bundles

    View Slide

  196. 0QUJNBM#VOEMJOH"MHPSJUIN
    75
    • Algorithm:


    • Sort partitions according to the ascending order
    of their AABB sizes.


    • Start from the last partition and scan backward;
    at each step, bundle all partitions that have been
    scanned, leave the rest unbundled.


    • Pick the one with the lowest search cost.
    p1 p2 p3 p4
    b1 b2 b3
    Partitions
    Bundles
    Larger AABBs, fewer queries.
    p1 p2 p3 p4
    b1 b2 b3
    Partitions
    Bundles

    View Slide

  197. 'JOBM4FBSDI"MHPSJUIN
    76
    foreach q in queries:


    AABBSize ← findSmallestAABBSize(q);


    partitions.add(AABBSize, q); // assuming a hash table
    foreach p in partitions:


    queries ← all queries in p;


    r ← AABBSize of p;


    bvh ← buildBVH(points, r);


    firstHitAABBs ← traceRays(bvh, queries);


    reorderQueries(queries, firstHitAABBs);


    traceRays(bvh, queries);

    View Slide

  198. 'JOBM4FBSDI"MHPSJUIN
    76
    foreach q in queries:


    AABBSize ← findSmallestAABBSize(q);


    partitions.add(AABBSize, q); // assuming a hash table
    foreach p in partitions:


    queries ← all queries in p;


    r ← AABBSize of p;


    bvh ← buildBVH(points, r);


    firstHitAABBs ← traceRays(bvh, queries);


    reorderQueries(queries, firstHitAABBs);


    traceRays(bvh, queries);
    bundle(partitions);

    View Slide

  199. 77
    Results

    View Slide

  200. &YQFSJNFOUBM4FUVQ
    • OptiX 7.1, CUDA 11; RTX 2080.


    • Baselines:


    • cuNSearch: grid search in CUDA; used in SPlisHSPlasH fluid simulator.


    • FRNN: grid search in CUDA.


    • PCLOctree: octree-search in CUDA (i.e., use octree, as opposed to BVH, to prune search).


    • FastRNN: KNN search in RT cores without our optimizations.


    • Datasets:


    • KITTI: self-driving car datasets; points are surface samples; mostly confined in 2D (ground)


    • Stanford 3D Scanning Repo: Bunny, Dragon, Buddha.


    • N-body simulation: non-uniform distribution in 3D.
    78

    View Slide

  201. 4QFFEVQTPWFS#BTFMJOFT
    79
    10-1
    100
    101
    102
    103
    Speedup (log)
    KITTI-1M
    KITTI-6M
    KITTI-12M
    KITTI-25M
    NBody-9M
    NBody-10M
    Bunny-360K
    Dragon-3.6M
    Buddha-4.6M
    OOM
    DNF
    Range Search PCLOctree cuNSearch
    KNN Search FRNN FastRNN
    10-1
    100
    101
    102
    103
    Speedup (log)
    1M 6M 12M 25M
    KITTI
    9M 10M
    N-body 3D scans
    360K 3.6M 4.6M
    Range search speedup: 2.2X — 44.0X


    KNN search speedup: 3.5X — 65.0X
    1. higher speedups on larger inputs.


    2. higher speedups on KNN search.

    View Slide

  202. 5JNF%JTUSJCVUJPO
    80
    100
    80
    60
    40
    20
    0
    Time (%)
    KITTI-1M
    KITTI-6M
    KITTI-12M
    KITTI-25M
    NBody-9M
    NBody-10M
    Bunny-360K
    Dragon3.6M
    Buddha-4.6M
    Data Opt BVH FS Search
    100
    80
    60
    40
    20
    0
    Time (%)
    KITTI-1M
    KITTI-6M
    KITTI-12M
    KITTI-25M
    NBody-9M
    NBody-10M
    Bunny-360K
    Dragon3.6M
    Buddha-4.6M
    Data Opt BVH FS Search
    Range search: much of the time is
    spent on optimization, data transfer,
    BVH construction.
    KNN search: time is mostly
    dominated by the actual search.
    KITTI N-body 3D scan KITTI N-body 3D scan
    0 0

    View Slide

  203. 5JNF%JTUSJCVUJPO
    81
    100
    80
    60
    40
    20
    0
    Time (%)
    KITTI-1M
    KITTI-6M
    KITTI-12M
    KITTI-25M
    NBody-9M
    NBody-10M
    Bunny-360K
    Dragon3.6M
    Buddha-4.6M
    Data Opt BVH FS Search
    100
    80
    60
    40
    20
    0
    Time (%)
    KITTI-1M
    KITTI-6M
    KITTI-12M
    KITTI-25M
    NBody-9M
    NBody-10M
    Bunny-360K
    Dragon3.6M
    Buddha-4.6M
    Data Opt BVH FS Search
    N-body N-body
    Galaxy (point) distribution in universe is very non-
    uniform; so a lot of time spent on partitioning.
    0 0

    View Slide

  204. 0QUJNJ[BUJPO&
    ff
    FDUT
    82
    10-2
    100
    102
    Log-Scale Time (s)
    KNN Range
    18.6%
    161.3
    NoOpt Sched. Oracle
    Sched. + Partition
    Sched. + Partition + Bundle
    N-body (9M)
    10-2
    100
    102
    104
    Log-Scale Time (s)
    KNN Range
    18.8%
    NoOpt Sched. Oracle
    Sched. + Partition
    Sched. + Partition + Bundle
    KITTI (12M)

    View Slide

  205. 83
    Concluding Remarks

    View Slide

  206. (FOFSBM1VSQPTF*SSFHVMBS1SPDFTTPS
    • Conventional GPUs evolved to support general-purpose regular
    applications; will the same happen to RT cores?


    • A few examples of using RT cores for non-graphics workloads.


    • Key: formulate your problem as a BVH search.


    • But very limited, because RT cores are built to support only BVH
    search, which has a very specific branching logic (ray-AABB test).


    • Relax the hardware? Does it make sense? Will Nvidia do it?
    84

    View Slide

  207. "QQSPYJNBUF/FJHICPS4FBSDI
    • Most often applications don’t need precise search.


    • Many natural opportunities for approximation in our algorithm.


    • Use a smaller-than-necessary AABB to build the BVH.


    • Elide ray-sphere test (skip IS shader calls); provides an error bound.


    • Even better: many applications that use neighbor search are
    differentiable (e.g., neural network). We could integrate
    approximate neighbor search into the training process to tolerate
    end-to-end accuracy loss.


    • See Yu Feng’s ISCA 2022 paper.
    85

    View Slide