RTNN: Accelerating Neighbor Search Using Hardware Ray Tracing

Yuhao Zhu https://github.com/horizon-research/rtnn University of Rochester "DDFMFSBUJOH /FJHICPS4FBSDI 6TJOH)BSEXBSF 3BZ5SBDJOH

2 (BNF1MBO "DDFMFSBUJOH /FJHICPS4FBSDI 6TJOH)BSEXBSF 3BZ5SBDJOH • What is ray
tracing?

tracing? • How does hardware support ray tracing?

(BNF1MBO 4 "DDFMFSBUJOH /FJHICPS4FBSDI 6TJOH)BSEXBSF 3BZ5SBDJOH • What is ray
tracing? • How does hardware support ray tracing? • What is neighbor search?

5 • What is ray tracing? • How does hardware
support ray tracing? • What is neighbor search? • How to use hardware ray tracing to accelerate neighbor search? (BNF1MBO "DDFMFSBUJOH /FJHICPS4FBSDI 6TJOH)BSEXBSF 3BZ5SBDJOH

tracing?

8 2D image cgarena.com

8 Modeling 3D mesh 2D image cgarena.com

.FTI JF )PX4DFOFJT3FQSFTFOUFE 9 Very informally: 3D piece-wide linear approximation
of arbitrary 3D surfaces free3d.com

.FTI JF )PX4DFOFJT3FQSFTFOUFE 9 Very informally: 3D piece-wide linear approximation
of arbitrary 3D surfaces Quadrilateral mesh free3d.com

TABLE I PROCESSING TIMES AND QUALITY MEASURES FOR THE PROCESSED
MESHES. THE COLUMNS ARE RESPECTIVELY THE NUMBER OF VERTICES OF THE INPUT AND OUTPUT MESHES, THE METRIC USED FOR THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) .FTI JF )PX4DFOFJT3FQSFTFOUFE 9 Very informally: 3D piece-wide linear approximation of arbitrary 3D surfaces Quadrilateral mesh Triangular mesh Valette, et al. [TVCG’08] free3d.com

10 Modeling Rendering Lighting, camera, material, etc. 3D mesh 2D
image cgarena.com

10 Modeling Rendering Lighting, camera, material, etc. Visibility Shading 3D
mesh 2D image cgarena.com

mesh 2D image cgarena.com Visibility Problem For each pixel in the image (to be rendered), which point in the scene (i.e., on the mesh) corresponds to it?

mesh 2D image cgarena.com

mesh 2D image cgarena.com * Usually cast multiple rays for each pixel

Shading Problem What’s the color of an intersecting scene point
along the ray direction? 10 Modeling Rendering Lighting, camera, material, etc. Visibility Shading 3D mesh 2D image cgarena.com * Usually cast multiple rays for each pixel

"TJEF0UIFS(FPNFUSZ1SJNJUJWFT 11 Hair and furs are usually modeled using curves
(e.g., Catmull–Rom spline). https://developer.nvidia.com/blog/optix-sdk-7-1/ Points and spheres. https://www.sciencefocus.com/future-technology/notre-dame-how-faithfully-can-we-rebuild-the-cathedral-with-modern-tech/

3BZ4DFOF*OUFSTFDUJPO 12 INPUT AND OUTPUT MESHES, THE METRIC USED FOR
THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) • Goal: calculate the [x, y, z] coordinates of the closest hit between the ray and the mesh. • Why closest hit? [x, y, z] Valette, et al. [TVCG’08] x y z

&YIBVTUJWF4FBSDI 13 INPUT AND OUTPUT MESHES, THE METRIC USED FOR
THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) • The simplest solution: [x, y, z] Valette, et al. [TVCG’08]

THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) • The simplest solution: • iterate all triangles [x, y, z] Valette, et al. [TVCG’08]

THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) • The simplest solution: • iterate all triangles • test intersection for each triangle [x, y, z] Valette, et al. [TVCG’08]

THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) • The simplest solution: • iterate all triangles • test intersection for each triangle • return the closest hit, if any [x, y, z] Valette, et al. [TVCG’08]

THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) • The simplest solution: • iterate all triangles • test intersection for each triangle • return the closest hit, if any • Complexity: [x, y, z] Valette, et al. [TVCG’08]

THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) • The simplest solution: • iterate all triangles • test intersection for each triangle • return the closest hit, if any • Complexity: • O(# of rays x # of triangles) [x, y, z] Valette, et al. [TVCG’08]

THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) • The simplest solution: • iterate all triangles • test intersection for each triangle • return the closest hit, if any • Complexity: • O(# of rays x # of triangles) • Slow: [x, y, z] Valette, et al. [TVCG’08]

THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) • The simplest solution: • iterate all triangles • test intersection for each triangle • return the closest hit, if any • Complexity: • O(# of rays x # of triangles) • Slow: • lots of triangles and lots of rays [x, y, z] Valette, et al. [TVCG’08]

THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) • The simplest solution: • iterate all triangles • test intersection for each triangle • return the closest hit, if any • Complexity: • O(# of rays x # of triangles) • Slow: • lots of triangles and lots of rays • …and it’s recursive [x, y, z] Valette, et al. [TVCG’08]

"TJEF8IZ3FDVSTJWF3BZ5SBDJOH 14 INPUT AND OUTPUT MESHES, THE METRIC USED FOR
THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) Valette, et al. [TVCG’08] • To implement realistic shading. Color?

THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) Valette, et al. [TVCG’08] • To implement realistic shading. • The color* of an exiting ray depends on the colors* of all incident rays. Color? Color? Color? Color?

THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) Valette, et al. [TVCG’08] • To implement realistic shading. • The color* of an exiting ray depends on the colors* of all incident rays. • color* should technically be radiance; not important for our discussion here. Color? Color? Color? Color?

THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) Valette, et al. [TVCG’08] • To implement realistic shading. • The color* of an exiting ray depends on the colors* of all incident rays. • color* should technically be radiance; not important for our discussion here. • also depends on the surface material (diffuse vs. specular vs. …); not important for our discussion here. Color? Color? Color? Color?

THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) Valette, et al. [TVCG’08] • To implement realistic shading. • The color* of an exiting ray depends on the colors* of all incident rays. • color* should technically be radiance; not important for our discussion here. • also depends on the surface material (diffuse vs. specular vs. …); not important for our discussion here. • How do we know the color of an incident ray? Cast more rays! Color? Color? Color? Color?

THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) Valette, et al. [TVCG’08] • To implement realistic shading. • The color* of an exiting ray depends on the colors* of all incident rays. • color* should technically be radiance; not important for our discussion here. • also depends on the surface material (diffuse vs. specular vs. …); not important for our discussion here. • How do we know the color of an incident ray? Cast more rays! Secondary Ray Secondary Ray Secondary Ray

"TJEF3FOEFSJOH&RVBUJPO 16 https://en.wikipedia.org/wiki/Rendering_equation Lo (x, ωo ) = ∫ Ω
fr (x, ωo , ωi ) Li (x, ωi ) cos θ dωi “Color” of exiting ray wo “Color” of incident ray wi Integrate incident rays over the hemisphere “Transfer function”

4QFFEJOH6Q3BZ5SJBOHMF*OUFSTFDUJPO5FTU 17 • Prune the search space. • Only search
part of the scene that does intersect the ray. intersect(space, ray) { if ray doesn’t intersect space boundary: return else: foreach subspace in space if (subspace != empty) intersect(subspace, ray) }

4QFFEJOH6Q3BZ5SJBOHMF*OUFSTFDUJPO5FTU 17 • Prune the search space. • Only search
part of the scene that does intersect the ray. • Key: how to partition the space? intersect(space, ray) { if ray doesn’t intersect space boundary: return else: foreach subspace in space if (subspace != empty) intersect(subspace, ray) }

4QBDF1BSUJUJPOWT0CKFDU1BSUJUJPO 18 Space partition: one object could be in different
partitions Object partition: different partitions could overlap in space

4QBDF1BSUJUJPOJOH%BUB4USVDUVSFT 19 Uniform Grid Quadtree (or Octree in 3D)

4QBDF1BSUJUJPOJOH%BUB4USVDUVSFT 20 K-d Tree Binary Space Partitioning Tree

#PVOEJOH7PMVNF)JFSBSDIZ 0CKFDU1BSUJUJPO Scene BVH Tree 21 2 1 4 3

#PVOEJOH7PMVNF)JFSBSDIZ 0CKFDU1BSUJUJPO Scene BVH Tree 21 2 1 4 C
3

3 C 1

D 3 C 1

D 3 C 1 D 2 3

#PVOEJOH7PMVNF)JFSBSDIZ 0CKFDU1BSUJUJPO Scene BVH Tree 21 2 1 4 B
C D 3 C 1 D 2 3

C D 3 B C 1 D 2 3

C D E 3 B C 1 D 2 3

C D E 3 B C 1 D 2 3 E 4

#PVOEJOH7PMVNF)JFSBSDIZ 0CKFDU1BSUJUJPO Scene BVH Tree 21 2 1 4 A
B C D E 3 B C 1 D 2 3 E 4

B C D E 3 A B C 1 D 2 3 E 4

B C D E 3 A B C 1 D 2 3 E 4 Interior node Leaf node Root Primitive

#PVOEJOH7PMVNF)JFSBSDIZ 0CKFDU1BSUJUJPO 2 1 4 22 A B C D
E 3 • A, B, C, D, E are the bounding volumes, which are Axis-Aligned Bounding Boxes (AABBs) here. Other (irregular) bounding volumes are possible. A B C 1 D 2 3 E 4 Interior node Leaf node Root Primitive

*OUFSTFDUJPO5FTU6TJOH#7) 23 2 1 4 A B C D E
A B E C D 2 3 4 3 1 Current Stack A Ray Ray-AABB Intersection Test ClosestHit = NA

A B E C D 2 3 4 3 1 Current Stack B E Ray-AABB Intersection Test Ray ClosestHit = NA

A B E C D 2 3 4 3 1 Current Stack C E D Ray-AABB Intersection Test Ray ClosestHit = NA

A B E C D 2 3 4 3 1 Current Stack D E Ray-AABB Intersection Test Ray ClosestHit = NA

A B E C D 2 3 4 3 1 Current Stack E 2 Ray-Triangle Intersection Test Ray 3 ClosestHit = NA

A B E C D 2 3 4 3 1 Current Stack Ray Ray-AABB Intersection Test E ClosestHit = 2

A B E C D 2 3 4 3 1 Current Stack Ray Ray-AABB Intersection Test E ClosestHit = 2 Distance to E > Distance to 2; Stop!

3BZ""##*OUFSTFDUJPO 29 Ray: O + tD, tmin <= t <=
tmax O D thit tmin tmax

"4VCUMFCVU$SJUJDBM$BTF 30 Ray: O + tD, tmin <= t <=
tmax O D thit tmin tmax

tmax O D thit tmin tmax Should this be counted as a hit? tmin tmax

tmax O D thit Yes; any ray segment that’s completely inside an AABB must be treated as intersecting. tmin tmax Should this be counted as a hit? tmin tmax

"TJEF5XP5FSNJOPMPHZ$POGVTJPOT 31 • Ray casting vs. ray tracing • Technically,
finding the intersection of one ray and the scene is called ray casting. • Ray tracing referes to recursive ray casting. • Acceleration structures • Data structures that help speed up ray tracing is called “acceleration structures” (e.g., BVH), not to be confused with hardware accelerators.

tracing? • How does hardware support ray tracing?

3BZ5SBDJOHPO(16T6TJOH#7) 33 2 1 4 A B C D E
3 Ray Ray Ray

3 Ray Ray Ray • Build the BVH.

3 Ray Ray Ray • Build the BVH. • For each ray (thread): • Traverse the BVH (manage local stack) • Ray-AABB intersection test • Ray-primitive intersection test • Executes a shading algorithm

3 Ray Ray Ray • Build the BVH. • For each ray (thread): • Traverse the BVH (manage local stack) • Ray-AABB intersection test • Ray-primitive intersection test • Executes a shading algorithm • Prior to OptiX (2010) • Manually implement in CUDA.

3 Ray Ray Ray • Build the BVH. • For each ray (thread): • Traverse the BVH (manage local stack) • Ray-AABB intersection test • Ray-primitive intersection test • Executes a shading algorithm • Prior to OptiX (2010) • Manually implement in CUDA. Fixed-function ~Fixed-function

3BZ5SBDJOHJO0QUJ9BOE5VSJOH(16 34 • OptiX (2010): a ray tracing-specific programming model.
• Provides a generic ray tracing pipeline. • Some pipeline stages are programmable; others are fixed functions. ACM Reference Format Parker, S., Bigler, J., Dietrich, A., Friedrich, H., Hoberock, J., Luebke, D., McAllister, D., McGuire, M., Morley, K., Robison, A., Stich, M. 2010. OptiX™: A General Purpose Ray Tracing Engine. ACM Trans. Graph. 29, 4, Article 66 (July 2010), 13 pages. DOI = 10.1145/1778765.1778803 http://doi.acm.org/10.1145/1778765.1778803. Copyright Notice Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profi t or direct commercial advantage and that copies show this notice on the fi rst page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specifi c permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701, fax +1 (212) 869-0481, or [email protected]. © 2010 ACM 0730-0301/2010/07-ART66 $10.00 DOI 10.1145/1778765.1778803 http://doi.acm.org/10.1145/1778765.1778803 OptiX: A General Purpose Ray Tracing Engine Steven G. Parker1⇤ James Bigler1 Andreas Dietrich1 Heiko Friedrich1 Jared Hoberock1 David Luebke1 David McAllister1 Morgan McGuire1,2 Keith Morley1 Austin Robison1 Martin Stich1 NVIDIA1 Williams College2 Figure 1: Images from various applications built with OptiX. Top: Physically based light transport through path tracing. Bottom: Ray tracing of a procedural Julia set, photon mapping, large-scale line of sight and collision detection, Whitted-style ray tracing of dynamic geometry, and ray traced ambient occlusion. All applications are interactive. Abstract The NVIDIA® OptiX™ ray tracing engine is a programmable system designed for NVIDIA GPUs and other highly parallel architectures. The OptiX engine builds on the key observation that most ray tracing algorithms can be implemented using a small set of programmable operations. Consequently, the core of OptiX is a domain-specific just-in-time compiler that generates custom ray tracing kernels by combining user-supplied programs for ray generation, material shading, object intersection, and scene traversal. This enables the implementation of a highly diverse set of ray tracing-based algorithms and applications, including interactive rendering, offline rendering, collision detection systems, artificial intelligence queries, and scientific simulations such as sound propagation. OptiX achieves high performance through a compact object model and application of several ray tracing-specific compiler optimizations. For ease of use it exposes a single-ray programming model with full support for recursion and a dynamic dispatch mech- anism similar to virtual function calls. CR Categories: I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism; D.2.11 [Software Architectures]: Domain- specific architectures; I.3.1 [Computer Graphics]: Hardware Architectures—; Keywords: ray tracing, graphics systems, graphics hardware ⇤e-mail: [email protected] 1 Introduction To address the problem of creating an accessible, flexible, and efficient ray tracing system for many-core architectures, we introduce OptiX, a general purpose ray tracing engine. This engine combines a programmable ray tracing pipeline with a lightweight scene rep- resentation. A general programming interface enables the implementation of a variety of ray tracing-based algorithms in graphics and non-graphics domains, such as rendering, sound propagation, collision detection and artificial intelligence. In this paper, we discuss the design goals of the OptiX engine as well as an implementation for NVIDIA Quadro®, GeForce®, and Tesla® GPUs. In our implementation, we compose domain-specific compilation with a flexible set of controls over scene hierarchy, acceleration structure creation and traversal, on-the-fly scene update, and a dynamically load-balanced GPU execution model. Although OptiX currently targets highly parallel architectures, it is applica- ble to a wide range of special- and general-purpose hardware and multiple execution models. To create a system for a broad range of ray tracing tasks, several ACM Transactions on Graphics, Vol. 29, No. 4, Article 66, Publication date: July 2010.

• Provides a generic ray tracing pipeline. • Some pipeline stages are programmable; others are fixed functions. • Prior to Turing architecture (2018): • Everything runs on CUDA cores. ACM Reference Format Parker, S., Bigler, J., Dietrich, A., Friedrich, H., Hoberock, J., Luebke, D., McAllister, D., McGuire, M., Morley, K., Robison, A., Stich, M. 2010. OptiX™: A General Purpose Ray Tracing Engine. ACM Trans. Graph. 29, 4, Article 66 (July 2010), 13 pages. DOI = 10.1145/1778765.1778803 http://doi.acm.org/10.1145/1778765.1778803. Copyright Notice Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profi t or direct commercial advantage and that copies show this notice on the fi rst page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specifi c permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701, fax +1 (212) 869-0481, or [email protected]. © 2010 ACM 0730-0301/2010/07-ART66 $10.00 DOI 10.1145/1778765.1778803 http://doi.acm.org/10.1145/1778765.1778803 OptiX: A General Purpose Ray Tracing Engine Steven G. Parker1⇤ James Bigler1 Andreas Dietrich1 Heiko Friedrich1 Jared Hoberock1 David Luebke1 David McAllister1 Morgan McGuire1,2 Keith Morley1 Austin Robison1 Martin Stich1 NVIDIA1 Williams College2 Figure 1: Images from various applications built with OptiX. Top: Physically based light transport through path tracing. Bottom: Ray tracing of a procedural Julia set, photon mapping, large-scale line of sight and collision detection, Whitted-style ray tracing of dynamic geometry, and ray traced ambient occlusion. All applications are interactive. Abstract The NVIDIA® OptiX™ ray tracing engine is a programmable system designed for NVIDIA GPUs and other highly parallel architectures. The OptiX engine builds on the key observation that most ray tracing algorithms can be implemented using a small set of programmable operations. Consequently, the core of OptiX is a domain-specific just-in-time compiler that generates custom ray tracing kernels by combining user-supplied programs for ray generation, material shading, object intersection, and scene traversal. This enables the implementation of a highly diverse set of ray tracing-based algorithms and applications, including interactive rendering, offline rendering, collision detection systems, artificial intelligence queries, and scientific simulations such as sound propagation. OptiX achieves high performance through a compact object model and application of several ray tracing-specific compiler optimizations. For ease of use it exposes a single-ray programming model with full support for recursion and a dynamic dispatch mech- anism similar to virtual function calls. CR Categories: I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism; D.2.11 [Software Architectures]: Domain- specific architectures; I.3.1 [Computer Graphics]: Hardware Architectures—; Keywords: ray tracing, graphics systems, graphics hardware ⇤e-mail: [email protected] 1 Introduction To address the problem of creating an accessible, flexible, and efficient ray tracing system for many-core architectures, we introduce OptiX, a general purpose ray tracing engine. This engine combines a programmable ray tracing pipeline with a lightweight scene rep- resentation. A general programming interface enables the implementation of a variety of ray tracing-based algorithms in graphics and non-graphics domains, such as rendering, sound propagation, collision detection and artificial intelligence. In this paper, we discuss the design goals of the OptiX engine as well as an implementation for NVIDIA Quadro®, GeForce®, and Tesla® GPUs. In our implementation, we compose domain-specific compilation with a flexible set of controls over scene hierarchy, acceleration structure creation and traversal, on-the-fly scene update, and a dynamically load-balanced GPU execution model. Although OptiX currently targets highly parallel architectures, it is applica- ble to a wide range of special- and general-purpose hardware and multiple execution models. To create a system for a broad range of ray tracing tasks, several ACM Transactions on Graphics, Vol. 29, No. 4, Article 66, Publication date: July 2010.

• Provides a generic ray tracing pipeline. • Some pipeline stages are programmable; others are fixed functions. • Prior to Turing architecture (2018): • Everything runs on CUDA cores. • Turing architecture: • RT Cores accelerate fixed-function stages. • Programmable stages on the CUDA cores. ACM Reference Format Parker, S., Bigler, J., Dietrich, A., Friedrich, H., Hoberock, J., Luebke, D., McAllister, D., McGuire, M., Morley, K., Robison, A., Stich, M. 2010. OptiX™: A General Purpose Ray Tracing Engine. ACM Trans. Graph. 29, 4, Article 66 (July 2010), 13 pages. DOI = 10.1145/1778765.1778803 http://doi.acm.org/10.1145/1778765.1778803. Copyright Notice Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profi t or direct commercial advantage and that copies show this notice on the fi rst page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specifi c permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701, fax +1 (212) 869-0481, or [email protected]. © 2010 ACM 0730-0301/2010/07-ART66 $10.00 DOI 10.1145/1778765.1778803 http://doi.acm.org/10.1145/1778765.1778803 OptiX: A General Purpose Ray Tracing Engine Steven G. Parker1⇤ James Bigler1 Andreas Dietrich1 Heiko Friedrich1 Jared Hoberock1 David Luebke1 David McAllister1 Morgan McGuire1,2 Keith Morley1 Austin Robison1 Martin Stich1 NVIDIA1 Williams College2 Figure 1: Images from various applications built with OptiX. Top: Physically based light transport through path tracing. Bottom: Ray tracing of a procedural Julia set, photon mapping, large-scale line of sight and collision detection, Whitted-style ray tracing of dynamic geometry, and ray traced ambient occlusion. All applications are interactive. Abstract The NVIDIA® OptiX™ ray tracing engine is a programmable system designed for NVIDIA GPUs and other highly parallel architectures. The OptiX engine builds on the key observation that most ray tracing algorithms can be implemented using a small set of programmable operations. Consequently, the core of OptiX is a domain-specific just-in-time compiler that generates custom ray tracing kernels by combining user-supplied programs for ray generation, material shading, object intersection, and scene traversal. This enables the implementation of a highly diverse set of ray tracing-based algorithms and applications, including interactive rendering, offline rendering, collision detection systems, artificial intelligence queries, and scientific simulations such as sound propagation. OptiX achieves high performance through a compact object model and application of several ray tracing-specific compiler optimizations. For ease of use it exposes a single-ray programming model with full support for recursion and a dynamic dispatch mech- anism similar to virtual function calls. CR Categories: I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism; D.2.11 [Software Architectures]: Domain- specific architectures; I.3.1 [Computer Graphics]: Hardware Architectures—; Keywords: ray tracing, graphics systems, graphics hardware ⇤e-mail: [email protected] 1 Introduction To address the problem of creating an accessible, flexible, and efficient ray tracing system for many-core architectures, we introduce OptiX, a general purpose ray tracing engine. This engine combines a programmable ray tracing pipeline with a lightweight scene rep- resentation. A general programming interface enables the implementation of a variety of ray tracing-based algorithms in graphics and non-graphics domains, such as rendering, sound propagation, collision detection and artificial intelligence. In this paper, we discuss the design goals of the OptiX engine as well as an implementation for NVIDIA Quadro®, GeForce®, and Tesla® GPUs. In our implementation, we compose domain-specific compilation with a flexible set of controls over scene hierarchy, acceleration structure creation and traversal, on-the-fly scene update, and a dynamically load-balanced GPU execution model. Although OptiX currently targets highly parallel architectures, it is applica- ble to a wide range of special- and general-purpose hardware and multiple execution models. To create a system for a broad range of ray tracing tasks, several ACM Transactions on Graphics, Vol. 29, No. 4, Article 66, Publication date: July 2010.

"O4.JO5VSJOH(16 35 https://wccftech.com/nvidia-turing-gpu-architecture-geforce-rtx-graphics-cards-detailed/

0QUJ91SPHSBNNJOH.PEFM 36 Construct BVH • “Shaders” are user-defined functions executing
on CUDA cores.

0QUJ91SPHSBNNJOH.PEFM 36 Construct BVH Ray Generation (RG) Shader • “Shaders”
are user-defined functions executing on CUDA cores.

0QUJ91SPHSBNNJOH.PEFM 36 Construct BVH Ray Generation (RG) Shader BVH Traversal
+ Ray-AABB Test (TL) • “Shaders” are user-defined functions executing on CUDA cores.

0QUJ91SPHSBNNJOH.PEFM 36 Construct BVH Ray Generation (RG) Shader BVH Traversal
+ Ray-AABB Test (TL) • “Shaders” are user-defined functions executing on CUDA cores. A B E C D 2 3 4 1

0QUJ91SPHSBNNJOH.PEFM 36 Construct BVH Ray Generation (RG) Shader Intersection (IS)
Shader Enter leaf node BVH Traversal + Ray-AABB Test (TL) • “Shaders” are user-defined functions executing on CUDA cores. A B E C D 2 3 4 1

Shader Enter leaf node BVH Traversal + Ray-AABB Test (TL) • “Shaders” are user-defined functions executing on CUDA cores. • Allows custom primitives (not just triangles). A B E C D 2 3 4 1

Shader Enter leaf node BVH Traversal + Ray-AABB Test (TL) No Any-Hit (AH) Shader Ray primitive intersect? Yes • “Shaders” are user-defined functions executing on CUDA cores. • Allows custom primitives (not just triangles).

Shader Enter leaf node BVH Traversal + Ray-AABB Test (TL) No Any-Hit (AH) Shader Ray primitive intersect? Yes Found a hit? Traversal completes • “Shaders” are user-defined functions executing on CUDA cores. • Allows custom primitives (not just triangles).

Shader Enter leaf node BVH Traversal + Ray-AABB Test (TL) No Any-Hit (AH) Shader Ray primitive intersect? Yes Closest-Hit (CH) Shader Miss Shader Found a hit? Traversal completes • “Shaders” are user-defined functions executing on CUDA cores. • Allows custom primitives (not just triangles).

Shader Enter leaf node BVH Traversal + Ray-AABB Test (TL) No Any-Hit (AH) Shader Ray primitive intersect? Yes Closest-Hit (CH) Shader Miss Shader Found a hit? Traversal completes • “Shaders” are user-defined functions executing on CUDA cores. • Allows custom primitives (not just triangles). Fixed functions executed on the RT cores.

0QUJ91SPHSBNNJOH.PEFM 37 Ray Generation (RG) Shader Construct BVH … …
… … … … BVH Traversal + Ray-AABB Test (TL) Found a hit? Closest-Hit (CH) Shader Miss Shader Any-Hit (AH) Shader Intersection (IS) Shader Ray primitive intersect? Enter leaf node Yes No Traversal completes One Single CUDA Kernel CUDA Threads OptiX Rays

-JGFPGBO0QUJ93BZ 38 2 1 4 A B C D E
3 CUDA Cores RT Cores RG TL (A, B, D) IS (2, 3) TL (C, E) CH Think of RT cores as special function units for BVH traversal.

"TJEF0UIFS/PUBCMF3BZ5SBDJOH&OHJOFT 39 • Intel OSPRay • Won 2020 Oscar for
Scientific and Technical Achievement. • Built on Intel Embree, a collection of ray tracing kernels, which uses Intel Implicit SPMD Program Compiler (ISPC) for explicit vectorization. • PBRT • Pedagogical engine. • The book won 2014 Oscar for Scientific and Technical Achievement.

(BNF1MBO 40 "DDFMFSBUJOH /FJHICPS4FBSDI 6TJOH)BSEXBSF 3BZ5SBDJOH • What is ray
tracing? • How does hardware support ray tracing? • What is neighbor search • How does it relate to ray tracing?

/FJHICPS4FBSDI 41 r Range Search query search points

/FJHICPS4FBSDI 42 Range Search usually also limits the total #
of neighbors: • practical memory constraint, • downstream algorithms expect a fixed # of neighbors.

/FJHICPS4FBSDI 42 Range Search usually also limits the total #
of neighbors: • practical memory constraint, • downstream algorithms expect a fixed # of neighbors. rangeSearch(query, points, range, K) Return any K points that are within range of query

/FJHICPS4FBSDI 43 Range Search KNN Search 2 nearest neighbors usually
also limits the total # of neighbors: • practical memory constraint, • downstream algorithms expect a fixed # of neighbors. rangeSearch(query, points, range, K) Return any K points that are within range of query

/FJHICPS4FBSDI 44 Range Search KNN Search usually also limits the
total # of neighbors: • practical memory constraint, • downstream algorithms expect a fixed # of neighbors. usually also limits ranges of neighbors: • neighbors too far away are of no significance (e.g., force from a remote particle). rangeSearch(query, points, range, K) Return any K points that are within range of query

/FJHICPS4FBSDI 44 Range Search KNN Search usually also limits the
total # of neighbors: • practical memory constraint, • downstream algorithms expect a fixed # of neighbors. usually also limits ranges of neighbors: • neighbors too far away are of no significance (e.g., force from a remote particle). rangeSearch(query, points, range, K) Return any K points that are within range of query KNN(query, points, range, K) Return K nearest points that are within range of query

0VS'PDVT-PX%JNFOTJPOBM4FBSDI 45 • Low dimension: <= 3D. • Prevalent in
science and engineering fields (e.g., computational fluid dynamics, graphics, vision). • They deal with physical data (e.g., particles, surface samples) that are inherent 2D/3D. • High-dimensional search is a completely different game. • “Curse of dimensionality” means we need different algorithms and distance metric.

5VSOUIF1SPCMFN"SPVOE 46 r Q Find all points within r from
Q

5VSOUIF1SPCMFN"SPVOE 47 Find all points within r from Q Find
whether Q is within r from other points Q r

1PJOUJO4QIFSF5FTU 48 Q

1PJOUJO4QIFSF5FTU 48 Q Is Q in the AABB? (Prunes remote
points)

1PJOUJO4QIFSF5FTU 48 Q Is Q in the AABB? (Prunes remote
points) If so, is Q in the sphere?

• Recall: any ray that’s within an AABB must be
treated as intersecting. 1PJOUJO""##5FTU 49 Q 2r

treated as intersecting. • Idea: generate a short ray from Q and (ask the RT cores to) perform the ray-AABB test. 1PJOUJO""##5FTU 49 Q 2r

treated as intersecting. • Idea: generate a short ray from Q and (ask the RT cores to) perform the ray-AABB test. • The ray has an arbitrary direction and a very small length. 1PJOUJO""##5FTU 49 Q 2r

treated as intersecting. • Idea: generate a short ray from Q and (ask the RT cores to) perform the ray-AABB test. • The ray has an arbitrary direction and a very small length. • Why a very small ray length? 1PJOUJO""##5FTU 49 Q 2r

treated as intersecting. • Idea: generate a short ray from Q and (ask the RT cores to) perform the ray-AABB test. • The ray has an arbitrary direction and a very small length. • Why a very small ray length? 1PJOUJO""##5FTU 49 Q 2r Q’

50 • What is ray tracing? • How does hardware
support ray tracing? • What is neighbor search? • How to use hardware ray tracing to accelerate neighbor search? (BNF1MBO "DDFMFSBUJOH /FJHICPS4FBSDI 6TJOH)BSEXBSF 3BZ5SBDJOH

0WFSBMM*EFB 51 rangeSearch(query, points, r, K)

0WFSBMM*EFB 52 Create an AABB of width 2r for every
point rangeSearch(query, points, r, K) https://forums.developer.nvidia.com/t/bvh-building-algorithm-and-primitive-order/182231/8

point rangeSearch(query, points, r, K) https://forums.developer.nvidia.com/t/bvh-building-algorithm-and-primitive-order/182231/8 Construct a BVH from the AABBs (No control; hidden behind the OptiX APIs and most likely done in hardware)

point rangeSearch(query, points, r, K) https://forums.developer.nvidia.com/t/bvh-building-algorithm-and-primitive-order/182231/8 Construct a BVH from the AABBs (No control; hidden behind the OptiX APIs and most likely done in hardware) Use spheres as primitives, not triangles.

point rangeSearch(query, points, r, K) https://forums.developer.nvidia.com/t/bvh-building-algorithm-and-primitive-order/182231/8 Generate a ray for each query (RG Shader) Construct a BVH from the AABBs (No control; hidden behind the OptiX APIs and most likely done in hardware)

point rangeSearch(query, points, r, K) https://forums.developer.nvidia.com/t/bvh-building-algorithm-and-primitive-order/182231/8 Generate a ray for each query (RG Shader) Construct a BVH from the AABBs (No control; hidden behind the OptiX APIs and most likely done in hardware) Traverse BVH; skip non-circumscribing AABBs (No control; done in hardware)

point rangeSearch(query, points, r, K) https://forums.developer.nvidia.com/t/bvh-building-algorithm-and-primitive-order/182231/8 Generate a ray for each query (RG Shader) Construct a BVH from the AABBs (No control; hidden behind the OptiX APIs and most likely done in hardware) Traverse BVH; skip non-circumscribing AABBs (No control; done in hardware) At leaf nodes: calc dist, collect neighbors (IS Shader)

"OPUIFS1FSTQFDUJWF1PJOUJO4QIFSF5FTU 53 Ray Generation (RG) Shader Construct BVH BVH Traversal
+ Ray-AABB Test (TL) Found a hit? Closest-Hit (CH) Shader Miss Shader Any-Hit (AH) Shader Intersection (IS) Shader Ray primitive intersect? Enter leaf node Yes No Traversal completes Is Q in the AABB? (Prunes remote points) If so, is Q in the sphere?

1SPCMFN$POUSPM'MPX%JWFSHFODF 54 X OptiX groups every 32 adjacent rays into
a warp.

1SPCMFN$POUSPM'MPX%JWFSHFODF 55 Y OptiX groups every 32 adjacent rays into
a warp.

*EFB0SEFS2VFSJFT4QBUJBMMZ 56 • Intuition: group spatially close queries together so
that their rays follow similar traversal paths.

that their rays follow similar traversal paths. • Improving ray coherence in graphics parlance.

that their rays follow similar traversal paths. • Improving ray coherence in graphics parlance. • How? A simple heuristic: queries enclosed by the same AABB are spatially close.

*EFB0SEFS2VFSJFT4QBUJBMMZ 57 1 2 3 4 7 6 5 8

*EFB0SEFS2VFSJFT4QBUJBMMZ 57 • A query might be enclosed by many
AABBs, but any AABB will do. 1 2 3 4 7 6 5 8

AABBs, but any AABB will do. • How to find one? Cast a ray and immediately terminate the ray once the first IS shader is called. 1 2 3 4 7 6 5 8

AABBs, but any AABB will do. • How to find one? Cast a ray and immediately terminate the ray once the first IS shader is called. • optixTerminateRay() 1 2 3 4 7 6 5 8

AABBs, but any AABB will do. • How to find one? Cast a ray and immediately terminate the ray once the first IS shader is called. • optixTerminateRay() • Effectively returning ID (key) of the first enclosing leaf AABB. 1 2 3 4 7 6 5 8

AABBs, but any AABB will do. • How to find one? Cast a ray and immediately terminate the ray once the first IS shader is called. • optixTerminateRay() • Effectively returning ID (key) of the first enclosing leaf AABB. • Then sort by key. 1 2 3 4 7 6 5 8

4FBSDI"MHPSJUIN 4P'BS 58 1 2 3 4 7 6 5
8 bvh ← buildBVH(points, r); firstHitAABBs ← traceRays(bvh, queries); reorderQueries(queries, firstHitAABBs); traceRays(bvh, queries);

1SPCMFN-BSHF""##T 59 Q 2r rangeSearch(query, points, r, K) • Strictly
speaking, the AABB width must be 2r. • What if we can find K neighbors in a smaller range? We can use a smaller AABB. • What’s the benefit?

#FOF fi UTPG4NBMMFS""##T 60 35 30 25 20 15 10
5 0 Time (s) 30 25 20 15 10 5 0 AABB Width • Using smaller AABBs drastically reduces the search time.

5 0 Time (s) 30 25 20 15 10 5 0 AABB Width • Using smaller AABBs drastically reduces the search time. • Smaller AABB means a query is enclosed by fewer AABBs.

5 0 Time (s) 30 25 20 15 10 5 0 AABB Width • Using smaller AABBs drastically reduces the search time. • Smaller AABB means a query is enclosed by fewer AABBs. • …which leads to fewer traversals and IS shader calls.

5 0 Time (s) 30 25 20 15 10 5 0 AABB Width • Using smaller AABBs drastically reduces the search time. • Smaller AABB means a query is enclosed by fewer AABBs. • …which leads to fewer traversals and IS shader calls. • Particularly important for KNN search, where the IS shader manipulates a priority queue.

*EFB2VFSZ1BSUJUJPOJOH 61 • For each query, find an AABB size
that’s just large enough to ensure correctness. 2r

that’s just large enough to ensure correctness. 2r d

that’s just large enough to ensure correctness. • Group queries such that queries in each partition share the same AABB. q0 q1 q2 q3 Calc. Smallest AABB Size q1 .. .. .. BVH 0 Partitions … …… Queries …… q0 .. BVH 1 .. BVH n-1 q2 BVH n q3 ..

that’s just large enough to ensure correctness. • Group queries such that queries in each partition share the same AABB. • Build a different BVH for each partition. q0 q1 q2 q3 Calc. Smallest AABB Size q1 .. .. .. BVH 0 Partitions … …… Queries …… q0 .. BVH 1 .. BVH n-1 q2 BVH n q3 ..

that’s just large enough to ensure correctness. • Group queries such that queries in each partition share the same AABB. • Build a different BVH for each partition. • Essentially trades BVH construction overhead for faster search. q0 q1 q2 q3 Calc. Smallest AABB Size q1 .. .. .. BVH 0 Partitions … …… Queries …… q0 .. BVH 1 .. BVH n-1 q2 BVH n q3 ..

%FUFSNJOJOH""##4J[FGPS3BOHF4FBSDI 63 d

%FUFSNJOJOH""##4J[FGPS3BOHF4FBSDI 63 • Build a uniform grid. d

%FUFSNJOJOH""##4J[FGPS3BOHF4FBSDI 63 • Build a uniform grid. • Start from
the cell that contains the query, and iteratively grow along all four (2D) or six (3D) directions. d

the cell that contains the query, and iteratively grow along all four (2D) or six (3D) directions. • Stop when K neighbors are found (or the sphere boundary is reached). d

the cell that contains the query, and iteratively grow along all four (2D) or six (3D) directions. • Stop when K neighbors are found (or the sphere boundary is reached). • We call the final collection of cells the megacell, with a width d. d

the cell that contains the query, and iteratively grow along all four (2D) or six (3D) directions. • Stop when K neighbors are found (or the sphere boundary is reached). • We call the final collection of cells the megacell, with a width d. • d is the AABB size. d

%FUFSNJOJOH""##4J[FGPS,//4FBSDI 64 • Find the megacell (width d), just like
in range search. • Can we use d as the AABB size? d

%FUFSNJOJOH""##4J[FGPS,//4FBSDI 65 • Find the megacell (width d), just like
in range search. • Can we use d as the AABB size? • No! Some of the nearest K neighbors might be outside of the megacell. d p2 q qp1 > qp2 p1

"$POTFSWBUJWF""##4J[FGPS,// 66 d p2 q p1

"$POTFSWBUJWF""##4J[FGPS,// 66 • The circumscribing circle/sphere of the megacall is
guaranteed to have the K nearest neighbors. d p2 q p1

guaranteed to have the K nearest neighbors. • Why? Given a circle with N neighbors, those N neighbors are by definition the N nearest neighbors; N is guaranteed to be >= K. d p2 q p1

guaranteed to have the K nearest neighbors. • Why? Given a circle with N neighbors, those N neighbors are by definition the N nearest neighbors; N is guaranteed to be >= K. • AABB must be the circumscribing square/cube of that circle/sphere. d p2 q p1

guaranteed to have the K nearest neighbors. • Why? Given a circle with N neighbors, those N neighbors are by definition the N nearest neighbors; N is guaranteed to be >= K. • AABB must be the circumscribing square/cube of that circle/sphere. • Width is for 2D and for 3D. 2d 3d d p2 q p1

$BO8F%P#FUUFS 67 d p2 q p1 A B

$BO8F%P#FUUFS 67 • What we really want to find is
sphere C, which is smallest sphere that contains K nearest neighbors. d p2 q p1 A B C

$BO8F%P#FUUFS 67 • What we really want to find is
sphere C, which is smallest sphere that contains K nearest neighbors. • How? We know cube A has at least K neighbors. d p2 q p1 A B C

"#FUUFS ""##4J[FGPS,// 68 • Assumption: point density is locally uniform
within and around a megacell. d p2 q p1 A B C

within and around a megacell. • A sphere C that has the same volume as cube A will contain K neighbors, which are guaranteed to be the K nearest neighbors. d p2 q p1 A B C

within and around a megacell. • A sphere C that has the same volume as cube A will contain K neighbors, which are guaranteed to be the K nearest neighbors. • AABB size is for 3D. 2 3 3 4π d d p2 q p1 A B C

4FBSDI"MHPSJUIN 4P'BS 69 bvh ← buildBVH(points, r); firstHitAABBs ← traceRays(bvh,
queries); reorderQueries(queries, firstHitAABBs); traceRays(bvh, queries);

4FBSDI"MHPSJUIN 4P'BS 69 bvh ← buildBVH(points, r); firstHitAABBs ← traceRays(bvh,
queries); reorderQueries(queries, firstHitAABBs); traceRays(bvh, queries); foreach q in queries: AABBSize ← findSmallestAABBSize(q); partitions.add(AABBSize, q); // assuming a hash table foreach p in partitions: queries ← all queries in p; r ← AABBSize of p;

#VOEMF1BSUJUJPOT 70 • Problem: too many partitions leads to high
BVH construction overhead.

BVH construction overhead. • Especially bad when point density is globally non-uniform (e.g., astrophysics simulation).

BVH construction overhead. • Especially bad when point density is globally non-uniform (e.g., astrophysics simulation). • Bundle partitions to minimize overall search time. Bundling two partitions: p1 p2 p3 p4 b1 b2 b3 Partitions Bundles

BVH construction overhead. • Especially bad when point density is globally non-uniform (e.g., astrophysics simulation). • Bundle partitions to minimize overall search time. Bundling two partitions: • eliminates one BVH construction cost. p1 p2 p3 p4 b1 b2 b3 Partitions Bundles

BVH construction overhead. • Especially bad when point density is globally non-uniform (e.g., astrophysics simulation). • Bundle partitions to minimize overall search time. Bundling two partitions: • eliminates one BVH construction cost. • but also increases the search cost. Why? p1 p2 p3 p4 b1 b2 b3 Partitions Bundles

$PTU.PEFM 71 • Search cost is dictated by the number
of IS shader calls, which 35 28 21 14 7 0 Execution Time (s) 0.9 0.6 0.3 0.0 # of IS Shader Calls (millions)

of IS shader calls, which • …is dictated by the number of AABBs a query resides in, which 35 28 21 14 7 0 Execution Time (s) 0.9 0.6 0.3 0.0 # of IS Shader Calls (millions)

of IS shader calls, which • …is dictated by the number of AABBs a query resides in, which • …is equivalent to the number of points inside an AABB, which 35 28 21 14 7 0 Execution Time (s) 0.9 0.6 0.3 0.0 # of IS Shader Calls (millions)

of IS shader calls, which • …is dictated by the number of AABBs a query resides in, which • …is equivalent to the number of points inside an AABB, which • …is density x volume (r3), assuming locally-uniform density 35 28 21 14 7 0 Execution Time (s) 0.9 0.6 0.3 0.0 # of IS Shader Calls (millions)

of IS shader calls, which • …is dictated by the number of AABBs a query resides in, which • …is equivalent to the number of points inside an AABB, which • …is density x volume (r3), assuming locally-uniform density • Search cost ∝ r3 35 28 21 14 7 0 Execution Time (s) 0.9 0.6 0.3 0.0 # of IS Shader Calls (millions)

of IS shader calls, which • …is dictated by the number of AABBs a query resides in, which • …is equivalent to the number of points inside an AABB, which • …is density x volume (r3), assuming locally-uniform density • Search cost ∝ r3 35 28 21 14 7 0 Execution Time (s) 0.9 0.6 0.3 0.0 # of IS Shader Calls (millions) Tsearch = kNρS3

of IS shader calls, which • …is dictated by the number of AABBs a query resides in, which • …is equivalent to the number of points inside an AABB, which • …is density x volume (r3), assuming locally-uniform density • Search cost ∝ r3 35 28 21 14 7 0 Execution Time (s) 0.9 0.6 0.3 0.0 # of IS Shader Calls (millions) Tsearch = kNρS3 # of queries in a partition

of IS shader calls, which • …is dictated by the number of AABBs a query resides in, which • …is equivalent to the number of points inside an AABB, which • …is density x volume (r3), assuming locally-uniform density • Search cost ∝ r3 35 28 21 14 7 0 Execution Time (s) 0.9 0.6 0.3 0.0 # of IS Shader Calls (millions) Tsearch = kNρS3 # of queries in a partition Point density in a partition

of IS shader calls, which • …is dictated by the number of AABBs a query resides in, which • …is equivalent to the number of points inside an AABB, which • …is density x volume (r3), assuming locally-uniform density • Search cost ∝ r3 35 28 21 14 7 0 Execution Time (s) 0.9 0.6 0.3 0.0 # of IS Shader Calls (millions) Tsearch = kNρS3 # of queries in a partition Point density in a partition AABB size of the partition

of IS shader calls, which • …is dictated by the number of AABBs a query resides in, which • …is equivalent to the number of points inside an AABB, which • …is density x volume (r3), assuming locally-uniform density • Search cost ∝ r3 35 28 21 14 7 0 Execution Time (s) 0.9 0.6 0.3 0.0 # of IS Shader Calls (millions) Tsearch = kNρS3 # of queries in a partition Point density in a partition AABB size of the partition A constant regressed offline

$PTU.PEFM 72 • When combining two partitions, the AABB size
of the new partition must be the max of the two. k(N1 ρ1 + N2 ρ2 )[max(S1 , S2 )]3 k(N1 ρ1 S3 1 + N2 ρ2 S3 2 ) >

0QUJNBM#VOEMJOH 73 • Bundling increases search cost, but reduces BVH
construction cost. What’s the optimal bundling?

construction cost. What’s the optimal bundling? p1 p2 p3 p4 b1 b2 b3 Partitions Bundles p1 p2 p3 p4 b1 b2 b3 Partitions Bundles

construction cost. What’s the optimal bundling? • Combinatorial optimization, but we have to solve it at run-time. p1 p2 p3 p4 b1 b2 b3 Partitions Bundles p1 p2 p3 p4 b1 b2 b3 Partitions Bundles

construction cost. What’s the optimal bundling? • Combinatorial optimization, but we have to solve it at run-time. • We leverage an empirical observation to simplify the problem structure, which yields an efficient linear-time solution. p1 p2 p3 p4 b1 b2 b3 Partitions Bundles p1 p2 p3 p4 b1 b2 b3 Partitions Bundles

&NQJSJDBM0CTFSWBUJPO 74 • Empirically: AABB size and # of queries
are inversely correlated.

are inversely correlated. 104 105 106 107 Number of Queries 2.3 1.9 1.5 1.1 0.7 0.3 AABB Size

are inversely correlated. 104 105 106 107 Number of Queries 2.3 1.9 1.5 1.1 0.7 0.3 AABB Size Intuitively, only a handful of sparsely located queries need a large AABB to find K neighbors.

are inversely correlated. • Given this empirical observation, we can derive the optimal bundling in linear time. • Proof omitted; see paper. 104 105 106 107 Number of Queries 2.3 1.9 1.5 1.1 0.7 0.3 AABB Size Intuitively, only a handful of sparsely located queries need a large AABB to find K neighbors.

0QUJNBM#VOEMJOH"MHPSJUIN 75 • Algorithm: • Sort partitions according to the
ascending order of their AABB sizes. • Start from the last partition and scan backward; at each step, bundle all partitions that have been scanned, leave the rest unbundled. • Pick the one with the lowest search cost. p1 p2 p3 p4 b1 b2 b3 Partitions Bundles Larger AABBs, fewer queries.

0QUJNBM#VOEMJOH"MHPSJUIN 75 • Algorithm: • Sort partitions according to the
ascending order of their AABB sizes. • Start from the last partition and scan backward; at each step, bundle all partitions that have been scanned, leave the rest unbundled. • Pick the one with the lowest search cost. p1 p2 p3 p4 b1 b2 b3 Partitions Bundles Larger AABBs, fewer queries. p1 p2 p3 p4 b1 b2 b3 Partitions Bundles

'JOBM4FBSDI"MHPSJUIN 76 foreach q in queries: AABBSize ← findSmallestAABBSize(q); partitions.add(AABBSize,
q); // assuming a hash table foreach p in partitions: queries ← all queries in p; r ← AABBSize of p; bvh ← buildBVH(points, r); firstHitAABBs ← traceRays(bvh, queries); reorderQueries(queries, firstHitAABBs); traceRays(bvh, queries);

'JOBM4FBSDI"MHPSJUIN 76 foreach q in queries: AABBSize ← findSmallestAABBSize(q); partitions.add(AABBSize,
q); // assuming a hash table foreach p in partitions: queries ← all queries in p; r ← AABBSize of p; bvh ← buildBVH(points, r); firstHitAABBs ← traceRays(bvh, queries); reorderQueries(queries, firstHitAABBs); traceRays(bvh, queries); bundle(partitions);

77 Results

&YQFSJNFOUBM4FUVQ • OptiX 7.1, CUDA 11; RTX 2080. • Baselines:
• cuNSearch: grid search in CUDA; used in SPlisHSPlasH fluid simulator. • FRNN: grid search in CUDA. • PCLOctree: octree-search in CUDA (i.e., use octree, as opposed to BVH, to prune search). • FastRNN: KNN search in RT cores without our optimizations. • Datasets: • KITTI: self-driving car datasets; points are surface samples; mostly confined in 2D (ground) • Stanford 3D Scanning Repo: Bunny, Dragon, Buddha. • N-body simulation: non-uniform distribution in 3D. 78

4QFFEVQTPWFS#BTFMJOFT 79 10-1 100 101 102 103 Speedup (log) KITTI-1M
KITTI-6M KITTI-12M KITTI-25M NBody-9M NBody-10M Bunny-360K Dragon-3.6M Buddha-4.6M OOM DNF Range Search PCLOctree cuNSearch KNN Search FRNN FastRNN 10-1 100 101 102 103 Speedup (log) 1M 6M 12M 25M KITTI 9M 10M N-body 3D scans 360K 3.6M 4.6M Range search speedup: 2.2X — 44.0X KNN search speedup: 3.5X — 65.0X 1. higher speedups on larger inputs. 2. higher speedups on KNN search.

5JNF%JTUSJCVUJPO 80 100 80 60 40 20 0 Time (%)
KITTI-1M KITTI-6M KITTI-12M KITTI-25M NBody-9M NBody-10M Bunny-360K Dragon3.6M Buddha-4.6M Data Opt BVH FS Search 100 80 60 40 20 0 Time (%) KITTI-1M KITTI-6M KITTI-12M KITTI-25M NBody-9M NBody-10M Bunny-360K Dragon3.6M Buddha-4.6M Data Opt BVH FS Search Range search: much of the time is spent on optimization, data transfer, BVH construction. KNN search: time is mostly dominated by the actual search. KITTI N-body 3D scan KITTI N-body 3D scan 0 0

5JNF%JTUSJCVUJPO 81 100 80 60 40 20 0 Time (%)
KITTI-1M KITTI-6M KITTI-12M KITTI-25M NBody-9M NBody-10M Bunny-360K Dragon3.6M Buddha-4.6M Data Opt BVH FS Search 100 80 60 40 20 0 Time (%) KITTI-1M KITTI-6M KITTI-12M KITTI-25M NBody-9M NBody-10M Bunny-360K Dragon3.6M Buddha-4.6M Data Opt BVH FS Search N-body N-body Galaxy (point) distribution in universe is very non- uniform; so a lot of time spent on partitioning. 0 0

0QUJNJ[BUJPO& ff FDUT 82 10-2 100 102 Log-Scale Time (s)
KNN Range 18.6% 161.3 NoOpt Sched. Oracle Sched. + Partition Sched. + Partition + Bundle N-body (9M) 10-2 100 102 104 Log-Scale Time (s) KNN Range 18.8% NoOpt Sched. Oracle Sched. + Partition Sched. + Partition + Bundle KITTI (12M)

83 Concluding Remarks

(FOFSBM1VSQPTF*SSFHVMBS1SPDFTTPS • Conventional GPUs evolved to support general-purpose regular applications;
will the same happen to RT cores? • A few examples of using RT cores for non-graphics workloads. • Key: formulate your problem as a BVH search. • But very limited, because RT cores are built to support only BVH search, which has a very specific branching logic (ray-AABB test). • Relax the hardware? Does it make sense? Will Nvidia do it? 84

"QQSPYJNBUF/FJHICPS4FBSDI • Most often applications don’t need precise search. •
Many natural opportunities for approximation in our algorithm. • Use a smaller-than-necessary AABB to build the BVH. • Elide ray-sphere test (skip IS shader calls); provides an error bound. • Even better: many applications that use neighbor search are differentiable (e.g., neural network). We could integrate approximate neighbor search into the training process to tolerate end-to-end accuracy loss. • See Yu Feng’s ISCA 2022 paper. 85

RTNN: Accelerating Neighbor Search Using Hardwa...

RTNN: Accelerating Neighbor Search Using Hardware Ray Tracing

More Decks by Yuhao Zhu

Other Decks in Research

Featured

Transcript