Slide 1

Slide 1 text

How do folks use Dask + RAPIDS? Dask Demo Day Oct 2023

Slide 2

Slide 2 text

2 25% of the Fortune 100 use RAPIDS

Slide 3

Slide 3 text

3 Growing community

Slide 4

Slide 4 text

4 Common RAPIDS use cases ▸ Workloads ○ Dask-cuDF + XGBoost risk assessment on >1TB datasets ○ LLM Text preprocessing on >10TB datasets ○ Apache Beam pipelines ○ Graph neural networks ▸ Sectors ○ Retail ○ Financial services ○ Cyber security ○ Telecoms ○ Automotive ▸ Market size ○ Some users spending 6-7 figures per month on GPU Dask clusters ○ Clusters with up to 100 GPU workers Where do we see people using RAPIDS? ▸ Platforms ○ Google Cloud ○ AWS ○ Azure ○ Oracle ○ On Prem (often SLURM) ▸ Dask often paired with ○ XGBoost ○ Optuna ○ Spark ○ Apache Beam ○ Numba ○ PyTorch ○ Tensorflow

Slide 5

Slide 5 text

5 Model evaluation in recommender systems https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s32017/

Slide 6

Slide 6 text

6 Feature processing in recommender systems https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s32017/

Slide 7

Slide 7 text

7 Signal processing in autonomous vehicles https://www.nvidia.com/en-us/on-demand/session/gtcspring23-s51336/

Slide 8

Slide 8 text

8 LLM Data Preprocessing (10s of TBs) https://developer.nvidia.com/blog/curating-trillion-token-datasets-introducing-nemo-data-curator/

Slide 9

Slide 9 text

9 Apache Beam Pipelines https://www.youtube.com/watch?v=uGEQkws1Low

Slide 10

Slide 10 text

10 Open Source Community ▸ NVIDIA targets Large Enterprises with RAPIDS. ▸ Large Enterprise users are less likely to open GitHub issues than academic or SME users. ▸ RAPIDS users typically work for companies with direct links to NVIDIA for support. ▸ Dask issues are often discussed in high-level projects like dask-cudf, nvtabular, NeMo, etc Maybe we could do better at communicating back to the Dask community… Why don’t we see RAPIDS discussed more in OSS Dask?