Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyData Cardiff - RAPIDS 0.11: Open GPU Data Science

PyData Cardiff - RAPIDS 0.11: Open GPU Data Science

A remix of the RAPIDS 0.11 release deck targeted at the PyData Cardiff community.

All release decks
https://docs.rapids.ai/overview

Meetup
https://www.meetup.com/PyData-Cardiff-Meetup/events/268066478/

Abstract
The RAPIDS suite of open source software libraries (https://rapids.ai/) allow you to run data science and analytics pipelines entirely on GPUs, but following familiar Python APIs including Numpy, Pandas and SciKit Learn.

RAPIDS relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposes that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.

RAPIDS also focuses on common data preparation tasks for analytics and data science. This includes a familiar DataFrame API that integrates with a variety of machine learning algorithms for end-to-end pipeline accelerations without paying typical serialization costs. RAPIDS also includes support for multi-node, multi-GPU deployments, enabling vastly accelerated processing and training on much larger dataset sizes.

Jacob Tomlinson

February 11, 2020
Tweet

More Decks by Jacob Tomlinson

Other Decks in Technology

Transcript

  1. 5 Data Processing Evolution Faster data access, less data movement

    25-100x Improvement Less code Language flexible Primarily In-Memory HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train HDFS Read GPU Read Query CPU Write GPU Read ETL CPU Write GPU Read ML Train Arrow Read ETL ML Train 5-10x Improvement More code Language rigid Substantially on GPU 50-100x Improvement Same code Language flexible Primarily on GPU RAPIDS Traditional GPU Processing Hadoop Processing, Reading from disk Spark In-Memory Processing Query
  2. 7 Pandas Analytics CPU Memory Data Preparation Visualization Model Training

    Scikit-Learn Machine Learning NetworkX Graph Analytics PyTorch Chainer MxNet Deep Learning Matplotlib/Plotly Visualization Open Source Data Science Ecosystem Familiar Python APIs Dask
  3. 8 cuDF cuIO Analytics GPU Memory Data Preparation Visualization Model

    Training cuML Machine Learning cuGraph Graph Analytics PyTorch Chainer MxNet Deep Learning cuXfilter <> pyViz Visualization RAPIDS End-to-End Accelerated GPU Data Science Dask
  4. 10 Faster Speeds, Real-World Benefits cuIO/cuDF – Load and Data

    Preparation XGBoost Machine Learning Time in seconds (shorter is better) cuIO/cuDF (Load and Data Prep) Data Conversion XGBoost Benchmark 200GB CSV dataset; Data prep includes joins, variable transformations CPU Cluster Configuration CPU nodes (61 GiB memory, 8 vCPUs, 64-bit platform), Apache Spark DGX Cluster Configuration 5x DGX-1 on InfiniBand network 8762 6148 3925 3221 322 213 End-to-End
  5. 12 RAPIDS End-to-End Accelerated GPU Data Science cuDF cuIO Analytics

    GPU Memory Data Preparation Visualization Model Training cuML Machine Learning cuGraph Graph Analytics PyTorch Chainer MxNet Deep Learning cuXfilter <> pyViz Visualization Dask
  6. 14 cuDF cuIO Analytics GPU Memory Data Preparation Visualization Model

    Training cuML Machine Learning cuGraph Graph Analytics PyTorch Chainer MxNet Deep Learning cuXfilter <> pyViz Visualization RAPIDS GPU Accelerated data wrangling and feature engineering Dask
  7. 15 ETL - the Backbone of Data Science cuDF is…

    Python Library • A Python library for manipulating GPU DataFrames following the Pandas API • Python interface to CUDA C++ library with additional functionality • Creating GPU DataFrames from Numpy arrays, Pandas DataFrames, and PyArrow Tables • JIT compilation of User-Defined Functions (UDFs) using Numba
  8. 16 cuDF v0.10, Pandas 0.24.2 Running on NVIDIA DGX-1: GPU:

    NVIDIA Tesla V100 32GB CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz Benchmark Setup: DataFrames: 2x int32 columns key columns, 3x int32 value columns Merge: inner GroupBy: count, sum, min, max calculated for each value column Benchmarks: single-GPU Speedup vs. Pandas
  9. 17 • Follow Pandas APIs and provide >10x speedup •

    CSV Reader - v0.2, CSV Writer v0.8 • Parquet Reader – v0.7, Parquet Writer v0.12 • ORC Reader – v0.7, ORC Writer v0.10 • JSON Reader - v0.8 • Avro Reader - v0.9 • GPU Direct Storage integration in progress for bypassing PCIe bottlenecks! • Key is GPU-accelerating both parsing and decompression wherever possible Source: Apache Crail blog: SQL Performance: Part 1 - Input File Formats Extraction is the Cornerstone cuIO for Faster Data Loading
  10. 19 GPU Memory Data Preparation Visualization Model Training RAPIDS Building

    bridges into the array ecosystem Dask cuDF cuIO Analytics cuML Machine Learning cuGraph Graph Analytics PyTorch Chainer MxNet Deep Learning cuXfilter <> pyViz Visualization
  11. 20

  12. 24 GPU Memory Data Preparation Visualization Model Training Dask Machine

    Learning More models more problems cuDF cuIO Analytics cuML Machine Learning cuGraph Graph Analytics PyTorch Chainer MxNet Deep Learning cuXfilter <> pyViz Visualization
  13. 25 Algorithms GPU-accelerated Scikit-Learn Classification / Regression Inference Clustering Decomposition

    & Dimensionality Reduction Time Series Decision Trees / Random Forests Linear Regression Logistic Regression K-Nearest Neighbors Support Vector Machine Classification Random forest / GBDT inference K-Means DBSCAN Spectral Clustering Principal Components Singular Value Decomposition UMAP Spectral Embedding T-SNE Holt-Winters Kalman Filtering ARIMA Cross Validation More to come! Hyper-parameter Tuning Key: • Preexisting • NEW or enhanced for 0.11
  14. 26 RAPIDS matches common Python APIs from sklearn.cluster import DBSCAN

    dbscan = DBSCAN(eps = 0.3, min_samples = 5) dbscan.fit(X) y_hat = dbscan.predict(X) Find Clusters from sklearn.datasets import make_moons import pandas X, y = make_moons(n_samples=int(1e2), noise=0.05, random_state=0) X = pandas.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])}) CPU-Based Clustering
  15. 27 RAPIDS matches common Python APIs from cuml import DBSCAN

    dbscan = DBSCAN(eps = 0.3, min_samples = 5) dbscan.fit(X) y_hat = dbscan.predict(X) Find Clusters from sklearn.datasets import make_moons import cudf X, y = make_moons(n_samples=int(1e2), noise=0.05, random_state=0) X = cudf.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])}) GPU-Accelerated Clustering
  16. 29 cuML’s Forest Inference Library accelerates prediction (inference) for random

    forests and boosted decision trees: • Works with existing saved models (XGBoost, LightGBM, scikit-learn RF cuML RF soon) • Lightweight Python API • Single V100 GPU can infer up to 34x faster than XGBoost dual-CPU node • Over 100 million forest inferences per sec (with 1000 trees) on a DGX-1 Forest Inference Taking models from training to production 23x 36x 34x 23x
  17. 31 GPU Memory Data Preparation Visualization Model Training Dask Graph

    Analytics More connections more insights cuDF cuIO Analytics cuML Machine Learning cuGraph Graph Analytics PyTorch Chainer MxNet Deep Learning cuXfilter <> pyViz Visualization
  18. 32 Algorithms GPU-accelerated NetworkX Community Components Link Analysis Link Prediction

    Traversal Structure Spectral Clustering Balanced-Cut Modularity Maximization Louvain Subgraph Extraction KCore and KCore Number Jaccard Weighted Jaccard Overlap Coefficient Single Source Shortest Path (SSSP) Breadth First Search (BFS) Triangle Counting COO-to-CSR (Multi-GPU) Transpose Multi-GPU More to come! Utilities Weakly Connected Components Strongly Connected Components Page Rank (Multi-GPU) Personal Page Rank Katz Query Language Renumbering
  19. 33 GOALS AND BENEFITS OF CUGRAPH Focus on Features and

    User Experience • Property Graph support via DataFrames Seamless Integration with cuDF and cuML • Up to 500 million edges on a single 32GB GPU • Multi-GPU support for scaling into the billions of edges Breakthrough Performance • Python: Familiar NetworkX-like API • C/C++: lower-level granular control for application developers Multiple APIs • Extensive collection of algorithm, primitive, and utility functions Growing Functionality
  20. 34 Benchmarks: single-GPU cuGraph vs NetworkX Dataset Nodes Edges preferentialAttachment

    100,000 999,970 caidaRouterLevel 192,244 1,218,132 coAuthorsDBLP 299,067 299,067 dblp-2010 326,186 1,615,400 citationCiteseer 268,495 2,313,294 coPapersDBLP 540,486 30,491,458 coPapersCiteseer 434,102 32,073,440 as-Skitter 1,696,415 22,190,596
  21. 36 cuSpatial • cuDF for data loading, cuGraph for routing

    optimization, and cuML for clustering are just a few examples Seamless Integration into RAPIDS • Extensive collection of algorithm, primitive, and utility functions for spatial analytics Growing Functionality • Up to 1000x faster than CPU spatial libraries • Python and C++ APIs for maximum usability and integration Breakthrough Performance & Ease of Use
  22. 37 cuSpatial cuSpatial Operation Input data cuSpatial Runtime Reference Runtime

    Speedup Point-in-Polygon Test 1.3+ million vehicle point locations and 27 Region of Interests 1.11 ms (C++) 1.50 ms (Python) [Nvidia Titan V] 334 ms (C++, optimized serial) 130468.2 ms (python Shapely API, serial) [Intel i7-7800X] 301X (C++) 86,978X (Python) Haversine Distance Computation 13+ million Monthly NYC taxi trip pickup and drop-off locations 7.61 ms (Python) [Nvidia T4] 416.9 ms (Numba) [Nvidia T4] 54.7X (Python) Hausdorff Distance Computation (for clustering) 52,800 trajectories with 1.3+ million points 13.5s [Quadro V100] 19227.5s (Python SciPy API, serial) [Intel i7-6700K] 1,400X (Python) Performance at a Glance
  23. 40 cuDF cuIO Analytics GPU Memory Data Preparation Visualization Model

    Training cuML Machine Learning cuGraph Graph Analytics PyTorch Chainer MxNet Deep Learning cuXfilter <> pyViz Visualization RAPIDS Scaling RAPIDS with Dask Dask
  24. 41 Why Dask? • Easy Migration: Built on top of

    NumPy, Pandas Scikit-Learn, etc. • Easy Training: With the same APIs • Trusted: With the same developer community PyData Native • Easy to install and use on a laptop • Scales out to thousand-node clusters Easy Scalability • Most common parallelism framework today in the PyData and SciPy community Popular • HPC: SLURM, PBS, LSF, SGE • Cloud: Kubernetes, AWS, Azure • Hadoop/Spark: Yarn Deployable
  25. 43 Scale up with RAPIDS Accelerated on single GPU NumPy

    -> CuPy/PyTorch/.. Pandas -> cuDF Scikit-Learn -> cuML Numba -> Numba RAPIDS and Others NumPy, Pandas, Scikit-Learn, Numba and many more Single CPU core In-memory data PyData Scale Up / Accelerate
  26. 44 Scale out with RAPIDS + Dask with OpenUCX Accelerated

    on single GPU NumPy -> CuPy/PyTorch/.. Pandas -> cuDF Scikit-Learn -> cuML Numba -> Numba RAPIDS and Others Multi-GPU On single Node (DGX) Or across a cluster RAPIDS + Dask with OpenUCX Scale Up / Accelerate Scale out / Parallelize NumPy, Pandas, Scikit-Learn, Numba and many more Single CPU core In-memory data PyData Multi-core and Distributed PyData NumPy -> Dask Array Pandas -> Dask DataFrame Scikit-Learn -> Dask-ML … -> Dask Futures Dask
  27. 46 APP A Data Movement and Transformation Data Movement and

    Transformation The bane of productivity and performance CPU APP B Copy & Convert Copy & Convert Copy & Convert APP A GPU Data APP B GPU Data Read Data Load Data APP B APP A GPU
  28. 47 Data Movement and Transformation Data Movement and Transformation What

    if we could keep data on the GPU? APP A APP B Copy & Convert Copy & Convert Copy & Convert Read Data Load Data CPU APP A GPU Data APP B GPU Data APP B APP A GPU Copy & Convert
  29. 50 NEP18 import numpy as np import cupy x =

    np.random.random((10000, 1000)) y = cupy.array(x) u, s, v = np.linalg.svd(x) # 3min 11s u, s, v = np.linalg.svd(y) # 19.1 s Numpy __array_function__ protocol
  30. 52 Why OpenUCX? • TCP sockets are slow! • UCX

    provides uniform access to transports (TCP, InfiniBand, shared memory, NVLink) • Alpha Python bindings for UCX (ucx-py) • Will provide best communication performance, to Dask based on available hardware on nodes/cluster Bringing hardware accelerated communications to Dask conda install -c conda-forge -c rapidsai \ cudatoolkit=<CUDA version> ucx-proc=*=gpu ucx ucx-py
  31. 53 cuDF v0.11, UCX-PY 0.11 Running on NVIDIA DGX-2: GPU:

    NVIDIA Tesla V100 32GB CPU: Intel(R) Xeon(R) CPU 8168 @ 2.70GHz Benchmark Setup: DataFrames: Left/Right 1x int64 column key column, 1x int64 value columns Merge: inner 30% of matching data balanced across each partition Benchmarks: Distributed cuDF Random Merge
  32. 57

  33. 58 Deploy RAPIDS Everywhere Focused on robust functionality, deployment, and

    user experience Integration with major cloud providers Both containers and cloud specific machine instances Support for Enterprise and HPC Orchestration Layers Cloud Dataproc Azure Machine Learning
  34. 59 Join the Movement Everyone can help! Integrations, feedback, documentation

    support, pull requests, new issues, or code donations welcomed! APACHE ARROW GPU Open Analytics Initiative https://arrow.apache.org/ @ApacheArrow http://gpuopenanalytics.com/ @GPUOAI RAPIDS https://rapids.ai @RAPIDSAI Dask https://dask.org @Dask_dev