Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyData Cardiff - RAPIDS 0.11: Open GPU Data Science

PyData Cardiff - RAPIDS 0.11: Open GPU Data Science

A remix of the RAPIDS 0.11 release deck targeted at the PyData Cardiff community.

All release decks
https://docs.rapids.ai/overview

Meetup
https://www.meetup.com/PyData-Cardiff-Meetup/events/268066478/

Abstract
The RAPIDS suite of open source software libraries (https://rapids.ai/) allow you to run data science and analytics pipelines entirely on GPUs, but following familiar Python APIs including Numpy, Pandas and SciKit Learn.

RAPIDS relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposes that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.

RAPIDS also focuses on common data preparation tasks for analytics and data science. This includes a familiar DataFrame API that integrates with a variety of machine learning algorithms for end-to-end pipeline accelerations without paying typical serialization costs. RAPIDS also includes support for multi-node, multi-GPU deployments, enabling vastly accelerated processing and training on much larger dataset sizes.

Jacob Tomlinson

February 11, 2020
Tweet

More Decks by Jacob Tomlinson

Other Decks in Technology

Transcript

  1. Jacob Tomlinson
    Senior Software Engineer, RAPIDS Engineering
    Open GPU Data Science

    View Slide

  2. 2
    Jacob
    Tomlinson

    View Slide

  3. 3
    What is RAPIDS?

    View Slide

  4. 4
    RAPIDS
    https://github.com/rapidsai

    View Slide

  5. 5
    Data Processing Evolution
    Faster data access, less data movement
    25-100x Improvement
    Less code
    Language flexible
    Primarily In-Memory
    HDFS
    Read
    HDFS
    Write
    HDFS
    Read
    HDFS
    Write
    HDFS
    Read
    Query ETL ML Train
    HDFS
    Read
    Query ETL ML Train
    HDFS
    Read
    GPU
    Read
    Query
    CPU
    Write
    GPU
    Read
    ETL
    CPU
    Write
    GPU
    Read
    ML
    Train
    Arrow
    Read
    ETL
    ML
    Train
    5-10x Improvement
    More code
    Language rigid
    Substantially on GPU
    50-100x Improvement
    Same code
    Language flexible
    Primarily on GPU
    RAPIDS
    Traditional GPU Processing
    Hadoop Processing, Reading from disk
    Spark In-Memory Processing
    Query

    View Slide

  6. 6
    Jake VanderPlas - PyCon 2017

    View Slide

  7. 7
    Pandas
    Analytics
    CPU Memory
    Data Preparation Visualization
    Model Training
    Scikit-Learn
    Machine Learning
    NetworkX
    Graph Analytics
    PyTorch Chainer MxNet
    Deep Learning
    Matplotlib/Plotly
    Visualization
    Open Source Data Science Ecosystem
    Familiar Python APIs
    Dask

    View Slide

  8. 8
    cuDF cuIO
    Analytics
    GPU Memory
    Data Preparation Visualization
    Model Training
    cuML
    Machine Learning
    cuGraph
    Graph Analytics
    PyTorch Chainer MxNet
    Deep Learning
    cuXfilter <> pyViz
    Visualization
    RAPIDS
    End-to-End Accelerated GPU Data Science
    Dask

    View Slide

  9. 9
    Ecosystem Partners

    View Slide

  10. 10
    Faster Speeds, Real-World Benefits
    cuIO/cuDF –
    Load and Data Preparation XGBoost Machine Learning
    Time in seconds (shorter is better)
    cuIO/cuDF (Load and Data Prep) Data Conversion XGBoost
    Benchmark
    200GB CSV dataset; Data prep includes
    joins, variable transformations
    CPU Cluster Configuration
    CPU nodes (61 GiB memory, 8 vCPUs,
    64-bit platform), Apache Spark
    DGX Cluster Configuration
    5x DGX-1 on InfiniBand
    network
    8762
    6148
    3925
    3221
    322
    213
    End-to-End

    View Slide

  11. 11
    Technologies

    View Slide

  12. 12
    RAPIDS
    End-to-End Accelerated GPU Data Science
    cuDF cuIO
    Analytics
    GPU Memory
    Data Preparation Visualization
    Model Training
    cuML
    Machine Learning
    cuGraph
    Graph Analytics
    PyTorch Chainer MxNet
    Deep Learning
    cuXfilter <> pyViz
    Visualization
    Dask

    View Slide

  13. 13
    cuDF

    View Slide

  14. 14
    cuDF cuIO
    Analytics
    GPU Memory
    Data Preparation Visualization
    Model Training
    cuML
    Machine Learning
    cuGraph
    Graph Analytics
    PyTorch Chainer MxNet
    Deep Learning
    cuXfilter <> pyViz
    Visualization
    RAPIDS
    GPU Accelerated data wrangling and feature engineering
    Dask

    View Slide

  15. 15
    ETL - the Backbone of Data Science
    cuDF is…
    Python Library
    ● A Python library for manipulating GPU
    DataFrames following the Pandas API
    ● Python interface to CUDA C++ library with
    additional functionality
    ● Creating GPU DataFrames from Numpy arrays,
    Pandas DataFrames, and PyArrow Tables
    ● JIT compilation of User-Defined Functions
    (UDFs) using Numba

    View Slide

  16. 16
    cuDF v0.10, Pandas 0.24.2
    Running on NVIDIA DGX-1:
    GPU: NVIDIA Tesla V100 32GB
    CPU: Intel(R) Xeon(R) CPU E5-2698 v4
    @ 2.20GHz
    Benchmark Setup:
    DataFrames: 2x int32 columns key columns,
    3x int32 value columns
    Merge: inner
    GroupBy: count, sum, min, max calculated
    for each value column
    Benchmarks: single-GPU Speedup vs. Pandas

    View Slide

  17. 17
    • Follow Pandas APIs and provide >10x speedup
    • CSV Reader - v0.2, CSV Writer v0.8
    • Parquet Reader – v0.7, Parquet Writer v0.12
    • ORC Reader – v0.7, ORC Writer v0.10
    • JSON Reader - v0.8
    • Avro Reader - v0.9
    • GPU Direct Storage integration in progress for
    bypassing PCIe bottlenecks!
    • Key is GPU-accelerating both parsing and
    decompression wherever possible Source: Apache Crail blog: SQL Performance: Part 1 - Input File Formats
    Extraction is the Cornerstone
    cuIO for Faster Data Loading

    View Slide

  18. 18
    CuPy

    View Slide

  19. 19
    GPU Memory
    Data Preparation Visualization
    Model Training
    RAPIDS
    Building bridges into the array ecosystem
    Dask
    cuDF cuIO
    Analytics
    cuML
    Machine Learning
    cuGraph
    Graph Analytics
    PyTorch Chainer MxNet
    Deep Learning
    cuXfilter <> pyViz
    Visualization

    View Slide

  20. 20

    View Slide

  21. 21
    Benchmark: single-GPU CuPy vs NumPy
    More details: https://blog.dask.org/2019/06/27/single-gpu-cupy-benchmarks

    View Slide

  22. 22
    SVD Benchmark
    Dask and CuPy Doing Complex Workflows

    View Slide

  23. 23
    cuML

    View Slide

  24. 24
    GPU Memory
    Data Preparation Visualization
    Model Training
    Dask
    Machine Learning
    More models more problems
    cuDF cuIO
    Analytics
    cuML
    Machine Learning
    cuGraph
    Graph Analytics
    PyTorch Chainer MxNet
    Deep Learning
    cuXfilter <> pyViz
    Visualization

    View Slide

  25. 25
    Algorithms
    GPU-accelerated Scikit-Learn
    Classification / Regression
    Inference
    Clustering
    Decomposition & Dimensionality Reduction
    Time Series
    Decision Trees / Random Forests
    Linear Regression
    Logistic Regression
    K-Nearest Neighbors
    Support Vector Machine Classification
    Random forest / GBDT inference
    K-Means
    DBSCAN
    Spectral Clustering
    Principal Components
    Singular Value Decomposition
    UMAP
    Spectral Embedding
    T-SNE
    Holt-Winters
    Kalman Filtering
    ARIMA
    Cross Validation
    More to come!
    Hyper-parameter Tuning
    Key:
    ● Preexisting
    ● NEW or enhanced for 0.11

    View Slide

  26. 26
    RAPIDS matches common Python APIs
    from sklearn.cluster import DBSCAN
    dbscan = DBSCAN(eps = 0.3, min_samples = 5)
    dbscan.fit(X)
    y_hat = dbscan.predict(X)
    Find Clusters
    from sklearn.datasets import make_moons
    import pandas
    X, y = make_moons(n_samples=int(1e2),
    noise=0.05, random_state=0)
    X = pandas.DataFrame({'fea%d'%i: X[:, i]
    for i in range(X.shape[1])})
    CPU-Based Clustering

    View Slide

  27. 27
    RAPIDS matches common Python APIs
    from cuml import DBSCAN
    dbscan = DBSCAN(eps = 0.3, min_samples = 5)
    dbscan.fit(X)
    y_hat = dbscan.predict(X)
    Find Clusters
    from sklearn.datasets import make_moons
    import cudf
    X, y = make_moons(n_samples=int(1e2),
    noise=0.05, random_state=0)
    X = cudf.DataFrame({'fea%d'%i: X[:, i]
    for i in range(X.shape[1])})
    GPU-Accelerated Clustering

    View Slide

  28. 28
    Benchmarks: single-GPU cuML vs scikit-learn
    1x V100
    vs.
    2x 20 core CPU

    View Slide

  29. 29
    cuML’s Forest Inference Library
    accelerates prediction (inference) for
    random forests and boosted decision
    trees:
    ● Works with existing saved models
    (XGBoost, LightGBM, scikit-learn RF
    cuML RF soon)
    ● Lightweight Python API
    ● Single V100 GPU can infer up to 34x
    faster than XGBoost dual-CPU node
    ● Over 100 million forest inferences
    per sec (with 1000 trees) on a DGX-1
    Forest Inference
    Taking models from training to production
    23x 36x 34x 23x

    View Slide

  30. 30
    cuGraph

    View Slide

  31. 31
    GPU Memory
    Data Preparation Visualization
    Model Training
    Dask
    Graph Analytics
    More connections more insights
    cuDF cuIO
    Analytics
    cuML
    Machine Learning
    cuGraph
    Graph Analytics
    PyTorch Chainer MxNet
    Deep Learning
    cuXfilter <> pyViz
    Visualization

    View Slide

  32. 32
    Algorithms
    GPU-accelerated NetworkX
    Community
    Components
    Link Analysis
    Link Prediction
    Traversal
    Structure
    Spectral Clustering
    Balanced-Cut
    Modularity Maximization
    Louvain
    Subgraph Extraction
    KCore and KCore Number
    Jaccard
    Weighted Jaccard
    Overlap Coefficient
    Single Source Shortest Path (SSSP)
    Breadth First Search (BFS)
    Triangle Counting
    COO-to-CSR (Multi-GPU)
    Transpose
    Multi-GPU
    More to come!
    Utilities
    Weakly Connected Components
    Strongly Connected Components
    Page Rank (Multi-GPU)
    Personal Page Rank
    Katz
    Query Language
    Renumbering

    View Slide

  33. 33
    GOALS AND BENEFITS OF CUGRAPH
    Focus on Features and User Experience
    • Property Graph support via DataFrames
    Seamless Integration with cuDF and cuML
    • Up to 500 million edges on a single 32GB GPU
    • Multi-GPU support for scaling into the billions
    of edges
    Breakthrough Performance
    • Python: Familiar NetworkX-like API
    • C/C++: lower-level granular control for
    application developers
    Multiple APIs
    • Extensive collection of algorithm, primitive,
    and utility functions
    Growing Functionality

    View Slide

  34. 34
    Benchmarks: single-GPU cuGraph vs NetworkX
    Dataset Nodes Edges
    preferentialAttachment 100,000 999,970
    caidaRouterLevel 192,244 1,218,132
    coAuthorsDBLP 299,067 299,067
    dblp-2010 326,186 1,615,400
    citationCiteseer 268,495 2,313,294
    coPapersDBLP 540,486 30,491,458
    coPapersCiteseer 434,102 32,073,440
    as-Skitter 1,696,415 22,190,596

    View Slide

  35. 35
    cuSpatial

    View Slide

  36. 36
    cuSpatial
    • cuDF for data loading, cuGraph for routing
    optimization, and cuML for clustering are just a
    few examples
    Seamless Integration into RAPIDS
    • Extensive collection of algorithm, primitive,
    and utility functions for spatial analytics
    Growing Functionality
    • Up to 1000x faster than CPU spatial libraries
    • Python and C++ APIs for maximum usability
    and integration
    Breakthrough Performance & Ease of Use

    View Slide

  37. 37
    cuSpatial
    cuSpatial Operation Input data cuSpatial Runtime Reference Runtime Speedup
    Point-in-Polygon Test 1.3+ million vehicle point
    locations and 27 Region of
    Interests
    1.11 ms (C++)
    1.50 ms (Python)
    [Nvidia Titan V]
    334 ms (C++, optimized
    serial)
    130468.2 ms (python
    Shapely API, serial)
    [Intel i7-7800X]
    301X
    (C++)
    86,978X (Python)
    Haversine Distance
    Computation
    13+ million Monthly NYC taxi
    trip pickup and drop-off
    locations
    7.61 ms (Python)
    [Nvidia T4]
    416.9 ms (Numba)
    [Nvidia T4]
    54.7X (Python)
    Hausdorff Distance
    Computation (for
    clustering)
    52,800 trajectories with 1.3+
    million points
    13.5s
    [Quadro V100]
    19227.5s (Python SciPy
    API, serial)
    [Intel i7-6700K]
    1,400X (Python)
    Performance at a Glance

    View Slide

  38. 38
    Scaling
    and interoperability

    View Slide

  39. 39
    Dask

    View Slide

  40. 40
    cuDF cuIO
    Analytics
    GPU Memory
    Data Preparation Visualization
    Model Training
    cuML
    Machine Learning
    cuGraph
    Graph Analytics
    PyTorch Chainer MxNet
    Deep Learning
    cuXfilter <> pyViz
    Visualization
    RAPIDS
    Scaling RAPIDS with Dask
    Dask

    View Slide

  41. 41
    Why Dask?
    • Easy Migration: Built on top of NumPy, Pandas
    Scikit-Learn, etc.
    • Easy Training: With the same APIs
    • Trusted: With the same developer community
    PyData Native
    • Easy to install and use on a laptop
    • Scales out to thousand-node clusters
    Easy Scalability
    • Most common parallelism framework today
    in the PyData and SciPy community
    Popular
    • HPC: SLURM, PBS, LSF, SGE
    • Cloud: Kubernetes, AWS, Azure
    • Hadoop/Spark: Yarn
    Deployable

    View Slide

  42. 42
    Dask
    Dask scales arrays, dataframes and ML APIs

    View Slide

  43. 43
    Scale up with RAPIDS
    Accelerated on single GPU
    NumPy -> CuPy/PyTorch/..
    Pandas -> cuDF
    Scikit-Learn -> cuML
    Numba -> Numba
    RAPIDS and Others
    NumPy, Pandas, Scikit-Learn,
    Numba and many more
    Single CPU core
    In-memory data
    PyData
    Scale Up / Accelerate

    View Slide

  44. 44
    Scale out with RAPIDS + Dask with OpenUCX
    Accelerated on single GPU
    NumPy -> CuPy/PyTorch/..
    Pandas -> cuDF
    Scikit-Learn -> cuML
    Numba -> Numba
    RAPIDS and Others
    Multi-GPU
    On single Node (DGX)
    Or across a cluster
    RAPIDS + Dask with
    OpenUCX
    Scale Up / Accelerate
    Scale out / Parallelize
    NumPy, Pandas, Scikit-Learn,
    Numba and many more
    Single CPU core
    In-memory data
    PyData
    Multi-core and Distributed PyData
    NumPy -> Dask Array
    Pandas -> Dask DataFrame
    Scikit-Learn -> Dask-ML
    … -> Dask Futures
    Dask

    View Slide

  45. 45
    CUDA Array Interface

    View Slide

  46. 46
    APP A
    Data Movement and Transformation
    Data Movement and Transformation
    The bane of productivity and performance
    CPU
    APP B
    Copy & Convert
    Copy & Convert
    Copy & Convert
    APP A
    GPU
    Data
    APP B
    GPU
    Data
    Read Data
    Load Data
    APP B
    APP A
    GPU

    View Slide

  47. 47
    Data Movement and Transformation
    Data Movement and Transformation
    What if we could keep data on the GPU?
    APP A
    APP B
    Copy & Convert
    Copy & Convert
    Copy & Convert
    Read Data
    Load Data
    CPU
    APP A
    GPU
    Data
    APP B
    GPU
    Data
    APP B
    APP A
    GPU
    Copy & Convert

    View Slide

  48. 48
    Interoperability for the Win
    DLPack and __cuda_array_interface__
    mpi4py

    View Slide

  49. 49
    Array function and dispatching

    View Slide

  50. 50
    NEP18
    import numpy as np
    import cupy
    x = np.random.random((10000, 1000))
    y = cupy.array(x)
    u, s, v = np.linalg.svd(x) # 3min 11s
    u, s, v = np.linalg.svd(y) # 19.1 s
    Numpy __array_function__ protocol

    View Slide

  51. 51
    UCX

    View Slide

  52. 52
    Why OpenUCX?
    • TCP sockets are slow!
    • UCX provides uniform access to transports (TCP,
    InfiniBand, shared memory, NVLink)
    • Alpha Python bindings for UCX (ucx-py)
    • Will provide best communication performance, to Dask
    based on available hardware on nodes/cluster
    Bringing hardware accelerated communications to Dask
    conda install -c conda-forge -c rapidsai \
    cudatoolkit= ucx-proc=*=gpu ucx ucx-py

    View Slide

  53. 53
    cuDF v0.11, UCX-PY 0.11
    Running on NVIDIA DGX-2:
    GPU: NVIDIA Tesla V100 32GB
    CPU: Intel(R) Xeon(R) CPU 8168
    @ 2.70GHz
    Benchmark Setup:
    DataFrames: Left/Right 1x int64 column key
    column, 1x int64 value columns
    Merge: inner
    30% of matching data balanced across each
    partition
    Benchmarks: Distributed cuDF Random Merge

    View Slide

  54. 54
    Getting started

    View Slide

  55. 55
    RAPIDS Docs
    https://docs.rapids.ai

    View Slide

  56. 56
    Easy Installation
    Interactive Installation Guide

    View Slide

  57. 57

    View Slide

  58. 58
    Deploy RAPIDS Everywhere
    Focused on robust functionality, deployment, and user experience
    Integration with major cloud providers
    Both containers and cloud specific machine instances
    Support for Enterprise and HPC Orchestration Layers
    Cloud Dataproc
    Azure Machine Learning

    View Slide

  59. 59
    Join the Movement
    Everyone can help!
    Integrations, feedback, documentation support, pull requests, new issues, or code donations welcomed!
    APACHE ARROW GPU Open Analytics
    Initiative
    https://arrow.apache.org/
    @ApacheArrow
    http://gpuopenanalytics.com/
    @GPUOAI
    RAPIDS
    https://rapids.ai
    @RAPIDSAI
    Dask
    https://dask.org
    @Dask_dev

    View Slide

  60. THANK YOU
    Jacob Tomlinson @_jacobtomlinson
    [email protected]

    View Slide