Lock in $30 Savings on PRO—Offer Ends Soon! ⏳

RAPIDS and cuDF: accelerating DataFrames on GPUs

Avatar for Keith Kraus Keith Kraus
November 16, 2019

RAPIDS and cuDF: accelerating DataFrames on GPUs

The Python data science stack is composed of a rich set of powerful libraries that work wonderfully well together, providing coherent, beautiful, Pythonic APIs that let the Data Scientist think less about programming and more about the data. However, many of these libraries are largely single_threaded (e.g., Pandas, Scikit-Learn), and as data workflows grow larger, they quickly run up against this limitation. RAPIDS is a suite of open-source libraries that provide APIs nearly identical to existing popular Python libraries. By leveraging the massively parallel processing capabilities of GPUs, RAPIDS libraries can provide speedups of 50x or more over their purely-CPU counterparts. cuDF is a GPU DataFrame library following the Pandas API. cuML is a GPU Machine Learning library following the Scikit-Learn API. cuGraph is a GPU Graph Analytics library with an API inspired by NetworkX. This talk will provide an overview of the RAPIDS ecosystem, with a focus on the cuDF library, its features and design. We'll show how cuDF combines the use of Numba, Cython, modern C++, CUDA, and Apache Arrow to build a highly performant DataFrame library that is also highly interoperable with other libraries in the PyData ecosystem. We'll show examples of workflows using cuDF both on a single GPU, and across multiple GPUs in conjunction with the Dask library. We'll also share some performance results, best practices, tips, and tricks.

Avatar for Keith Kraus

Keith Kraus

November 16, 2019
Tweet

More Decks by Keith Kraus

Other Decks in Programming

Transcript

  1. 2 Data Processing Evolution Faster data access, less data movement

    HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train Hadoop Processing, Reading from disk
  2. 3 Data Processing Evolution Faster data access, less data movement

    25-100x Improvement Less code Language flexible Primarily In-Memory HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train Hadoop Processing, Reading from disk Spark In-Memory Processing
  3. 4 Spark is not Enough Basic workloads are bottlenecked by

    the CPU • In a simple benchmark consisting of aggregating data, the CPU is the bottleneck • This is after the data is parsed and cached into memory which is another common bottleneck • The CPU bottleneck is even worse in more complex workloads! SELECT cab_type, count(*) FROM trips_orc GROUP BY cab_type; Source: Mark Litwintschik’s blog: 1.1 Billion Taxi Rides: EC2 versus EMR
  4. 5 Data Processing Evolution Faster data access, less data movement

    25-100x Improvement Less code Language flexible Primarily In-Memory HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train HDFS Read GPU Read Query CPU Write GPU Read ETL CPU Write GPU Read ML Train 5-10x Improvement More code Language rigid Substantially on GPU Traditional GPU Processing Hadoop Processing, Reading from disk Spark In-Memory Processing
  5. 6 Why GPUs? Numerous hardware advantages • Thousands of cores

    with up to ~15 TeraFlops of general purpose compute performance • Up to 1 TB/s of memory bandwidth • Hardware interconnects for up to 300 GB/s bidirectional GPU <--> GPU bandwidth • Can scale up to 16x GPUs in a single node Almost never run out of compute relative to memory bandwidth!
  6. 7 APP A Data Movement and Transformation Data Movement and

    Transformation The bane of productivity and performance CPU APP B Copy & Convert Copy & Convert Copy & Convert APP A GPU Data APP B GPU Data Read Data Load Data APP B APP A GPU
  7. 8 Data Movement and Transformation Data Movement and Transformation What

    if we could keep data on the GPU? APP A APP B Copy & Convert Copy & Convert Copy & Convert Read Data Load Data CPU APP A GPU Data APP B GPU Data APP B APP A GPU Copy & Convert
  8. 9 Learning from Apache Arrow From Apache Arrow Home Page

    - https://arrow.apache.org/ • Each system has its own internal memory format • 70-80% computation wasted on serialization and deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg, Parquet-to-Arrow reader)
  9. 10 Data Processing Evolution Faster data access, less data movement

    25-100x Improvement Less code Language flexible Primarily In-Memory HDFS Read HDFS Write HDFS Read HDFS Write HDFS Read Query ETL ML Train HDFS Read Query ETL ML Train HDFS Read GPU Read Query CPU Write GPU Read ETL CPU Write GPU Read ML Train Arrow Read ETL ML Train 5-10x Improvement More code Language rigid Substantially on GPU 50-100x Improvement Same code Language flexible Primarily on GPU RAPIDS Traditional GPU Processing Hadoop Processing, Reading from disk Spark In-Memory Processing Query
  10. 13 The Python Data Science Stack Mostly single-threaded Unused potential

    of computing power Deep learning frameworks are a notable exception Parallel computing: joblib, Numba, Dask, etc., Many, many others
  11. 14 GPUs are hard Too much data movement Writing GPU

    code is hard Fragmented ecosystem No Python API for data manipulation
  12. 15 Python + GPUs A winning combination Python for productivity,

    GPUs for performance Hides GPU programming from application developers Success demonstrated by various Deep Learning frameworks Other examples: Numba, CuPy, PyCUDA/PyOpenCL
  13. 16 Pandas Analytics CPU Memory Data Preparation Visualization Model Training

    Scikit-Learn Machine Learning NetworkX Graph Analytics PyTorch Chainer MxNet Deep Learning Matplotlib/Seaborn Visualization Dask Open Source Data Science Ecosystem Familiar Python APIs
  14. 17 RAPIDS End-to-end GPU accelerated Data Science cuDF+cuIO Analytics GPU

    Memory Data Preparation Visualization Model Training cuML Machine Learning cuGraph Graph Analytics PyTorch Chainer MxNet Deep Learning cuXfilter <> pyViz Visualization Dask
  15. 18 RAPIDS matches common Python APIs GPU-accelerated clustering from sklearn.cluster

    import DBSCAN dbscan = DBSCAN(eps = 0.3, min_samples = 5) dbscan.fit(X) y_hat = dbscan.predict(X) from sklearn.datasets import make_moons import pandas X, y = make_moons(n_samples=int(1e2), noise=0.05, random_state=0) X = pandas.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])})
  16. 19 RAPIDS matches common Python APIs GPU-accelerated clustering from cuml

    import DBSCAN dbscan = DBSCAN(eps = 0.3, min_samples = 5) dbscan.fit(X) y_hat = dbscan.predict(X) from sklearn.datasets import make_moons import cudf X, y = make_moons(n_samples=int(1e2), noise=0.05, random_state=0) X = cudf.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])})
  17. 21 GPU-Accelerated ETL The average data scientist spends 90+% of

    their time in ETL as opposed to training models
  18. 22 Python Library • A Python library for manipulating GPU

    DataFrames following the Pandas API • Python interface to CUDA C++ library with additional functionality • Create GPU DataFrames from Numpy arrays, Pandas DataFrames, and PyArrow Tables • Keep data on the GPU through the entire workflow GPU-Accelerated ETL cuDF is..
  19. 23 GPU-Accelerated ETL In [1]: import cudf In [2]: cudf.set_allocator(pool=True)

    In [3]: df = cudf.datasets.randomdata(nrows=10_000_000) In [4]: pdf = df.to_pandas() In [5]: %timeit -n 5 -r 5 df.groupby('id').sum() 18.9 ms ± 12.5 ms per loop (mean ± std. dev. of 5 runs, 5 loops each) In [6]: %timeit -n 5 -r 5 pdf.groupby('id').sum() 140 ms ± 9.18 ms per loop (mean ± std. dev. of 5 runs, 5 loops each) What to expect
  20. 24 GPU-Accelerated ETL In [1]: import cudf In [2]: cudf.set_allocator(pool=True)

    In [3]: df1 = cudf.datasets.randomdata(nrows=10_000_000) In [4]: df2 = cudf.datasets.randomdata(nrows=1_000_000) In [5]: pdf1 = df1.to_pandas(); pdf2 = df2.to_pandas() In [6]: %timeit -n5 -r5 df1.join(df2, on='id', lsuffix='left') 17.7 ms ± 8.41 ms per loop (mean ± std. dev. of 5 runs, 5 loops each) In [7]: %timeit -n5 -r5 pdf1.join(pdf2, on='id', lsuffix='left') 788 ms ± 1.81 ms per loop (mean ± std. dev. of 5 runs, 5 loops each) What to expect
  21. 25 cuDF v0.10, Pandas 0.24.2 Running on NVIDIA DGX-1: GPU:

    NVIDIA Tesla V100 32GB CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz Benchmark Setup: DataFrames: 2x int32 columns key columns, 3x int32 value columns Merge: inner GroupBy: count, sum, min, max calculated for each value column GPU-Accelerated ETL What to expect
  22. 26 Follow Pandas APIs and provide >10x speedup Key is

    GPU-accelerating both parsing and decompression wherever possible Supported formats: Avro, CSV, JSON, ORC, Parquet GPU Direct Storage integration in progress for bypassing PCIe bottlenecks! GPU-Accelerated ETL cuIO: GPU-accelerated I/O
  23. 27 •Regular Expressions •Element-wise operations • Split, Find, Extract, Cat,

    Typecasting, etc… •String GroupBys, Joins •Categorical columns fully on GPU Current v0.10 String Support • Native string columns in libcudf (C++ layer) • Extensive performance optimization • More Pandas String API compatibility • JIT-compiled String UDFs Future v0.11+ String Support GPU-Accelerated ETL String support
  24. 28 GPU-Accelerated ETL >>> import numpy as np >>> import

    math >>> b = cudf.Series([16, 25, 36, 49, 64, 81], dtype=np.float64) >>> def some_func(A): ... b = 0 ... for a in A: ... b = b + math.sqrt(a) ... return b >>> print(b.rolling(3, min_periods=1).apply(some_func)) User-defined functions
  25. 29 GPU-Accelerated ETL Scaling out to multiple GPUs Accelerated on

    single GPU NumPy -> CuPy/PyTorch/.. Pandas -> cuDF Scikit-Learn -> cuML Numba -> Numba RAPIDS Multi-GPU On single Node (DGX) Or across a cluster RAPIDS + Dask with OpenUCX Scale Up / Accelerate NumPy, Pandas, Scikit-Learn, Numba and many more Single CPU core In-memory data PyData Multi-core and Distributed PyData NumPy -> Dask Array Pandas -> Dask DataFrame Scikit-Learn -> Dask-ML … -> Dask Futures Dask
  26. 31 Apache Arrow Performant, flexible and cross-platform data format Columnar

    memory format optimized for performance and parallel processing (GPUs as well as CPUs) Supports a number of programming languages (C, C++,C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust) arrow.apache.org/
  27. 32 libcudf C++ layer for cuDF Low level library containing

    function implementations and C/C++ API CUDA kernels to perform element-wise math operations on GPU DataFrame columns CUDA sort, join, groupby, reduction, etc. operations on GPU DataFrames Importing/exporting Apache Arrow in GPU memory using CUDA IPC // C++ API cudf::column some_function( cudf::column input_1, cudf::column input_2) { // do something with inputs return output; }
  28. 33 Cython Python wrapper around libcudf Cython is a superset

    of Python that allows wrapping C/C++ APIs, exposing them as Python APIs Supports modern C++ features such as templates and smart pointers Integrates well with PyData. Used in libraries like Pandas and scikit learn # cython declaration: cdef extern from “cudf/function.hpp”: column some_function( column input_1, column input_2) # python wrapper: def py_some_function(a, b): return some_function(a.c_obj, b.c_obj)
  29. 34 Numba Numba enables GPU computing from Python by compiling

    functions into code that executes on the GPU JIT compilation of user-defined functions from numba import cuda import numpy as np @cuda.jit(device=True) def a_device_function(a, b): return a + b # JITed functions can be passed to # libcudf APIs
  30. 35 dask-cudf and OpenUCX Dask extends arrays and DataFrames (including

    cuDF DataFrames) to distributed arrays and DataFrames Constructs a task graph which can execute on multiple GPUs on a single node (scaling up) or across nodes (scaling out) OpenUCX provides optimized communication across GPUs Operating on DataFrames larger than available GPU memory “just works” with Dask Scaling up/out cuDF to multiple GPUs import dask_cudf ddf = dask_cudf.read_csv(“*.csv”) # operations like join() or groupby() # on ddf are distributed across GPUs!
  31. 41 Explore: RAPIDS Code and Blogs Check out our code

    and how we use it https://github.com/rapidsai https://medium.com/rapids-ai
  32. 42 Explore: Notebooks Contrib Notebooks Contrib Repo has tutorials and

    examples, and various E2E demos. RAPIDS Youtube channel has explanations, code walkthroughs and use cases.
  33. 45 Join the Movement Everyone can help! Integrations, feedback, documentation

    support, pull requests, new issues, or code donations welcomed! APACHE ARROW GPU Open Analytics Initiative https://arrow.apache.org/ @ApacheArrow http://gpuopenanalytics.com/ @GPUOAI RAPIDS https://rapids.ai @RAPIDSAI Dask https://dask.org @Dask_dev