Slide 1

Slide 1 text

cudf.pandas: the Zero Code Change GPU Accelerator for pandas Jacob Tomlinson, Senior Software Engineer | PyData Exeter Meetup Feb 2023

Slide 2

Slide 2 text

NVIDIA RAPIDS PyData Team OSS PyData maintainers hired by NVIDIA to make GPU acceleration ubiquitous and many more…

Slide 3

Slide 3 text

• Pandas and alternatives • Demo • How it works • FAQs and limitations • Conclusion Outline

Slide 4

Slide 4 text

Pandas and alternatives • Pandas is great (but slow) • Why is it slow? o Largely single-threaded o Not a query engine! • Many alternatives : o Faster underlying implementation (C++, Rust, CUDA) o Query engines o SQL-inspired o Distributed computing o Hardware accelerated (GPUs) Results of the H2O.ai benchmark maintained by DuckDB: https://duckdblabs.github.io/db-benchmark/

Slide 5

Slide 5 text

cuDF: GPU DataFrames • Pandas-like API, runs on the GPU • Powered by CUDA and libcudf, a C++ DataFrame library for GPUs • Operations are ~10-100x faster than pandas • Provides 60-75% of the pandas API • Not what this talk is about! cuDF speedups relative to Pandas for a number of different operations NVIDIA A100, AMD EPYC 7642 48-Core Processor

Slide 6

Slide 6 text

"Should I switch from pandas to something else?" • Reasons to use something other than pandas: o Performance above all o Data size o Rewriting code ≠ problem • Reasons to use pandas: o API flexibility o Collaboration o Ecosystem built on pandas o pandas is getting faster

Slide 7

Slide 7 text

What is cudf.pandas? • Lets you keep using pandas o Accelerates it on the GPU with no changes • 100% of the pandas API o Uses the GPU for supported operations o Falls back to using the CPU otherwise • 3rd-party code acceleration o Everything is accelerated. No one changes their code Jupyter/IPython: %load_ext cudf.pandas Command line: python –m cudf.pandas script.py Direct import: import cudf.pandas cudf.pandas.install()

Slide 8

Slide 8 text

Demo Time https://github.com/shwina/pydata-global-2023-demo

Slide 9

Slide 9 text

Demo Recap • Accelerates your code on GPUs with no changes • Key to getting good performance is to minimize CPU execution • 3rd party libraries written to use pandas can be accelerated on the GPUs

Slide 10

Slide 10 text

Under the hood • How does it work?

Slide 11

Slide 11 text

Under the hood • How does it work? o Proxy objects that dispatch to cudf or pandas

Slide 12

Slide 12 text

Under the hood • How does it work? o Proxy objects that dispatch to cudf or pandas o Deep import customization to hijack pandas imports

Slide 13

Slide 13 text

Under the hood • How does it work? o Proxy objects that dispatch to cudf or pandas o Deep import customization to hijack pandas imports • What about...

Slide 14

Slide 14 text

Under the hood • How does it work? o Proxy objects that dispatch to cudf or pandas o Deep import customization to hijack pandas imports • What about... o Duck typing? ▪ Doesn't work for free functions like pd.read_csv ▪ Lots of code doing hard isinstance checks

Slide 15

Slide 15 text

Under the hood • How does it work? o Proxy objects that dispatch to cudf or pandas o Deep import customization to hijack pandas imports • What about... o Duck typing? ▪ Doesn't work for free functions like pd.read_csv ▪ Lots of code doing hard isinstance checks o DataFrame Standard API? ▪ Solves a different problem (developer-focused API) ▪ Exciting possibilities! • Fallback to a faster DataFrame library like Polars?

Slide 16

Slide 16 text

FAQs

Slide 17

Slide 17 text

FAQs • Will my code run up to 100x faster with no code changes? o Yes, with idiomatic pandas usage o The profiler helps you identify where it's falling back to the CPU § As a bonus, you'll likely improve performance on CPUs

Slide 18

Slide 18 text

FAQs • Will my code run up to 100x faster with no code changes? o Yes, with idiomatic pandas usage o The profiler helps you identify where it's falling back to the CPU § As a bonus, you'll likely improve performance on CPUs • How much of the pandas API does this support? o 100%, with the following caveats § Some operations fall back to using the CPU via pandas § There may be small differences from pandas o We test against the pandas unit test suite (94% tests passing)

Slide 19

Slide 19 text

FAQs • Will my code run up to 100x faster with no code changes? o Yes, with idiomatic pandas usage o The profiler helps you identify where it's falling back to the CPU § As a bonus, you'll likely improve performance on CPUs • How much of the pandas API does this support? o 100%, with the following caveats § Some operations fall back to using the CPU via pandas § There may be small differences from pandas o We test against the pandas unit test suite (94% tests passing) • Will cudf.pandas work with ? o Yes, if the library uses pandas in a standard way o Some known limitations: § Isinstance() checks for numpy arrays § Use of the C-API to talk to NumPy or Pandas § Subclassing pd.DataFrame (this kinda works)

Slide 20

Slide 20 text

FAQs • Will my code run up to 100x faster with no code changes? o Yes, with idiomatic pandas usage o The profiler helps you identify where it's falling back to the CPU § As a bonus, you'll likely improve performance on CPUs • How much of the pandas API does this support? o 100%, with the following caveats § Some operations fall back to using the CPU via pandas § There may be small differences from pandas o We test against the pandas unit test suite (94% tests passing) • Will cudf.pandas work with ? o Yes, if the library uses pandas in a standard way o Some known limitations: § Isinstance() checks for numpy arrays § Use of the C-API to talk to NumPy or Pandas § Subclassing pd.DataFrame (this kinda works) • What about working with data larger than GPU memory ? o Right now, this will fall back to using the CPU

Slide 21

Slide 21 text

Get started with cudf.pandas • Code for today's talk: o https://github.com/shwina/pydata-global-2023-demo • Try it on Google Colab: o https://nvda.ws/rapids-cudf • Report issues or feedback on our GitHub repo! o https://github.com/rapidsai/cudf

Slide 22

Slide 22 text

Thank you! Social links at https://jacobtomlinson.dev