The State of Higher Performance Python

Expert Briefing – The State of Higher Performance Python @IanOzsvald
– ianozsvald.com Ian Ozsvald PyDataGlobal 2022

•Compiling/Pandas/Scaling briefing •Find bottlenecks so you can go faster -
efficiently •Q&A – what’s blocking you? Today’s goal By [ian]@ianozsvald[.com] Ian Ozsvald

 Interim Chief Data Scientist  20+ years experience 
Team coaching & public courses –I’m sharing from my Higher Performance Python course Introductions By [ian]@ianozsvald[.com] Ian Ozsvald 2nd Edition!

•Pandas is default (but 5X memory usage hurts!) •Introspection into
tools like Pandas is possible but rare •Python is interpreted, single-core, the GIL is bad, it can’t scale – switch to another language!? (obviously - no) Where we’re at By [ian]@ianozsvald[.com] Ian Ozsvald

•Python 3.11 runs code in 40-80% of the time of
3.10 •3.12 will have a JIT •This doesn’t affect numpy, Pandas and friends (sadly) Python 3.11 By [ian]@ianozsvald[.com] Ian Ozsvald

•Time-based fn viewer •Notebook & module support •Great for drilling
into complex code – e.g. Pandas! •See what’s happening down the call-stack by time-taken VizTracer By [ian]@ianozsvald[.com] Ian Ozsvald

•Just-in-time compiler for math (NumPy & pure Python) •No direct
Pandas support (some support in Pandas) •Specialises to your CPU, or general pre-compiled library (Ahead-of-time), optional strict type declarations •GPU and parallel-CPU support Compilation with Numba By [ian]@ianozsvald[.com] Ian Ozsvald

Compilation with Numba By [ian]@ianozsvald[.com] Ian Ozsvald

•Platform – e.g. integrated with sklearn, plotting tools •Increasing integration
of Numba •Making e.g. groupby.apply fast is hard/painful •Arrow datatypes offer faster string dtype (experimental) •Parallelisation is hard (don’t hold your breath, see Dask) Pandas v1.5+ By [ian]@ianozsvald[.com] Ian Ozsvald

Pandas v1.5+ By [ian]@ianozsvald[.com] Ian Ozsvald 670MB vs 140MB for
unique strings Bigger savings for repeated strings 6x speed-up for this operation

 .rolling lets us make a “rolling window” on our
data  Great for time-series e.g. rolling 1w or 10min  User-defined functions traditionally crazy-slow  Numba can make faster functions for little effort Rolling + Numba By [ian]@ianozsvald[.com] Ian Ozsvald

Rolling operations By [ian]@ianozsvald[.com] Ian Ozsvald “raw arrays” are the
underlying NumPy arrays – this only works for NumPy arrays, not Pandas Extension types 40x speed- up to Cython

 Rich library wrapping DataFrames, Arrays and arbitrary Python functions
(very powerful, quite complex)  Popular for scaling Pandas  Wraps Pandas DataFrames Dask for larger data By [ian]@ianozsvald[.com] Ian Ozsvald 4 Dask talks at PyDataGlobal!

 Bigger than RAM or “I want to use all
my cores”  Generally you’ll use Parquet (or CSV or many choices)  The Dashboard gives rich diagnostics  Write Pandas-like code  Lazy – use “.compute()” Dask & Pandas for larger datasets By [ian]@ianozsvald[.com] Ian Ozsvald

•Vaex – medium data, HDF5 data, similar Pandas API •Polars
– in-memory data, Arrow dtypes, less-similar Pd API •Modin – extends Pandas (Ray/Dask) but low adoption •Bodo - commercial Pandas compiler+paralleliser Vaex/Polars/Modin/Bodo By [ian]@ianozsvald[.com] Ian Ozsvald

Modin example with Ray (Dask worse) By [ian]@ianozsvald[.com] Ian Ozsvald

By [ian]@ianozsvald[.com] Ian Ozsvald

 Take advantage of the new tools  Newsletter: 
See blog for my classes + many past talks  I’d love a postcard if you learned something new! Summary By [ian]@ianozsvald[.com] Ian Ozsvald Get my cheatsheet: http://bit.ly/hpp_cheatsheet

•What blockers do you have? •Have you had success with
Vaex/Modin/Polars/Rapids/Bodo/…? •What holds you up with Pandas? Q&A... By [ian]@ianozsvald[.com] Ian Ozsvald

The State of Higher Performance Python

The State of Higher Performance Python

ianozsvald

More Decks by ianozsvald

Other Decks in Technology

Featured

Transcript

Expert Briefing – The State of Higher Performance Python @IanOzsvald

•Compiling/Pandas/Scaling briefing •Find bottlenecks so you can go faster -

 Interim Chief Data Scientist  20+ years experience 

•Pandas is default (but 5X memory usage hurts!) •Introspection into

•Python 3.11 runs code in 40-80% of the time of

•Time-based fn viewer •Notebook & module support •Great for drilling

•Just-in-time compiler for math (NumPy & pure Python) •No direct

Compilation with Numba By [ian]@ianozsvald[.com] Ian Ozsvald

•Platform – e.g. integrated with sklearn, plotting tools •Increasing integration

Pandas v1.5+ By [ian]@ianozsvald[.com] Ian Ozsvald 670MB vs 140MB for

 .rolling lets us make a “rolling window” on our

Rolling operations By [ian]@ianozsvald[.com] Ian Ozsvald “raw arrays” are the

 Rich library wrapping DataFrames, Arrays and arbitrary Python functions

 Bigger than RAM or “I want to use all

•Vaex – medium data, HDF5 data, similar Pandas API •Polars

Modin example with Ray (Dask worse) By [ian]@ianozsvald[.com] Ian Ozsvald

By [ian]@ianozsvald[.com] Ian Ozsvald

 Take advantage of the new tools  Newsletter: 

•What blockers do you have? •Have you had success with