Higher Performance Python for Data Science

Higher Performance Python for Data Science @IanOzsvald – ianozsvald.com Ian
Ozsvald Coiled.io 2021

•Profiling/Compiling/Pandas/Scaling briefing •Find bottlenecks so you can go faster -
efficiently •Q&A – what’s blocking you? Today’s goal By [ian]@ianozsvald[.com] Ian Ozsvald

•Add questions and your stories to the chat •We’ll do
live Q&A at the end Questions? Please ask! By [ian]@ianozsvald[.com] Ian Ozsvald

 Interim Chief Data Scientist  20+ years experience 
Team coaching & public courses –I’m sharing from my Higher Performance Python course Introductions By [ian]@ianozsvald[.com] Ian Ozsvald 2nd Edition!

•Pandas is default (but 5X memory usage hurts!) •Profiling is
ignored by many (to their misfortune!) •Python is interpreted, single-core, the GIL is bad, it can’t scale – switch to another language!? (obviously - no) Where we’re at By [ian]@ianozsvald[.com] Ian Ozsvald

TIOBE number 1 By [ian]@ianozsvald[.com] Ian Ozsvald “2nd best at
everything”

Other Pythons? By [ian]@ianozsvald[.com] Ian Ozsvald

•5x faster by 2027, Guido+Mark by Microsoft •Won’t change normal
behaviour •3.10 has internal speedups, 3.11 (1yr) a JIT, ... The “Mark Shannon Plan” By [ian]@ianozsvald[.com] Ian Ozsvald https://github.com/markshannon/faster-cpython/blob/master/funding.md

•numpy/pandas compatible, most mindshare •numpy/pandas/sklearn now “foundational” •Other libraries (e.g.
Polars) interesting but hard to displace when e.g. pandas is a foundation for DataFrames Stick with what we know - CPython By [ian]@ianozsvald[.com] Ian Ozsvald

•Think on how you currently profile •Can you share your
tips, success or war stories? •Richard – when do you profile regular Python code? What’s been most useful? Audience By [ian]@ianozsvald[.com] Ian Ozsvald

•Deterministic line-by-line profiler – great for numpy/pandas •Notebook & module-based,
does have an overhead •>10 yrs, 2M+ downloads/yr (est.) line_profiler By [ian]@ianozsvald[.com] Ian Ozsvald

•Time-based fn viewer •Notebook & module support •Great for drilling
into complex code – e.g. Pandas! •See what’s happening down the call-stack by time-taken VizTracer By [ian]@ianozsvald[.com] Ian Ozsvald

•It is less common to profile RAM, sometimes rewarding •Useful
to find slowdowns due to “too much work happening” – especially in Pandas in a Notebook memory_profiler + extension By [ian]@ianozsvald[.com] Ian Ozsvald

•Think on how you write Numpy and Pandas code •Can
you share your tips, success or war stories? •Richard – any observations on how to write Pandas “well” for support and execution speed? Audience By [ian]@ianozsvald[.com] Ian Ozsvald

Prefer vectorisation for faster code By [ian]@ianozsvald[.com] Ian Ozsvald Circa
0GB on the left but many many object creations, on the right with numpy 800MB allocations then vector operations (note ipyton_memory_usage to track memory & time usage)

•Just-in-time compiler for math (NumPy & pure Python) •No direct
Pandas support (some support in Pandas) •Specialises to your CPU, or general pre-compiled library (Ahead-of-time), optional strict type declarations •GPU and parallel-CPU support Compilation with Numba By [ian]@ianozsvald[.com] Ian Ozsvald

Compilation with Numba By [ian]@ianozsvald[.com] Ian Ozsvald

•Platform – e.g. integrated with sklearn, plotting tools •Increasing integration
of Numba •Making e.g. groupby.apply fast is hard/painful •Arrow datatypes offer faster string dtype (experimental) •Parallelisation is hard (don’t hold your breath, see Dask) Pandas v1+ By [ian]@ianozsvald[.com] Ian Ozsvald

 .rolling lets us make a “rolling window” on our
data  Great for time-series e.g. rolling 1w or 10min  User-defined functions traditionally crazy-slow  Numba can make faster functions for little effort Rolling and Numba By [ian]@ianozsvald[.com] Ian Ozsvald

Numba on rolling operations By [ian]@ianozsvald[.com] Ian Ozsvald “raw arrays”
are the underlying NumPy arrays – this only works for NumPy arrays, not Pandas Extension types 20x speed- up

GroupBy and apply is easy but slow By [ian]@ianozsvald[.com] Ian
Ozsvald

Vectorised version will be faster By [ian]@ianozsvald[.com] Ian Ozsvald

Vectorised-precalculation is faster By [ian]@ianozsvald[.com] Ian Ozsvald 1.5x faster –
but maybe harder to read and to remember in a month what was happening?

 Some helpful dependencies aren’t installed by default  You
should install these – especially bottleneck Bottleneck & numexpr – install these! By [ian]@ianozsvald[.com] Ian Ozsvald

Use “numexpr” via “eval” By [ian]@ianozsvald[.com] Ian Ozsvald Bottleneck (and
numexpr) are not installed by default and they offer free speedups, so install them! If numexpr is not installed then pd.eval still works – but goes as slow as non-eval equivalent (and you don’t get any warnings) ipython_memory_usage for diagnostics

•How do you deal with “bigger than RAM data”? •What
happens when Pandas throws an OOM •Richard – I’d love to hear how people transition into Dask – especially war stories! It isn’t all smooth, but then it isn’t so difficult either Audience By [ian]@ianozsvald[.com] Ian Ozsvald

 Rich library wrapping DataFrames, Arrays and arbitrary Python functions
(very powerful, quite complex)  Popular for scaling Pandas  Wraps Pandas DataFrames Dask for larger data By [ian]@ianozsvald[.com] Ian Ozsvald 6 Dask talks at PyDataGlobal!

 Bigger than RAM or “I want to use all
my cores”  Generally you’ll use Parquet (or CSV or many choices)  The Dashboard gives rich diagnostics  Write Pandas-like code  Lazy – use “.compute()” Dask & Pandas for larger datasets By [ian]@ianozsvald[.com] Ian Ozsvald

Tasks and completion state By [ian]@ianozsvald[.com] Ian Ozsvald

Task graph for describe operation By [ian]@ianozsvald[.com] Ian Ozsvald

•Vaex – medium data, HDF5 data, similar Pandas API •Polars
– in-memory data, Arrow dtypes, less-similar Pd API •Modin – extends Pandas (Ray/Dask) but low adoption •Each take you off the beaten path – good to experiment with but you’ll lose the wide Pandas ecosystem Vaex/Polars/Modin By [ian]@ianozsvald[.com] Ian Ozsvald

 Reflect, debug, profile – go slow to go fast
 My newsletter has tips (and jobs)  See blog for my classes + many past talks  I’d love a postcard if you learned something new! Summary By [ian]@ianozsvald[.com] Ian Ozsvald Get my cheatsheet: http://bit.ly/hpp_cheatsheet

•What blockers do you have? •Have you had success with
Vaex/Modin/Polars/Rapids/…? •What holds you up with Pandas? Q&A... By [ian]@ianozsvald[.com] Ian Ozsvald

Higher Performance Python for Data Science

Higher Performance Python for Data Science

ianozsvald

More Decks by ianozsvald

Other Decks in Research

Featured

Transcript