Higher Performance Python for Data Science

Slide 1

Slide 1 text

Higher Performance Python for Data Science @IanOzsvald – ianozsvald.com Ian Ozsvald Coiled.io 2021

Slide 2

Slide 2 text

•Profiling/Compiling/Pandas/Scaling briefing •Find bottlenecks so you can go faster - efficiently •Q&A – what’s blocking you? Today’s goal By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 3

Slide 3 text

•Add questions and your stories to the chat •We’ll do live Q&A at the end Questions? Please ask! By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 4

Slide 4 text

 Interim Chief Data Scientist  20+ years experience  Team coaching & public courses –I’m sharing from my Higher Performance Python course Introductions By [ian]@ianozsvald[.com] Ian Ozsvald 2nd Edition!

Slide 5

Slide 5 text

•Pandas is default (but 5X memory usage hurts!) •Profiling is ignored by many (to their misfortune!) •Python is interpreted, single-core, the GIL is bad, it can’t scale – switch to another language!? (obviously - no) Where we’re at By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 6

Slide 6 text

TIOBE number 1 By [ian]@ianozsvald[.com] Ian Ozsvald “2nd best at everything”

Slide 7

Slide 7 text

Other Pythons? By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 8

Slide 8 text

•5x faster by 2027, Guido+Mark by Microsoft •Won’t change normal behaviour •3.10 has internal speedups, 3.11 (1yr) a JIT, ... The “Mark Shannon Plan” By [ian]@ianozsvald[.com] Ian Ozsvald https://github.com/markshannon/faster-cpython/blob/master/funding.md

Slide 9

Slide 9 text

•numpy/pandas compatible, most mindshare •numpy/pandas/sklearn now “foundational” •Other libraries (e.g. Polars) interesting but hard to displace when e.g. pandas is a foundation for DataFrames Stick with what we know - CPython By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 10

Slide 10 text

•Think on how you currently profile •Can you share your tips, success or war stories? •Richard – when do you profile regular Python code? What’s been most useful? Audience By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 11

Slide 11 text

•Deterministic line-by-line profiler – great for numpy/pandas •Notebook & module-based, does have an overhead •>10 yrs, 2M+ downloads/yr (est.) line_profiler By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 12

Slide 12 text

•Time-based fn viewer •Notebook & module support •Great for drilling into complex code – e.g. Pandas! •See what’s happening down the call-stack by time-taken VizTracer By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 13

Slide 13 text

•It is less common to profile RAM, sometimes rewarding •Useful to find slowdowns due to “too much work happening” – especially in Pandas in a Notebook memory_profiler + extension By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 14

Slide 14 text

•Think on how you write Numpy and Pandas code •Can you share your tips, success or war stories? •Richard – any observations on how to write Pandas “well” for support and execution speed? Audience By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 15

Slide 15 text

Prefer vectorisation for faster code By [ian]@ianozsvald[.com] Ian Ozsvald Circa 0GB on the left but many many object creations, on the right with numpy 800MB allocations then vector operations (note ipyton_memory_usage to track memory & time usage)

Slide 16

Slide 16 text

•Just-in-time compiler for math (NumPy & pure Python) •No direct Pandas support (some support in Pandas) •Specialises to your CPU, or general pre-compiled library (Ahead-of-time), optional strict type declarations •GPU and parallel-CPU support Compilation with Numba By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 17

Slide 17 text

Compilation with Numba By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 18

Slide 18 text

•Platform – e.g. integrated with sklearn, plotting tools •Increasing integration of Numba •Making e.g. groupby.apply fast is hard/painful •Arrow datatypes offer faster string dtype (experimental) •Parallelisation is hard (don’t hold your breath, see Dask) Pandas v1+ By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 19

Slide 19 text

 .rolling lets us make a “rolling window” on our data  Great for time-series e.g. rolling 1w or 10min  User-defined functions traditionally crazy-slow  Numba can make faster functions for little effort Rolling and Numba By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 20

Slide 20 text

Numba on rolling operations By [ian]@ianozsvald[.com] Ian Ozsvald “raw arrays” are the underlying NumPy arrays – this only works for NumPy arrays, not Pandas Extension types 20x speed- up

Slide 21

Slide 21 text

GroupBy and apply is easy but slow By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 22

Slide 22 text

Vectorised version will be faster By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 23

Slide 23 text

Vectorised-precalculation is faster By [ian]@ianozsvald[.com] Ian Ozsvald 1.5x faster – but maybe harder to read and to remember in a month what was happening?

Slide 24

Slide 24 text

 Some helpful dependencies aren’t installed by default  You should install these – especially bottleneck Bottleneck & numexpr – install these! By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 25

Slide 25 text

Use “numexpr” via “eval” By [ian]@ianozsvald[.com] Ian Ozsvald Bottleneck (and numexpr) are not installed by default and they offer free speedups, so install them! If numexpr is not installed then pd.eval still works – but goes as slow as non-eval equivalent (and you don’t get any warnings) ipython_memory_usage for diagnostics

Slide 26

Slide 26 text

•How do you deal with “bigger than RAM data”? •What happens when Pandas throws an OOM •Richard – I’d love to hear how people transition into Dask – especially war stories! It isn’t all smooth, but then it isn’t so difficult either Audience By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 27

Slide 27 text

 Rich library wrapping DataFrames, Arrays and arbitrary Python functions (very powerful, quite complex)  Popular for scaling Pandas  Wraps Pandas DataFrames Dask for larger data By [ian]@ianozsvald[.com] Ian Ozsvald 6 Dask talks at PyDataGlobal!

Slide 28

Slide 28 text

 Bigger than RAM or “I want to use all my cores”  Generally you’ll use Parquet (or CSV or many choices)  The Dashboard gives rich diagnostics  Write Pandas-like code  Lazy – use “.compute()” Dask & Pandas for larger datasets By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 29

Slide 29 text

Tasks and completion state By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 30

Slide 30 text

Task graph for describe operation By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 31

Slide 31 text

•Vaex – medium data, HDF5 data, similar Pandas API •Polars – in-memory data, Arrow dtypes, less-similar Pd API •Modin – extends Pandas (Ray/Dask) but low adoption •Each take you off the beaten path – good to experiment with but you’ll lose the wide Pandas ecosystem Vaex/Polars/Modin By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 32

Slide 32 text

 Reflect, debug, profile – go slow to go fast  My newsletter has tips (and jobs)  See blog for my classes + many past talks  I’d love a postcard if you learned something new! Summary By [ian]@ianozsvald[.com] Ian Ozsvald Get my cheatsheet: http://bit.ly/hpp_cheatsheet

Slide 33

Slide 33 text

•What blockers do you have? •Have you had success with Vaex/Modin/Polars/Rapids/…? •What holds you up with Pandas? Q&A... By [ian]@ianozsvald[.com] Ian Ozsvald