Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Higher Performance Python for Data Science

3d644406158b4d440111903db1f62622?s=47 ianozsvald
November 30, 2021

Higher Performance Python for Data Science

Interview with Richard at Coiled.io on how we can make data science with Python go faster. I give an update on the state of Python virtual machines (so many!), profiling options (line_profiler, viztracer, memory_profiler, ipython_memory_usage), compilation (numba), vectorisations, faster pandas ideas (including bottleneck & numexpr) and then scaling with Dask/Vaex/Polars. More details are often shared on my newsletter https://buttondown.email/NotANumber and all my past public talks are on my blog: https://ianozsvald.com/about-me/

3d644406158b4d440111903db1f62622?s=128

ianozsvald

November 30, 2021
Tweet

Transcript

  1. Higher Performance Python for Data Science @IanOzsvald – ianozsvald.com Ian

    Ozsvald Coiled.io 2021
  2. •Profiling/Compiling/Pandas/Scaling briefing •Find bottlenecks so you can go faster -

    efficiently •Q&A – what’s blocking you? Today’s goal By [ian]@ianozsvald[.com] Ian Ozsvald
  3. •Add questions and your stories to the chat •We’ll do

    live Q&A at the end Questions? Please ask! By [ian]@ianozsvald[.com] Ian Ozsvald
  4.  Interim Chief Data Scientist  20+ years experience 

    Team coaching & public courses –I’m sharing from my Higher Performance Python course Introductions By [ian]@ianozsvald[.com] Ian Ozsvald 2nd Edition!
  5. •Pandas is default (but 5X memory usage hurts!) •Profiling is

    ignored by many (to their misfortune!) •Python is interpreted, single-core, the GIL is bad, it can’t scale – switch to another language!? (obviously - no) Where we’re at By [ian]@ianozsvald[.com] Ian Ozsvald
  6. TIOBE number 1 By [ian]@ianozsvald[.com] Ian Ozsvald “2nd best at

    everything”
  7. Other Pythons? By [ian]@ianozsvald[.com] Ian Ozsvald

  8. •5x faster by 2027, Guido+Mark by Microsoft •Won’t change normal

    behaviour •3.10 has internal speedups, 3.11 (1yr) a JIT, ... The “Mark Shannon Plan” By [ian]@ianozsvald[.com] Ian Ozsvald https://github.com/markshannon/faster-cpython/blob/master/funding.md
  9. •numpy/pandas compatible, most mindshare •numpy/pandas/sklearn now “foundational” •Other libraries (e.g.

    Polars) interesting but hard to displace when e.g. pandas is a foundation for DataFrames Stick with what we know - CPython By [ian]@ianozsvald[.com] Ian Ozsvald
  10. •Think on how you currently profile •Can you share your

    tips, success or war stories? •Richard – when do you profile regular Python code? What’s been most useful? Audience By [ian]@ianozsvald[.com] Ian Ozsvald
  11. •Deterministic line-by-line profiler – great for numpy/pandas •Notebook & module-based,

    does have an overhead •>10 yrs, 2M+ downloads/yr (est.) line_profiler By [ian]@ianozsvald[.com] Ian Ozsvald
  12. •Time-based fn viewer •Notebook & module support •Great for drilling

    into complex code – e.g. Pandas! •See what’s happening down the call-stack by time-taken VizTracer By [ian]@ianozsvald[.com] Ian Ozsvald
  13. •It is less common to profile RAM, sometimes rewarding •Useful

    to find slowdowns due to “too much work happening” – especially in Pandas in a Notebook memory_profiler + extension By [ian]@ianozsvald[.com] Ian Ozsvald
  14. •Think on how you write Numpy and Pandas code •Can

    you share your tips, success or war stories? •Richard – any observations on how to write Pandas “well” for support and execution speed? Audience By [ian]@ianozsvald[.com] Ian Ozsvald
  15. Prefer vectorisation for faster code By [ian]@ianozsvald[.com] Ian Ozsvald Circa

    0GB on the left but many many object creations, on the right with numpy 800MB allocations then vector operations (note ipyton_memory_usage to track memory & time usage)
  16. •Just-in-time compiler for math (NumPy & pure Python) •No direct

    Pandas support (some support in Pandas) •Specialises to your CPU, or general pre-compiled library (Ahead-of-time), optional strict type declarations •GPU and parallel-CPU support Compilation with Numba By [ian]@ianozsvald[.com] Ian Ozsvald
  17. Compilation with Numba By [ian]@ianozsvald[.com] Ian Ozsvald

  18. •Platform – e.g. integrated with sklearn, plotting tools •Increasing integration

    of Numba •Making e.g. groupby.apply fast is hard/painful •Arrow datatypes offer faster string dtype (experimental) •Parallelisation is hard (don’t hold your breath, see Dask) Pandas v1+ By [ian]@ianozsvald[.com] Ian Ozsvald
  19.  .rolling lets us make a “rolling window” on our

    data  Great for time-series e.g. rolling 1w or 10min  User-defined functions traditionally crazy-slow  Numba can make faster functions for little effort Rolling and Numba By [ian]@ianozsvald[.com] Ian Ozsvald
  20. Numba on rolling operations By [ian]@ianozsvald[.com] Ian Ozsvald “raw arrays”

    are the underlying NumPy arrays – this only works for NumPy arrays, not Pandas Extension types 20x speed- up
  21. GroupBy and apply is easy but slow By [ian]@ianozsvald[.com] Ian

    Ozsvald
  22. Vectorised version will be faster By [ian]@ianozsvald[.com] Ian Ozsvald

  23. Vectorised-precalculation is faster By [ian]@ianozsvald[.com] Ian Ozsvald 1.5x faster –

    but maybe harder to read and to remember in a month what was happening?
  24.  Some helpful dependencies aren’t installed by default  You

    should install these – especially bottleneck Bottleneck & numexpr – install these! By [ian]@ianozsvald[.com] Ian Ozsvald
  25. Use “numexpr” via “eval” By [ian]@ianozsvald[.com] Ian Ozsvald Bottleneck (and

    numexpr) are not installed by default and they offer free speed- ups, so install them! If numexpr is not installed then pd.eval still works – but goes as slow as non-eval equivalent (and you don’t get any warnings) ipython_memory_usage for diagnostics
  26. •How do you deal with “bigger than RAM data”? •What

    happens when Pandas throws an OOM •Richard – I’d love to hear how people transition into Dask – especially war stories! It isn’t all smooth, but then it isn’t so difficult either Audience By [ian]@ianozsvald[.com] Ian Ozsvald
  27.  Rich library wrapping DataFrames, Arrays and arbitrary Python functions

    (very powerful, quite complex)  Popular for scaling Pandas  Wraps Pandas DataFrames Dask for larger data By [ian]@ianozsvald[.com] Ian Ozsvald 6 Dask talks at PyDataGlobal!
  28.  Bigger than RAM or “I want to use all

    my cores”  Generally you’ll use Parquet (or CSV or many choices)  The Dashboard gives rich diagnostics  Write Pandas-like code  Lazy – use “.compute()” Dask & Pandas for larger datasets By [ian]@ianozsvald[.com] Ian Ozsvald
  29. Tasks and completion state By [ian]@ianozsvald[.com] Ian Ozsvald

  30. Task graph for describe operation By [ian]@ianozsvald[.com] Ian Ozsvald

  31. •Vaex – medium data, HDF5 data, similar Pandas API •Polars

    – in-memory data, Arrow dtypes, less-similar Pd API •Modin – extends Pandas (Ray/Dask) but low adoption •Each take you off the beaten path – good to experiment with but you’ll lose the wide Pandas ecosystem Vaex/Polars/Modin By [ian]@ianozsvald[.com] Ian Ozsvald
  32.  Reflect, debug, profile – go slow to go fast

     My newsletter has tips (and jobs)  See blog for my classes + many past talks  I’d love a postcard if you learned something new! Summary By [ian]@ianozsvald[.com] Ian Ozsvald Get my cheatsheet: http://bit.ly/hpp_cheatsheet
  33. •What blockers do you have? •Have you had success with

    Vaex/Modin/Polars/Rapids/…? •What holds you up with Pandas? Q&A... By [ian]@ianozsvald[.com] Ian Ozsvald