Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Higher Performance Python for Data Science

ianozsvald
November 30, 2021

Higher Performance Python for Data Science

Interview with Richard at Coiled.io on how we can make data science with Python go faster. I give an update on the state of Python virtual machines (so many!), profiling options (line_profiler, viztracer, memory_profiler, ipython_memory_usage), compilation (numba), vectorisations, faster pandas ideas (including bottleneck & numexpr) and then scaling with Dask/Vaex/Polars. More details are often shared on my newsletter https://buttondown.email/NotANumber and all my past public talks are on my blog: https://ianozsvald.com/about-me/

ianozsvald

November 30, 2021
Tweet

More Decks by ianozsvald

Other Decks in Research

Transcript

  1. •Profiling/Compiling/Pandas/Scaling briefing •Find bottlenecks so you can go faster -

    efficiently •Q&A – what’s blocking you? Today’s goal By [ian]@ianozsvald[.com] Ian Ozsvald
  2. •Add questions and your stories to the chat •We’ll do

    live Q&A at the end Questions? Please ask! By [ian]@ianozsvald[.com] Ian Ozsvald
  3.  Interim Chief Data Scientist  20+ years experience 

    Team coaching & public courses –I’m sharing from my Higher Performance Python course Introductions By [ian]@ianozsvald[.com] Ian Ozsvald 2nd Edition!
  4. •Pandas is default (but 5X memory usage hurts!) •Profiling is

    ignored by many (to their misfortune!) •Python is interpreted, single-core, the GIL is bad, it can’t scale – switch to another language!? (obviously - no) Where we’re at By [ian]@ianozsvald[.com] Ian Ozsvald
  5. •5x faster by 2027, Guido+Mark by Microsoft •Won’t change normal

    behaviour •3.10 has internal speedups, 3.11 (1yr) a JIT, ... The “Mark Shannon Plan” By [ian]@ianozsvald[.com] Ian Ozsvald https://github.com/markshannon/faster-cpython/blob/master/funding.md
  6. •numpy/pandas compatible, most mindshare •numpy/pandas/sklearn now “foundational” •Other libraries (e.g.

    Polars) interesting but hard to displace when e.g. pandas is a foundation for DataFrames Stick with what we know - CPython By [ian]@ianozsvald[.com] Ian Ozsvald
  7. •Think on how you currently profile •Can you share your

    tips, success or war stories? •Richard – when do you profile regular Python code? What’s been most useful? Audience By [ian]@ianozsvald[.com] Ian Ozsvald
  8. •Deterministic line-by-line profiler – great for numpy/pandas •Notebook & module-based,

    does have an overhead •>10 yrs, 2M+ downloads/yr (est.) line_profiler By [ian]@ianozsvald[.com] Ian Ozsvald
  9. •Time-based fn viewer •Notebook & module support •Great for drilling

    into complex code – e.g. Pandas! •See what’s happening down the call-stack by time-taken VizTracer By [ian]@ianozsvald[.com] Ian Ozsvald
  10. •It is less common to profile RAM, sometimes rewarding •Useful

    to find slowdowns due to “too much work happening” – especially in Pandas in a Notebook memory_profiler + extension By [ian]@ianozsvald[.com] Ian Ozsvald
  11. •Think on how you write Numpy and Pandas code •Can

    you share your tips, success or war stories? •Richard – any observations on how to write Pandas “well” for support and execution speed? Audience By [ian]@ianozsvald[.com] Ian Ozsvald
  12. Prefer vectorisation for faster code By [ian]@ianozsvald[.com] Ian Ozsvald Circa

    0GB on the left but many many object creations, on the right with numpy 800MB allocations then vector operations (note ipyton_memory_usage to track memory & time usage)
  13. •Just-in-time compiler for math (NumPy & pure Python) •No direct

    Pandas support (some support in Pandas) •Specialises to your CPU, or general pre-compiled library (Ahead-of-time), optional strict type declarations •GPU and parallel-CPU support Compilation with Numba By [ian]@ianozsvald[.com] Ian Ozsvald
  14. •Platform – e.g. integrated with sklearn, plotting tools •Increasing integration

    of Numba •Making e.g. groupby.apply fast is hard/painful •Arrow datatypes offer faster string dtype (experimental) •Parallelisation is hard (don’t hold your breath, see Dask) Pandas v1+ By [ian]@ianozsvald[.com] Ian Ozsvald
  15.  .rolling lets us make a “rolling window” on our

    data  Great for time-series e.g. rolling 1w or 10min  User-defined functions traditionally crazy-slow  Numba can make faster functions for little effort Rolling and Numba By [ian]@ianozsvald[.com] Ian Ozsvald
  16. Numba on rolling operations By [ian]@ianozsvald[.com] Ian Ozsvald “raw arrays”

    are the underlying NumPy arrays – this only works for NumPy arrays, not Pandas Extension types 20x speed- up
  17. Vectorised-precalculation is faster By [ian]@ianozsvald[.com] Ian Ozsvald 1.5x faster –

    but maybe harder to read and to remember in a month what was happening?
  18.  Some helpful dependencies aren’t installed by default  You

    should install these – especially bottleneck Bottleneck & numexpr – install these! By [ian]@ianozsvald[.com] Ian Ozsvald
  19. Use “numexpr” via “eval” By [ian]@ianozsvald[.com] Ian Ozsvald Bottleneck (and

    numexpr) are not installed by default and they offer free speed- ups, so install them! If numexpr is not installed then pd.eval still works – but goes as slow as non-eval equivalent (and you don’t get any warnings) ipython_memory_usage for diagnostics
  20. •How do you deal with “bigger than RAM data”? •What

    happens when Pandas throws an OOM •Richard – I’d love to hear how people transition into Dask – especially war stories! It isn’t all smooth, but then it isn’t so difficult either Audience By [ian]@ianozsvald[.com] Ian Ozsvald
  21.  Rich library wrapping DataFrames, Arrays and arbitrary Python functions

    (very powerful, quite complex)  Popular for scaling Pandas  Wraps Pandas DataFrames Dask for larger data By [ian]@ianozsvald[.com] Ian Ozsvald 6 Dask talks at PyDataGlobal!
  22.  Bigger than RAM or “I want to use all

    my cores”  Generally you’ll use Parquet (or CSV or many choices)  The Dashboard gives rich diagnostics  Write Pandas-like code  Lazy – use “.compute()” Dask & Pandas for larger datasets By [ian]@ianozsvald[.com] Ian Ozsvald
  23. •Vaex – medium data, HDF5 data, similar Pandas API •Polars

    – in-memory data, Arrow dtypes, less-similar Pd API •Modin – extends Pandas (Ray/Dask) but low adoption •Each take you off the beaten path – good to experiment with but you’ll lose the wide Pandas ecosystem Vaex/Polars/Modin By [ian]@ianozsvald[.com] Ian Ozsvald
  24.  Reflect, debug, profile – go slow to go fast

     My newsletter has tips (and jobs)  See blog for my classes + many past talks  I’d love a postcard if you learned something new! Summary By [ian]@ianozsvald[.com] Ian Ozsvald Get my cheatsheet: http://bit.ly/hpp_cheatsheet
  25. •What blockers do you have? •Have you had success with

    Vaex/Modin/Polars/Rapids/…? •What holds you up with Pandas? Q&A... By [ian]@ianozsvald[.com] Ian Ozsvald