Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The State of Higher Performance Python

ianozsvald
November 23, 2022

The State of Higher Performance Python

We’ll review the state of the art in the data science world for common number-crunching tasks on small to big data. Topics we’ll cover include profiling, compilation and data manipulation. We’ll also review the near future for Python, Numba, Pandas, Dask and Polars and I’ll help you make some pragmatic choices about tools you might invest time in. We’ll have plenty of time to discuss your use cases and problems you might have encountered.

By: https://ianozsvald.com/

ianozsvald

November 23, 2022
Tweet

More Decks by ianozsvald

Other Decks in Technology

Transcript

  1. Expert Briefing – The State of Higher Performance Python @IanOzsvald

    – ianozsvald.com Ian Ozsvald PyDataGlobal 2022
  2. •Compiling/Pandas/Scaling briefing •Find bottlenecks so you can go faster -

    efficiently •Q&A – what’s blocking you? Today’s goal By [ian]@ianozsvald[.com] Ian Ozsvald
  3.  Interim Chief Data Scientist  20+ years experience 

    Team coaching & public courses –I’m sharing from my Higher Performance Python course Introductions By [ian]@ianozsvald[.com] Ian Ozsvald 2nd Edition!
  4. •Pandas is default (but 5X memory usage hurts!) •Introspection into

    tools like Pandas is possible but rare •Python is interpreted, single-core, the GIL is bad, it can’t scale – switch to another language!? (obviously - no) Where we’re at By [ian]@ianozsvald[.com] Ian Ozsvald
  5. •Python 3.11 runs code in 40-80% of the time of

    3.10 •3.12 will have a JIT •This doesn’t affect numpy, Pandas and friends (sadly) Python 3.11 By [ian]@ianozsvald[.com] Ian Ozsvald
  6. •Time-based fn viewer •Notebook & module support •Great for drilling

    into complex code – e.g. Pandas! •See what’s happening down the call-stack by time-taken VizTracer By [ian]@ianozsvald[.com] Ian Ozsvald
  7. •Just-in-time compiler for math (NumPy & pure Python) •No direct

    Pandas support (some support in Pandas) •Specialises to your CPU, or general pre-compiled library (Ahead-of-time), optional strict type declarations •GPU and parallel-CPU support Compilation with Numba By [ian]@ianozsvald[.com] Ian Ozsvald
  8. Compilation with Numba By [ian]@ianozsvald[.com] Ian Ozsvald

  9. •Platform – e.g. integrated with sklearn, plotting tools •Increasing integration

    of Numba •Making e.g. groupby.apply fast is hard/painful •Arrow datatypes offer faster string dtype (experimental) •Parallelisation is hard (don’t hold your breath, see Dask) Pandas v1.5+ By [ian]@ianozsvald[.com] Ian Ozsvald
  10. Pandas v1.5+ By [ian]@ianozsvald[.com] Ian Ozsvald 670MB vs 140MB for

    unique strings Bigger savings for repeated strings 6x speed-up for this operation
  11.  .rolling lets us make a “rolling window” on our

    data  Great for time-series e.g. rolling 1w or 10min  User-defined functions traditionally crazy-slow  Numba can make faster functions for little effort Rolling + Numba By [ian]@ianozsvald[.com] Ian Ozsvald
  12. Rolling operations By [ian]@ianozsvald[.com] Ian Ozsvald “raw arrays” are the

    underlying NumPy arrays – this only works for NumPy arrays, not Pandas Extension types 40x speed- up to Cython
  13.  Rich library wrapping DataFrames, Arrays and arbitrary Python functions

    (very powerful, quite complex)  Popular for scaling Pandas  Wraps Pandas DataFrames Dask for larger data By [ian]@ianozsvald[.com] Ian Ozsvald 4 Dask talks at PyDataGlobal!
  14.  Bigger than RAM or “I want to use all

    my cores”  Generally you’ll use Parquet (or CSV or many choices)  The Dashboard gives rich diagnostics  Write Pandas-like code  Lazy – use “.compute()” Dask & Pandas for larger datasets By [ian]@ianozsvald[.com] Ian Ozsvald
  15. •Vaex – medium data, HDF5 data, similar Pandas API •Polars

    – in-memory data, Arrow dtypes, less-similar Pd API •Modin – extends Pandas (Ray/Dask) but low adoption •Bodo - commercial Pandas compiler+paralleliser Vaex/Polars/Modin/Bodo By [ian]@ianozsvald[.com] Ian Ozsvald
  16. Modin example with Ray (Dask worse) By [ian]@ianozsvald[.com] Ian Ozsvald

  17. By [ian]@ianozsvald[.com] Ian Ozsvald

  18.  Take advantage of the new tools  Newsletter: 

    See blog for my classes + many past talks  I’d love a postcard if you learned something new! Summary By [ian]@ianozsvald[.com] Ian Ozsvald Get my cheatsheet: http://bit.ly/hpp_cheatsheet
  19. •What blockers do you have? •Have you had success with

    Vaex/Modin/Polars/Rapids/Bodo/…? •What holds you up with Pandas? Q&A... By [ian]@ianozsvald[.com] Ian Ozsvald