Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The State of Higher Performance Python

ianozsvald
November 23, 2022

The State of Higher Performance Python

We’ll review the state of the art in the data science world for common number-crunching tasks on small to big data. Topics we’ll cover include profiling, compilation and data manipulation. We’ll also review the near future for Python, Numba, Pandas, Dask and Polars and I’ll help you make some pragmatic choices about tools you might invest time in. We’ll have plenty of time to discuss your use cases and problems you might have encountered.

By: https://ianozsvald.com/

ianozsvald

November 23, 2022
Tweet

More Decks by ianozsvald

Other Decks in Technology

Transcript

  1. Expert Briefing – The State of
    Higher Performance Python
    @IanOzsvald – ianozsvald.com
    Ian Ozsvald
    PyDataGlobal 2022

    View Slide

  2. •Compiling/Pandas/Scaling briefing
    •Find bottlenecks so you can go faster - efficiently
    •Q&A – what’s blocking you?
    Today’s goal
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  3. Interim Chief Data Scientist

    20+ years experience

    Team coaching & public courses
    –I’m sharing from my Higher Performance
    Python course
    Introductions
    By [ian]@ianozsvald[.com] Ian Ozsvald
    2nd
    Edition!

    View Slide

  4. •Pandas is default (but 5X memory usage hurts!)
    •Introspection into tools like Pandas is possible but rare
    •Python is interpreted, single-core, the GIL is bad, it can’t
    scale – switch to another language!? (obviously - no)
    Where we’re at
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  5. •Python 3.11 runs code in
    40-80% of the time of 3.10
    •3.12 will have a JIT
    •This doesn’t affect numpy,
    Pandas and friends (sadly)
    Python 3.11
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  6. •Time-based fn viewer
    •Notebook & module support
    •Great for drilling into complex
    code – e.g. Pandas!
    •See what’s happening down the
    call-stack by time-taken
    VizTracer
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  7. •Just-in-time compiler for math (NumPy & pure Python)
    •No direct Pandas support (some support in Pandas)
    •Specialises to your CPU, or general pre-compiled library
    (Ahead-of-time), optional strict type declarations
    •GPU and parallel-CPU support
    Compilation with Numba
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  8. Compilation with Numba
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  9. •Platform – e.g. integrated with sklearn, plotting tools
    •Increasing integration of Numba
    •Making e.g. groupby.apply fast is hard/painful
    •Arrow datatypes offer faster string dtype (experimental)
    •Parallelisation is hard (don’t hold your breath, see Dask)
    Pandas v1.5+
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  10. Pandas v1.5+
    By [ian]@ianozsvald[.com] Ian Ozsvald
    670MB vs 140MB
    for unique strings
    Bigger savings for
    repeated strings
    6x speed-up for this
    operation

    View Slide


  11. .rolling lets us make a “rolling window” on our data

    Great for time-series e.g. rolling 1w or 10min

    User-defined functions traditionally crazy-slow

    Numba can make faster functions for little effort
    Rolling + Numba
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  12. Rolling operations
    By [ian]@ianozsvald[.com] Ian Ozsvald
    “raw arrays” are
    the underlying
    NumPy arrays –
    this only works for
    NumPy arrays, not
    Pandas Extension
    types
    40x
    speed-
    up to
    Cython

    View Slide


  13. Rich library wrapping DataFrames, Arrays and arbitrary
    Python functions (very powerful, quite complex)

    Popular for scaling Pandas

    Wraps Pandas DataFrames
    Dask for larger data
    By [ian]@ianozsvald[.com] Ian Ozsvald
    4 Dask talks at PyDataGlobal!

    View Slide


  14. Bigger than RAM or “I want to use all my cores”

    Generally you’ll use Parquet (or CSV or many choices)

    The Dashboard gives rich diagnostics

    Write Pandas-like code

    Lazy – use “.compute()”
    Dask & Pandas for larger datasets
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  15. •Vaex – medium data, HDF5 data, similar Pandas API
    •Polars – in-memory data, Arrow dtypes, less-similar Pd API
    •Modin – extends Pandas (Ray/Dask) but low adoption
    •Bodo - commercial Pandas compiler+paralleliser
    Vaex/Polars/Modin/Bodo
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  16. Modin example with Ray (Dask worse)
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  17. By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  18. Take advantage of the new tools

    Newsletter:

    See blog for my classes + many past talks

    I’d love a postcard if you learned something
    new!
    Summary
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Get my cheatsheet: http://bit.ly/hpp_cheatsheet

    View Slide

  19. •What blockers do you have?
    •Have you had success with
    Vaex/Modin/Polars/Rapids/Bodo/…?
    •What holds you up with Pandas?
    Q&A...
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide