Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Higher Performance Python for Data Science

ianozsvald
November 30, 2021

Higher Performance Python for Data Science

Interview with Richard at Coiled.io on how we can make data science with Python go faster. I give an update on the state of Python virtual machines (so many!), profiling options (line_profiler, viztracer, memory_profiler, ipython_memory_usage), compilation (numba), vectorisations, faster pandas ideas (including bottleneck & numexpr) and then scaling with Dask/Vaex/Polars. More details are often shared on my newsletter https://buttondown.email/NotANumber and all my past public talks are on my blog: https://ianozsvald.com/about-me/

ianozsvald

November 30, 2021
Tweet

More Decks by ianozsvald

Other Decks in Research

Transcript

  1. Higher Performance Python for
    Data Science
    @IanOzsvald – ianozsvald.com
    Ian Ozsvald
    Coiled.io 2021

    View full-size slide

  2. •Profiling/Compiling/Pandas/Scaling briefing
    •Find bottlenecks so you can go faster - efficiently
    •Q&A – what’s blocking you?
    Today’s goal
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  3. •Add questions and your stories to the chat
    •We’ll do live Q&A at the end
    Questions? Please ask!
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  4. Interim Chief Data Scientist

    20+ years experience

    Team coaching & public courses
    –I’m sharing from my Higher Performance
    Python course
    Introductions
    By [ian]@ianozsvald[.com] Ian Ozsvald
    2nd
    Edition!

    View full-size slide

  5. •Pandas is default (but 5X memory usage hurts!)
    •Profiling is ignored by many (to their misfortune!)
    •Python is interpreted, single-core, the GIL is bad, it can’t
    scale – switch to another language!? (obviously - no)
    Where we’re at
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  6. TIOBE number 1
    By [ian]@ianozsvald[.com] Ian Ozsvald
    “2nd best at everything”

    View full-size slide

  7. Other Pythons?
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  8. •5x faster by 2027, Guido+Mark by Microsoft
    •Won’t change normal behaviour
    •3.10 has internal speedups, 3.11 (1yr) a JIT, ...
    The “Mark Shannon Plan”
    By [ian]@ianozsvald[.com] Ian Ozsvald
    https://github.com/markshannon/faster-cpython/blob/master/funding.md

    View full-size slide

  9. •numpy/pandas compatible, most mindshare
    •numpy/pandas/sklearn now “foundational”
    •Other libraries (e.g. Polars) interesting but hard to displace
    when e.g. pandas is a foundation for DataFrames
    Stick with what we know - CPython
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  10. •Think on how you currently profile
    •Can you share your tips, success or war stories?
    •Richard – when do you profile regular Python code?
    What’s been most useful?
    Audience
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  11. •Deterministic line-by-line profiler – great for numpy/pandas
    •Notebook & module-based, does have an overhead
    •>10 yrs, 2M+ downloads/yr (est.)
    line_profiler
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  12. •Time-based fn viewer
    •Notebook & module support
    •Great for drilling into complex
    code – e.g. Pandas!
    •See what’s happening down the
    call-stack by time-taken
    VizTracer
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  13. •It is less common to profile RAM, sometimes rewarding
    •Useful to find slowdowns due to “too much work
    happening” – especially in Pandas in a Notebook
    memory_profiler + extension
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  14. •Think on how you write Numpy and Pandas code
    •Can you share your tips, success or war stories?
    •Richard – any observations on how to write Pandas “well”
    for support and execution speed?
    Audience
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  15. Prefer vectorisation for faster code
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Circa 0GB on the left but many many object creations, on the right with
    numpy 800MB allocations then vector operations (note
    ipyton_memory_usage to track memory & time usage)

    View full-size slide

  16. •Just-in-time compiler for math (NumPy & pure Python)
    •No direct Pandas support (some support in Pandas)
    •Specialises to your CPU, or general pre-compiled library
    (Ahead-of-time), optional strict type declarations
    •GPU and parallel-CPU support
    Compilation with Numba
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  17. Compilation with Numba
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  18. •Platform – e.g. integrated with sklearn, plotting tools
    •Increasing integration of Numba
    •Making e.g. groupby.apply fast is hard/painful
    •Arrow datatypes offer faster string dtype (experimental)
    •Parallelisation is hard (don’t hold your breath, see Dask)
    Pandas v1+
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  19. .rolling lets us make a “rolling window” on our data

    Great for time-series e.g. rolling 1w or 10min

    User-defined functions traditionally crazy-slow

    Numba can make faster functions for little effort
    Rolling and Numba
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  20. Numba on rolling operations
    By [ian]@ianozsvald[.com] Ian Ozsvald
    “raw arrays” are
    the underlying
    NumPy arrays –
    this only works for
    NumPy arrays, not
    Pandas Extension
    types
    20x
    speed-
    up

    View full-size slide

  21. GroupBy and apply is easy but slow
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  22. Vectorised version will be faster
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  23. Vectorised-precalculation is faster
    By [ian]@ianozsvald[.com] Ian Ozsvald
    1.5x faster – but maybe
    harder to read and to
    remember in a month what
    was happening?

    View full-size slide


  24. Some helpful dependencies aren’t installed by default

    You should install these – especially bottleneck
    Bottleneck & numexpr – install these!
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  25. Use “numexpr” via “eval”
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Bottleneck (and numexpr)
    are not installed by default
    and they offer free speed-
    ups, so install them!
    If numexpr is not installed then pd.eval still works – but goes
    as slow as non-eval equivalent (and you don’t get any
    warnings)
    ipython_memory_usage
    for diagnostics

    View full-size slide

  26. •How do you deal with “bigger than RAM data”?
    •What happens when Pandas throws an OOM
    •Richard – I’d love to hear how people transition into Dask –
    especially war stories! It isn’t all smooth, but then it isn’t so
    difficult either
    Audience
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  27. Rich library wrapping DataFrames, Arrays and arbitrary
    Python functions (very powerful, quite complex)

    Popular for scaling Pandas

    Wraps Pandas DataFrames
    Dask for larger data
    By [ian]@ianozsvald[.com] Ian Ozsvald
    6 Dask talks at PyDataGlobal!

    View full-size slide


  28. Bigger than RAM or “I want to use all my cores”

    Generally you’ll use Parquet (or CSV or many choices)

    The Dashboard gives rich diagnostics

    Write Pandas-like code

    Lazy – use “.compute()”
    Dask & Pandas for larger datasets
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  29. Tasks and completion state
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  30. Task graph for describe operation
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  31. •Vaex – medium data, HDF5 data, similar Pandas API
    •Polars – in-memory data, Arrow dtypes, less-similar Pd API
    •Modin – extends Pandas (Ray/Dask) but low adoption
    •Each take you off the beaten path – good to experiment
    with but you’ll lose the wide Pandas ecosystem
    Vaex/Polars/Modin
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  32. Reflect, debug, profile – go slow to go fast

    My newsletter has tips (and jobs)

    See blog for my classes + many past talks

    I’d love a postcard if you learned something
    new!
    Summary
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Get my cheatsheet: http://bit.ly/hpp_cheatsheet

    View full-size slide

  33. •What blockers do you have?
    •Have you had success with Vaex/Modin/Polars/Rapids/…?
    •What holds you up with Pandas?
    Q&A...
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide