Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Higher Performance Python

ianozsvald
November 16, 2019

Higher Performance Python

Higher Performance Python given at PyDataCambridge 2019. This talk covers evaluating two different OLS approaches using line_profiler, applying one with a set of Pandas options (iloc, apply, apply with raw=True), compiling with Numba and using multi-core with Dask, along with some "being a highly performant developer" advice.

ianozsvald

November 16, 2019
Tweet

More Decks by ianozsvald

Other Decks in Science

Transcript

  1. Tools for Higher Performance
    Python
    @IanOzsvald – ianozsvald.com
    Ian Ozsvald
    PyDataCambridge 2019

    View full-size slide


  2. Interim Chief Data Scientist

    19+ years experience

    Quickly build strategic data science plans

    Team coaching & public courses
    Introductions
    By [ian]@ianozsvald[.com] Ian Ozsvald
    2nd
    Edition
    April
    2020

    View full-size slide


  3. Introduce profiling, faster Pandas, multi-core

    Reflect on “good practice” so you can be highly
    performant
    Today’s goal
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  4. Calculate features including slope

    Ordinary Least Squared on time series

    100,000 rows in a DataFrame to trial

    Estimate for 100x (52wk*2 time windows)

    Lots of CPU time – can we do better?
    A typical higher-performance task
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  5. By [ian]@ianozsvald[.com] Ian Ozsvald
    Introduce 2 solutions

    View full-size slide

  6. By [ian]@ianozsvald[.com] Ian Ozsvald
    Sklearn is slow?!

    View full-size slide

  7. By [ian]@ianozsvald[.com] Ian Ozsvald
    line_profiler
    Sklearn is safe

    View full-size slide

  8. Pandas iloc & looping (with lstsq)
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  9. Pandas apply
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  10. Pandas apply with raw=True
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  11. Numba when raw=True
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  12. Pandas and NumPy distributed computing

    Bag (standard Python collections), Array (NumPy) and
    Distributed DataFrame (Pandas)

    Super-easy parallelised Pandas functions
    Dask
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide

  13. Dask with Numba
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  14. Sklearn & iloc – 90 minutes

    Lstsq & apply raw – 10 minutes

    With Numba – 1 minute

    With Dask – 30 seconds (180x theoretical speed-up)

    Don’t go crazy – remember maintenance
    Costs on the “big problem”
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  15. iloc & apply are fine

    Sklearn in dev, lstsq for prod? Maintenance cost...

    Numba (and Dask) need team buy-in

    Profile before trying ideas (else you’re guessing)

    Test everything! Bulwark is nice
    Being “highly performant”
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  16. Your organisers are volunteers

    Thank your volunteers & speakers please

    Get a free (1st ed) signed book later
    Thank your organisers
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  17. Jan: Successful Data Science Projects

    Feb: Software Engineering for Data Scientists (2 day)

    Mar: (planned) High Performance Python

    https://IanOzsvald.com/training
    Upcoming public courses
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide


  18. Measure – don’t guess

    Test everything

    I’d love a postcard if you learned something new

    Join my thoughts+jobs list for tips and my training list

    Lots of past talks on ianozsvald.com
    Summary
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View full-size slide