Higher Performance Python

3d644406158b4d440111903db1f62622?s=47 ianozsvald
November 16, 2019

Higher Performance Python

Higher Performance Python given at PyDataCambridge 2019. This talk covers evaluating two different OLS approaches using line_profiler, applying one with a set of Pandas options (iloc, apply, apply with raw=True), compiling with Numba and using multi-core with Dask, along with some "being a highly performant developer" advice.

3d644406158b4d440111903db1f62622?s=128

ianozsvald

November 16, 2019
Tweet

Transcript

  1. Tools for Higher Performance Python @IanOzsvald – ianozsvald.com Ian Ozsvald

    PyDataCambridge 2019
  2.  Interim Chief Data Scientist  19+ years experience 

    Quickly build strategic data science plans  Team coaching & public courses Introductions By [ian]@ianozsvald[.com] Ian Ozsvald 2nd Edition April 2020
  3.  Introduce profiling, faster Pandas, multi-core  Reflect on “good

    practice” so you can be highly performant Today’s goal By [ian]@ianozsvald[.com] Ian Ozsvald
  4.  Calculate features including slope  Ordinary Least Squared on

    time series  100,000 rows in a DataFrame to trial  Estimate for 100x (52wk*2 time windows)  Lots of CPU time – can we do better? A typical higher-performance task By [ian]@ianozsvald[.com] Ian Ozsvald
  5. By [ian]@ianozsvald[.com] Ian Ozsvald Introduce 2 solutions

  6. By [ian]@ianozsvald[.com] Ian Ozsvald Sklearn is slow?!

  7. By [ian]@ianozsvald[.com] Ian Ozsvald line_profiler Sklearn is safe

  8. Pandas iloc & looping (with lstsq) By [ian]@ianozsvald[.com] Ian Ozsvald

  9. Pandas apply By [ian]@ianozsvald[.com] Ian Ozsvald

  10. Pandas apply with raw=True By [ian]@ianozsvald[.com] Ian Ozsvald

  11. Numba when raw=True By [ian]@ianozsvald[.com] Ian Ozsvald

  12.  Pandas and NumPy distributed computing  Bag (standard Python

    collections), Array (NumPy) and Distributed DataFrame (Pandas)  Super-easy parallelised Pandas functions Dask By [ian]@ianozsvald[.com] Ian Ozsvald
  13. Dask with Numba By [ian]@ianozsvald[.com] Ian Ozsvald

  14.  Sklearn & iloc – 90 minutes  Lstsq &

    apply raw – 10 minutes  With Numba – 1 minute  With Dask – 30 seconds (180x theoretical speed-up)  Don’t go crazy – remember maintenance Costs on the “big problem” By [ian]@ianozsvald[.com] Ian Ozsvald
  15.  iloc & apply are fine  Sklearn in dev,

    lstsq for prod? Maintenance cost...  Numba (and Dask) need team buy-in  Profile before trying ideas (else you’re guessing)  Test everything! Bulwark is nice Being “highly performant” By [ian]@ianozsvald[.com] Ian Ozsvald
  16.  Your organisers are volunteers  Thank your volunteers &

    speakers please  Get a free (1st ed) signed book later Thank your organisers By [ian]@ianozsvald[.com] Ian Ozsvald
  17.  Jan: Successful Data Science Projects  Feb: Software Engineering

    for Data Scientists (2 day)  Mar: (planned) High Performance Python  https://IanOzsvald.com/training Upcoming public courses By [ian]@ianozsvald[.com] Ian Ozsvald
  18.  Measure – don’t guess  Test everything  I’d

    love a postcard if you learned something new  Join my thoughts+jobs list for tips and my training list  Lots of past talks on ianozsvald.com Summary By [ian]@ianozsvald[.com] Ian Ozsvald