Higher Performance Python (ODSC 2019)

Tools for Higher Performance Python @IanOzsvald – ianozsvald.com Ian Ozsvald
ODSC 2019

 Interim Chief Data Scientist  19+ years experience 
Quickly build strategic data science plans  Team coaching & public courses Introductions By [ian]@ianozsvald[.com] Ian Ozsvald 2nd Edition April 2020

 Introduce profiling, faster Pandas, multi-core  Reflect on “good
practice” so you can be highly performant Today’s goal By [ian]@ianozsvald[.com] Ian Ozsvald

 Calculate features including slope  Ordinary Least Squared on
time series  100,000 rows in a DataFrame to trial  Estimate for 100x (52wk*2 time windows)  Lots of CPU time – can we do better? A typical higher-performance task By [ian]@ianozsvald[.com] Ian Ozsvald

A typical task – need slope of the line By
[ian]@ianozsvald[.com] Ian Ozsvald

A typical task – add m By [ian]@ianozsvald[.com] Ian Ozsvald

By [ian]@ianozsvald[.com] Ian Ozsvald Introduce 2 solutions

By [ian]@ianozsvald[.com] Ian Ozsvald Sklearn is slow?!

By [ian]@ianozsvald[.com] Ian Ozsvald line_profiler Sklearn is safe These overheads
apply to this small data case

Pandas iloc & looping (with lstsq) By [ian]@ianozsvald[.com] Ian Ozsvald

Pandas iterrows By [ian]@ianozsvald[.com] Ian Ozsvald

Pandas apply By [ian]@ianozsvald[.com] Ian Ozsvald

Pandas apply with raw=True By [ian]@ianozsvald[.com] Ian Ozsvald

Swifter By [ian]@ianozsvald[.com] Ian Ozsvald

Numba when raw=True By [ian]@ianozsvald[.com] Ian Ozsvald

 Python, Pandas and NumPy distributed computing  Bag (standard
Python collections), Array (NumPy) and Distributed DataFrame (Pandas)  dask-ml for distributed sklearn machine learning  Super-easy parallelised Pandas functions Dask By [ian]@ianozsvald[.com] Ian Ozsvald

Dask with Numba By [ian]@ianozsvald[.com] Ian Ozsvald

 Sklearn & iloc – 90 minutes  Lstsq &
apply raw – 10 minutes  With Numba – 1 minute  Add Dask – 30 seconds (180x theoretical speed-up)  Don’t go crazy – remember maintenance Costs on the “big problem” By [ian]@ianozsvald[.com] Ian Ozsvald

 iloc & apply are fine, cache where possible 
Sklearn in dev, lstsq for prod? Maintenance cost...  Numba (and Dask) need team buy-in  Profile before trying ideas (else you’re guessing)  Test everything! Bulwark is nice On being “highly performant” By [ian]@ianozsvald[.com] Ian Ozsvald

 We publish the 2nd edition next April  Thanks
to O’Reilly for free copies to sign  I run training courses – come chat and tell me your needs please! Book signing (1st ed) later today By [ian]@ianozsvald[.com] Ian Ozsvald

 Measure – don’t guess  Test everything  I’d
love a postcard if you learned something new  Join my Thoughts & Jobs email list for tips via my blog  Lots of past talks on ianozsvald.com Summary By [ian]@ianozsvald[.com] Ian Ozsvald

Higher Performance Python (ODSC 2019)

Higher Performance Python (ODSC 2019)

ianozsvald

More Decks by ianozsvald

Other Decks in Science

Featured

Transcript

Tools for Higher Performance Python @IanOzsvald – ianozsvald.com Ian Ozsvald

 Interim Chief Data Scientist  19+ years experience 

 Introduce profiling, faster Pandas, multi-core  Reflect on “good

 Calculate features including slope  Ordinary Least Squared on

A typical task – need slope of the line By

A typical task – add m By [ian]@ianozsvald[.com] Ian Ozsvald

By [ian]@ianozsvald[.com] Ian Ozsvald Introduce 2 solutions

By [ian]@ianozsvald[.com] Ian Ozsvald Sklearn is slow?!

By [ian]@ianozsvald[.com] Ian Ozsvald line_profiler Sklearn is safe These overheads

Pandas iloc & looping (with lstsq) By [ian]@ianozsvald[.com] Ian Ozsvald

Pandas iterrows By [ian]@ianozsvald[.com] Ian Ozsvald

Pandas apply By [ian]@ianozsvald[.com] Ian Ozsvald

Pandas apply with raw=True By [ian]@ianozsvald[.com] Ian Ozsvald

Swifter By [ian]@ianozsvald[.com] Ian Ozsvald

Numba when raw=True By [ian]@ianozsvald[.com] Ian Ozsvald

 Python, Pandas and NumPy distributed computing  Bag (standard

Dask with Numba By [ian]@ianozsvald[.com] Ian Ozsvald

 Sklearn & iloc – 90 minutes  Lstsq &

 iloc & apply are fine, cache where possible 

 We publish the 2nd edition next April  Thanks

 Measure – don’t guess  Test everything  I’d