Slide 1

Slide 1 text

Tools for Higher Performance Python @IanOzsvald – ianozsvald.com Ian Ozsvald PyDataCambridge 2019

Slide 2

Slide 2 text

 Interim Chief Data Scientist  19+ years experience  Quickly build strategic data science plans  Team coaching & public courses Introductions By [ian]@ianozsvald[.com] Ian Ozsvald 2nd Edition April 2020

Slide 3

Slide 3 text

 Introduce profiling, faster Pandas, multi-core  Reflect on “good practice” so you can be highly performant Today’s goal By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 4

Slide 4 text

 Calculate features including slope  Ordinary Least Squared on time series  100,000 rows in a DataFrame to trial  Estimate for 100x (52wk*2 time windows)  Lots of CPU time – can we do better? A typical higher-performance task By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 5

Slide 5 text

By [ian]@ianozsvald[.com] Ian Ozsvald Introduce 2 solutions

Slide 6

Slide 6 text

By [ian]@ianozsvald[.com] Ian Ozsvald Sklearn is slow?!

Slide 7

Slide 7 text

By [ian]@ianozsvald[.com] Ian Ozsvald line_profiler Sklearn is safe

Slide 8

Slide 8 text

Pandas iloc & looping (with lstsq) By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 9

Slide 9 text

Pandas apply By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 10

Slide 10 text

Pandas apply with raw=True By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 11

Slide 11 text

Numba when raw=True By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 12

Slide 12 text

 Pandas and NumPy distributed computing  Bag (standard Python collections), Array (NumPy) and Distributed DataFrame (Pandas)  Super-easy parallelised Pandas functions Dask By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 13

Slide 13 text

Dask with Numba By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 14

Slide 14 text

 Sklearn & iloc – 90 minutes  Lstsq & apply raw – 10 minutes  With Numba – 1 minute  With Dask – 30 seconds (180x theoretical speed-up)  Don’t go crazy – remember maintenance Costs on the “big problem” By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 15

Slide 15 text

 iloc & apply are fine  Sklearn in dev, lstsq for prod? Maintenance cost...  Numba (and Dask) need team buy-in  Profile before trying ideas (else you’re guessing)  Test everything! Bulwark is nice Being “highly performant” By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 16

Slide 16 text

 Your organisers are volunteers  Thank your volunteers & speakers please  Get a free (1st ed) signed book later Thank your organisers By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 17

Slide 17 text

 Jan: Successful Data Science Projects  Feb: Software Engineering for Data Scientists (2 day)  Mar: (planned) High Performance Python  https://IanOzsvald.com/training Upcoming public courses By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 18

Slide 18 text

 Measure – don’t guess  Test everything  I’d love a postcard if you learned something new  Join my thoughts+jobs list for tips and my training list  Lots of past talks on ianozsvald.com Summary By [ian]@ianozsvald[.com] Ian Ozsvald