Machine Learning Libraries You'd Wish You'd Known About

Slide 1

Slide 1 text

Machine learning libraries you'd wish you'd known about London Python 2018-01 Ian Ozsvald @IanOzsvald ModelInsight.io

Slide 2

Slide 2 text

[email protected] @IanOzsvald[.com] London Python 2018-01 Introductions ● I’m an engineering data scientist ● Consulting in AI + Data Science for 15+ years Blog->IanOzsvald.com

Slide 3

Slide 3 text

[email protected] @IanOzsvald[.com] London Python 2018-01 Goals today ● Can I calculate on Pandas in parallel? ● Can I automate my machine learning? ● Is my regression working? ● Why did it make that decision? ● Github for examples: Builds on PyConUK 2016 – my introduction to Random Forests as a worked process with examples and graphs

Slide 4

Slide 4 text

[email protected] @IanOzsvald[.com] London Python 2018-01 watermark for reproducibility

Slide 5

Slide 5 text

[email protected] @IanOzsvald[.com] London Python 2018-01 Dask for Medium Data Tasks ● Pandas-compatible parallel processor ● Also see: Automated data exploration by Víctor Zabalza at PyConUK 2017 ● http://ianozsvald.com/2017/06/07/kaggles -quora-question-paris-competition/ ● In top 40% in < 6 days of effort

Slide 6

Slide 6 text

[email protected] @IanOzsvald[.com] London Python 2018-01 Dask – 3.5* speedup

Slide 7

Slide 7 text

[email protected] @IanOzsvald[.com] London Python 2018-01 TPOT – automated ML

Slide 8

Slide 8 text

[email protected] @IanOzsvald[.com] London Python 2018-01 TPOT ● Used on Kaggle Mercedes (6 week competition, 5 days of my effort) ● In top 40% result with little more than TPOT and a few days ● Ensembled 3 estimators (2 from TPOT) ● http://ianozsvald.com/2017/07/01/kaggl es-mercedes-benz-greener-manufacturing

Slide 9

Slide 9 text

[email protected] @IanOzsvald[.com] London Python 2018-01 pandas_profiling for EDA

Slide 10

Slide 10 text

[email protected] @IanOzsvald[.com] London Python 2018-01 Why explain our models? ● Check that our model works as we’d expect in the real world – are the “important features” really important? Are they noise? ● Help colleagues gain confidence in the model ● Diagnose if certain examples are poorly understood

Slide 11

Slide 11 text

[email protected] @IanOzsvald[.com] London Python 2018-01 Boston housing data ● Regress median-value (MEDV) from other features ● LSTAT - ‘low status %’ ● RM - ‘median rooms’ ● 13 features overall ● 506 rows

Slide 12

Slide 12 text

[email protected] @IanOzsvald[.com] London Python 2018-01 Yellowbrick ● Lots of visualisations that plug into sklearn ● Classification – class balance, confusion matrix ● Regression – y vs ŷ, residual errors ● Presented at PyDataLondon 2017 ● http://www.scikit-yb.org/en/latest/

Slide 13

Slide 13 text

[email protected] @IanOzsvald[.com] London Python 2018-01 Yellowbrick

Slide 14

Slide 14 text

[email protected] @IanOzsvald[.com] London Python 2018-01 ELI5 ● “Explain it like I’m 5!” ● Feature Importance via Permutation Importance ● Prediction explanations including text ● Sklearn, XGBoost, LightGBM ● http://eli5.readthedocs.io/en/latest/

Slide 15

Slide 15 text

[email protected] @IanOzsvald[.com] London Python 2018-01 ELI5 - Permutation Importance ● Model agnostic, hopefully not skewed ● Useful with both RF and linear models RandomForest's feature importances:

Slide 16

Slide 16 text

[email protected] @IanOzsvald[.com] London Python 2018-01 Explaining the regression ● ELI5 & LIME can explain single examples ● Expensive house – many rooms, low LSTAT %, good pupil/teacher ratio ● Cheap house – high LSTAT %, few rooms, maybe high nitric oxide pollution and lower pupil/teacher ratio ● These interpretations are different to the global feature importances ● Also see Kat Jarmul’s keynote @ PyDataWarsaw 2017: https://blog.kjamistan.com/towards-interpretable-reliable-models ● Michał Łopuszyński @ PyDataWarsaw https://www.slideshare.net/lopusz/debugging-machinelearning

Slide 17

Slide 17 text

[email protected] @IanOzsvald[.com] London Python 2018-01 ELI5 explanation ● Model specific ● Explain “46.8” ● Expensive property ● RM & LSTAT ● Some PTRATIO

Slide 18

Slide 18 text

[email protected] @IanOzsvald[.com] London Python 2018-01 ELI5 explain many examples

Slide 19

Slide 19 text

[email protected] @IanOzsvald[.com] London Python 2018-01 ELI5 explain many examples Few rooms, close to employment centres, lower LSTAT% Many rooms (big houses!)

Slide 20

Slide 20 text

[email protected] @IanOzsvald[.com] London Python 2018-01 LIME (circa 2016) ● Locally linear classifiers built around the 1 data point you want to explain ● Model agnostic, even images & text! ● https://github.com/marcotcr/lime

Slide 21

Slide 21 text

[email protected] @IanOzsvald[.com] London Python 2018-01 LIME ● 10.99 predicted (cheap property) ● Strong negative influences (from the mean price) – LSTAT, RM, NOX, ... Caveats: http://eli5.readthedocs.io/en/latest/blackbox/lime.html

Slide 22

Slide 22 text

[email protected] @IanOzsvald[.com] London Python 2018-01 Closing... ● Diagnose your ML just like you debug your code – explain its working to colleagues ● Write-up: http://ianozsvald.com/ ● Data science team coaching – can I help? ● Questions in exchange for beer :-) ● Learn something? Please send me a postcard! ● See my longer diagnosis Notebook on github:

Slide 23

Slide 23 text

[email protected] @IanOzsvald[.com] London Python 2018-01 Appendix: Dask – 3.5* speedup https://twitter.com/ianozsvald/status/870643737097056259