Slide 1

Slide 1 text

Machine learning libraries you'd wish you'd known about PyConUK 2017 Ian Ozsvald @IanOzsvald ModelInsight.io

Slide 2

Slide 2 text

[email protected] @IanOzsvald PyConUK 2017 ianozsvald.com Introductions ● I’m an engineering data scientist ● Consulting in AI + Data Science for 15+ years Blog->IanOzsvald.com

Slide 3

Slide 3 text

[email protected] @IanOzsvald PyConUK 2017 ianozsvald.com Goals today ● Is my regression working? ● Why did it make that decision? ● Can I calculate on Pandas in parallel? ● Can I automate my machine learning? ● Github for examples: Last year – my introduction to Random Forests as a worked process with examples and graphs

Slide 4

Slide 4 text

[email protected] @IanOzsvald PyConUK 2017 ianozsvald.com Why explain? ● Check that our model works as we’d expect in the real world – are the “important features” really important? Are they noise? ● Help colleagues gain confidence in the model ● Diagnose if certain examples are poorly understood

Slide 5

Slide 5 text

[email protected] @IanOzsvald PyConUK 2017 ianozsvald.com Boston housing data ● Regress median-value (MEDV) from other features ● LSTAT - ‘low status %’ ● RM - ‘median rooms’ ● 13 features overall ● 506 rows

Slide 6

Slide 6 text

[email protected] @IanOzsvald PyConUK 2017 ianozsvald.com Yellowbrick ● Lots of visualisations that plug into sklearn ● Classification – class balance, confusion matrix ● Regression – y vs ŷ, residual errors ● Presented at PyDataLondon 2017 ● http://www.scikit-yb.org/en/latest/

Slide 7

Slide 7 text

[email protected] @IanOzsvald PyConUK 2017 ianozsvald.com Yellowbrick

Slide 8

Slide 8 text

[email protected] @IanOzsvald PyConUK 2017 ianozsvald.com ELI5 ● “Explain it like I’m 5!” ● Feature Importance via Permutation Importance ● Prediction explanations including text ● Sklearn, XGBoost, LightGBM ● http://eli5.readthedocs.io/en/latest/

Slide 9

Slide 9 text

[email protected] @IanOzsvald PyConUK 2017 ianozsvald.com ELI5 - Permutation Importance ● Model agnostic, hopefully not skewed ● Useful with both RF and linear models RandomForest's feature importances:

Slide 10

Slide 10 text

[email protected] @IanOzsvald PyConUK 2017 ianozsvald.com Explaining the regression ● ELI5 & LIME can explain single examples ● Expensive house – many rooms, low LSTAT %, good pupil/teacher ratio ● Cheap house – high LSTAT %, few rooms, maybe high nitric oxide pollution and lower pupil/teacher ratio ● These interpretations are different to the global feature importances ● Also see Kat Jarmul’s keynote: https://blog.kjamistan.com/towards-interpretable-reliable-models / ● Michał Łopuszyński @ PyDataWarsaw https://www.slideshare.net/lopusz/debugging-machinelearning

Slide 11

Slide 11 text

[email protected] @IanOzsvald PyConUK 2017 ianozsvald.com ELI5 explanation ● Model specific ● Explain “46.8” ● Expensive property ● RM & LSTAT ● Some PTRATIO

Slide 12

Slide 12 text

[email protected] @IanOzsvald PyConUK 2017 ianozsvald.com ELI5 explain many examples

Slide 13

Slide 13 text

[email protected] @IanOzsvald PyConUK 2017 ianozsvald.com LIME (circa 2016) ● Locally linear classifiers built around the 1 data point you want to explain ● Model agnostic, even images & text! ● https://github.com/marcotcr/lime

Slide 14

Slide 14 text

[email protected] @IanOzsvald PyConUK 2017 ianozsvald.com LIME ● 10.99 predicted (cheap property) ● Strong negative influences (from the mean price) – LSTAT, RM, NOX, ... Caveats: http://eli5.readthedocs.io/en/latest/blackbox/lime.html

Slide 15

Slide 15 text

[email protected] @IanOzsvald PyConUK 2017 ianozsvald.com Dask for Medium Data Tasks ● Pandas-compatible parallel processor ● Runs on many cores and machines ● Also see: Automated data exploration by Víctor Zabalza from Friday ● http://ianozsvald.com/2017/06/07/kaggles -quora-question-paris-competition/

Slide 16

Slide 16 text

[email protected] @IanOzsvald PyConUK 2017 ianozsvald.com Dask – 3.5* speedup

Slide 17

Slide 17 text

[email protected] @IanOzsvald PyConUK 2017 ianozsvald.com TPOT – automated ML

Slide 18

Slide 18 text

[email protected] @IanOzsvald PyConUK 2017 ianozsvald.com TPOT ● Used on Kaggle Mercedes (6 week competition, 5 days of my effort) ● In top 50% result with little more than TPOT and a few days ● Ensembled 3 estimators (2 from TPOT) ● http://ianozsvald.com/2017/07/01/kaggl es-mercedes-benz-greener-manufacturing /

Slide 19

Slide 19 text

[email protected] @IanOzsvald PyConUK 2017 ianozsvald.com Closing... ● Diagnose your ML just like you debug your code – explain its working to colleagues ● Write-up: http://ianozsvald.com/ ● Training next year – what do you need? ● Questions in exchange for beer :-) ● Please send me a postcard if this is useful ● See my longer diagnosis Notebook on github:

Slide 20

Slide 20 text

[email protected] @IanOzsvald PyConUK 2017 ianozsvald.com Appendix: Dask – 3.5* speedup https://twitter.com/ianozsvald/status/870643737097056259