Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning Libraries You'd Wish You'd Known About

3d644406158b4d440111903db1f62622?s=47 ianozsvald
October 30, 2017

Machine Learning Libraries You'd Wish You'd Known About

3d644406158b4d440111903db1f62622?s=128

ianozsvald

October 30, 2017
Tweet

Transcript

  1. Machine learning libraries you'd wish you'd known about PyConUK 2017

    Ian Ozsvald @IanOzsvald ModelInsight.io
  2. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK 2017 ianozsvald.com Introductions • I’m an engineering

    data scientist • Consulting in AI + Data Science for 15+ years Blog->IanOzsvald.com
  3. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK 2017 ianozsvald.com Goals today • Is my

    regression working? • Why did it make that decision? • Can I calculate on Pandas in parallel? • Can I automate my machine learning? • Github for examples: Last year – my introduction to Random Forests as a worked process with examples and graphs
  4. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK 2017 ianozsvald.com Why explain? • Check that

    our model works as we’d expect in the real world – are the “important features” really important? Are they noise? • Help colleagues gain confidence in the model • Diagnose if certain examples are poorly understood
  5. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK 2017 ianozsvald.com Boston housing data • Regress

    median-value (MEDV) from other features • LSTAT - ‘low status %’ • RM - ‘median rooms’ • 13 features overall • 506 rows
  6. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK 2017 ianozsvald.com Yellowbrick • Lots of visualisations

    that plug into sklearn • Classification – class balance, confusion matrix • Regression – y vs ŷ, residual errors • Presented at PyDataLondon 2017 • http://www.scikit-yb.org/en/latest/
  7. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK 2017 ianozsvald.com Yellowbrick

  8. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK 2017 ianozsvald.com ELI5 • “Explain it like

    I’m 5!” • Feature Importance via Permutation Importance • Prediction explanations including text • Sklearn, XGBoost, LightGBM • http://eli5.readthedocs.io/en/latest/
  9. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK 2017 ianozsvald.com ELI5 - Permutation Importance •

    Model agnostic, hopefully not skewed • Useful with both RF and linear models RandomForest's feature importances:
  10. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK 2017 ianozsvald.com Explaining the regression • ELI5

    & LIME can explain single examples • Expensive house – many rooms, low LSTAT %, good pupil/teacher ratio • Cheap house – high LSTAT %, few rooms, maybe high nitric oxide pollution and lower pupil/teacher ratio • These interpretations are different to the global feature importances • Also see Kat Jarmul’s keynote: https://blog.kjamistan.com/towards-interpretable-reliable-models / • Michał Łopuszyński @ PyDataWarsaw https://www.slideshare.net/lopusz/debugging-machinelearning
  11. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK 2017 ianozsvald.com ELI5 explanation • Model specific

    • Explain “46.8” • Expensive property • RM & LSTAT • Some PTRATIO
  12. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK 2017 ianozsvald.com ELI5 explain many examples

  13. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK 2017 ianozsvald.com LIME (circa 2016) • Locally

    linear classifiers built around the 1 data point you want to explain • Model agnostic, even images & text! • https://github.com/marcotcr/lime
  14. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK 2017 ianozsvald.com LIME • 10.99 predicted (cheap

    property) • Strong negative influences (from the mean price) – LSTAT, RM, NOX, ... Caveats: http://eli5.readthedocs.io/en/latest/blackbox/lime.html
  15. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK 2017 ianozsvald.com Dask for Medium Data Tasks

    • Pandas-compatible parallel processor • Runs on many cores and machines • Also see: Automated data exploration by Víctor Zabalza from Friday • http://ianozsvald.com/2017/06/07/kaggles -quora-question-paris-competition/
  16. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK 2017 ianozsvald.com Dask – 3.5* speedup

  17. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK 2017 ianozsvald.com TPOT – automated ML

  18. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK 2017 ianozsvald.com TPOT • Used on Kaggle

    Mercedes (6 week competition, 5 days of my effort) • In top 50% result with little more than TPOT and a few days • Ensembled 3 estimators (2 from TPOT) • http://ianozsvald.com/2017/07/01/kaggl es-mercedes-benz-greener-manufacturing /
  19. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK 2017 ianozsvald.com Closing... • Diagnose your ML

    just like you debug your code – explain its working to colleagues • Write-up: http://ianozsvald.com/ • Training next year – what do you need? • Questions in exchange for beer :-) • Please send me a postcard if this is useful • See my longer diagnosis Notebook on github:
  20. Ian.Ozsvald@ModelInsight.io @IanOzsvald PyConUK 2017 ianozsvald.com Appendix: Dask – 3.5* speedup

    https://twitter.com/ianozsvald/status/870643737097056259