$30 off During Our Annual Pro Sale. View Details »

Machine Learning Libraries You'd Wish You'd Known About

ianozsvald
November 15, 2017

Machine Learning Libraries You'd Wish You'd Known About

ianozsvald

November 15, 2017
Tweet

More Decks by ianozsvald

Other Decks in Science

Transcript

  1. Machine learning libraries you'd wish you'd known about PyDataBudapest 2017

    Ian Ozsvald @IanOzsvald ModelInsight.io
  2. Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 Introductions • I’m an

    engineering data scientist • Consulting in AI + Data Science for 15+ years Blog->IanOzsvald.com
  3. Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 Goals today • Can

    I calculate on Pandas in parallel? • Can I automate my machine learning? • Is my regression working? • Why did it make that decision? • Github for examples: Builds on PyConUK 2016 – my introduction to Random Forests as a worked process with examples and graphs
  4. Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 Dask for Medium Data

    Tasks • Pandas-compatible parallel processor • Runs on many cores and machines • Also see: Automated data exploration by Víctor Zabalza at PyConUK 2017 • http://ianozsvald.com/2017/06/07/kaggles -quora-question-paris-competition/
  5. Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 Dask – 3.5* speedup

  6. Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 TPOT – automated ML

  7. Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 TPOT • Used on

    Kaggle Mercedes (6 week competition, 5 days of my effort) • In top 50% result with little more than TPOT and a few days • Ensembled 3 estimators (2 from TPOT) • http://ianozsvald.com/2017/07/01/kaggl es-mercedes-benz-greener-manufacturing
  8. Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 Why explain our models?

    • Check that our model works as we’d expect in the real world – are the “important features” really important? Are they noise? • Help colleagues gain confidence in the model • Diagnose if certain examples are poorly understood
  9. Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 Boston housing data •

    Regress median-value (MEDV) from other features • LSTAT - ‘low status %’ • RM - ‘median rooms’ • 13 features overall • 506 rows
  10. Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 Yellowbrick • Lots of

    visualisations that plug into sklearn • Classification – class balance, confusion matrix • Regression – y vs ŷ, residual errors • Presented at PyDataLondon 2017 • http://www.scikit-yb.org/en/latest/
  11. Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 Yellowbrick

  12. Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 ELI5 • “Explain it

    like I’m 5!” • Feature Importance via Permutation Importance • Prediction explanations including text • Sklearn, XGBoost, LightGBM • http://eli5.readthedocs.io/en/latest/
  13. Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 ELI5 - Permutation Importance

    • Model agnostic, hopefully not skewed • Useful with both RF and linear models RandomForest's feature importances:
  14. Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 Explaining the regression •

    ELI5 & LIME can explain single examples • Expensive house – many rooms, low LSTAT %, good pupil/teacher ratio • Cheap house – high LSTAT %, few rooms, maybe high nitric oxide pollution and lower pupil/teacher ratio • These interpretations are different to the global feature importances • Also see Kat Jarmul’s keynote @ PyDataWarsaw 2017: https://blog.kjamistan.com/towards-interpretable-reliable-models • Michał Łopuszyński @ PyDataWarsaw https://www.slideshare.net/lopusz/debugging-machinelearning
  15. Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 ELI5 explanation • Model

    specific • Explain “46.8” • Expensive property • RM & LSTAT • Some PTRATIO
  16. Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 ELI5 explain many examples

  17. Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 ELI5 explain many examples

    Few rooms, close to employment centres, lower LSTAT% Many rooms (big houses!)
  18. Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 LIME (circa 2016) •

    Locally linear classifiers built around the 1 data point you want to explain • Model agnostic, even images & text! • https://github.com/marcotcr/lime
  19. Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 LIME • 10.99 predicted

    (cheap property) • Strong negative influences (from the mean price) – LSTAT, RM, NOX, ... Caveats: http://eli5.readthedocs.io/en/latest/blackbox/lime.html
  20. Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 Closing... • Diagnose your

    ML just like you debug your code – explain its working to colleagues • Write-up: http://ianozsvald.com/ • Training next year – what do you need? • Questions in exchange for beer :-) • Learn something? Please send me a postcard! • See my longer diagnosis Notebook on github:
  21. Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 Appendix: Dask – 3.5*

    speedup https://twitter.com/ianozsvald/status/870643737097056259