Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning Libraries You'd Wish You'd Known About

ianozsvald
November 15, 2017

Machine Learning Libraries You'd Wish You'd Known About

ianozsvald

November 15, 2017
Tweet

More Decks by ianozsvald

Other Decks in Science

Transcript

  1. [email protected] @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 Introductions • I’m an

    engineering data scientist • Consulting in AI + Data Science for 15+ years Blog->IanOzsvald.com
  2. [email protected] @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 Goals today • Can

    I calculate on Pandas in parallel? • Can I automate my machine learning? • Is my regression working? • Why did it make that decision? • Github for examples: Builds on PyConUK 2016 – my introduction to Random Forests as a worked process with examples and graphs
  3. [email protected] @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 Dask for Medium Data

    Tasks • Pandas-compatible parallel processor • Runs on many cores and machines • Also see: Automated data exploration by Víctor Zabalza at PyConUK 2017 • http://ianozsvald.com/2017/06/07/kaggles -quora-question-paris-competition/
  4. [email protected] @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 TPOT • Used on

    Kaggle Mercedes (6 week competition, 5 days of my effort) • In top 50% result with little more than TPOT and a few days • Ensembled 3 estimators (2 from TPOT) • http://ianozsvald.com/2017/07/01/kaggl es-mercedes-benz-greener-manufacturing
  5. [email protected] @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 Why explain our models?

    • Check that our model works as we’d expect in the real world – are the “important features” really important? Are they noise? • Help colleagues gain confidence in the model • Diagnose if certain examples are poorly understood
  6. [email protected] @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 Boston housing data •

    Regress median-value (MEDV) from other features • LSTAT - ‘low status %’ • RM - ‘median rooms’ • 13 features overall • 506 rows
  7. [email protected] @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 Yellowbrick • Lots of

    visualisations that plug into sklearn • Classification – class balance, confusion matrix • Regression – y vs ŷ, residual errors • Presented at PyDataLondon 2017 • http://www.scikit-yb.org/en/latest/
  8. [email protected] @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 ELI5 • “Explain it

    like I’m 5!” • Feature Importance via Permutation Importance • Prediction explanations including text • Sklearn, XGBoost, LightGBM • http://eli5.readthedocs.io/en/latest/
  9. [email protected] @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 ELI5 - Permutation Importance

    • Model agnostic, hopefully not skewed • Useful with both RF and linear models RandomForest's feature importances:
  10. [email protected] @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 Explaining the regression •

    ELI5 & LIME can explain single examples • Expensive house – many rooms, low LSTAT %, good pupil/teacher ratio • Cheap house – high LSTAT %, few rooms, maybe high nitric oxide pollution and lower pupil/teacher ratio • These interpretations are different to the global feature importances • Also see Kat Jarmul’s keynote @ PyDataWarsaw 2017: https://blog.kjamistan.com/towards-interpretable-reliable-models • Michał Łopuszyński @ PyDataWarsaw https://www.slideshare.net/lopusz/debugging-machinelearning
  11. [email protected] @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 ELI5 explanation • Model

    specific • Explain “46.8” • Expensive property • RM & LSTAT • Some PTRATIO
  12. [email protected] @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 ELI5 explain many examples

    Few rooms, close to employment centres, lower LSTAT% Many rooms (big houses!)
  13. [email protected] @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 LIME (circa 2016) •

    Locally linear classifiers built around the 1 data point you want to explain • Model agnostic, even images & text! • https://github.com/marcotcr/lime
  14. [email protected] @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 LIME • 10.99 predicted

    (cheap property) • Strong negative influences (from the mean price) – LSTAT, RM, NOX, ... Caveats: http://eli5.readthedocs.io/en/latest/blackbox/lime.html
  15. [email protected] @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 Closing... • Diagnose your

    ML just like you debug your code – explain its working to colleagues • Write-up: http://ianozsvald.com/ • Training next year – what do you need? • Questions in exchange for beer :-) • Learn something? Please send me a postcard! • See my longer diagnosis Notebook on github:
  16. [email protected] @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 Appendix: Dask – 3.5*

    speedup https://twitter.com/ianozsvald/status/870643737097056259