Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning Libraries You'd Wish You'd Known About

ianozsvald
October 30, 2017

Machine Learning Libraries You'd Wish You'd Known About

ianozsvald

October 30, 2017
Tweet

More Decks by ianozsvald

Other Decks in Science

Transcript

  1. Machine learning libraries you'd wish
    you'd known about
    PyConUK 2017
    Ian Ozsvald @IanOzsvald ModelInsight.io

    View Slide

  2. [email protected] @IanOzsvald PyConUK 2017
    ianozsvald.com
    Introductions

    I’m an engineering data scientist

    Consulting in AI + Data Science for 15+
    years
    Blog->IanOzsvald.com

    View Slide

  3. [email protected] @IanOzsvald PyConUK 2017
    ianozsvald.com
    Goals today

    Is my regression working?

    Why did it make that decision?

    Can I calculate on Pandas in parallel?

    Can I automate my machine learning?

    Github for examples:
    Last year – my introduction to Random Forests as a worked
    process with examples and graphs

    View Slide

  4. [email protected] @IanOzsvald PyConUK 2017
    ianozsvald.com
    Why explain?

    Check that our model works as we’d
    expect in the real world – are the
    “important features” really important? Are
    they noise?

    Help colleagues gain confidence in the
    model

    Diagnose if certain examples are poorly
    understood

    View Slide

  5. [email protected] @IanOzsvald PyConUK 2017
    ianozsvald.com
    Boston housing data

    Regress median-value (MEDV) from
    other features

    LSTAT - ‘low status %’

    RM - ‘median rooms’

    13 features overall

    506 rows

    View Slide

  6. [email protected] @IanOzsvald PyConUK 2017
    ianozsvald.com
    Yellowbrick

    Lots of visualisations that plug into
    sklearn

    Classification – class balance, confusion
    matrix

    Regression – y vs ŷ, residual errors

    Presented at PyDataLondon 2017

    http://www.scikit-yb.org/en/latest/

    View Slide

  7. [email protected] @IanOzsvald PyConUK 2017
    ianozsvald.com
    Yellowbrick

    View Slide

  8. [email protected] @IanOzsvald PyConUK 2017
    ianozsvald.com
    ELI5

    “Explain it like I’m 5!”

    Feature Importance via Permutation
    Importance

    Prediction explanations including text

    Sklearn, XGBoost, LightGBM

    http://eli5.readthedocs.io/en/latest/

    View Slide

  9. [email protected] @IanOzsvald PyConUK 2017
    ianozsvald.com
    ELI5 - Permutation Importance

    Model agnostic, hopefully not skewed

    Useful with both RF and linear models
    RandomForest's feature importances:

    View Slide

  10. [email protected] @IanOzsvald PyConUK 2017
    ianozsvald.com
    Explaining the regression

    ELI5 & LIME can explain single examples

    Expensive house – many rooms, low LSTAT %, good
    pupil/teacher ratio

    Cheap house – high LSTAT %, few rooms, maybe high nitric
    oxide pollution and lower pupil/teacher ratio

    These interpretations are different to the global feature
    importances

    Also see Kat Jarmul’s keynote:
    https://blog.kjamistan.com/towards-interpretable-reliable-models
    /

    Michał Łopuszyński @ PyDataWarsaw
    https://www.slideshare.net/lopusz/debugging-machinelearning

    View Slide

  11. [email protected] @IanOzsvald PyConUK 2017
    ianozsvald.com
    ELI5 explanation

    Model specific

    Explain “46.8”

    Expensive property

    RM & LSTAT

    Some PTRATIO

    View Slide

  12. [email protected] @IanOzsvald PyConUK 2017
    ianozsvald.com
    ELI5 explain many examples

    View Slide

  13. [email protected] @IanOzsvald PyConUK 2017
    ianozsvald.com
    LIME (circa 2016)

    Locally linear classifiers built around the
    1 data point you want to explain

    Model agnostic, even images & text!

    https://github.com/marcotcr/lime

    View Slide

  14. [email protected] @IanOzsvald PyConUK 2017
    ianozsvald.com
    LIME

    10.99 predicted (cheap property)

    Strong negative influences (from the
    mean price) – LSTAT, RM, NOX, ...
    Caveats: http://eli5.readthedocs.io/en/latest/blackbox/lime.html

    View Slide

  15. [email protected] @IanOzsvald PyConUK 2017
    ianozsvald.com
    Dask for Medium Data Tasks

    Pandas-compatible parallel processor

    Runs on many cores and machines

    Also see: Automated data exploration by
    Víctor Zabalza from Friday

    http://ianozsvald.com/2017/06/07/kaggles
    -quora-question-paris-competition/

    View Slide

  16. [email protected] @IanOzsvald PyConUK 2017
    ianozsvald.com
    Dask – 3.5* speedup

    View Slide

  17. [email protected] @IanOzsvald PyConUK 2017
    ianozsvald.com
    TPOT – automated ML

    View Slide

  18. [email protected] @IanOzsvald PyConUK 2017
    ianozsvald.com
    TPOT

    Used on Kaggle Mercedes (6 week
    competition, 5 days of my effort)

    In top 50% result with little more than
    TPOT and a few days

    Ensembled 3 estimators (2 from TPOT)

    http://ianozsvald.com/2017/07/01/kaggl
    es-mercedes-benz-greener-manufacturing
    /

    View Slide

  19. [email protected] @IanOzsvald PyConUK 2017
    ianozsvald.com
    Closing...

    Diagnose your ML just like you debug your code –
    explain its working to colleagues

    Write-up: http://ianozsvald.com/

    Training next year – what do you need?

    Questions in exchange for beer :-)

    Please send me a postcard if this is useful

    See my longer diagnosis
    Notebook on github:

    View Slide

  20. [email protected] @IanOzsvald PyConUK 2017
    ianozsvald.com
    Appendix: Dask – 3.5* speedup
    https://twitter.com/ianozsvald/status/870643737097056259

    View Slide