Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Machine Learning Libraries You'd Wish You'd Known About

ianozsvald
November 15, 2017

Machine Learning Libraries You'd Wish You'd Known About

ianozsvald

November 15, 2017
Tweet

More Decks by ianozsvald

Other Decks in Science

Transcript

  1. Machine learning libraries you'd wish
    you'd known about
    PyDataBudapest 2017
    Ian Ozsvald @IanOzsvald ModelInsight.io

    View full-size slide

  2. [email protected] @IanOzsvald[.com]
    PyDataBudapest / BudapestBI 2017
    Introductions

    I’m an engineering data scientist

    Consulting in AI + Data Science for 15+
    years
    Blog->IanOzsvald.com

    View full-size slide

  3. [email protected] @IanOzsvald[.com]
    PyDataBudapest / BudapestBI 2017
    Goals today

    Can I calculate on Pandas in parallel?

    Can I automate my machine learning?

    Is my regression working?

    Why did it make that decision?

    Github for examples:
    Builds on PyConUK 2016 – my introduction to Random
    Forests as a worked process with examples and graphs

    View full-size slide

  4. [email protected] @IanOzsvald[.com]
    PyDataBudapest / BudapestBI 2017
    Dask for Medium Data Tasks

    Pandas-compatible parallel processor

    Runs on many cores and machines

    Also see: Automated data exploration by
    Víctor Zabalza at PyConUK 2017

    http://ianozsvald.com/2017/06/07/kaggles
    -quora-question-paris-competition/

    View full-size slide

  5. [email protected] @IanOzsvald[.com]
    PyDataBudapest / BudapestBI 2017
    Dask – 3.5* speedup

    View full-size slide

  6. [email protected] @IanOzsvald[.com]
    PyDataBudapest / BudapestBI 2017
    TPOT – automated ML

    View full-size slide

  7. [email protected] @IanOzsvald[.com]
    PyDataBudapest / BudapestBI 2017
    TPOT

    Used on Kaggle Mercedes (6 week
    competition, 5 days of my effort)

    In top 50% result with little more than
    TPOT and a few days

    Ensembled 3 estimators (2 from TPOT)

    http://ianozsvald.com/2017/07/01/kaggl
    es-mercedes-benz-greener-manufacturing

    View full-size slide

  8. [email protected] @IanOzsvald[.com]
    PyDataBudapest / BudapestBI 2017
    Why explain our models?

    Check that our model works as we’d
    expect in the real world – are the
    “important features” really important? Are
    they noise?

    Help colleagues gain confidence in the
    model

    Diagnose if certain examples are poorly
    understood

    View full-size slide

  9. [email protected] @IanOzsvald[.com]
    PyDataBudapest / BudapestBI 2017
    Boston housing data

    Regress median-value (MEDV) from
    other features

    LSTAT - ‘low status %’

    RM - ‘median rooms’

    13 features overall

    506 rows

    View full-size slide

  10. [email protected] @IanOzsvald[.com]
    PyDataBudapest / BudapestBI 2017
    Yellowbrick

    Lots of visualisations that plug into
    sklearn

    Classification – class balance, confusion
    matrix

    Regression – y vs ŷ, residual errors

    Presented at PyDataLondon 2017

    http://www.scikit-yb.org/en/latest/

    View full-size slide

  11. [email protected] @IanOzsvald[.com]
    PyDataBudapest / BudapestBI 2017
    Yellowbrick

    View full-size slide

  12. [email protected] @IanOzsvald[.com]
    PyDataBudapest / BudapestBI 2017
    ELI5

    “Explain it like I’m 5!”

    Feature Importance via Permutation
    Importance

    Prediction explanations including text

    Sklearn, XGBoost, LightGBM

    http://eli5.readthedocs.io/en/latest/

    View full-size slide

  13. [email protected] @IanOzsvald[.com]
    PyDataBudapest / BudapestBI 2017
    ELI5 - Permutation Importance

    Model agnostic, hopefully not skewed

    Useful with both RF and linear models
    RandomForest's feature importances:

    View full-size slide

  14. [email protected] @IanOzsvald[.com]
    PyDataBudapest / BudapestBI 2017
    Explaining the regression

    ELI5 & LIME can explain single examples

    Expensive house – many rooms, low LSTAT %, good
    pupil/teacher ratio

    Cheap house – high LSTAT %, few rooms, maybe high nitric
    oxide pollution and lower pupil/teacher ratio

    These interpretations are different to the global feature
    importances

    Also see Kat Jarmul’s keynote @ PyDataWarsaw 2017:
    https://blog.kjamistan.com/towards-interpretable-reliable-models

    Michał Łopuszyński @ PyDataWarsaw
    https://www.slideshare.net/lopusz/debugging-machinelearning

    View full-size slide

  15. [email protected] @IanOzsvald[.com]
    PyDataBudapest / BudapestBI 2017
    ELI5 explanation

    Model specific

    Explain “46.8”

    Expensive property

    RM & LSTAT

    Some PTRATIO

    View full-size slide

  16. [email protected] @IanOzsvald[.com]
    PyDataBudapest / BudapestBI 2017
    ELI5 explain many examples

    View full-size slide

  17. [email protected] @IanOzsvald[.com]
    PyDataBudapest / BudapestBI 2017
    ELI5 explain many examples
    Few rooms,
    close to
    employment
    centres, lower
    LSTAT%
    Many rooms
    (big houses!)

    View full-size slide

  18. [email protected] @IanOzsvald[.com]
    PyDataBudapest / BudapestBI 2017
    LIME (circa 2016)

    Locally linear classifiers built around the
    1 data point you want to explain

    Model agnostic, even images & text!

    https://github.com/marcotcr/lime

    View full-size slide

  19. [email protected] @IanOzsvald[.com]
    PyDataBudapest / BudapestBI 2017
    LIME

    10.99 predicted (cheap property)

    Strong negative influences (from the
    mean price) – LSTAT, RM, NOX, ...
    Caveats: http://eli5.readthedocs.io/en/latest/blackbox/lime.html

    View full-size slide

  20. [email protected] @IanOzsvald[.com]
    PyDataBudapest / BudapestBI 2017
    Closing...

    Diagnose your ML just like you debug your code –
    explain its working to colleagues

    Write-up: http://ianozsvald.com/

    Training next year – what do you need?

    Questions in exchange for beer :-)

    Learn something? Please send me a postcard!

    See my longer diagnosis
    Notebook on github:

    View full-size slide

  21. [email protected] @IanOzsvald[.com]
    PyDataBudapest / BudapestBI 2017
    Appendix: Dask – 3.5* speedup
    https://twitter.com/ianozsvald/status/870643737097056259

    View full-size slide