[email protected] @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 Introductions ● I’m an engineering data scientist ● Consulting in AI + Data Science for 15+ years Blog->IanOzsvald.com
[email protected] @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 Goals today ● Can I calculate on Pandas in parallel? ● Can I automate my machine learning? ● Is my regression working? ● Why did it make that decision? ● Github for examples: Builds on PyConUK 2016 – my introduction to Random Forests as a worked process with examples and graphs
[email protected] @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 Dask for Medium Data Tasks ● Pandas-compatible parallel processor ● Runs on many cores and machines ● Also see: Automated data exploration by Víctor Zabalza at PyConUK 2017 ● http://ianozsvald.com/2017/06/07/kaggles -quora-question-paris-competition/
[email protected] @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 TPOT ● Used on Kaggle Mercedes (6 week competition, 5 days of my effort) ● In top 50% result with little more than TPOT and a few days ● Ensembled 3 estimators (2 from TPOT) ● http://ianozsvald.com/2017/07/01/kaggl es-mercedes-benz-greener-manufacturing
[email protected] @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 Why explain our models? ● Check that our model works as we’d expect in the real world – are the “important features” really important? Are they noise? ● Help colleagues gain confidence in the model ● Diagnose if certain examples are poorly understood
[email protected] @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 Boston housing data ● Regress median-value (MEDV) from other features ● LSTAT - ‘low status %’ ● RM - ‘median rooms’ ● 13 features overall ● 506 rows
[email protected] @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 ELI5 - Permutation Importance ● Model agnostic, hopefully not skewed ● Useful with both RF and linear models RandomForest's feature importances:
[email protected] @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 Explaining the regression ● ELI5 & LIME can explain single examples ● Expensive house – many rooms, low LSTAT %, good pupil/teacher ratio ● Cheap house – high LSTAT %, few rooms, maybe high nitric oxide pollution and lower pupil/teacher ratio ● These interpretations are different to the global feature importances ● Also see Kat Jarmul’s keynote @ PyDataWarsaw 2017: https://blog.kjamistan.com/towards-interpretable-reliable-models ● Michał Łopuszyński @ PyDataWarsaw https://www.slideshare.net/lopusz/debugging-machinelearning
[email protected] @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 ELI5 explain many examples Few rooms, close to employment centres, lower LSTAT% Many rooms (big houses!)
[email protected] @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 LIME (circa 2016) ● Locally linear classifiers built around the 1 data point you want to explain ● Model agnostic, even images & text! ● https://github.com/marcotcr/lime
[email protected] @IanOzsvald[.com] PyDataBudapest / BudapestBI 2017 Closing... ● Diagnose your ML just like you debug your code – explain its working to colleagues ● Write-up: http://ianozsvald.com/ ● Training next year – what do you need? ● Questions in exchange for beer :-) ● Learn something? Please send me a postcard! ● See my longer diagnosis Notebook on github: