Slide 1

Slide 1 text

Creating Correct Classifiers PyDataLondon 2018 Ian Ozsvald @IanOzsvald ModelInsight.io

Slide 2

Slide 2 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataLondon 2018 Introductions ● I’m an engineering data scientist ● Consulting in AI + Data Science for 15+ years Blog->IanOzsvald.com

Slide 3

Slide 3 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataLondon 2018 NumFOCUS ● Have you thanked a speaker, a volunteer and a NumFOCUS organiser yet? Lots of volunteered time – please say thanks ● Leah can’t make it due to illness – please Tweet “@numfocus Leah get well soon from London!” ● Book signing (High Performance Python) at lunch

Slide 4

Slide 4 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataLondon 2018 Goals today ● Get a baseline model ● Visualise errors & diagnose problem areas ● Explain decisions ● Github for examples:

Slide 5

Slide 5 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataLondon 2018 pandas_profiling

Slide 6

Slide 6 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataLondon 2018 pandas_profiling

Slide 7

Slide 7 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataLondon 2018 DummyClassifier

Slide 8

Slide 8 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataLondon 2018 DummyClassifier

Slide 9

Slide 9 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataLondon 2018 Eyeball imputed results

Slide 10

Slide 10 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataLondon 2018 RandomForest

Slide 11

Slide 11 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataLondon 2018 RandomForest

Slide 12

Slide 12 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataLondon 2018 ConfusionMatrix (YellowBrick)

Slide 13

Slide 13 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataLondon 2018 Confusion’s Probabilities

Slide 14

Slide 14 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataLondon 2018 PermutationImportance ELI5 https://github.com/TeamHG-Memex/eli5/issues/256

Slide 15

Slide 15 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataLondon 2018 Worst Errors by Row

Slide 16

Slide 16 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataLondon 2018 Worst Errors by Row

Slide 17

Slide 17 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataLondon 2018 Errors by Major Feature

Slide 18

Slide 18 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataLondon 2018 TSNE by features

Slide 19

Slide 19 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataLondon 2018 TSNE by features Oddly similar cluster? Conflicted?

Slide 20

Slide 20 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataLondon 2018 TSNE by features Features for this cluster - lots of imputed ages! We’ve filtered by x, y region on the TSNE plot

Slide 21

Slide 21 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataLondon 2018 Examine conflicted area Oddly similar cluster?

Slide 22

Slide 22 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataLondon 2018 ELI5 show_prediction

Slide 23

Slide 23 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataLondon 2018 ELI5 show_prediction

Slide 24

Slide 24 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataLondon 2018 Last mentions ● skopt’s BayesSearchCV perhaps “beats” RandomizedSearchCV & GridSearchCV ● New iteration of this talk for PyDataAmsterdam 2018 in 1 month (with SHAP replacing ELI5 + other tools) ● If you can’t reliably explain why a prediction happens – do you really understand your model?

Slide 25

Slide 25 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataLondon 2018 Closing... ● Diagnose your ML just like you debug your code – explain its working to colleagues ● Do you want training on topics like this? ● Write-up + more: http://ianozsvald.com/ ● Questions in exchange for beer :-) ● Learnt something? Please send me a postcard! ● See my longer diagnosis Notebook on github:

Slide 26

Slide 26 text

Ian.Ozsvald@ModelInsight.io @IanOzsvald[.com] PyDataLondon 2018 Appendix ● Ian’s “Machine Learning Libraries You’d Wish You’d Knew” @ PyConUK 2017 ● Ian’s “Using Machine Learning to solve a classification problem with scikit-learn” @ PyConUK 2016 ● Gael Varoquaux’s tutorial “Understanding and diagnosing your machine-learning models” @ PyDataLondon 2018 http://gael-varoquaux.info/interpreting_ml_tuto/ ● Also see Kat Jarmul’s keynote @ PyDataWarsaw 2017: https://blog.kjamistan.com/towards-interpretable-reliable-model s ● Michał Łopuszyński @ PyDataWarsaw https://www.slideshare.net/lopusz/debugging-machinelearning