Slide 1

Slide 1 text

Creating Correct Classifiers PyDataAmsterdam 2018 Ian Ozsvald @IanOzsvald ModelInsight.io

Slide 2

Slide 2 text

[email protected] @IanOzsvald[.com] PyDataAmsterdam 2018 Introductions ● I’m an engineering data scientist ● Consulting in AI + Data Science for 15+ years Blog->IanOzsvald.com

Slide 3

Slide 3 text

[email protected] @IanOzsvald[.com] PyDataAmsterdam 2018 NumFOCUS ● Have you thanked a speaker, a volunteer and a NumFOCUS organiser yet? Lots of volunteered time – please say thanks ● Thank contributors too!

Slide 4

Slide 4 text

[email protected] @IanOzsvald[.com] PyDataAmsterdam 2018 Goals today ● Get a baseline model ● Visualise errors & diagnose problem areas ● Explain decisions ● Github for examples:

Slide 5

Slide 5 text

[email protected] @IanOzsvald[.com] PyDataAmsterdam 2018 pandas_profiling

Slide 6

Slide 6 text

[email protected] @IanOzsvald[.com] PyDataAmsterdam 2018 pandas_profiling

Slide 7

Slide 7 text

[email protected] @IanOzsvald[.com] PyDataAmsterdam 2018 DummyClassifier

Slide 8

Slide 8 text

[email protected] @IanOzsvald[.com] PyDataAmsterdam 2018 DummyClassifier

Slide 9

Slide 9 text

[email protected] @IanOzsvald[.com] PyDataAmsterdam 2018 Eyeball imputed results

Slide 10

Slide 10 text

[email protected] @IanOzsvald[.com] PyDataAmsterdam 2018 RandomForest

Slide 11

Slide 11 text

[email protected] @IanOzsvald[.com] PyDataAmsterdam 2018 RandomForest

Slide 12

Slide 12 text

[email protected] @IanOzsvald[.com] PyDataAmsterdam 2018 ConfusionMatrix (YellowBrick)

Slide 13

Slide 13 text

[email protected] @IanOzsvald[.com] PyDataAmsterdam 2018 Confusion’s Probabilities

Slide 14

Slide 14 text

[email protected] @IanOzsvald[.com] PyDataAmsterdam 2018 ROC Curve (YellowBrick)

Slide 15

Slide 15 text

[email protected] @IanOzsvald[.com] PyDataAmsterdam 2018 ROC Curve (YellowBrick) LogisticRegression classifier to show a contrast with lower AUC

Slide 16

Slide 16 text

[email protected] @IanOzsvald[.com] PyDataAmsterdam 2018 Worst Errors by Row

Slide 17

Slide 17 text

[email protected] @IanOzsvald[.com] PyDataAmsterdam 2018 Worst Errors by Row

Slide 18

Slide 18 text

[email protected] @IanOzsvald[.com] PyDataAmsterdam 2018 Errors by Major Feature

Slide 19

Slide 19 text

[email protected] @IanOzsvald[.com] PyDataAmsterdam 2018 TSNE by features

Slide 20

Slide 20 text

[email protected] @IanOzsvald[.com] PyDataAmsterdam 2018 TSNE by features Oddly similar cluster? Conflicted?

Slide 21

Slide 21 text

[email protected] @IanOzsvald[.com] PyDataAmsterdam 2018 TSNE by features Features for this cluster - lots of imputed ages! We’ve filtered by x, y region on the TSNE plot

Slide 22

Slide 22 text

[email protected] @IanOzsvald[.com] PyDataAmsterdam 2018 Examine conflicted area Oddly similar cluster?

Slide 23

Slide 23 text

[email protected] @IanOzsvald[.com] PyDataAmsterdam 2018 SHAPley explanations

Slide 24

Slide 24 text

[email protected] @IanOzsvald[.com] PyDataAmsterdam 2018 SHAPley – model-wide behaviour

Slide 25

Slide 25 text

[email protected] @IanOzsvald[.com] PyDataAmsterdam 2018 SHAPley – model-wide behaviour

Slide 26

Slide 26 text

[email protected] @IanOzsvald[.com] PyDataAmsterdam 2018 SHAP summary_plot

Slide 27

Slide 27 text

[email protected] @IanOzsvald[.com] PyDataAmsterdam 2018 Closing... ● Diagnose your ML just like you debug your code – explain its working to colleagues ● Do you want training on topics like this? ● Write-up + more: http://ianozsvald.com/ ● Questions in exchange for beer :-) ● Learnt something? Please send me a postcard! ● See my longer diagnosis Notebook on github:

Slide 28

Slide 28 text

[email protected] @IanOzsvald[.com] PyDataAmsterdam 2018 Appendix ● Ian’s “Machine Learning Libraries You’d Wish You’d Knew” @ PyConUK 2017 ● Ian’s “Using Machine Learning to solve a classification problem with scikit-learn” @ PyConUK 2016 ● Gael Varoquaux’s tutorial “Understanding and diagnosing your machine-learning models” @ PyDataLondon 2018 http://gael-varoquaux.info/interpreting_ml_tuto/ ● Also see Kat Jarmul’s keynote @ PyDataWarsaw 2017: https://blog.kjamistan.com/towards-interpretable-reliable-model s ● Michał Łopuszyński @ PyDataWarsaw https://www.slideshare.net/lopusz/debugging-machinelearning

Slide 29

Slide 29 text

[email protected] @IanOzsvald[.com] PyDataAmsterdam 2018 ROC Curve (YellowBrick) LogisticRegression classifier to show a contrast with lower AUC