Slide 1

Slide 1 text

Using Machine Learning to solve a classification problem with scikit- learn - a practical walkthrough London Python 2017-01 Ian Ozsvald @IanOzsvald ModelInsight.io

Slide 2

Slide 2 text

[email protected] @IanOzsvald London Python 2017-01 Introductions ● I’m an engineering data scientist ● AI/Data Science consulting 15+ years ● Data science team coach – I observe that engineers have the data Blog->IanOzsvald.com

Slide 3

Slide 3 text

[email protected] @IanOzsvald London Python 2017-01 We’ll briefly cover... ● Why? My hypothesis about you ● Two class classification ● A process to build an ML model ● Train/Test and Cross validation ● Debugging the model ● Deployment Fully worked process, more examples, more graphs (this London Python talk builds on my PyConUK 2016 conference talk)

Slide 4

Slide 4 text

[email protected] @IanOzsvald London Python 2017-01 Process ● Exploratory Data Analysis (EDA) ● Build a DummyClassifier model ● Build a RandomForest with several features ● Use cross-validation (Notebook) in favour of Train/Test sets ● Find worst errors and improve ● Stop when ‘good enough’ for your needs

Slide 5

Slide 5 text

[email protected] @IanOzsvald London Python 2017-01 Data overview Nice, fairly tidy data – usually you have to work hard here!

Slide 6

Slide 6 text

[email protected] @IanOzsvald London Python 2017-01 Seaborn plots for EDA Classifier’s best guess is ‘they died’ unless you introduce new information e.g. ‘Sex’

Slide 7

Slide 7 text

[email protected] @IanOzsvald London Python 2017-01 Seaborn plots

Slide 8

Slide 8 text

[email protected] @IanOzsvald London Python 2017-01 Training and Testing ● Features (X) and Target (y) ● Training and test splits of each ● Like lessons and exams ● Clever algs can memorize the answers!

Slide 9

Slide 9 text

[email protected] @IanOzsvald London Python 2017-01 Simplest sklearn ● Do the dumbest thing first – no ML, just a majority-class guess to make a baseline ● ‘Train’ and predict on test set Here we ignore is_female, it just makes an appropriately sized input matrix X ‘stuff to learn’ y ‘target to learn’

Slide 10

Slide 10 text

[email protected] @IanOzsvald London Python 2017-01 Random Forests ● Treat as a ‘black box’ ● Very powerful and robust ● Doesn’t require scaling ● Handles non-linear responses ● Handles relationships between parameters ● Not (too) fooled if you give many noise features

Slide 11

Slide 11 text

[email protected] @IanOzsvald London Python 2017-01 RandomForestClassifier ● Build RF using 1 feature (is_female) ● We outperform a majority guess :-)

Slide 12

Slide 12 text

[email protected] @IanOzsvald London Python 2017-01 RandomForestClassifier ● Build RF using 2 features ● No significant improvement...we’ll push on (this is the usual state…) ● General rule – add more features

Slide 13

Slide 13 text

[email protected] @IanOzsvald London Python 2017-01 Dealing with NaN/Null ● Sklearn won’t work with NaN values ● You must replace or delete these rows ● RF works “ok” with a sentinel value Note - sklearn issue 5870 discusses a NaN-friendly way to build trees – go contribute and help this discussion!

Slide 14

Slide 14 text

[email protected] @IanOzsvald London Python 2017-01 Cross validation Ref: http://blog.kaggle.com/2015/06/29/scikit-learn-video-7- optimizing-your-model-with-cross-validation/ It is a bit like “taking many exams” rather than just having 1 Sklearn does 3-fold by default (not 5-fold shown here) 3-fold is a sensible starting point More folds give a better estimate of mean & take longer to run

Slide 15

Slide 15 text

[email protected] @IanOzsvald London Python 2017-01 RandomForestClassifier ● Build RF using many features ● With bigger RF we may also classify better ● n_estimators only param worth tuning

Slide 16

Slide 16 text

[email protected] @IanOzsvald London Python 2017-01 Debugging ● Confusion matrix – does it look sensible? ● Cross validation scores – are they stable? (Notebook) ● Feature importances ● Find ‘worst’ errors and eyeball (see Notebook)

Slide 17

Slide 17 text

[email protected] @IanOzsvald London Python 2017-01 Deployment ● Pickle your models, reload them ● Ad-hoc scripts → reports or db ● Microservices ● Flask ● My featherweight API on github (built on Flask) ● New Jupyter microservices ● Do please have unit tests & reproducible environments ● Use conda environments in Anaconda

Slide 18

Slide 18 text

[email protected] @IanOzsvald London Python 2017-01 Background material ● Sklearn has lovely forum ● PyDataTV on YouTube for conf videos ● PyDataLondon monthly meetup

Slide 19

Slide 19 text

[email protected] @IanOzsvald London Python 2017-01 Closing... ● Random Forest + good data gives you a great start (don’t be seduced by Deep Learning’s hype!) ● Write-up: http://ianozsvald.com/ ● Use github repo to try this yourself ● https://github.com/savarin/pyconuk-introtutorial ● Longer great tutorial from PyConUK 2014 (Ezzeri) ● Take an engineering mindset and go slow ● Questions<->beer

Slide 20

Slide 20 text

[email protected] @IanOzsvald London Python 2017-01 Community ● Python 3.6+ is the way to go ● How many of you contribute to the open source community? ● Bug fix, document, talk, answer questions, sponsor ● Thank your organisers and do nice things for them

Slide 21

Slide 21 text

[email protected] @IanOzsvald London Python 2017-01 PyDataLondon Conf 2017 ● May 5-7th at Bloomberg (pydata.org) ● Call for Proposals running until e.o.Feb ● 330 data scientists+engineers