Introduction to Random Forests for Machine Learning at London Python 2017-01

Using Machine Learning to solve a classification problem with scikit-
learn - a practical walkthrough London Python 2017-01 Ian Ozsvald @IanOzsvald ModelInsight.io

[email protected] @IanOzsvald London Python 2017-01 Introductions • I’m an engineering
data scientist • AI/Data Science consulting 15+ years • Data science team coach – I observe that engineers have the data Blog->IanOzsvald.com

[email protected] @IanOzsvald London Python 2017-01 We’ll briefly cover... • Why?
My hypothesis about you • Two class classification • A process to build an ML model • Train/Test and Cross validation • Debugging the model • Deployment Fully worked process, more examples, more graphs (this London Python talk builds on my PyConUK 2016 conference talk)

[email protected] @IanOzsvald London Python 2017-01 Process • Exploratory Data Analysis
(EDA) • Build a DummyClassifier model • Build a RandomForest with several features • Use cross-validation (Notebook) in favour of Train/Test sets • Find worst errors and improve • Stop when ‘good enough’ for your needs

[email protected] @IanOzsvald London Python 2017-01 Data overview Nice, fairly tidy
data – usually you have to work hard here!

[email protected] @IanOzsvald London Python 2017-01 Seaborn plots for EDA Classifier’s
best guess is ‘they died’ unless you introduce new information e.g. ‘Sex’

[email protected] @IanOzsvald London Python 2017-01 Seaborn plots

[email protected] @IanOzsvald London Python 2017-01 Training and Testing • Features
(X) and Target (y) • Training and test splits of each • Like lessons and exams • Clever algs can memorize the answers!

[email protected] @IanOzsvald London Python 2017-01 Simplest sklearn • Do the
dumbest thing first – no ML, just a majority-class guess to make a baseline • ‘Train’ and predict on test set Here we ignore is_female, it just makes an appropriately sized input matrix X ‘stuff to learn’ y ‘target to learn’

[email protected] @IanOzsvald London Python 2017-01 Random Forests • Treat as
a ‘black box’ • Very powerful and robust • Doesn’t require scaling • Handles non-linear responses • Handles relationships between parameters • Not (too) fooled if you give many noise features

[email protected] @IanOzsvald London Python 2017-01 RandomForestClassifier • Build RF using
1 feature (is_female) • We outperform a majority guess :-)

2 features • No significant improvement...we’ll push on (this is the usual state…) • General rule – add more features

[email protected] @IanOzsvald London Python 2017-01 Dealing with NaN/Null • Sklearn
won’t work with NaN values • You must replace or delete these rows • RF works “ok” with a sentinel value Note - sklearn issue 5870 discusses a NaN-friendly way to build trees – go contribute and help this discussion!

[email protected] @IanOzsvald London Python 2017-01 Cross validation Ref: http://blog.kaggle.com/2015/06/29/scikit-learn-video-7- optimizing-your-model-with-cross-validation/
It is a bit like “taking many exams” rather than just having 1 Sklearn does 3-fold by default (not 5-fold shown here) 3-fold is a sensible starting point More folds give a better estimate of mean & take longer to run

many features • With bigger RF we may also classify better • n_estimators only param worth tuning

[email protected] @IanOzsvald London Python 2017-01 Debugging • Confusion matrix –
does it look sensible? • Cross validation scores – are they stable? (Notebook) • Feature importances • Find ‘worst’ errors and eyeball (see Notebook)

[email protected] @IanOzsvald London Python 2017-01 Deployment • Pickle your models,
reload them • Ad-hoc scripts → reports or db • Microservices • Flask • My featherweight API on github (built on Flask) • New Jupyter microservices • Do please have unit tests & reproducible environments • Use conda environments in Anaconda

[email protected] @IanOzsvald London Python 2017-01 Background material • Sklearn has
lovely forum • PyDataTV on YouTube for conf videos • PyDataLondon monthly meetup

[email protected] @IanOzsvald London Python 2017-01 Closing... • Random Forest +
good data gives you a great start (don’t be seduced by Deep Learning’s hype!) • Write-up: http://ianozsvald.com/ • Use github repo to try this yourself • https://github.com/savarin/pyconuk-introtutorial • Longer great tutorial from PyConUK 2014 (Ezzeri) • Take an engineering mindset and go slow • Questions<->beer

[email protected] @IanOzsvald London Python 2017-01 Community • Python 3.6+ is
the way to go • How many of you contribute to the open source community? • Bug fix, document, talk, answer questions, sponsor • Thank your organisers and do nice things for them

[email protected] @IanOzsvald London Python 2017-01 PyDataLondon Conf 2017 • May
5-7th at Bloomberg (pydata.org) • Call for Proposals running until e.o.Feb • 330 data scientists+engineers

Introduction to Random Forests for Machine Lear...

Introduction to Random Forests for Machine Learning at London Python 2017-01

ianozsvald

More Decks by ianozsvald

Other Decks in Technology

Featured

Transcript

Using Machine Learning to solve a classification problem with scikit-

[email protected] @IanOzsvald London Python 2017-01 Introductions • I’m an engineering

[email protected] @IanOzsvald London Python 2017-01 We’ll briefly cover... • Why?

[email protected] @IanOzsvald London Python 2017-01 Process • Exploratory Data Analysis

[email protected] @IanOzsvald London Python 2017-01 Data overview Nice, fairly tidy

[email protected] @IanOzsvald London Python 2017-01 Seaborn plots for EDA Classifier’s

[email protected] @IanOzsvald London Python 2017-01 Seaborn plots

[email protected] @IanOzsvald London Python 2017-01 Training and Testing • Features

[email protected] @IanOzsvald London Python 2017-01 Simplest sklearn • Do the

[email protected] @IanOzsvald London Python 2017-01 Random Forests • Treat as

[email protected] @IanOzsvald London Python 2017-01 RandomForestClassifier • Build RF using

[email protected] @IanOzsvald London Python 2017-01 RandomForestClassifier • Build RF using

[email protected] @IanOzsvald London Python 2017-01 Dealing with NaN/Null • Sklearn

[email protected] @IanOzsvald London Python 2017-01 Cross validation Ref: http://blog.kaggle.com/2015/06/29/scikit-learn-video-7- optimizing-your-model-with-cross-validation/

[email protected] @IanOzsvald London Python 2017-01 RandomForestClassifier • Build RF using

[email protected] @IanOzsvald London Python 2017-01 Debugging • Confusion matrix –

[email protected] @IanOzsvald London Python 2017-01 Deployment • Pickle your models,

[email protected] @IanOzsvald London Python 2017-01 Background material • Sklearn has

[email protected] @IanOzsvald London Python 2017-01 Closing... • Random Forest +

[email protected] @IanOzsvald London Python 2017-01 Community • Python 3.6+ is

[email protected] @IanOzsvald London Python 2017-01 PyDataLondon Conf 2017 • May