Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Random Forests for Machine Learning at London Python 2017-01

ianozsvald
January 27, 2017

Introduction to Random Forests for Machine Learning at London Python 2017-01

ianozsvald

January 27, 2017
Tweet

More Decks by ianozsvald

Other Decks in Technology

Transcript

  1. Using Machine Learning to solve a classification problem with scikit-

    learn - a practical walkthrough London Python 2017-01 Ian Ozsvald @IanOzsvald ModelInsight.io
  2. [email protected] @IanOzsvald London Python 2017-01 Introductions • I’m an engineering

    data scientist • AI/Data Science consulting 15+ years • Data science team coach – I observe that engineers have the data Blog->IanOzsvald.com
  3. [email protected] @IanOzsvald London Python 2017-01 We’ll briefly cover... • Why?

    My hypothesis about you • Two class classification • A process to build an ML model • Train/Test and Cross validation • Debugging the model • Deployment Fully worked process, more examples, more graphs (this London Python talk builds on my PyConUK 2016 conference talk)
  4. [email protected] @IanOzsvald London Python 2017-01 Process • Exploratory Data Analysis

    (EDA) • Build a DummyClassifier model • Build a RandomForest with several features • Use cross-validation (Notebook) in favour of Train/Test sets • Find worst errors and improve • Stop when ‘good enough’ for your needs
  5. [email protected] @IanOzsvald London Python 2017-01 Seaborn plots for EDA Classifier’s

    best guess is ‘they died’ unless you introduce new information e.g. ‘Sex’
  6. [email protected] @IanOzsvald London Python 2017-01 Training and Testing • Features

    (X) and Target (y) • Training and test splits of each • Like lessons and exams • Clever algs can memorize the answers!
  7. [email protected] @IanOzsvald London Python 2017-01 Simplest sklearn • Do the

    dumbest thing first – no ML, just a majority-class guess to make a baseline • ‘Train’ and predict on test set Here we ignore is_female, it just makes an appropriately sized input matrix X ‘stuff to learn’ y ‘target to learn’
  8. [email protected] @IanOzsvald London Python 2017-01 Random Forests • Treat as

    a ‘black box’ • Very powerful and robust • Doesn’t require scaling • Handles non-linear responses • Handles relationships between parameters • Not (too) fooled if you give many noise features
  9. [email protected] @IanOzsvald London Python 2017-01 RandomForestClassifier • Build RF using

    1 feature (is_female) • We outperform a majority guess :-)
  10. [email protected] @IanOzsvald London Python 2017-01 RandomForestClassifier • Build RF using

    2 features • No significant improvement...we’ll push on (this is the usual state…) • General rule – add more features
  11. [email protected] @IanOzsvald London Python 2017-01 Dealing with NaN/Null • Sklearn

    won’t work with NaN values • You must replace or delete these rows • RF works “ok” with a sentinel value Note - sklearn issue 5870 discusses a NaN-friendly way to build trees – go contribute and help this discussion!
  12. [email protected] @IanOzsvald London Python 2017-01 Cross validation Ref: http://blog.kaggle.com/2015/06/29/scikit-learn-video-7- optimizing-your-model-with-cross-validation/

    It is a bit like “taking many exams” rather than just having 1 Sklearn does 3-fold by default (not 5-fold shown here) 3-fold is a sensible starting point More folds give a better estimate of mean & take longer to run
  13. [email protected] @IanOzsvald London Python 2017-01 RandomForestClassifier • Build RF using

    many features • With bigger RF we may also classify better • n_estimators only param worth tuning
  14. [email protected] @IanOzsvald London Python 2017-01 Debugging • Confusion matrix –

    does it look sensible? • Cross validation scores – are they stable? (Notebook) • Feature importances • Find ‘worst’ errors and eyeball (see Notebook)
  15. [email protected] @IanOzsvald London Python 2017-01 Deployment • Pickle your models,

    reload them • Ad-hoc scripts → reports or db • Microservices • Flask • My featherweight API on github (built on Flask) • New Jupyter microservices • Do please have unit tests & reproducible environments • Use conda environments in Anaconda
  16. [email protected] @IanOzsvald London Python 2017-01 Background material • Sklearn has

    lovely forum • PyDataTV on YouTube for conf videos • PyDataLondon monthly meetup
  17. [email protected] @IanOzsvald London Python 2017-01 Closing... • Random Forest +

    good data gives you a great start (don’t be seduced by Deep Learning’s hype!) • Write-up: http://ianozsvald.com/ • Use github repo to try this yourself • https://github.com/savarin/pyconuk-introtutorial • Longer great tutorial from PyConUK 2014 (Ezzeri) • Take an engineering mindset and go slow • Questions<->beer
  18. [email protected] @IanOzsvald London Python 2017-01 Community • Python 3.6+ is

    the way to go • How many of you contribute to the open source community? • Bug fix, document, talk, answer questions, sponsor • Thank your organisers and do nice things for them
  19. [email protected] @IanOzsvald London Python 2017-01 PyDataLondon Conf 2017 • May

    5-7th at Bloomberg (pydata.org) • Call for Proposals running until e.o.Feb • 330 data scientists+engineers