Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Random Forests for Machine Lear...

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
Avatar for ianozsvald ianozsvald
January 27, 2017

Introduction to Random Forests for Machine Learning at London Python 2017-01

Avatar for ianozsvald

ianozsvald

January 27, 2017
Tweet

More Decks by ianozsvald

Other Decks in Technology

Transcript

  1. Using Machine Learning to solve a classification problem with scikit-

    learn - a practical walkthrough London Python 2017-01 Ian Ozsvald @IanOzsvald ModelInsight.io
  2. [email protected] @IanOzsvald London Python 2017-01 Introductions • I’m an engineering

    data scientist • AI/Data Science consulting 15+ years • Data science team coach – I observe that engineers have the data Blog->IanOzsvald.com
  3. [email protected] @IanOzsvald London Python 2017-01 We’ll briefly cover... • Why?

    My hypothesis about you • Two class classification • A process to build an ML model • Train/Test and Cross validation • Debugging the model • Deployment Fully worked process, more examples, more graphs (this London Python talk builds on my PyConUK 2016 conference talk)
  4. [email protected] @IanOzsvald London Python 2017-01 Process • Exploratory Data Analysis

    (EDA) • Build a DummyClassifier model • Build a RandomForest with several features • Use cross-validation (Notebook) in favour of Train/Test sets • Find worst errors and improve • Stop when ‘good enough’ for your needs
  5. [email protected] @IanOzsvald London Python 2017-01 Seaborn plots for EDA Classifier’s

    best guess is ‘they died’ unless you introduce new information e.g. ‘Sex’
  6. [email protected] @IanOzsvald London Python 2017-01 Training and Testing • Features

    (X) and Target (y) • Training and test splits of each • Like lessons and exams • Clever algs can memorize the answers!
  7. [email protected] @IanOzsvald London Python 2017-01 Simplest sklearn • Do the

    dumbest thing first – no ML, just a majority-class guess to make a baseline • ‘Train’ and predict on test set Here we ignore is_female, it just makes an appropriately sized input matrix X ‘stuff to learn’ y ‘target to learn’
  8. [email protected] @IanOzsvald London Python 2017-01 Random Forests • Treat as

    a ‘black box’ • Very powerful and robust • Doesn’t require scaling • Handles non-linear responses • Handles relationships between parameters • Not (too) fooled if you give many noise features
  9. [email protected] @IanOzsvald London Python 2017-01 RandomForestClassifier • Build RF using

    1 feature (is_female) • We outperform a majority guess :-)
  10. [email protected] @IanOzsvald London Python 2017-01 RandomForestClassifier • Build RF using

    2 features • No significant improvement...we’ll push on (this is the usual state…) • General rule – add more features
  11. [email protected] @IanOzsvald London Python 2017-01 Dealing with NaN/Null • Sklearn

    won’t work with NaN values • You must replace or delete these rows • RF works “ok” with a sentinel value Note - sklearn issue 5870 discusses a NaN-friendly way to build trees – go contribute and help this discussion!
  12. [email protected] @IanOzsvald London Python 2017-01 Cross validation Ref: http://blog.kaggle.com/2015/06/29/scikit-learn-video-7- optimizing-your-model-with-cross-validation/

    It is a bit like “taking many exams” rather than just having 1 Sklearn does 3-fold by default (not 5-fold shown here) 3-fold is a sensible starting point More folds give a better estimate of mean & take longer to run
  13. [email protected] @IanOzsvald London Python 2017-01 RandomForestClassifier • Build RF using

    many features • With bigger RF we may also classify better • n_estimators only param worth tuning
  14. [email protected] @IanOzsvald London Python 2017-01 Debugging • Confusion matrix –

    does it look sensible? • Cross validation scores – are they stable? (Notebook) • Feature importances • Find ‘worst’ errors and eyeball (see Notebook)
  15. [email protected] @IanOzsvald London Python 2017-01 Deployment • Pickle your models,

    reload them • Ad-hoc scripts → reports or db • Microservices • Flask • My featherweight API on github (built on Flask) • New Jupyter microservices • Do please have unit tests & reproducible environments • Use conda environments in Anaconda
  16. [email protected] @IanOzsvald London Python 2017-01 Background material • Sklearn has

    lovely forum • PyDataTV on YouTube for conf videos • PyDataLondon monthly meetup
  17. [email protected] @IanOzsvald London Python 2017-01 Closing... • Random Forest +

    good data gives you a great start (don’t be seduced by Deep Learning’s hype!) • Write-up: http://ianozsvald.com/ • Use github repo to try this yourself • https://github.com/savarin/pyconuk-introtutorial • Longer great tutorial from PyConUK 2014 (Ezzeri) • Take an engineering mindset and go slow • Questions<->beer
  18. [email protected] @IanOzsvald London Python 2017-01 Community • Python 3.6+ is

    the way to go • How many of you contribute to the open source community? • Bug fix, document, talk, answer questions, sponsor • Thank your organisers and do nice things for them
  19. [email protected] @IanOzsvald London Python 2017-01 PyDataLondon Conf 2017 • May

    5-7th at Bloomberg (pydata.org) • Call for Proposals running until e.o.Feb • 330 data scientists+engineers