Test-Driven Machine Learning with Scikit-learn @ PyCon Sei (2015)

TEST- DRIVEN MACHINE LEARNING TDD M A C H I
N E L E A R N I N G [email protected] @leriomaggio +ValerioMaggio

PLEASE ANSWER TO FIVE QUESTIONS THREE QUESTIONS • Do you
already know what Machine Learning is? • Do you already know/use/hear about TDD • Have you ever used   Scikit-Learn?

Machine Learning Tools

ML AT A GLANCE

WHAT IS MACHINE LEARNING? Machine learning teaches machines how to
carry out tasks by themselves. It is that simple. The complexity comes with the details. W. Richert & L.P. Coelho, 2013 Building Machine Learning Systems with Python

MACHINE LEARNING & DATA ANALYSIS ML IS ABOUT   MAKING
PREDICTIONS

THE ESSENCE OF MACHINE LEARNING • A Pattern exists •
We cannot pin it down mathematically • We have data on it. Learning by Examples

EXAMPLE(2): CLASSIFICATION

EXAMPLE(3): CLUSTERING

@jakevdp

WHAT ABOUT PYTHON?

PYTHON and DATA SCIENCE • Python: The language of choice
for Data Science • Displacing R DATA SCIENCE I N P Y T H O N experfy.com/blog/ python-data-science/

PYTHON & S C I E N C E

ONE PYTHON to rule them all http://goo.gl/17XA5J One of the
biggest benefits of doing data science in Python is added efficiency of using one programming language across different applications. T. Yarkoni, Univ. Texas DATA SCIENCE P Y T H O N

PYTHON vs MATLAB/R DATA SCIENCE P Y T H O
N

MACHINE LEARNING IN PYTHON

ML PYTHON POWERED MACHINE LEARNING P Y T H O
N github.com/kevincobain2000/awesome-machine-learning

MACHINE LEARNING P Y T H O N Scala early
stage + C++ PyWrap + Python Powered

SO WHY TO CHOOSE  SCIKIT-LEARN?

WHY SCIKIT-LEARN

pip install numpy pip install scipy pip install ipython pip
install scikit-learn + + + + pip install matplotlib https://store.continuum.io/cshop/anaconda/

DESIGN PHILOSOPHY • Includes all the batteries n e c
e s s a r y f o r ( g e n e r a l purpose)   Machine Learning Code • Data (and Datasets) • Feature Selection, Extraction algorithms • M L A l g o r i t h m s (Classification, Regression, Clustering, ….) • Evaluation functions (Cross Validation, Confusion Matrix) SCIKIT L E A R N Algorithm Selection Philosophy: Try to keep the core as light as possible, including well-known and largely used ML methods

SCIKIT AT A GLANCE

4. TDD Mantra TDD

THINK STEP TDD M A N T R A 19
Think about what we want the code to do Think

THINK STEP TDD M A N T R A 19
Set up a Walking Skeleton Think

RED BAR TDD M A N T R A 19
Red bar FAIL: testFoo (__main__.FooTests) ------------------------------------ Traceback (most recent call last): self.failUnless(False) AssertionError ------------------------------------ 1 test in 0.003s FAILED (failures=1) Think

THINK STEP TDD E X A M P L E
19 Think We want to create objects that can say whether two given dates "match".

THINK STEP TDD E X A M P L E
19 Think

RED BAR TDD M A N T R A 19
Red bar =================================== ERROR: testMatches ----------------------------------- Traceback (most recent call last): line 8, in testMatches p = DatePattern(2004, 9, 28) NameError: global name 'DatePattern' is not defined ----------------------------------- Ran 1 test in 0.000s FAILED (errors=1) Think

GREEN BAR TDD M A N T R A 19
Red bar Green Bar Write the code just to pass the test Think

GREEN BAR TDD M A N T R A 19
Red bar Green Bar ========================== -------------------------- Ran 1 test in 0.000s OK Think

TDD M A N T R A 19 Think Red
bar Green Bar Refactoring

FRAMEWORKS • unittest • nose • py.test • nose.testing •
assert_array_equal, assert_almost_array_equal PYTHON T E S T I N G https://wiki.python.org/moin/PythonTestingToolsTaxonomy

4. TDD Mantra

SO WHAT ABOUT TDD AND   MACHINE LEARNING?

PROS:   Great reference on the topic CONS: Examples in
Ruby

TDD & SCIENTIFIC METHOD • TDD is (basically) about: •
1. Hypothesising; 2. testing; 3. theorising • The Scientific Method goes through a Trial&Error approach: • 1. Come up with Hypothesis; 2. Test hypothesis; 3. Create a theory • TDD also: • Makes logical propositions of validity • Documenting progresses (i.e., Write the Paper) • Work in Feedback Loops TDD M A C H I N E L E A R N I N G

LOGICAL PROPOSITIONS OF VALIDITY TDD M A C H I
N E L E A R N I N G

RISKS WITH MACHINE LEARNING TDD M A C H I
N E L E A R N I N G • Unstable data • programming fault (despite outliers reduction) • Underfitting • the learning function does not take into account enough information to accurately model the phenomenon • Overfitting • the learning fuction does not generalise enough to properly model the phenomenon • Unpredictable Future • We don’t actually know if our model is working or not! (running time checking) a.k.a. What to test?

UNDER VS OVER FITTING

HOW TO COPE WITH RISKS? TESTING S O L U
T I O N S

RUNNING EXAMPLE https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/tests/test_svm.py IRIS Dataset & SVM

IRIS DATASET

DISCRIMINATIVE FEATURES

THANKS FOR YOUR KIND ATTENTION [email protected] @leriomaggio +ValerioMaggio

Test-Driven Machine Learning with Scikit-learn ...

Test-Driven Machine Learning with Scikit-learn @ PyCon Sei (2015)

More Decks by Valerio Maggio

Other Decks in Programming

Featured

Transcript