Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Test-Driven Machine Learning with Scikit-learn @ PyCon Sei (2015)

Test-Driven Machine Learning with Scikit-learn @ PyCon Sei (2015)

Machine learning is an amazing research and application field, which perfectly matches math skills with coding abilities in order to define programs that are able to learn from data.

Therefore, after having defined our own (mathematical) model, machine learning is about writing code - sometimes a lot of - to actually make the model to work.

However, one point usually underestimated or omitted when dealing with machine learning algorithms is how to write good quality code.

Test-driven development (TDD) is one of the most popular agile methods, specifically designed to support developers in producing (potential) less-buggy code by writing tests before the actual code under test.

The application of test-first programming principles to the implementation of Naive Bayes classifiers or Neural networks looks like a daunting challenge. Conversely, the test-code-refactor cycle strategy founds its principles in the scientific method: make a proposition of validity, share results, work in feedback loops. Moreover, this kind of approach to tackle problems, in this particular case would also allow for a better understanding of how the whole learning model works under the hood.

In this talk, examples of Test-Driven implementations of some of the most famous machine learning algorithms will be presented using scikit-learn.

The talk is intended for an intermediate audience. The content of the talk is intended to be mostly practical, and code oriented. Thus a good proficiency with the Python language is required. Conversely, no prior knowledge about TDD nor Machine Learning algorithms is necessary to attend this talk.

Valerio Maggio

April 18, 2015
Tweet

More Decks by Valerio Maggio

Other Decks in Programming

Transcript

  1. TEST- DRIVEN MACHINE LEARNING TDD M A C H I

    N E L E A R N I N G [email protected] @leriomaggio +ValerioMaggio
  2. PLEASE ANSWER TO FIVE QUESTIONS THREE QUESTIONS • Do you

    already know what Machine Learning is? • Do you already know/use/hear about TDD • Have you ever used 
 Scikit-Learn?
  3. WHAT IS MACHINE LEARNING? Machine learning teaches machines how to

    carry out tasks by themselves. It is that simple. The complexity comes with the details. W. Richert & L.P. Coelho, 2013 Building Machine Learning Systems with Python
  4. THE ESSENCE OF MACHINE LEARNING • A Pattern exists •

    We cannot pin it down mathematically • We have data on it. Learning by Examples
  5. PYTHON and DATA SCIENCE • Python: The language of choice

    for Data Science • Displacing R DATA SCIENCE I N P Y T H O N experfy.com/blog/ python-data-science/
  6. ONE PYTHON to rule them all http://goo.gl/17XA5J One of the

    biggest benefits of doing data science in Python is added efficiency of using one programming language across different applications. T. Yarkoni, Univ. Texas DATA SCIENCE P Y T H O N
  7. ML PYTHON POWERED MACHINE LEARNING P Y T H O

    N github.com/kevincobain2000/awesome-machine-learning
  8. MACHINE LEARNING P Y T H O N Scala early

    stage + C++ PyWrap + Python Powered
  9. pip install numpy pip install scipy pip install ipython pip

    install scikit-learn + + + + pip install matplotlib https://store.continuum.io/cshop/anaconda/
  10. DESIGN PHILOSOPHY • Includes all the batteries n e c

    e s s a r y f o r ( g e n e r a l purpose) 
 Machine Learning Code • Data (and Datasets) • Feature Selection, Extraction algorithms • M L A l g o r i t h m s (Classification, Regression, Clustering, ….) • Evaluation functions (Cross Validation, Confusion Matrix) SCIKIT L E A R N Algorithm Selection Philosophy: Try to keep the core as light as possible, including well-known and largely used ML methods
  11. THINK STEP TDD M A N T R A 19

    Think about what we want the code to do Think
  12. THINK STEP TDD M A N T R A 19

    Set up a Walking Skeleton Think
  13. RED BAR TDD M A N T R A 19

    Red bar FAIL: testFoo (__main__.FooTests) ------------------------------------ Traceback (most recent call last): self.failUnless(False) AssertionError ------------------------------------ 1 test in 0.003s FAILED (failures=1) Think
  14. THINK STEP TDD E X A M P L E

    19 Think We want to create objects that can say whether two given dates "match".
  15. RED BAR TDD M A N T R A 19

    Red bar =================================== ERROR: testMatches ----------------------------------- Traceback (most recent call last): line 8, in testMatches p = DatePattern(2004, 9, 28) NameError: global name 'DatePattern' is not defined ----------------------------------- Ran 1 test in 0.000s FAILED (errors=1) Think
  16. GREEN BAR TDD M A N T R A 19

    Red bar Green Bar Write the code just to pass the test Think
  17. GREEN BAR TDD M A N T R A 19

    Red bar Green Bar ========================== -------------------------- Ran 1 test in 0.000s OK Think
  18. TDD M A N T R A 19 Think Red

    bar Green Bar Refactoring
  19. FRAMEWORKS • unittest • nose • py.test • nose.testing •

    assert_array_equal, assert_almost_array_equal PYTHON T E S T I N G https://wiki.python.org/moin/PythonTestingToolsTaxonomy
  20. TDD & SCIENTIFIC METHOD • TDD is (basically) about: •

    1. Hypothesising; 2. testing; 3. theorising • The Scientific Method goes through a Trial&Error approach: • 1. Come up with Hypothesis; 2. Test hypothesis; 3. Create a theory • TDD also: • Makes logical propositions of validity • Documenting progresses (i.e., Write the Paper) • Work in Feedback Loops TDD M A C H I N E L E A R N I N G
  21. RISKS WITH MACHINE LEARNING TDD M A C H I

    N E L E A R N I N G • Unstable data • programming fault (despite outliers reduction) • Underfitting • the learning function does not take into account enough information to accurately model the phenomenon • Overfitting • the learning fuction does not generalise enough to properly model the phenomenon • Unpredictable Future • We don’t actually know if our model is working or not! (running time checking) a.k.a. What to test?