Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Case for Testing in Data Science

Avatar for Chris Chris
August 14, 2019

The Case for Testing in Data Science

Data Scientists work is often focused more on the research and exploration rather than the development side of building software. However, in order to demonstrate value from the analytics we do, we have to be able to deliver reliably. This often takes the form of a reproducible research report/project or a built prototype that developers/stakeholders can further experiment with. Testing is a fundamental component to any software project, though unfortunately, it can become a bit of an afterthought in the Data Science world. This talk aims to highlight why testing is so important and beneficial in a Data Scientists workflow and some pointers on how we can do more of it more often.

Avatar for Chris

Chris

August 14, 2019
Tweet

More Decks by Chris

Other Decks in Programming

Transcript

  1. Goal of Data Science? Make valuable use of data for

    the business: • By deriving insights from it to help drive decisions. • By solving problems with it to provide new or improved services and products.
  2. Different Needs at Different Times Exploration Construction Software Development EDA

    Stats / ML Visualisation Delivery Fast Feedback Robustness Automation Reliability Prototyping
  3. Prolonged Exploration can be Harmful • Analysis are often complex

    • Coding well is hard (a conscious effort) • Issues with – Reproducibility – Reliability – Maintainability • Erodes trust
  4. So… • How can we go about making our DS

    projects more Reproducible, Reliable and Maintainable? Borrow good practices from Software Development (Construction) • Testing is a fundamental cornerstone.
  5. How can we test more? • Motivation to test •

    Knowledge of how to test • Note our resistances, and opportunities.
  6. What Do I Mean By Testing? An automated set of

    checks that when run, prove a specific feature or function of your software works as you expected.
  7. Why is it Valuable? • Automated testing provides confidence that

    the code works, and quick feedback when it doesn’t! • Allows you to make changes with confidence (refactoring / library upgrades). • Only then is it possible to refine and improve the codebase.
  8. Allows You to Play Well with Others • Tests help

    ensure reproducibility • Tests help document / share knowledge • Allows for much easier collaboration. • Includes “Future You”, they love it when you write tests!
  9. Libraries A few to choose from, but pytest is a

    great choice. • Minimal boilerplate • Detailed failing test output • Active community, many plugins
  10. Given, When, Then A good structure to follow is: •

    Given a starting set of conditions • When I perform these actions • Then I expect these results
  11. Pytest Example # test_pipeline.py import my_analysis_pkg as lib def test_pipeline_output_exists():

    # Given config = lib.get_default_config() data = lib.load_data('path/to/test/data’) # When pipeline = lib.Pipeline(data, config) pipeline.run() # Then assert pipeline.output_report
  12. Pytest Example Run all tests from the command line with:

    • Test discovery, execution and reporting. • See the docs at: https://docs.pytest.org pytest path/to/test/directory
  13. Anything You Have Written In the Analysis: • Data Cleaning

    Pipelines • Model Preprocessing Steps • Summary Reports / Tables • Don’t test what has already been tested.
  14. Anything You Have Written In constructing the Deliverable / Service:

    • User facing APIs • Data validation • Utility functions • Configuration Setup • Logging
  15. When Might we Resist Testing? • Time pressures and deadlines

    – Testing is time consuming and takes effort • May be perceived as low value work – By management / stakeholders / clients – By yourself (not as interesting as analytics) • Not familiar with it, don’t know where to start • No energy or mental RAM remaining • If it breaks I’ll know and I’ll fix it
  16. Great Opportunities for Tests • After manually testing in the

    REPL • After scripting a set of operations • After finding a bug in the code (regression testing) • After exploration in the Jupyter Notebook
  17. Ultimately a Trade Off you make • More tests mean

    more upfront investment, but more benefits over time. • Fewer tests mean managing the risks now, and taking on even more over time.
  18. Summary • DS need to be a balance of exploration

    and construction activities • Testing is fundamental to good software. • Pytest makes it easy to get started. • Focus on your code over libraries • Resist temptations to take shortcuts.
  19. Questions? Resources • Clean Code - Robert Martin • Python

    3 Object Oriented Programming - Dusty Phillips • Python Testing with pytest - Brian Okken Chris Musselle Senior Data Scientist [email protected]
  20. We are Hiring! • Data Scientists • Senior Data Scientists

    • Data Engineers https://www.mango-solutions.com/about/careers [email protected]
  21. Test Driven Data Analysis (TDDA) • Combining TDD approaches with

    Data Analysis • Nick Radcliffe – Talk at PyData Berlin 2017 – YouTube – Tutorial at PyData London 2017 – YouTube – Blog at http://tdda.info pip install tdda
  22. TDDA: Core Features • Provides ways to test your code

    and data • Reference Tests: – Against “known to be correct” reference results. • Constraint Tests: – Does the data look similar to what you have seen before?
  23. Test Driven Data Analysis • Supports pytest integration • Has

    cli tool • Run checks with • Get started with tdda examples tdda discover path/to/data.csv tdda verify path/to/data.csv