The Case for Testing in Data Science

The Case for Testing in Data Science Chris Musselle Senior
Data Scientist [email protected]

Goal of Data Science? Make valuable use of data for
the business: • By deriving insights from it to help drive decisions. • By solving problems with it to provide new or improved services and products.

Different Needs at Different Times Exploration Construction Software Development EDA
Stats / ML Visualisation Delivery Fast Feedback Robustness Automation Reliability Prototyping

Prolonged Exploration can be Harmful • Analysis are often complex
• Coding well is hard (a conscious effort) • Issues with – Reproducibility – Reliability – Maintainability • Erodes trust

So… • How can we go about making our DS
projects more Reproducible, Reliable and Maintainable? Borrow good practices from Software Development (Construction) • Testing is a fundamental cornerstone.

How can we test more? • Motivation to test •
Knowledge of how to test • Note our resistances, and opportunities.

Motivation

What Do I Mean By Testing? An automated set of
checks that when run, prove a specific feature or function of your software works as you expected.

Why is it Valuable? • Automated testing provides confidence that
the code works, and quick feedback when it doesn’t! • Allows you to make changes with confidence (refactoring / library upgrades). • Only then is it possible to refine and improve the codebase.

Allows You to Play Well with Others • Tests help
ensure reproducibility • Tests help document / share knowledge • Allows for much easier collaboration. • Includes “Future You”, they love it when you write tests!

Knowing How to Test

Libraries A few to choose from, but pytest is a
great choice. • Minimal boilerplate • Detailed failing test output • Active community, many plugins

Given, When, Then A good structure to follow is: •
Given a starting set of conditions • When I perform these actions • Then I expect these results

Pytest Example # test_pipeline.py import my_analysis_pkg as lib def test_pipeline_output_exists():
# Given config = lib.get_default_config() data = lib.load_data('path/to/test/data’) # When pipeline = lib.Pipeline(data, config) pipeline.run() # Then assert pipeline.output_report

Pytest Example Run all tests from the command line with:
• Test discovery, execution and reporting. • See the docs at: https://docs.pytest.org pytest path/to/test/directory

What and When to Test?

Anything You Have Written In the Analysis: • Data Cleaning
Pipelines • Model Preprocessing Steps • Summary Reports / Tables • Don’t test what has already been tested.

Anything You Have Written In constructing the Deliverable / Service:
• User facing APIs • Data validation • Utility functions • Configuration Setup • Logging

Resistances

When Might we Resist Testing? • Time pressures and deadlines
– Testing is time consuming and takes effort • May be perceived as low value work – By management / stakeholders / clients – By yourself (not as interesting as analytics) • Not familiar with it, don’t know where to start • No energy or mental RAM remaining • If it breaks I’ll know and I’ll fix it

Great Opportunities for Tests • After manually testing in the
REPL • After scripting a set of operations • After finding a bug in the code (regression testing) • After exploration in the Jupyter Notebook

Ultimately a Trade Off you make • More tests mean
more upfront investment, but more benefits over time. • Fewer tests mean managing the risks now, and taking on even more over time.

Summary • DS need to be a balance of exploration
and construction activities • Testing is fundamental to good software. • Pytest makes it easy to get started. • Focus on your code over libraries • Resist temptations to take shortcuts.

Questions? Resources • Clean Code - Robert Martin • Python
3 Object Oriented Programming - Dusty Phillips • Python Testing with pytest - Brian Okken Chris Musselle Senior Data Scientist [email protected]

We are Hiring! • Data Scientists • Senior Data Scientists
• Data Engineers https://www.mango-solutions.com/about/careers [email protected]

Data Science Value Pipeline

Test Driven Data Analysis?

Test Driven Data Analysis (TDDA) • Combining TDD approaches with
Data Analysis • Nick Radcliffe – Talk at PyData Berlin 2017 – YouTube – Tutorial at PyData London 2017 – YouTube – Blog at http://tdda.info pip install tdda

TDDA: Core Features • Provides ways to test your code
and data • Reference Tests: – Against “known to be correct” reference results. • Constraint Tests: – Does the data look similar to what you have seen before?

Test Driven Data Analysis • Supports pytest integration • Has
cli tool • Run checks with • Get started with tdda examples tdda discover path/to/data.csv tdda verify path/to/data.csv

The Case for Testing in Data Science

The Case for Testing in Data Science

Chris

More Decks by Chris

Other Decks in Programming

Featured

Transcript