Slide 1

Slide 1 text

Building a Data Pipeline with Testing in Mind Jiaqi Liu, Software Engineer at Button Director, @WomenWhoCodeNYC @jiaqicodes

Slide 2

Slide 2 text

usebutton.com

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Agenda • Data Pipelines • Challenges with Testing Data Pipelines • Designing Features: • Well Defined Schemas • Dry Run Mode • Storing Metadata • Testing, Monitoring & Alerting

Slide 5

Slide 5 text

Data Pipeline

Slide 6

Slide 6 text

ETL Pipeline Extract data from a source, this could be scraping from a site, a large file, a realtime stream of data feeds. Transform the data - this could be joining the data with additional information for an enhanced data set, running through a machine learning model, or aggregating the data in some way. Load the data into a data warehouse or a User facing dashboard - wherever the end storage and display for data might be.

Slide 7

Slide 7 text

Data Pipeline Example

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

PyCon Convert HTML to structured data Write to file Select Random Talk Write to File

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

Batch Periodic Process that reads data in bulk (typically from a filesystem or a database) Stream High throughput, low latency system that reads data from a stream or a queue

Slide 16

Slide 16 text

Apache Airflow Luigi Google Dataflow Pachyderm

Slide 17

Slide 17 text

What Could Go Wrong?

Slide 18

Slide 18 text

Problems • Batch Job is never scheduled • Batch Job takes too long to run • Data is malformed or corrupt • Data is lost • Stream is backed-up, Stream data is lost • Non-deterministic models

Slide 19

Slide 19 text

Batch Jobs

Slide 20

Slide 20 text

Confidential Stream Jobs

Slide 21

Slide 21 text

Data is exposed or lost or malformed. A statistical model is producing highly inaccurate results Data Integrity Speed in Data Processing could be Core to Business Delayed Processing Data Pipeline Concerns

Slide 22

Slide 22 text

It’s not enough to know that the pipeline is healthy, you also have to know that the data being processed is accurate.

Slide 23

Slide 23 text

Build data pipelines that support interpretability and observability

Slide 24

Slide 24 text

Interpretability Not just understand what a model predicted but also why. Allows for debugging and auditing machine learning models.

Slide 25

Slide 25 text

Because Button is a marketplace, we see the side effects of user behavior in our data and have to decipher what assumptions are safe to make.

Slide 26

Slide 26 text

Observability Cindy Sridharan - https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c

Slide 27

Slide 27 text

Mystery Free Prod

Slide 28

Slide 28 text

Features Build feature to support Interpretability and observability

Slide 29

Slide 29 text

Features to Include • Well Defined Schemas • Capturing Metadata about the Pipeline • Having a Test Run Feature

Slide 30

Slide 30 text

Schema

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

Best Practices Protocol buffers are a language-neutral, platform-neutral extensible mechanism for serializing structured data.

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

Metadata

Slide 35

Slide 35 text

Test in Dry Run Mode

Slide 36

Slide 36 text

Confidential Monitoring & Testing

Slide 37

Slide 37 text

Testing

Slide 38

Slide 38 text

Test Pyramid https://martinfowler.com/articles/practical-test-pyramid.html

Slide 39

Slide 39 text

System Unit Test Functional Regression Different Types of Tests

Slide 40

Slide 40 text

Unit Tests

Slide 41

Slide 41 text

Functional Tests • Can also be known as Integration Tests • In the case of data, it’s the golden tests • Sets the gold standard for data in and data out • Doesn’t need to be logic specific like Unit Tests are • Build a Golden Test framework and define fixtures (expected input, and expected output) Failed Gold Test

Slide 42

Slide 42 text

Regression Tests https://www.ibeta.com/regression-testing-nutshell/

Slide 43

Slide 43 text

93% Model A Precision 95% Model B Precision Champion/Challenger Model

Slide 44

Slide 44 text

Black Box Measures Health of Whole System Continuous Relies on Other Signals System Tests

Slide 45

Slide 45 text

Monitoring

Slide 46

Slide 46 text

Confidential Monitoring Tools

Slide 47

Slide 47 text

No content

Slide 48

Slide 48 text

Time Series Metrics

Slide 49

Slide 49 text

Alerting

Slide 50

Slide 50 text

Monitoring with Time Series Data

Slide 51

Slide 51 text

Best Practices

Slide 52

Slide 52 text

Best Practices Set a threshold that works for you.
 Establish a baseline and go from there.

Slide 53

Slide 53 text

Thanks! We’re hiring! www.usebutton.com [email protected]