Data Tests

Data Tests Separating between Data Sources and the Rest satRday,
February 2020 Harel Lustiger Data Scientist, Harmonic Analytics

What are we doing? We are separating between the data
source and the rest

Insert 500 hundred lines here ggplot(Tog3[-c(1,28),], aes(Date,Rec2All, colour = isChina))+
geom_line(size = 2.2, alpha = 0.5)+ data <- read.csv("../input/novel-corona-virus-2019-dataset.csv") data <- data[,c(1:2,4,6:8)] data$isChina <- ifelse(data$Country %in% c("Mainland China", "China"),"China", "Not China") data$Date <- as.Date(data$Date, format = "%m/%d/%y") Insert 100 hundred lines here # main.R Dividing the analytic application into modules analytic module data source

Making the data source existence dependent on what matters Analytic
Module Depends On Database Database Access Has a Data Source Data Tests

What’s the Story? We are helping our clients by ﬁtting
a solution to their needs

Getting to know our client Euel Cheatam the Mercedes dealership
manager

The proposed solution is a weekly fact sheet with popular
Q&A

The data scientists encounter real-world impediments

Iteration Zero developing a data-driven app without real data

Generate dataset assumption triplet 01

The analytic application expects tidy-data1 as its input 1 https://r4ds.had.co.nz/tidy-data.html

Observation Unique Identiﬁer (UID) Target variable Salient features We formulate
a tidy dataset by making three assumptions

Convert assumptions to assertions with data-tests 02

## 1. Check if the dataset exists stopifnot(exist("cars_data"), is.data.frame(cars_data)) #
data-tests.R ## 2. Check if the necessary columns exist expected_cols <- c("car_model", "price", "gear", "mpg") stopifnot(all(expected_cols %in% colnames(cars_data))) ## 3. Check if the records are unique is.distinct <- function(x) dplyr::n_distinct(x) == length(x) stopifnot(is.distinct(cars_data$car_model))

Implement a data source plugin 03

Quick glimpse over dataset::mtcars Observation Unique Identiﬁer Salient Feature Salient
Feature

# data-source.R get_cars_data <- function(){ ## 1. Generate records data(mtcars,
package = "datasets") cars_data <- mtcars %>% tibble::rownames_to_column("car_model") ## 2. Generate price set.seed(2020) price <- runif(n = nrow(cars_data), min = 41, max = 75) cars_data <- cars_data %>% tibble::add_column(price = price) ## Run data-tests source("data-tests.R") return(cars_data) } ## 1. Generate records data(mtcars, package = "datasets") cars_data <- mtcars %>% tibble::rownames_to_column("car_model") ## 2. Generate price set.seed(2020) price <- runif(n = nrow(cars_data), min = 41, max = 75) cars_data <- cars_data %>% tibble::add_column(price = price)

Develop an analytic module 04

# app.R ## 1. Get the data source("data-source.R") cars_data <-
get_cars_data() ## 2. Render booklet print(cars_data %>% dplyr::select(car_model, mpg, gear, price)) boxplot(price ~ gear, cars_data) lm(price ~ mpg + gear, cars_data) %>% summary()

Conclusion Why and how we separate the data source from
the rest

Recall why we have separated the data source from the
rest Database Database Access Analytic Module Depends On Has a Data Source

Recall why we have separated the data source from the
rest Database Database Access Data Tests Analytic Module Depends On Dictates Has a Data Source

Recall how we have separated the data source from the
rest Harel Lustiger [email protected] assumption triplet data-tests.R data-source.R app.R

Data Tests

Data Tests

Harel Lustiger

More Decks by Harel Lustiger

Other Decks in Programming

Featured

Transcript

Data Tests Separating between Data Sources and the Rest satRday,

What are we doing? We are separating between the data

Insert 500 hundred lines here ggplot(Tog3[-c(1,28),], aes(Date,Rec2All, colour = isChina))+

Making the data source existence dependent on what matters Analytic

What’s the Story? We are helping our clients by ﬁtting

Getting to know our client Euel Cheatam the Mercedes dealership

The proposed solution is a weekly fact sheet with popular

The data scientists encounter real-world impediments

Iteration Zero developing a data-driven app without real data

Generate dataset assumption triplet 01

The analytic application expects tidy-data1 as its input 1 https://r4ds.had.co.nz/tidy-data.html

Observation Unique Identiﬁer (UID) Target variable Salient features We formulate

Convert assumptions to assertions with data-tests 02

## 1. Check if the dataset exists stopifnot(exist("cars_data"), is.data.frame(cars_data)) #

Implement a data source plugin 03

Quick glimpse over dataset::mtcars Observation Unique Identiﬁer Salient Feature Salient

# data-source.R get_cars_data <- function(){ ## 1. Generate records data(mtcars,

Develop an analytic module 04

# app.R ## 1. Get the data source("data-source.R") cars_data <-

Conclusion Why and how we separate the data source from

Recall why we have separated the data source from the

Recall why we have separated the data source from the

Recall how we have separated the data source from the