Slide 1

Slide 1 text

Data Tests Separating between Data Sources and the Rest satRday, February 2020 Harel Lustiger Data Scientist, Harmonic Analytics

Slide 2

Slide 2 text

What are we doing? We are separating between the data source and the rest

Slide 3

Slide 3 text

Insert 500 hundred lines here ggplot(Tog3[-c(1,28),], aes(Date,Rec2All, colour = isChina))+ geom_line(size = 2.2, alpha = 0.5)+ data <- read.csv("../input/novel-corona-virus-2019-dataset.csv") data <- data[,c(1:2,4,6:8)] data$isChina <- ifelse(data$Country %in% c("Mainland China", "China"),"China", "Not China") data$Date <- as.Date(data$Date, format = "%m/%d/%y") Insert 100 hundred lines here # main.R Dividing the analytic application into modules analytic module data source

Slide 4

Slide 4 text

Making the data source existence dependent on what matters Analytic Module Depends On Database Database Access Has a Data Source Data Tests

Slide 5

Slide 5 text

What’s the Story? We are helping our clients by fitting a solution to their needs

Slide 6

Slide 6 text

Getting to know our client Euel Cheatam the Mercedes dealership manager

Slide 7

Slide 7 text

The proposed solution is a weekly fact sheet with popular Q&A

Slide 8

Slide 8 text

The data scientists encounter real-world impediments

Slide 9

Slide 9 text

Iteration Zero developing a data-driven app without real data

Slide 10

Slide 10 text

Generate dataset assumption triplet 01

Slide 11

Slide 11 text

The analytic application expects tidy-data1 as its input 1 https://r4ds.had.co.nz/tidy-data.html

Slide 12

Slide 12 text

Observation Unique Identifier (UID) Target variable Salient features We formulate a tidy dataset by making three assumptions

Slide 13

Slide 13 text

Convert assumptions to assertions with data-tests 02

Slide 14

Slide 14 text

## 1. Check if the dataset exists stopifnot(exist("cars_data"), is.data.frame(cars_data)) # data-tests.R ## 2. Check if the necessary columns exist expected_cols <- c("car_model", "price", "gear", "mpg") stopifnot(all(expected_cols %in% colnames(cars_data))) ## 3. Check if the records are unique is.distinct <- function(x) dplyr::n_distinct(x) == length(x) stopifnot(is.distinct(cars_data$car_model))

Slide 15

Slide 15 text

Implement a data source plugin 03

Slide 16

Slide 16 text

Quick glimpse over dataset::mtcars Observation Unique Identifier Salient Feature Salient Feature

Slide 17

Slide 17 text

# data-source.R get_cars_data <- function(){ ## 1. Generate records data(mtcars, package = "datasets") cars_data <- mtcars %>% tibble::rownames_to_column("car_model") ## 2. Generate price set.seed(2020) price <- runif(n = nrow(cars_data), min = 41, max = 75) cars_data <- cars_data %>% tibble::add_column(price = price) ## Run data-tests source("data-tests.R") return(cars_data) } ## 1. Generate records data(mtcars, package = "datasets") cars_data <- mtcars %>% tibble::rownames_to_column("car_model") ## 2. Generate price set.seed(2020) price <- runif(n = nrow(cars_data), min = 41, max = 75) cars_data <- cars_data %>% tibble::add_column(price = price)

Slide 18

Slide 18 text

Develop an analytic module 04

Slide 19

Slide 19 text

# app.R ## 1. Get the data source("data-source.R") cars_data <- get_cars_data() ## 2. Render booklet print(cars_data %>% dplyr::select(car_model, mpg, gear, price)) boxplot(price ~ gear, cars_data) lm(price ~ mpg + gear, cars_data) %>% summary()

Slide 20

Slide 20 text

Conclusion Why and how we separate the data source from the rest

Slide 21

Slide 21 text

Recall why we have separated the data source from the rest Database Database Access Analytic Module Depends On Has a Data Source

Slide 22

Slide 22 text

Recall why we have separated the data source from the rest Database Database Access Data Tests Analytic Module Depends On Dictates Has a Data Source

Slide 23

Slide 23 text

Recall how we have separated the data source from the rest Harel Lustiger [email protected] assumption triplet data-tests.R data-source.R app.R