Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Tests

Data Tests

Abstract

Data science projects in commercial companies often experience a challenge arising from evolving data sources. As the project progresses, new signals and information sources are added incrementally. In practice, when the data source changes, it creates a need to change the application source code.

With no design up front, some application modules, such as a dashboard or a machine learning model, unwittingly become dependent on the data source. In this case, accommodating the evolving data source is not simply a matter of changing the code related to the data source. Rather, preserving the rest of the existing application in a working condition involves further code changes in distant elements of the application.

An alternative way of dealing with evolving data sources is to introduce a small design up front. Such design lets programmers manage the source code dependencies throughout the project life-cycle.

This talk suggests a design that (1) separates data sources from analytic applications and (2) restricts analytic applications from knowing about the data sources.

While the evolving data sources challenge is programming language agnostic, this talk demonstrates an implementation of the suggested design in R.

The takeaway of this talk is a design of a system, where data sources are plugins to the system’s analytic modules.

---

This presentation was given at satRday Auckland, February 2020.
See event details at https://auckland2020.satrdays.org/#portfolioModal7

Harel Lustiger

February 22, 2020
Tweet

More Decks by Harel Lustiger

Other Decks in Programming

Transcript

  1. Data Tests Separating between Data Sources and the Rest satRday,

    February 2020 Harel Lustiger Data Scientist, Harmonic Analytics
  2. Insert 500 hundred lines here ggplot(Tog3[-c(1,28),], aes(Date,Rec2All, colour = isChina))+

    geom_line(size = 2.2, alpha = 0.5)+ data <- read.csv("../input/novel-corona-virus-2019-dataset.csv") data <- data[,c(1:2,4,6:8)] data$isChina <- ifelse(data$Country %in% c("Mainland China", "China"),"China", "Not China") data$Date <- as.Date(data$Date, format = "%m/%d/%y") Insert 100 hundred lines here # main.R Dividing the analytic application into modules analytic module data source
  3. Making the data source existence dependent on what matters Analytic

    Module Depends On Database Database Access Has a Data Source Data Tests
  4. ## 1. Check if the dataset exists stopifnot(exist("cars_data"), is.data.frame(cars_data)) #

    data-tests.R ## 2. Check if the necessary columns exist expected_cols <- c("car_model", "price", "gear", "mpg") stopifnot(all(expected_cols %in% colnames(cars_data))) ## 3. Check if the records are unique is.distinct <- function(x) dplyr::n_distinct(x) == length(x) stopifnot(is.distinct(cars_data$car_model))
  5. # data-source.R get_cars_data <- function(){ ## 1. Generate records data(mtcars,

    package = "datasets") cars_data <- mtcars %>% tibble::rownames_to_column("car_model") ## 2. Generate price set.seed(2020) price <- runif(n = nrow(cars_data), min = 41, max = 75) cars_data <- cars_data %>% tibble::add_column(price = price) ## Run data-tests source("data-tests.R") return(cars_data) } ## 1. Generate records data(mtcars, package = "datasets") cars_data <- mtcars %>% tibble::rownames_to_column("car_model") ## 2. Generate price set.seed(2020) price <- runif(n = nrow(cars_data), min = 41, max = 75) cars_data <- cars_data %>% tibble::add_column(price = price)
  6. # app.R ## 1. Get the data source("data-source.R") cars_data <-

    get_cars_data() ## 2. Render booklet print(cars_data %>% dplyr::select(car_model, mpg, gear, price)) boxplot(price ~ gear, cars_data) lm(price ~ mpg + gear, cars_data) %>% summary()
  7. Recall why we have separated the data source from the

    rest Database Database Access Analytic Module Depends On Has a Data Source
  8. Recall why we have separated the data source from the

    rest Database Database Access Data Tests Analytic Module Depends On Dictates Has a Data Source
  9. Recall how we have separated the data source from the

    rest Harel Lustiger [email protected] assumption triplet data-tests.R data-source.R app.R