Pro Yearly is on sale from $80 to $50! »

Data Tests

Data Tests

Abstract

Data science projects in commercial companies often experience a challenge arising from evolving data sources. As the project progresses, new signals and information sources are added incrementally. In practice, when the data source changes, it creates a need to change the application source code.

With no design up front, some application modules, such as a dashboard or a machine learning model, unwittingly become dependent on the data source. In this case, accommodating the evolving data source is not simply a matter of changing the code related to the data source. Rather, preserving the rest of the existing application in a working condition involves further code changes in distant elements of the application.

An alternative way of dealing with evolving data sources is to introduce a small design up front. Such design lets programmers manage the source code dependencies throughout the project life-cycle.

This talk suggests a design that (1) separates data sources from analytic applications and (2) restricts analytic applications from knowing about the data sources.

While the evolving data sources challenge is programming language agnostic, this talk demonstrates an implementation of the suggested design in R.

The takeaway of this talk is a design of a system, where data sources are plugins to the system’s analytic modules.

---

This presentation was given at satRday Auckland, February 2020.
See event details at https://auckland2020.satrdays.org/#portfolioModal7

C4e4a6fbfe3401d48bf835fd3ddbce66?s=128

Harel Lustiger

February 22, 2020
Tweet

Transcript

  1. Data Tests Separating between Data Sources and the Rest satRday,

    February 2020 Harel Lustiger Data Scientist, Harmonic Analytics
  2. What are we doing? We are separating between the data

    source and the rest
  3. Insert 500 hundred lines here ggplot(Tog3[-c(1,28),], aes(Date,Rec2All, colour = isChina))+

    geom_line(size = 2.2, alpha = 0.5)+ data <- read.csv("../input/novel-corona-virus-2019-dataset.csv") data <- data[,c(1:2,4,6:8)] data$isChina <- ifelse(data$Country %in% c("Mainland China", "China"),"China", "Not China") data$Date <- as.Date(data$Date, format = "%m/%d/%y") Insert 100 hundred lines here # main.R Dividing the analytic application into modules analytic module data source
  4. Making the data source existence dependent on what matters Analytic

    Module Depends On Database Database Access Has a Data Source Data Tests
  5. What’s the Story? We are helping our clients by fitting

    a solution to their needs
  6. Getting to know our client Euel Cheatam the Mercedes dealership

    manager
  7. The proposed solution is a weekly fact sheet with popular

    Q&A
  8. The data scientists encounter real-world impediments

  9. Iteration Zero developing a data-driven app without real data

  10. Generate dataset assumption triplet 01

  11. The analytic application expects tidy-data1 as its input 1 https://r4ds.had.co.nz/tidy-data.html

  12. Observation Unique Identifier (UID) Target variable Salient features We formulate

    a tidy dataset by making three assumptions
  13. Convert assumptions to assertions with data-tests 02

  14. ## 1. Check if the dataset exists stopifnot(exist("cars_data"), is.data.frame(cars_data)) #

    data-tests.R ## 2. Check if the necessary columns exist expected_cols <- c("car_model", "price", "gear", "mpg") stopifnot(all(expected_cols %in% colnames(cars_data))) ## 3. Check if the records are unique is.distinct <- function(x) dplyr::n_distinct(x) == length(x) stopifnot(is.distinct(cars_data$car_model))
  15. Implement a data source plugin 03

  16. Quick glimpse over dataset::mtcars Observation Unique Identifier Salient Feature Salient

    Feature
  17. # data-source.R get_cars_data <- function(){ ## 1. Generate records data(mtcars,

    package = "datasets") cars_data <- mtcars %>% tibble::rownames_to_column("car_model") ## 2. Generate price set.seed(2020) price <- runif(n = nrow(cars_data), min = 41, max = 75) cars_data <- cars_data %>% tibble::add_column(price = price) ## Run data-tests source("data-tests.R") return(cars_data) } ## 1. Generate records data(mtcars, package = "datasets") cars_data <- mtcars %>% tibble::rownames_to_column("car_model") ## 2. Generate price set.seed(2020) price <- runif(n = nrow(cars_data), min = 41, max = 75) cars_data <- cars_data %>% tibble::add_column(price = price)
  18. Develop an analytic module 04

  19. # app.R ## 1. Get the data source("data-source.R") cars_data <-

    get_cars_data() ## 2. Render booklet print(cars_data %>% dplyr::select(car_model, mpg, gear, price)) boxplot(price ~ gear, cars_data) lm(price ~ mpg + gear, cars_data) %>% summary()
  20. Conclusion Why and how we separate the data source from

    the rest
  21. Recall why we have separated the data source from the

    rest Database Database Access Analytic Module Depends On Has a Data Source
  22. Recall why we have separated the data source from the

    rest Database Database Access Data Tests Analytic Module Depends On Dictates Has a Data Source
  23. Recall how we have separated the data source from the

    rest Harel Lustiger harel.lustiger@gmail.com assumption triplet data-tests.R data-source.R app.R