R from Data Analysis to Production R User Group / MLDM Monday 2021-01-04 leoluyi@

About me • Leo Lu • Serve a data team in retailing banking • Build data products • Data Engineering • Data Viz • Models • Text Mining • ...

What we'll discuss today 1. ❓ How hard is it to serve your data products 2. " Things that should be noticed in the workflow 3. # Packaging your data product

౯㮉Ӟፗ᮷ࣁ ᪴襊膏霘ࣗጱ᪠Ӥ ©leoluyi, 2021 4

ፓ秂 ࣁ磪褖虻რጱ虻碘獤ຉ絑हӾ牧ই֜蝚螂奲而獉蟂ጱ秚ګ牧现Ӟ 犚䋿֢Ӥጱૡٍ牧虏Ӟ㮆犥 R 傶च器ጱ虻碘獤ຉ౲๐率䌕礯ݢ犥 ๅ磪硳ሲ ੝Ӟ犚襊 ©leoluyi, 2021 5

As data analysts, we often... ᤩᥝ穩᯿碝䁆ᤈӞ㮆܎ଙ獮ጱ蕦褾獤ຉ礯 • ܻত虻碘犋憎ԧ牪磪碝ጱ虻碘ᥝ㶓獈ֵአ • 盛ԧ܎ଙ獮ጱᛔ૩ࣁ䌃ӣX ... • Ի矑妔౯ጱᮎ㮆Ո櫝肬ԧ ©leoluyi, 2021 6

虻碘獤ຉ窕纷ጱ物 蕦褾௔牏加粬௔牏脀୧௔ ©leoluyi, 2021 8

Stage 1: Static Reports Pain points: • Ӟ樄ত౯㮉盄ۘێጱ碉ቘ虻碘牏狶瑽蔭 • REPL (RStudio): 聜䙼԰㵕ୗጱ砺֢盅牧Ջ 讕᮷䷱磪ኸӥ The ability to repeat experiments is part of the foundation for all science, and reproducible work is also critical for business applications. Lessons: ྯӞᤈ֦䌘虻碘砺֢ጱ code ᮷ᥝኸӥ牧㪔ᆙ 殼ଧ碉ቘኸӥ戢懿

Stage 2: Analysis Routines Pain points: • ᯿蕦ጱ蕦蕣揳Ӥಋૡ • ፡茐Ӟञ source code 犋Ꭳই֜ӥಋ Lessons: • Reproducible reports with R Markdown • Structure Your Code: 竃༩Ꭳ螇 Entrypoint牏虻碘რ牏㷢碍牏叨ڊ

Stage 3: Interavtive Reports Pain points: • 聲樿ᥝ፡ጱ࿞螐蚤֦狶ڊ㬵ጱ犋Ӟ䰬 • Unless someone in the business is able to use the insight we generate to make a better decision our teams won't add any value. Lessons: • Data-as-a-Service: Interactive documents (Rmd) or Shiny Apps • Interavtive reports allows us to create true 'data products' that go beyond standard BI dashboards. • (How about PowerBI?)

Stage 4: Shiny Server Pain points: • 㻌秚֢禂篷ဩ蟂ᗟ Lessons: • Alternatives: ShinyProxy Ӟ樌֡ஞ獍ݪ Open Analytics 樄咳ጱӞ㮆 ع揲Ӭਠ獊樄რጱ Shiny App ֑๐瑊໛礍 • 窩荠ԧ Shiny Server Open Source ጱ ಅ磪ۑ胼 • ൉׀ๅग़ጱᓕ矒秚ګ牐

Stage 5: APIs Pain points: • ྯ稞᮷ᥝࣁ絑ह愊ᶎ䁆ᤈ珸㵕಍胼஑ ک奾ຎ Lessons: • 襑ᥝӞ㮆瞱媲䁆ᤈጱ daemon • REST API with plumber • models, trigger jobs, ...

Stage 6: Containerization Pain points: • ग़絑हग़承᥺ (者ฎ䨝磪ݶԪ蚤֦አጱ 䩚ᥜ犋Ӟ䰬) Lessons: • Build ᩻粁ԋ • Image ᩻胅य़ • ֕螭ฎ꧊஑ dive rocker/shiny

A Spectrum of Analysis and Production

Part 2. Things that should be noticed in the workflow

Things that should be noticed in the workflow • Reproducible workflow: from zero to one • Environments: R and Package Versioning, Dependency Management

Structure Your Code • 肬胼㺔氂: 犥ஃ虻碘獤ຉ䒍ጱૡ֢犋ॡ 䨝窩荠کૡ纷ᶎ牧֕㶴䨝୽段ک෭盅 ᓕቘ޾蟂ᗟ • 磪犚㿁褧䨝螡䢔಩䌕礯۱౮ Package Keep your code tidy.

Code Organization Manage your Data Science project structure in early stage. 䋿褬蟴ᗝ狅襑穩ᘒ吖

Don't save your data in memory or on disk ֦Ӟਧ፡螂蝡䰬ጱ䲆礯ࣁፓ袅愊ᶎ... ├── raw_data_20201221.csv ├── raw_data_20201221_v2.csv └── raw_data_20201223.csv ই֜嘦狒玲஑ݶ䰬/ྋ嘦ጱ虻碘㬵რฎӞ㮆᯿ᥝጱ㺔氂牧֕Ꮭ಩虻 碘ਂӥ㬵䨝虏֦ԏ盅஺ኼധ蝡㮆㺔氂

Entrypoint: startup.R or run.R This file will... 1. build your environment 2. set global variables 3. source in all other code files 4. (pulling data) 5. render your report / model / data Reproduce from start to finish.

Environment & Dependency Management • Isolated: Each project gets its own library of R packages. • Portable: Captures the state of your R packages. • Reproducible: Later restore your R library exactly as specified. Source: renv1 1

Install R requirements Use requirements.txtgist file just like what we used to do in Python. gist

The renv2 Workflow 1. renv::init() to initialize a new project-local environment, 2. Work in the project as normal, installing and removing new R packages as they are needed in the project, 3. renv::snapshot() to save the state of the project library to the lockfile (called renv.lock), 4. Continue working on your project, installing and updating R packages as needed. 5. renv::snapshot() again to save the state of your project library if your attempts to update R packages were successful, or call renv::restore() to revert to the previous state as encoded in the lockfile if your attempts to update packages introduced some new problems. 2

ls -l ./renv/library/R-4.0/x86_64-apple-darwin17.0

renv::snapshot() ## The following package(s) will be updated in the lockfile: ## ## # CRAN =============================== ## - KernSmooth [* -> 2.23-16] ## - boot [* -> 1.3-24] ## - class [* -> 7.3-16] ## - cluster [* -> 2.1.0] ## - codetools [* -> 0.2-16] ## - fastmap [* -> 1.0.1] ## - foreign [* -> 0.8-75] ## - miniUI [* ->] ## - nnet [* -> 7.3-13] ## - rpart [* -> 4.1-15] ## - shiny [* ->] ## - sourcetools [* -> 0.1.7] ## - spatial [* -> 7.3-11] ## - survival [* -> 3.1-11] ## - xtable [* -> 1.8-4] ## ## * Lockfile written to '~/Documents/website-source/renv.lock'.

Dotfiles and activate function used by renv File Usage .Rprofile Used to activate renv for new R sessions launched in the project. renv.lock The lockfile, describing the state of your project's library at some point in time. renv/activate.R The activation script run by the project .Rprofile. renv/library The private project library. (symlink)

Full reproducibility • renv ૪妿笕᪃य़蟂獤䌘ෝॺկ粚๜眐丆ጱ襑穩牧獮൉ฎࣁӞ膌 ጱ絑ह愊ᶎ • 磪犚 R packages 磪羬翄絑हӤጱ狅蚅 • ๅग़碻狡獨Ո犋మአ renv ಩ಅ磪ॺկ斉ࢧ㬵

Part 3. Packaging your data product

Packaging your data product • APIs • Containerize • Service Management, CI/CD Funnel

Plumber Plumber allows you to create a web API by merely decorating your existing R source code with special comments.

Slide 37

Plumber - Getting Started plumber.R #' @get /hello #' @@serializer html function(){ "

hello world

hello world

"}

Plumber - Getting Started Endpoints: #' @get /hi #' @post /hi #' @put /hi #' @delete /hi #' @head /hi function(){ ... }

Plumber - Getting Started plumber.R #' Echo the parameter sent in #' @param msg The message to echo back. #' @get /echo function(msg=""){ list(msg = paste0("The message is: '", msg, "'")) }

Plumber - Getting Started plumber.R #' Plot out data from the iris dataset #' @param spec If provided, filter the data to only this species (e.g. 'setosa') #' @get /plot #' @serializer png function(spec){ my_data <- iris title <- "All Species" # Filter if the species was specified if (!missing(spec)){ title <- paste0("Only the '", spec, "' Species") my_data <- subset(iris, Species == spec) } plot(my_data$Sepal.Length, my_data$Petal.Length, main=title, xlab="Sepal Length", ylab="Petal Length") }

Plumber - Getting Started pr <- plumber::plumb("plumber.R") pr %>% pr_run(port=8000)

Advanced Plumber - Multiple Applications on One Port nginx: image: nginx:1.9 ports: - "80:80" volumes: - ./nginx.conf:/etc/nginx/nginx.conf:ro restart: always depends_on: - app1 - app2

Service Management - Logging At the moment, there is still no native library for logging. futile.logger • actively maintaining • supports json error logging •

Dockerize! ©leoluyi, 2021 44

What's the cost? • Base image size: 1.6 GB • Packages: ??? • Build time elapsed: 30 mins (at the first time on my MBA) ©leoluyi, 2021 47

Testing for Shiny: Shinytest Why test Shiny applications? There are many possible reasons for an application to stop working: • You make modifications to your application. • An external data source stops working. • An upgraded R package. Shinytest uses snapshot-based testing strategy. library(shinytest) recordTest("simple-app/") ©leoluyi, 2021 48

Testing for Shiny: Shinytest ©leoluyi, 2021 49

Testing for Shiny: Shinytest ©leoluyi, 2021 50

CI/CD - TravisCI (.travis.yml) ©leoluyi, 2021 51

Deploy to cloud! plumber_example ©leoluyi, 2021 52

Summary ©leoluyi, 2021 53

Constraints we have: • ॺկ虻ਞ矒ᓕ • 介手牏ྋୗ絑ह獤櫝 • 涢挨Ӥ粚窕纷 • ๐率緳矒膏෭扮 (Log) ኸਂ秚ګ • ᵍ櫝獉翕 ©leoluyi, 2021 54

What we've talked about totay 1. ❓ How hard is it to serve your data products • A Spectrum of Analysis and Production 2. " Things that should be noticed in the workflow • Reproducible workflow: from zero to one • Environments: R and Package Versioning, Dependency Management 3. # Packaging your data product • APIs • Containerize • Service Management, CI/CD Funnel ©leoluyi, 2021 55

൉㺔膏Ի窕 㸎瓽 leoluyi@github ©leoluyi, 2021 56

Thanks for listening ©leoluyi, 2021 57