Upgrade to Pro — share decks privately, control downloads, hide ads and more …

R from Data Analysis to Production

Leo Lu
January 04, 2021

R from Data Analysis to Production

本次聚會地點於三創育成基金會(11F)場地

20201-01-04 19:30-20:30

講題:R from Data Analysis to Production

R 語言作為一個資料分析的語言是非常高效的,簡單的幾行腳本就可以做資料處理剖析、計算、分析、建模、資料視覺化,甚至做 Web Apps (Shiny)、API Service,在統計分析的強項之外,豐富的套件生態系及框架工具,讓企業得以快速實踐各項數據應用,進而將 R 作為 Production 部署的選項之一。然而快速迭代的研究實驗。然而在以 R 作為主要工具的資料分析團隊中,最終還是會遇到開發自動化資料管線、報表、模型服務等營運項目的問題,在開發的初始階段就就考慮 Production 層級幾乎很少遇到,更何況要處理在日常就會遇到 R 分析環境的獨立性 (Isolated)、可重現性 (Reproducible)、可靠性 (Robust) 等問題。

本次講題將分享在一個有限資源的資料分析環境中,如何透過組織內部的機制,及一些實作上的工具,讓一個以 R 為基礎的資料分析或服務專案可以少一些雷,內容將包含:

- Development Principles
- Environments: R and Package Versioning, Dependency Management
- Packaging your Apps
- Service Management, CI/CD Funnel

目標分享對象:

- 分析團隊使用/混用 R 做開發
- 開發及維運相關機制或資源有限的團隊

Leo Lu

January 04, 2021
Tweet

More Decks by Leo Lu

Other Decks in Programming

Transcript

  1. R from Data Analysis to Production R User Group /

    MLDM Monday 2021-01-04 leoluyi@ӣ獺 ©leoluyi, 2021 1
  2. About me • 㸎瓽 Leo Lu • Serve a data

    team in retailing banking • Build data products • Data Engineering • Data Viz • Models • Text Mining • ... ©leoluyi, 2021 2
  3. What we'll discuss today 1. ❓ How hard is it

    to serve your data products 2. " Things that should be noticed in the workflow 3. # Packaging your data product ©leoluyi, 2021 3
  4. Stage 1: Static Reports Pain points: • Ӟ樄ত౯㮉盄ۘێጱ碉ቘ虻碘牏狶瑽蔭 • REPL

    (RStudio): 聜䙼԰㵕ୗጱ砺֢盅牧Ջ 讕᮷䷱磪ኸӥ The ability to repeat experiments is part of the foundation for all science, and reproducible work is also critical for business applications. Lessons: ྯӞᤈ֦䌘虻碘砺֢ጱ code ᮷ᥝኸӥ牧㪔ᆙ 殼ଧ碉ቘኸӥ戢懿 ©leoluyi, 2021 9
  5. Stage 2: Analysis Routines Pain points: • ᯿蕦ጱ蕦蕣揳Ӥಋૡ • ፡茐Ӟञ

    source code 犋Ꭳই֜ӥಋ Lessons: • Reproducible reports with R Markdown • Structure Your Code: 竃༩Ꭳ螇 Entrypoint牏虻碘რ牏㷢碍牏叨ڊ ©leoluyi, 2021 10
  6. Stage 3: Interavtive Reports Pain points: • 聲樿ᥝ፡ጱ࿞螐蚤֦狶ڊ㬵ጱ犋Ӟ䰬 • Unless

    someone in the business is able to use the insight we generate to make a better decision our teams won't add any value. Lessons: • Data-as-a-Service: Interactive documents (Rmd) or Shiny Apps • Interavtive reports allows us to create true 'data products' that go beyond standard BI dashboards. • (How about PowerBI?) ©leoluyi, 2021 12
  7. Stage 4: Shiny Server Pain points: • 㻌秚֢禂篷ဩ蟂ᗟ Lessons: •

    Alternatives: ShinyProxy Ӟ樌֡ஞ獍ݪ Open Analytics 樄咳ጱӞ㮆 ع揲Ӭਠ獊樄რጱ Shiny App ֑๐瑊໛礍 • 窩荠ԧ Shiny Server Open Source ጱ ಅ磪ۑ胼 • ൉׀ๅग़ጱᓕ矒秚ګ牐 ©leoluyi, 2021 13
  8. Stage 5: APIs Pain points: • ྯ稞᮷ᥝࣁ絑ह愊ᶎ䁆ᤈ珸㵕಍胼஑ ک奾ຎ Lessons: •

    襑ᥝӞ㮆瞱媲䁆ᤈጱ daemon • REST API with plumber • models, trigger jobs, ... ©leoluyi, 2021 15
  9. Stage 6: Containerization Pain points: • ग़絑हग़承᥺ (者ฎ䨝磪ݶԪ蚤֦አጱ 䩚ᥜ犋Ӟ䰬) Lessons:

    • Build ᩻粁ԋ • Image ᩻胅य़ • ֕螭ฎ꧊஑ dive rocker/shiny ©leoluyi, 2021 16
  10. Things that should be noticed in the workflow • Reproducible

    workflow: from zero to one • Environments: R and Package Versioning, Dependency Management ©leoluyi, 2021 19
  11. Code Organization Manage your Data Science project structure in early

    stage. 䋿褬蟴ᗝ狅襑穩ᘒ吖 ©leoluyi, 2021 21
  12. Don't save your data in memory or on disk ֦Ӟਧ፡螂蝡䰬ጱ䲆礯ࣁፓ袅愊ᶎ...

    ├── raw_data_20201221.csv ├── raw_data_20201221_v2.csv └── raw_data_20201223.csv ই֜嘦狒玲஑ݶ䰬/ྋ嘦ጱ虻碘㬵რฎӞ㮆᯿ᥝጱ㺔氂牧֕Ꮭ಩虻 碘ਂӥ㬵䨝虏֦ԏ盅஺ኼധ蝡㮆㺔氂 ©leoluyi, 2021 23
  13. Entrypoint: startup.R or run.R This file will... 1. build your

    environment 2. set global variables 3. source in all other code files 4. (pulling data) 5. render your report / model / data Reproduce from start to finish. ©leoluyi, 2021 24
  14. Environment & Dependency Management • Isolated: Each project gets its

    own library of R packages. • Portable: Captures the state of your R packages. • Reproducible: Later restore your R library exactly as specified. Source: renv1 1 https://blog.rstudio.com/2019/11/06/renv-project-environments-for-r/ ©leoluyi, 2021 25
  15. Install R requirements Use requirements.txtgist file just like what we

    used to do in Python. gist https://gist.github.com/leoluyi/10888517e7833971ae0d375f40afdbb0 ©leoluyi, 2021 26
  16. The renv2 Workflow 1. renv::init() to initialize a new project-local

    environment, 2. Work in the project as normal, installing and removing new R packages as they are needed in the project, 3. renv::snapshot() to save the state of the project library to the lockfile (called renv.lock), 4. Continue working on your project, installing and updating R packages as needed. 5. renv::snapshot() again to save the state of your project library if your attempts to update R packages were successful, or call renv::restore() to revert to the previous state as encoded in the lockfile if your attempts to update packages introduced some new problems. 2 https://rstudio.github.io/renv/articles/renv.html ©leoluyi, 2021 27
  17. renv::snapshot() ## The following package(s) will be updated in the

    lockfile: ## ## # CRAN =============================== ## - KernSmooth [* -> 2.23-16] ## - boot [* -> 1.3-24] ## - class [* -> 7.3-16] ## - cluster [* -> 2.1.0] ## - codetools [* -> 0.2-16] ## - fastmap [* -> 1.0.1] ## - foreign [* -> 0.8-75] ## - miniUI [* -> 0.1.1.1] ## - nnet [* -> 7.3-13] ## - rpart [* -> 4.1-15] ## - shiny [* -> 1.4.0.2] ## - sourcetools [* -> 0.1.7] ## - spatial [* -> 7.3-11] ## - survival [* -> 3.1-11] ## - xtable [* -> 1.8-4] ## ## * Lockfile written to '~/Documents/website-source/renv.lock'. ©leoluyi, 2021 31
  18. Dotfiles and activate function used by renv File Usage .Rprofile

    Used to activate renv for new R sessions launched in the project. renv.lock The lockfile, describing the state of your project's library at some point in time. renv/activate.R The activation script run by the project .Rprofile. renv/library The private project library. (symlink) ©leoluyi, 2021 32
  19. Full reproducibility • renv ૪妿笕᪃य़蟂獤䌘ෝॺկ粚๜眐丆ጱ襑穩牧獮൉ฎࣁӞ膌 ጱ絑ह愊ᶎ • 磪犚 R packages

    磪羬翄絑हӤጱ狅蚅 • ๅग़碻狡獨Ո犋మአ renv ಩ಅ磪ॺկ斉ࢧ㬵 ©leoluyi, 2021 33
  20. Packaging your data product • APIs • Containerize • Service

    Management, CI/CD Funnel ©leoluyi, 2021 35
  21. Plumber Plumber allows you to create a web API by

    merely decorating your existing R source code with special comments. ©leoluyi, 2021 36
  22. Plumber - Getting Started plumber.R #' @get /hello #' @@serializer

    html function(){ "<html><h1>hello world</h1></html>" } ©leoluyi, 2021 37
  23. Plumber - Getting Started Endpoints: #' @get /hi #' @post

    /hi #' @put /hi #' @delete /hi #' @head /hi function(){ ... } ©leoluyi, 2021 38
  24. Plumber - Getting Started plumber.R #' Echo the parameter sent

    in #' @param msg The message to echo back. #' @get /echo function(msg=""){ list(msg = paste0("The message is: '", msg, "'")) } ©leoluyi, 2021 39
  25. Plumber - Getting Started plumber.R #' Plot out data from

    the iris dataset #' @param spec If provided, filter the data to only this species (e.g. 'setosa') #' @get /plot #' @serializer png function(spec){ my_data <- iris title <- "All Species" # Filter if the species was specified if (!missing(spec)){ title <- paste0("Only the '", spec, "' Species") my_data <- subset(iris, Species == spec) } plot(my_data$Sepal.Length, my_data$Petal.Length, main=title, xlab="Sepal Length", ylab="Petal Length") } ©leoluyi, 2021 40
  26. Advanced Plumber - Multiple Applications on One Port nginx: image:

    nginx:1.9 ports: - "80:80" volumes: - ./nginx.conf:/etc/nginx/nginx.conf:ro restart: always depends_on: - app1 - app2 ©leoluyi, 2021 42
  27. Service Management - Logging At the moment, there is still

    no native library for logging. futile.logger • actively maintaining • supports json error logging • similar semantics to Python's logging as well as log4j-like • may be complicated ©leoluyi, 2021 43
  28. What's the cost? • Base image size: 1.6 GB •

    Packages: ??? • Build time elapsed: 30 mins (at the first time on my MBA) ©leoluyi, 2021 47
  29. Testing for Shiny: Shinytest Why test Shiny applications? There are

    many possible reasons for an application to stop working: • You make modifications to your application. • An external data source stops working. • An upgraded R package. Shinytest uses snapshot-based testing strategy. library(shinytest) recordTest("simple-app/") ©leoluyi, 2021 48
  30. Constraints we have: • ॺկ虻ਞ矒ᓕ • 介手牏ྋୗ絑ह獤櫝 • 涢挨Ӥ粚窕纷 •

    ๐率緳矒膏෭扮 (Log) ኸਂ秚ګ • ᵍ櫝獉翕 ©leoluyi, 2021 54
  31. What we've talked about totay 1. ❓ How hard is

    it to serve your data products • A Spectrum of Analysis and Production 2. " Things that should be noticed in the workflow • Reproducible workflow: from zero to one • Environments: R and Package Versioning, Dependency Management 3. # Packaging your data product • APIs • Containerize • Service Management, CI/CD Funnel ©leoluyi, 2021 55