Slide 1

Slide 1 text

R from Data Analysis to Production R User Group / MLDM Monday 2021-01-04 leoluyi@ӣ獺 ©leoluyi, 2021 1

Slide 2

Slide 2 text

About me • 㸎瓽 Leo Lu • Serve a data team in retailing banking • Build data products • Data Engineering • Data Viz • Models • Text Mining • ... ©leoluyi, 2021 2

Slide 3

Slide 3 text

What we'll discuss today 1. ❓ How hard is it to serve your data products 2. " Things that should be noticed in the workflow 3. # Packaging your data product ©leoluyi, 2021 3

Slide 4

Slide 4 text

౯㮉Ӟፗ᮷ࣁ ᪴襊膏霘ࣗጱ᪠Ӥ ©leoluyi, 2021 4

Slide 5

Slide 5 text

ፓ秂 ࣁ磪褖虻რጱ虻碘獤ຉ絑हӾ牧ই֜蝚螂奲而獉蟂ጱ秚ګ牧现Ӟ 犚䋿֢Ӥጱૡٍ牧虏Ӟ㮆犥 R 傶च器ጱ虻碘獤ຉ౲๐率䌕礯ݢ犥 ๅ磪硳ሲ ੝Ӟ犚襊 ©leoluyi, 2021 5

Slide 6

Slide 6 text

As data analysts, we often... ᤩᥝ穩᯿碝䁆ᤈӞ㮆܎ଙ獮ጱ蕦褾獤ຉ礯 • ܻত虻碘犋憎ԧ牪磪碝ጱ虻碘ᥝ㶓獈ֵአ • 盛ԧ܎ଙ獮ጱᛔ૩ࣁ䌃ӣX ... • Ի矑妔౯ጱᮎ㮆Ո櫝肬ԧ ©leoluyi, 2021 6

Slide 7

Slide 7 text

©leoluyi, 2021 7

Slide 8

Slide 8 text

虻碘獤ຉ窕纷ጱ物 蕦褾௔牏加粬௔牏脀୧௔ ©leoluyi, 2021 8

Slide 9

Slide 9 text

Stage 1: Static Reports Pain points: • Ӟ樄ত౯㮉盄ۘێጱ碉ቘ虻碘牏狶瑽蔭 • REPL (RStudio): 聜䙼԰㵕ୗጱ砺֢盅牧Ջ 讕᮷䷱磪ኸӥ The ability to repeat experiments is part of the foundation for all science, and reproducible work is also critical for business applications. Lessons: ྯӞᤈ֦䌘虻碘砺֢ጱ code ᮷ᥝኸӥ牧㪔ᆙ 殼ଧ碉ቘኸӥ戢懿 ©leoluyi, 2021 9

Slide 10

Slide 10 text

Stage 2: Analysis Routines Pain points: • ᯿蕦ጱ蕦蕣揳Ӥಋૡ • ፡茐Ӟञ source code 犋Ꭳই֜ӥಋ Lessons: • Reproducible reports with R Markdown • Structure Your Code: 竃༩Ꭳ螇 Entrypoint牏虻碘რ牏㷢碍牏叨ڊ ©leoluyi, 2021 10

Slide 11

Slide 11 text

©leoluyi, 2021 11

Slide 12

Slide 12 text

Stage 3: Interavtive Reports Pain points: • 聲樿ᥝ፡ጱ࿞螐蚤֦狶ڊ㬵ጱ犋Ӟ䰬 • Unless someone in the business is able to use the insight we generate to make a better decision our teams won't add any value. Lessons: • Data-as-a-Service: Interactive documents (Rmd) or Shiny Apps • Interavtive reports allows us to create true 'data products' that go beyond standard BI dashboards. • (How about PowerBI?) ©leoluyi, 2021 12

Slide 13

Slide 13 text

Stage 4: Shiny Server Pain points: • 㻌秚֢禂篷ဩ蟂ᗟ Lessons: • Alternatives: ShinyProxy Ӟ樌֡ஞ獍ݪ Open Analytics 樄咳ጱӞ㮆 ع揲Ӭਠ獊樄რጱ Shiny App ֑๐瑊໛礍 • 窩荠ԧ Shiny Server Open Source ጱ ಅ磪ۑ胼 • ൉׀ๅग़ጱᓕ矒秚ګ牐 ©leoluyi, 2021 13

Slide 14

Slide 14 text

©leoluyi, 2021 14

Slide 15

Slide 15 text

Stage 5: APIs Pain points: • ྯ稞᮷ᥝࣁ絑ह愊ᶎ䁆ᤈ珸㵕಍胼஑ ک奾ຎ Lessons: • 襑ᥝӞ㮆瞱媲䁆ᤈጱ daemon • REST API with plumber • models, trigger jobs, ... ©leoluyi, 2021 15

Slide 16

Slide 16 text

Stage 6: Containerization Pain points: • ग़絑हग़承᥺ (者ฎ䨝磪ݶԪ蚤֦አጱ 䩚ᥜ犋Ӟ䰬) Lessons: • Build ᩻粁ԋ • Image ᩻胅य़ • ֕螭ฎ꧊஑ dive rocker/shiny ©leoluyi, 2021 16

Slide 17

Slide 17 text

A Spectrum of Analysis and Production ©leoluyi, 2021 17

Slide 18

Slide 18 text

Part 2. Things that should be noticed in the workflow ©leoluyi, 2021 18

Slide 19

Slide 19 text

Things that should be noticed in the workflow • Reproducible workflow: from zero to one • Environments: R and Package Versioning, Dependency Management ©leoluyi, 2021 19

Slide 20

Slide 20 text

Structure Your Code • 肬胼㺔氂: 犥ஃ虻碘獤ຉ䒍ጱૡ֢犋ॡ 䨝窩荠کૡ纷ᶎ牧֕㶴䨝୽段ک෭盅 ᓕቘ޾蟂ᗟ • 磪犚㿁褧䨝螡䢔಩䌕礯۱౮ Package Keep your code tidy. ©leoluyi, 2021 20

Slide 21

Slide 21 text

Code Organization Manage your Data Science project structure in early stage. 䋿褬蟴ᗝ狅襑穩ᘒ吖 ©leoluyi, 2021 21

Slide 22

Slide 22 text

©leoluyi, 2021 22

Slide 23

Slide 23 text

Don't save your data in memory or on disk ֦Ӟਧ፡螂蝡䰬ጱ䲆礯ࣁፓ袅愊ᶎ... ├── raw_data_20201221.csv ├── raw_data_20201221_v2.csv └── raw_data_20201223.csv ই֜嘦狒玲஑ݶ䰬/ྋ嘦ጱ虻碘㬵რฎӞ㮆᯿ᥝጱ㺔氂牧֕Ꮭ಩虻 碘ਂӥ㬵䨝虏֦ԏ盅஺ኼധ蝡㮆㺔氂 ©leoluyi, 2021 23

Slide 24

Slide 24 text

Entrypoint: startup.R or run.R This file will... 1. build your environment 2. set global variables 3. source in all other code files 4. (pulling data) 5. render your report / model / data Reproduce from start to finish. ©leoluyi, 2021 24

Slide 25

Slide 25 text

Environment & Dependency Management • Isolated: Each project gets its own library of R packages. • Portable: Captures the state of your R packages. • Reproducible: Later restore your R library exactly as specified. Source: renv1 1 https://blog.rstudio.com/2019/11/06/renv-project-environments-for-r/ ©leoluyi, 2021 25

Slide 26

Slide 26 text

Install R requirements Use requirements.txtgist file just like what we used to do in Python. gist https://gist.github.com/leoluyi/10888517e7833971ae0d375f40afdbb0 ©leoluyi, 2021 26

Slide 27

Slide 27 text

The renv2 Workflow 1. renv::init() to initialize a new project-local environment, 2. Work in the project as normal, installing and removing new R packages as they are needed in the project, 3. renv::snapshot() to save the state of the project library to the lockfile (called renv.lock), 4. Continue working on your project, installing and updating R packages as needed. 5. renv::snapshot() again to save the state of your project library if your attempts to update R packages were successful, or call renv::restore() to revert to the previous state as encoded in the lockfile if your attempts to update packages introduced some new problems. 2 https://rstudio.github.io/renv/articles/renv.html ©leoluyi, 2021 27

Slide 28

Slide 28 text

©leoluyi, 2021 28

Slide 29

Slide 29 text

©leoluyi, 2021 29

Slide 30

Slide 30 text

ls -l ./renv/library/R-4.0/x86_64-apple-darwin17.0 ©leoluyi, 2021 30

Slide 31

Slide 31 text

renv::snapshot() ## The following package(s) will be updated in the lockfile: ## ## # CRAN =============================== ## - KernSmooth [* -> 2.23-16] ## - boot [* -> 1.3-24] ## - class [* -> 7.3-16] ## - cluster [* -> 2.1.0] ## - codetools [* -> 0.2-16] ## - fastmap [* -> 1.0.1] ## - foreign [* -> 0.8-75] ## - miniUI [* -> 0.1.1.1] ## - nnet [* -> 7.3-13] ## - rpart [* -> 4.1-15] ## - shiny [* -> 1.4.0.2] ## - sourcetools [* -> 0.1.7] ## - spatial [* -> 7.3-11] ## - survival [* -> 3.1-11] ## - xtable [* -> 1.8-4] ## ## * Lockfile written to '~/Documents/website-source/renv.lock'. ©leoluyi, 2021 31

Slide 32

Slide 32 text

Dotfiles and activate function used by renv File Usage .Rprofile Used to activate renv for new R sessions launched in the project. renv.lock The lockfile, describing the state of your project's library at some point in time. renv/activate.R The activation script run by the project .Rprofile. renv/library The private project library. (symlink) ©leoluyi, 2021 32

Slide 33

Slide 33 text

Full reproducibility • renv ૪妿笕᪃य़蟂獤䌘ෝॺկ粚๜眐丆ጱ襑穩牧獮൉ฎࣁӞ膌 ጱ絑ह愊ᶎ • 磪犚 R packages 磪羬翄絑हӤጱ狅蚅 • ๅग़碻狡獨Ո犋మአ renv ಩ಅ磪ॺկ斉ࢧ㬵 ©leoluyi, 2021 33

Slide 34

Slide 34 text

Part 3. Packaging your data product ©leoluyi, 2021 34

Slide 35

Slide 35 text

Packaging your data product • APIs • Containerize • Service Management, CI/CD Funnel ©leoluyi, 2021 35

Slide 36

Slide 36 text

Plumber Plumber allows you to create a web API by merely decorating your existing R source code with special comments. ©leoluyi, 2021 36

Slide 37

Slide 37 text

Plumber - Getting Started plumber.R #' @get /hello #' @@serializer html function(){ "

hello world

" } ©leoluyi, 2021 37

Slide 38

Slide 38 text

Plumber - Getting Started Endpoints: #' @get /hi #' @post /hi #' @put /hi #' @delete /hi #' @head /hi function(){ ... } ©leoluyi, 2021 38

Slide 39

Slide 39 text

Plumber - Getting Started plumber.R #' Echo the parameter sent in #' @param msg The message to echo back. #' @get /echo function(msg=""){ list(msg = paste0("The message is: '", msg, "'")) } ©leoluyi, 2021 39

Slide 40

Slide 40 text

Plumber - Getting Started plumber.R #' Plot out data from the iris dataset #' @param spec If provided, filter the data to only this species (e.g. 'setosa') #' @get /plot #' @serializer png function(spec){ my_data <- iris title <- "All Species" # Filter if the species was specified if (!missing(spec)){ title <- paste0("Only the '", spec, "' Species") my_data <- subset(iris, Species == spec) } plot(my_data$Sepal.Length, my_data$Petal.Length, main=title, xlab="Sepal Length", ylab="Petal Length") } ©leoluyi, 2021 40

Slide 41

Slide 41 text

Plumber - Getting Started pr <- plumber::plumb("plumber.R") pr %>% pr_run(port=8000) ©leoluyi, 2021 41

Slide 42

Slide 42 text

Advanced Plumber - Multiple Applications on One Port nginx: image: nginx:1.9 ports: - "80:80" volumes: - ./nginx.conf:/etc/nginx/nginx.conf:ro restart: always depends_on: - app1 - app2 ©leoluyi, 2021 42

Slide 43

Slide 43 text

Service Management - Logging At the moment, there is still no native library for logging. futile.logger • actively maintaining • supports json error logging • similar semantics to Python's logging as well as log4j-like • may be complicated ©leoluyi, 2021 43

Slide 44

Slide 44 text

Dockerize! ©leoluyi, 2021 44

Slide 45

Slide 45 text

©leoluyi, 2021 45

Slide 46

Slide 46 text

©leoluyi, 2021 46

Slide 47

Slide 47 text

What's the cost? • Base image size: 1.6 GB • Packages: ??? • Build time elapsed: 30 mins (at the first time on my MBA) ©leoluyi, 2021 47

Slide 48

Slide 48 text

Testing for Shiny: Shinytest Why test Shiny applications? There are many possible reasons for an application to stop working: • You make modifications to your application. • An external data source stops working. • An upgraded R package. Shinytest uses snapshot-based testing strategy. library(shinytest) recordTest("simple-app/") ©leoluyi, 2021 48

Slide 49

Slide 49 text

Testing for Shiny: Shinytest ©leoluyi, 2021 49

Slide 50

Slide 50 text

Testing for Shiny: Shinytest ©leoluyi, 2021 50

Slide 51

Slide 51 text

CI/CD - TravisCI (.travis.yml) ©leoluyi, 2021 51

Slide 52

Slide 52 text

Deploy to cloud! plumber_example ©leoluyi, 2021 52

Slide 53

Slide 53 text

Summary ©leoluyi, 2021 53

Slide 54

Slide 54 text

Constraints we have: • ॺկ虻ਞ矒ᓕ • 介手牏ྋୗ絑ह獤櫝 • 涢挨Ӥ粚窕纷 • ๐率緳矒膏෭扮 (Log) ኸਂ秚ګ • ᵍ櫝獉翕 ©leoluyi, 2021 54

Slide 55

Slide 55 text

What we've talked about totay 1. ❓ How hard is it to serve your data products • A Spectrum of Analysis and Production 2. " Things that should be noticed in the workflow • Reproducible workflow: from zero to one • Environments: R and Package Versioning, Dependency Management 3. # Packaging your data product • APIs • Containerize • Service Management, CI/CD Funnel ©leoluyi, 2021 55

Slide 56

Slide 56 text

൉㺔膏Ի窕 㸎瓽 leoluyi@github ©leoluyi, 2021 56

Slide 57

Slide 57 text

Thanks for listening ©leoluyi, 2021 57