how, why & where? 3) Cloud R Stack - Example Cloud Infra - Choice of Tools 4) Production Mindset - BetteR habits - Workflow 5) Buy or Build? This talk Born on Twitter 2
we do: use ML models with alternative financial signals to help deserving but underserved populations gain access to fair credit, starting with international students (2 products in market). Differentiator: Measure borrower’s credit potential instead of their credit history (SSN / FICO etc) Team: ~25 & growing (hiring local Sr R Data Engineer, Summer Intern) Funding: Series B Founded: 2013
Technology rocker, EMIs, ECS, GCE, other cloud tools DBI, RMySQL, RPostGreSQL, Redis, Hadoop, Kinesis (AWR), Spark etc. Your internal pkgs, RServer, CI, Git, Chron, (most R packages), sparkR shiny, HTML, CSV, rook, googlesheets, HtmlWidgets, shinyapps.io, Dropbox httr, curl - API interactions for Email, SMS, Push, Slack, OR via CI tool Function Multi-Channel Engagement Optimal Service Delivery Platform-agnostic function & information availability Business Logic Identities, Attribs, Relations Scaleable Services & Contingencies “Full Stack” Data Science with R Generic: rapache, opencpu, plumber ML: h2o, Domino Data Lab Goal: Scalable, Timely, Economic Decisions
dev shops, but no serious bandwidth for hire - Memory mgmt (still?) Why R? 2011 vs Now Drivers 1. Large ecosystem of packages 2. Most other things are available as HTTP APIs 3. Fantastic IDE (RStudio) 1-pt access to stack 4. Great recruiting tool 7 Good news: it works!
(small, low memory, public IP) 2) Scheduler (do things triggered by time & events) (medium, run CI tools, invoke compute slaves) 3) Workers (heavy computations) (highmem, multi core, stateless) 4) Storage (data source & sink) (external services or internal clusters) 5) Modeler (H2O Cluster, or similar) (cluster, available on demand) 6) Reporting (scalable web / Shiny server) (medium, autoscaled containers) 9
Jenkins Sync Repo on Commit 3. Sync triggers next Jenkins job creates Docker container 4. Next job: AWS cli tools deploy Docker container to ECS 5. “Dev” Shiny app live on staging 6. API call to notify Slack channel Sample Workflows Marketing Cost Monitor 1. Jenkins Rscript fetches today’s Adwords spend & internal sales data every 5 minutes. 2. Rscript runs anomaly detection & threshold checks 3. When check fails, API calls from R to alert via Slack and Email (eg: SendGrid). 12
code 2. Generate informational logs so you can debug later (futile.logger) 3. Add versioning (github) 4. Save business logic as functions in package (selfscoRe) 5. Add examples (Rmd) 6. Write documentation (Rmd) 7. Create a web service (Shiny apps) 8. Put the service in a docker container The Production Mindset 14
- If time-to-market is key requires: - Custom Development - Higher Cost Tolerance - Niche engagements BUILD - If compliance is major factor (HIPAA, PCI) - If cost control is key - Full Control of Features Reqd requires: - In-house devops competence - Longer time-to-market - Ongoing maintenance 17