$30 off During Our Annual Pro Sale. View Details »

"Full Stack" Data Science with R

"Full Stack" Data Science with R

Talk at EARL 2017, San Francisco 2017-06-06

- https://earlconf.com/2017/sanfrancisco/

Gabriela de Queiroz

June 06, 2017
Tweet

More Decks by Gabriela de Queiroz

Other Decks in Programming

Transcript

  1. "Full Stack" Data Science with R
    Production data science & engineering
    with open source tools
    Gabriela de Queiroz
    Ajay Gopal

    View Slide

  2. Agenda
    1) Data (Science) Stack Needs
    2) Using R - how, why & where?
    3) Cloud R Stack
    - Example Cloud Infra
    - Choice of Tools
    4) Production Mindset
    - BetteR habits
    - Workflow
    5) Buy or Build?
    This talk
    Born on Twitter
    2

    View Slide

  3. Gabriela de Queiroz
    Lead Data Scientist
    / : gdequeiroz
    First met R: 2007
    Ajay Gopal, PhD
    Chief Data Scientist
    : ajzz : @aj2z
    First met R: 2011
    About us
    3

    View Slide

  4. What’s SelfScore?
    4
    Industry: FinTech Startup, Menlo Park, CA
    What we do: use ML models with alternative financial signals to help deserving but underserved
    populations gain access to fair credit, starting with international students (2 products in market).
    Differentiator: Measure borrower’s credit potential instead of their credit history (SSN / FICO etc)
    Team: ~25 & growing (hiring local Sr R Data Engineer, Summer Intern)
    Funding: Series B
    Founded: 2013

    View Slide

  5. The “Full Stack” Analogy
    5
    Front End
    Back End
    Data Store
    Devops
    APIs
    UX
    Technology
    Puppet, Chef, Ansible, AWS EC2,
    Docker, ECS/GCE, Heroku
    MySQL, PostGres, MongoDB, Redis,
    MemCached etc.
    PHP, JS, Python, Ruby, ORMs, CI, Git
    Restify, Django, Rails, ASP.net, Lambda
    HTML/CSS, JS (Node, React), Bootstrap,
    iOS, Android, Ionic, Cordoba
    Email (SendGrid), SMS (Twilio), Push
    (SNS, Firebase), Msg Frmwks
    Function
    Multi-Channel Engagement
    Optimal Service Delivery
    Platform-agnostic function &
    information availability
    Business Logic
    Identities, Attribs, Relations
    Scaleable Services &
    Contingencies
    Goal: Scalable, Engaging, Valuable Service

    View Slide

  6. 6
    Front End
    Back End
    Data Store
    Devops
    APIs
    UX
    Technology
    rocker, EMIs, ECS, GCE, other cloud
    tools
    DBI, RMySQL, RPostGreSQL, Redis,
    Hadoop, Kinesis (AWR), Spark etc.
    Your internal pkgs, RServer, CI, Git,
    Chron, (most R packages), sparkR
    shiny, HTML, CSV, rook, googlesheets,
    HtmlWidgets, shinyapps.io, Dropbox
    httr, curl - API interactions for Email,
    SMS, Push, Slack, OR via CI tool
    Function
    Multi-Channel Engagement
    Optimal Service Delivery
    Platform-agnostic function &
    information availability
    Business Logic
    Identities, Attribs, Relations
    Scaleable Services &
    Contingencies
    “Full Stack” Data Science with R
    Generic: rapache, opencpu, plumber
    ML: h2o, Domino Data Lab
    Goal: Scalable, Timely, Economic Decisions

    View Slide

  7. Detractors
    - Very few hard-core devs
    - Only two major dev
    shops, but no serious
    bandwidth for hire
    - Memory mgmt (still?)
    Why R? 2011 vs Now
    Drivers
    1. Large ecosystem of
    packages
    2. Most other things are
    available as HTTP APIs
    3. Fantastic IDE (RStudio)
    1-pt access to stack
    4. Great recruiting tool
    7
    Good news: it works!

    View Slide

  8. Minimum Functions Infra Should Support
    1) Retrieve Data
    - Ad / Marketing
    - Sales
    - Transaction
    - 3rd Party / Behavioral
    2) Process (ETL)
    - Fetch, clean up, store
    3) Analyze
    - Cross-Connectivity
    - Aggregation
    - Algorithms
    4) Predict
    - Models in batch
    - REST APIs
    5) Inform
    - Customers (Services & API)
    - Partners
    Eg: Marketing, fulfillment
    - Internal Stakeholders
    Eg: Reporting / Dashboards
    8

    View Slide

  9. Infra Resource Roles
    1) Bastion (to connect to external world)
    (small, low memory, public IP)
    2) Scheduler (do things triggered by time & events)
    (medium, run CI tools, invoke compute slaves)
    3) Workers (heavy computations)
    (highmem, multi core, stateless)
    4) Storage (data source & sink)
    (external services or internal clusters)
    5) Modeler (H2O Cluster, or similar)
    (cluster, available on demand)
    6) Reporting (scalable web / Shiny server)
    (medium, autoscaled containers)
    9

    View Slide

  10. Sample AWS Infra
    10

    View Slide

  11. Choice of Tools
    11

    View Slide

  12. “Staging” Shiny App
    1. Git Commit to “Dev” branch
    2. Jenkins Sync Repo on Commit
    3. Sync triggers next Jenkins job
    creates Docker container
    4. Next job: AWS cli tools deploy
    Docker container to ECS
    5. “Dev” Shiny app live on staging
    6. API call to notify Slack channel
    Sample Workflows
    Marketing Cost Monitor
    1. Jenkins Rscript fetches
    today’s Adwords spend &
    internal sales data every 5
    minutes.
    2. Rscript runs anomaly
    detection & threshold checks
    3. When check fails, API calls
    from R to alert via Slack and
    Email (eg: SendGrid).
    12

    View Slide

  13. Full-Stack Data Science
    People
    - Data / Backend Engineer
    - Data Scientist
    - Modeller / Statistician
    - Product Manager
    - Devops Engineer
    Output
    - EDA / ad-hoc
    - Scheduled Reporting
    - Batch Predictions
    - Stream Processing
    - Real-Time Prediction APIs
    Our “product” is scalable, actionable intelligence
    13
    … let’s adopt good software development practices

    View Slide

  14. BetteR habits:
    1. Write inline and offline tests for your code
    2. Generate informational logs so you can debug later (futile.logger)
    3. Add versioning (github)
    4. Save business logic as functions in package (selfscoRe)
    5. Add examples (Rmd)
    6. Write documentation (Rmd)
    7. Create a web service (Shiny apps)
    8. Put the service in a docker container
    The Production Mindset
    14

    View Slide

  15. Create and Prioritize Issues/Tickets
    Write Code
    Test Code in Staging
    Review the Code
    Continuous Integration
    Workflow
    15

    View Slide

  16. Should we buy or should we build?
    VS
    Should my company buy the infra? Should my team build it?
    16

    View Slide

  17. Buy vs Build Considerations
    BUY
    - If no dev/tech in-house
    - If time-to-market is key
    requires:
    - Custom Development
    - Higher Cost Tolerance
    - Niche engagements
    BUILD
    - If compliance is major factor
    (HIPAA, PCI)
    - If cost control is key
    - Full Control of Features Reqd
    requires:
    - In-house devops competence
    - Longer time-to-market
    - Ongoing maintenance
    17

    View Slide

  18. “Full Stack” Data Engineer
    More info: bit.ly/LeadDataEngineer
    WE ARE HIRING!
    18
    Img Credits: http://daemon.co.za/2014/04/what-does-full-stack-mean

    View Slide