Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DevOps for Data Science

DevOps for Data Science

Slides of the talk presented at ODSC India 2018.

Anand Chitipothu

August 31, 2018
Tweet

More Decks by Anand Chitipothu

Other Decks in Technology

Transcript

  1. DevOps for Data Science Experiences from building a cloud-based data

    science pla6orm ... Anand Chi)pothu rorodata
  2. Who is Speaking? Anand Chi)pothu @anandology Co-founder and pla.orm architect

    of @rorodata Worked at Internet Archive & Strand Life Sciences Advanced programming courses at @pipalacademy
  3. Managing Data Science in Produc1on is Hard! • The tools

    and prac0ces are not very mature • Everyone ends up building their own solu0ons • Building own solu0ons require careful system architecture and complex devops
  4. The Goal: The Effec-ve Data Science Team • The data

    science team is self-sufficient to build end-to-end ML applica8ons • Not steep learning curve
  5. Launching Notebooks - Challenges • Switching between different compute needs

    • Installing required so9ware dependencies • Data storage • GPU support
  6. Instance Size Different instance sizes to pick from: • S1

    - 1 CPU core, 1 GB RAM • S2 - 1 CPU core, 3.5 GB RAM • M1 - 2 CPU core, 15 GB RAM • X1 - 64 CPU core, 1024 GB RAM • G1 - 4 CPU cores, 60GB RAM, 1 K100 GPU
  7. How to specify addi/onal dependencies? • runtime.txt: specifies the run.me

    • environment.yml: conda environment file with python dependencies • requirements.txt: python dependencies to be installed from pip • apt.txt: system packages that need to be installed • postBuild: script for custom needs Wri$ng a Dockerfile is too low-level.
  8. Behind the Scenes • Two docker images are built for

    each project - One for CPU and another for GPU • Run@mes are also built using the same approach • Manages compute instances • Pools the compute resources to op@mize resource consump@on • Uses a network file system to persist data and notebooks • Automa@c endpoint and HTTPS management
  9. Challenges • Designing and documen/ng APIs • Running the service

    and configuring URL endpoints • Scale to meet the usage • Client library to use the API • Authen/ca/on • Tracking the usage and performance
  10. Right-level of Abstrac3on! Step 1: Write your func1on # sq.py

    def square(n): """compute square of a number. """ return n*n
  11. Step 2: Run it as an API $ firefly sq.square

    http://127.0.0.1:8000/ ... Firefly is the open-source library that we built to solve that problem.
  12. Step 3: Use it >>> import firefly >>> api =

    firefly.Client("http://127.0.0.1:8000/") >>> api.square(n=4) 16 Out-of the box client library to access the API.
  13. What about ML models? Write your predict func-on and run

    it as API. # face_detection.py import joblib model = joblib.load("model.pkl") def predict(image): ...
  14. Integra(on Write a config file in the project specifying what

    services to run. services: - name: api function: face_detection.predict size: S2
  15. The Push Bu)on The deploy command submits the code to

    pla3orm and it starts the requires services in that project. $ roro deploy ...
  16. Behind the Scenes • It builts the docker images •

    Starts the specified services • Provides URL endpoints with HTTPS
  17. Summary • Making the data science team self-sufficient is key

    to their produc9vity • Op9mize for developer experience • Right-level of abstrac9on is the key!