Scaling your data infrastructure

Scaling your data infrastructure

Scaling your data infrastructure @ PyConNove

8cafbb6a1b892de6f03ec6db012fb39f?s=128

barrachri

April 20, 2018
Tweet

Transcript

  1. Scaling your data infrastructure C H R I S T

    I A N B A R R A @ P Y C O N N O V E
  2. THE AGENDA 2 3 START THE DATA SCIENCE WORKFLOW SCALING

    IS NOT JUST A MATTER OF MACHINE WHEN THE SIZE OF YOUR DATA MATTERS 1
  3. THE AGENDA 4 5 CONTAINERIZED DATA SCIENCE CASSINY: PUT ALL

    THE THINGS TOGETHER END
  4. THE DATA SCIENCE WORKFLOW

  5. HEXAGON PRESENTATION TEMPLATE

  6. HOW YOU BUILD, ITERATE AND SHARE DEPENDS ON MANY THINGS

    Your Users Your Product Your Team Your Company Your Tech Stack Your Domain
  7. SCIKIT-LEARN DOCKER DATA SCIENCE TOOLBELT PANDAS JUPYTER RAY

  8. SCALING IS NOT JUST A MATTER OF MACHINES

  9. We all use it.

  10. We really care about versioning. We have Untitled_1.ipynb, Untitled_2.ipynb and

    Untitled_3.ipynb. HOMER SIMPSON C H I E F D A T A S C I E N T I S T D A T A B E E R I N C
  11. Since JSON is a plain text format, they can be

    version-controlled and shared with colleagues. E X I P Y T H O N N O T E B O O K D O C U M E N T A T I O N
  12. THEY GOT IT RIGHT

  13. BUT WE KEEP IMPROVING

  14. 90% OF JUPITER IS MADE BY HYDROGEN

  15. THE HARD THING ABOUT STORAGE

  16. PARQUET P A R Q U E T + O

    B J E C T S T O R A G E = YO U C A N Q U E R Y I T U S I N G S Q L PA N DA S H A S N AT I V E S U P P O R T F O R G E T A B O U T C S V
  17. WHEN THE SIZE OF YOUR DATA MATTERS

  18. IT’S TOO SLOW DOESN’T FIT IN YOUR RAM

  19. CODE OPTIMIZATION APPROACH SCALING FROM DIFFERENT SIDES A BIGGER MACHINE

    USE MULTIPLE CORES MORE MACHINES FRAMEWORKS: DASK RAY SPARK PANDAS: READ BY CHUNKS SCIKIT-LEARN: PARTIAL FIT
  20. chunks & partial_fit 1 M A C H I N

    E
  21. Multiple machines. n M A C H I N E

    S
  22. I don’t want to use Spark/JVM, what do you have

    for me? H A P P Y P Y T H O N U S E R
  23. WHAT IS RAY?

  24. A high-performance distributed execution engine REDIS SCHEDULER WORKER ARROW &

    PLASMA
  25. Use pandas through ray to query parquet files in an

    object storage. W O R K I N P R O G R E S S
  26. CONTAINERIZED DATA SCIENCE

  27. If you trained a model with scikit-learn 0.18.1, will the

    same model work with 0.19.1? P R O B L E M # 1
  28. How do you share your models? P R O B

    L E M # 2
  29. How do you put your models in production? P R

    O B L E M # 3
  30. Containerize everything. T H E A N S W E

    R
  31. 1. It’s damn easy to move things around 2. You

    get versioning for free 3. Stack agnostic 4. Move Docker images around T O R E C A P
  32. CASSINY: PUT ALL THE THINGS TOGETHER

  33. CLEAR REQUIREMENTS CONTAINERIZED EASY OBJECT STORAGE JUPYTER + IPYTHON PLATFORM

    AGNOSTIC
  34. OPEN SOURCE

  35. DEMO

  36. TAKEAWAYS UNIFIED DATA WAREHOUSE KEEP YOUR CODE RUNNING ON ONE

    MACHINE USE DOCKER TRY RAY BRING CI/CD TO YOUR DATASCIENCE WORKFLOW OBJECT STORAGE IS COOL DISTRIBUTED COMPUTING IS HARD I DIDN’T HAVE ANOTHER POINT
  37. None