$30 off During Our Annual Pro Sale. View Details »

Scaling your data infrastructure

Scaling your data infrastructure

Scaling your data infrastructure @ PyConNove

barrachri

April 20, 2018
Tweet

More Decks by barrachri

Other Decks in Technology

Transcript

  1. Scaling your
    data infrastructure
    C H R I S T I A N B A R R A @ P Y C O N N O V E

    View Slide

  2. THE AGENDA
    2 3
    START
    THE DATA SCIENCE
    WORKFLOW
    SCALING IS NOT JUST
    A MATTER OF MACHINE
    WHEN THE SIZE OF
    YOUR DATA MATTERS
    1

    View Slide

  3. THE AGENDA
    4 5
    CONTAINERIZED
    DATA SCIENCE
    CASSINY: PUT ALL THE
    THINGS TOGETHER
    END

    View Slide

  4. THE
    DATA
    SCIENCE
    WORKFLOW

    View Slide

  5. HEXAGON PRESENTATION TEMPLATE

    View Slide

  6. HOW YOU
    BUILD,
    ITERATE
    AND SHARE
    DEPENDS ON
    MANY THINGS
    Your
    Users
    Your
    Product
    Your
    Team
    Your
    Company
    Your
    Tech
    Stack
    Your
    Domain

    View Slide

  7. SCIKIT-LEARN
    DOCKER
    DATA SCIENCE TOOLBELT
    PANDAS JUPYTER
    RAY

    View Slide

  8. SCALING
    IS NOT JUST A
    MATTER OF
    MACHINES

    View Slide

  9. We all use it.

    View Slide

  10. We really care about versioning.
    We have Untitled_1.ipynb,
    Untitled_2.ipynb and Untitled_3.ipynb.
    HOMER SIMPSON
    C H I E F D A T A S C I E N T I S T
    D A T A B E E R I N C

    View Slide

  11. Since JSON is a plain text format, they can be
    version-controlled and shared with colleagues.
    E X I P Y T H O N N O T E B O O K D O C U M E N T A T I O N

    View Slide

  12. THEY GOT
    IT RIGHT

    View Slide

  13. BUT WE
    KEEP
    IMPROVING

    View Slide

  14. 90% OF
    JUPITER IS
    MADE BY
    HYDROGEN

    View Slide

  15. THE HARD
    THING ABOUT
    STORAGE

    View Slide

  16. PARQUET
    P A R Q U E T + O B J E C T S T O R A G E =
    YO U C A N Q U E R Y I T U S I N G S Q L
    PA N DA S H A S N AT I V E S U P P O R T
    F O R G E T A B O U T C S V

    View Slide

  17. WHEN THE
    SIZE OF YOUR
    DATA MATTERS

    View Slide

  18. IT’S TOO
    SLOW
    DOESN’T FIT
    IN YOUR RAM

    View Slide

  19. CODE
    OPTIMIZATION
    APPROACH
    SCALING
    FROM
    DIFFERENT
    SIDES
    A BIGGER
    MACHINE
    USE
    MULTIPLE
    CORES
    MORE
    MACHINES
    FRAMEWORKS:
    DASK
    RAY
    SPARK
    PANDAS:
    READ BY
    CHUNKS
    SCIKIT-LEARN:
    PARTIAL
    FIT

    View Slide

  20. chunks & partial_fit
    1 M A C H I N E

    View Slide

  21. Multiple machines.
    n M A C H I N E S

    View Slide

  22. I don’t want to use Spark/JVM,
    what do you have for me?
    H A P P Y P Y T H O N U S E R

    View Slide

  23. WHAT IS
    RAY?

    View Slide

  24. A high-performance distributed execution engine
    REDIS SCHEDULER WORKER
    ARROW
    &
    PLASMA

    View Slide

  25. Use pandas through ray to query parquet files in
    an object storage.
    W O R K I N P R O G R E S S

    View Slide

  26. CONTAINERIZED
    DATA SCIENCE

    View Slide

  27. If you trained a model with scikit-learn 0.18.1,
    will the same model work with 0.19.1?
    P R O B L E M # 1

    View Slide

  28. How do you share your models?
    P R O B L E M # 2

    View Slide

  29. How do you put your models in production?
    P R O B L E M # 3

    View Slide

  30. Containerize everything.
    T H E A N S W E R

    View Slide

  31. 1. It’s damn easy to move things around
    2. You get versioning for free
    3. Stack agnostic
    4. Move Docker images around
    T O R E C A P

    View Slide

  32. CASSINY:
    PUT ALL THE
    THINGS
    TOGETHER

    View Slide

  33. CLEAR REQUIREMENTS
    CONTAINERIZED EASY OBJECT STORAGE JUPYTER + IPYTHON PLATFORM AGNOSTIC

    View Slide

  34. OPEN SOURCE

    View Slide

  35. DEMO

    View Slide

  36. TAKEAWAYS
    UNIFIED DATA WAREHOUSE
    KEEP YOUR CODE RUNNING ON ONE MACHINE
    USE DOCKER
    TRY RAY
    BRING CI/CD TO YOUR DATASCIENCE WORKFLOW
    OBJECT STORAGE IS COOL
    DISTRIBUTED COMPUTING IS HARD
    I DIDN’T HAVE ANOTHER POINT

    View Slide

  37. View Slide