Upgrade to Pro — share decks privately, control downloads, hide ads and more …

EuroSciPy17 - Distributed compute with Dask, Kubernetes and AWS

EuroSciPy17 - Distributed compute with Dask, Kubernetes and AWS

Video - https://www.youtube.com/watch?v=6nFllDegCTY

Dask is a Python based flexible parallel computing library. It enables interactive data analysis on large datasets, and scales from your laptop to a cluster. This parallelization can speed up your analysis, but if your compute nodes are sat idle you can end up burning a lot of money.

Dask has an interface through which it can ask for more / less resources. We've built an example system, using cluster management tools like Kubernetes and public cloud infrastructure, that allows you to maximize the amount of compute you get when you need it while also minimizing cost.

Ca3d0556227d66b3c15be1eadf69473b?s=128

Jacob Tomlinson

August 31, 2017
Tweet

More Decks by Jacob Tomlinson

Other Decks in Technology

Transcript

  1. Distributed compute with Dask, Kubernetes and AWS Jacob Tomlinson, Alex

    Hilson Met Office Informatics Lab
  2. Who

  3. None
  4. None
  5. Our Problem

  6. A typical workflow...

  7. A typical workflow...

  8. Distributed Compute

  9. http://dask.pydata.org

  10. http://dask.pydata.org

  11. Adaptive Clusters

  12. $ dask-scheduler --preload adaptive.py

  13. IaaS PaaS

  14. None
  15. www.informaticslab.co.uk data.informaticslab.co.uk github.com/met-office-lab @informatics_lab