Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scalable Scientific Computing using Dask

Scalable Scientific Computing using Dask

Pandas and NumPy are great tools to dive through data, do analysis and train machine learning models. They provide intuitive APIs and superb performance. Sadly they are both restricted to the main memory of a single machine and mostly also to a single CPU. Dask is a flexible tools for parallelizing NumPy and Pandas code on a single machine or a cluster.

Uwe L. Korn

October 24, 2018

More Decks by Uwe L. Korn

Other Decks in Programming


  1. 2 • Senior Data Scientist at Blue Yonder (@BlueYonderTech) •

    Apache {Arrow, Parquet} PMC • Data Engineer and Architect with heavy focus around Pandas About me xhochy [email protected]
  2. 3 • Execution and definition of task graphs • a

    parallel computing library that scales the existing Python ecosystem. • scales down to your laptop laptop • sclaes up to a cluster What is Dask?
  3. 4 • multi-core and distributed parallel execution • low-level: task

    schedulers for computation graphs • high-level: Array, Bag and DataFrame More than a single CPU
  4. 5 Dask is • More light-weight • In Python, operates

    well with C/C++/Fortran/LLVM or other natively compiled code • Part of the Python ecosystem What about Spark?
  5. 6 Spark is • Written in Scala and works well

    within the JVM • Python support is very limited • Brings its own ecosystem • Able to provide more higher level optimizations What about Spark?