Scalable Scientific Computing using Dask

Scalable Scientific Computing using Dask

Pandas and NumPy are great tools to dive through data, do analysis and train machine learning models. They provide intuitive APIs and superb performance. Sadly they are both restricted to the main memory of a single machine and mostly also to a single CPU. Dask is a flexible tools for parallelizing NumPy and Pandas code on a single machine or a cluster.

D6fcc16462fbe93673342da3ff5d8121?s=128

Uwe L. Korn

October 24, 2018
Tweet

Transcript

  1. 1 PyCon.DE / PyData Karlsruhe 2018 Uwe L. Korn Scalable

    Scientific Computing with Dask
  2. 2 • Senior Data Scientist at Blue Yonder (@BlueYonderTech) •

    Apache {Arrow, Parquet} PMC • Data Engineer and Architect with heavy focus around Pandas About me xhochy mail@uwekorn.com
  3. 3 • Execution and definition of task graphs • a

    parallel computing library that scales the existing Python ecosystem. • scales down to your laptop laptop • sclaes up to a cluster What is Dask?
  4. 4 • multi-core and distributed parallel execution • low-level: task

    schedulers for computation graphs • high-level: Array, Bag and DataFrame More than a single CPU
  5. 5 Dask is • More light-weight • In Python, operates

    well with C/C++/Fortran/LLVM or other natively compiled code • Part of the Python ecosystem What about Spark?
  6. 6 Spark is • Written in Scala and works well

    within the JVM • Python support is very limited • Brings its own ecosystem • Able to provide more higher level optimizations What about Spark?
  7. https://github.com/mrocklin/ pydata-nyc-2018-tutorial 7