Slide 1

Slide 1 text

1 PyCon.DE / PyData Karlsruhe 2018 Uwe L. Korn Scalable Scientific Computing with Dask

Slide 2

Slide 2 text

2 • Senior Data Scientist at Blue Yonder (@BlueYonderTech) • Apache {Arrow, Parquet} PMC • Data Engineer and Architect with heavy focus around Pandas About me xhochy [email protected]

Slide 3

Slide 3 text

3 • Execution and definition of task graphs • a parallel computing library that scales the existing Python ecosystem. • scales down to your laptop laptop • sclaes up to a cluster What is Dask?

Slide 4

Slide 4 text

4 • multi-core and distributed parallel execution • low-level: task schedulers for computation graphs • high-level: Array, Bag and DataFrame More than a single CPU

Slide 5

Slide 5 text

5 Dask is • More light-weight • In Python, operates well with C/C++/Fortran/LLVM or other natively compiled code • Part of the Python ecosystem What about Spark?

Slide 6

Slide 6 text

6 Spark is • Written in Scala and works well within the JVM • Python support is very limited • Brings its own ecosystem • Able to provide more higher level optimizations What about Spark?

Slide 7

Slide 7 text

https://github.com/mrocklin/ pydata-nyc-2018-tutorial 7