1
PyCon.DE / PyData Karlsruhe 2018
Uwe L. Korn
Scalable Scientific Computing with
Dask
Slide 2
Slide 2 text
2
• Senior Data Scientist at Blue Yonder
(@BlueYonderTech)
• Apache {Arrow, Parquet} PMC
• Data Engineer and Architect with heavy
focus around Pandas
About me
xhochy
[email protected]
Slide 3
Slide 3 text
3
• Execution and definition of task graphs
• a parallel computing library that scales the existing Python ecosystem.
• scales down to your laptop laptop
• sclaes up to a cluster
What is Dask?
Slide 4
Slide 4 text
4
• multi-core and distributed parallel execution
• low-level: task schedulers for computation graphs
• high-level: Array, Bag and DataFrame
More than a single CPU
Slide 5
Slide 5 text
5
Dask is
• More light-weight
• In Python, operates well with C/C++/Fortran/LLVM or other natively
compiled code
• Part of the Python ecosystem
What about Spark?
Slide 6
Slide 6 text
6
Spark is
• Written in Scala and works well within the JVM
• Python support is very limited
• Brings its own ecosystem
• Able to provide more higher level optimizations
What about Spark?