Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scalable Scientific Computing using Dask

Scalable Scientific Computing using Dask

Pandas and NumPy are great tools to dive through data, do analysis and train machine learning models. They provide intuitive APIs and superb performance. Sadly they are both restricted to the main memory of a single machine and mostly also to a single CPU. Dask is a flexible tools for parallelizing NumPy and Pandas code on a single machine or a cluster.

Uwe L. Korn

October 24, 2018
Tweet

More Decks by Uwe L. Korn

Other Decks in Programming

Transcript

  1. 1
    PyCon.DE / PyData Karlsruhe 2018
    Uwe L. Korn
    Scalable Scientific Computing with
    Dask

    View Slide

  2. 2
    • Senior Data Scientist at Blue Yonder
    (@BlueYonderTech)
    • Apache {Arrow, Parquet} PMC
    • Data Engineer and Architect with heavy
    focus around Pandas
    About me
    xhochy
    [email protected]

    View Slide

  3. 3
    • Execution and definition of task graphs
    • a parallel computing library that scales the existing Python ecosystem.
    • scales down to your laptop laptop
    • sclaes up to a cluster
    What is Dask?

    View Slide

  4. 4
    • multi-core and distributed parallel execution
    • low-level: task schedulers for computation graphs
    • high-level: Array, Bag and DataFrame
    More than a single CPU

    View Slide

  5. 5
    Dask is
    • More light-weight
    • In Python, operates well with C/C++/Fortran/LLVM or other natively
    compiled code
    • Part of the Python ecosystem
    What about Spark?

    View Slide

  6. 6
    Spark is
    • Written in Scala and works well within the JVM
    • Python support is very limited
    • Brings its own ecosystem
    • Able to provide more higher level optimizations
    What about Spark?

    View Slide

  7. https://github.com/mrocklin/
    pydata-nyc-2018-tutorial
    7

    View Slide