Scalable Scientific Computing using Dask

October 24, 2018

130

Scalable Scientific Computing using Dask

Pandas and NumPy are great tools to dive through data, do analysis and train machine learning models. They provide intuitive APIs and superb performance. Sadly they are both restricted to the main memory of a single machine and mostly also to a single CPU. Dask is a flexible tools for parallelizing NumPy and Pandas code on a single machine or a cluster.

Uwe L. Korn

October 24, 2018

More Decks by Uwe L. Korn

See All by Uwe L. Korn

PyData Sofia May 2024 - Intro to Apache Arrow

xhochy

Going beyond Apache Parquet's default settings

xhochy

410

pd.{read/to}_sql is simple but not fast

xhochy

390

PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems

xhochy

160

Berlin Buzzwords 2019 - Taming the language border in data analytics and science with Apache Arrow

xhochy

210

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

xhochy

450

Free Movement of Data with Apache Arrow

xhochy

210

Extending Pandas using Apache Arrow and Numba

xhochy

300

Building customer-visible data science dashboards with Altair / Vega / Vue

xhochy

430

Other Decks in Programming

See All in Programming

あなたとJIT, 今すぐアセンブル

sisshiki1969

520

[DevinMeetupTokyo2025] コード書かせないDevinの使い方

takumiyoshikawa

270

Vibe Codingの幻想を超えて-生成AIを現場で使えるようにするまでの泥臭い話.ai

fumiyakume

10k

なぜあなたのオブザーバビリティ導入は頓挫するのか

ryota_hnk

580

令和最新版手のひらコンピュータ

koba789

6.9k

AIコーディングエージェント全社導入とセキュリティ対策

hikaruegashira

9.5k

PHPUnitの限界をPlaywrightで補完するテストアプローチ

yuzneri

390

書き捨てではなく継続開発可能なコードをAIコーディングエージェントで書くために意識していること

shuyakinjo

240

可変性を制する設計: 構造と振る舞いから考える概念モデリングとその実装

a_suenami

1.7k

React は次の10年を生き残れるか：3つのトレンドから考える

oukayuka

16k

実践 Dev Containers × Claude Code

touyu

160

No Install CMS戦略〜 5年先を見据えたフロントエンド開発を考える / no_install_cms

rdlabo

470

Featured

See All Featured

Dealing with People You Can't Stand - Big Design 2015

cassininazir

367

26k

VelocityConf: Rendering Performance Case Studies

addyosmani

332

24k

How GitHub (no longer) Works

holman

314

140k

Fireside Chat

paigeccino

3.6k

jQuery: Nuts, Bolts and Bling

dougneiner

7.8k

What's in a price? How to price your products and services

michaelherold

246

12k

How to Ace a Technical Interview

jacobian

278

23k

"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)

danielanewman

229

22k

CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again

sstephenson

161

15k

The Pragmatic Product Professional

lauravandoore

6.8k

Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure

mongodb

9.6k

What’s in a name? Adding method to the madness

productmarketing

PRO

3.6k

Transcript

1 PyCon.DE / PyData Karlsruhe 2018 Uwe L. Korn Scalable
Scientific Computing with Dask
2 • Senior Data Scientist at Blue Yonder (@BlueYonderTech) •
Apache {Arrow, Parquet} PMC • Data Engineer and Architect with heavy focus around Pandas About me xhochy mail@uwekorn.com
3 • Execution and definition of task graphs • a
parallel computing library that scales the existing Python ecosystem. • scales down to your laptop laptop • sclaes up to a cluster What is Dask?
4 • multi-core and distributed parallel execution • low-level: task
schedulers for computation graphs • high-level: Array, Bag and DataFrame More than a single CPU
5 Dask is • More light-weight • In Python, operates
well with C/C++/Fortran/LLVM or other natively compiled code • Part of the Python ecosystem What about Spark?
6 Spark is • Written in Scala and works well
within the JVM • Python support is very limited • Brings its own ecosystem • Able to provide more higher level optimizations What about Spark?
https://github.com/mrocklin/ pydata-nyc-2018-tutorial 7