Scale Up Your Data Science Work Flow Using Dask

Scale Up Data Science Workflow using Dask Arnab Biswas twitter:
@arnabbiswas1 arnab.blog

Agenda - Fundamentals of Computer Architecture - Parallelism - Challenges
with Large Data - Distributed Computing - Dask : What & Why? - Dask : Big Data Collection (DataFrame & Array) - Dask ML https://github.com/arnabbiswas1/dask_workshop

Fundamentals of Computer System • Components • Computing Unit: CPU,
GPU • Memory Unit: RAM, Hard Drive, CPU Cache (L1/L2) • Bus: Connector between Computing & Memory Unit • Network Connection: Very slow connection connecting to other computing and memory units

Computing Unit • Properties • Instructions per cycle (IPC): Number
of operations CPU can do in one cycle • Clock Speed: Number of cycles CPU can execute in one sec • Multicore Architecture • Processors are not getting faster • Chip Makers are putting multiple CPUs ("cores") within the same physical unit increasing total capacity

Parallelism • Parallelism as a process • Break a problem
into lot of smaller independent problems • Solve each smaller problem independently using a single core • Combine the results • Parallelism is hard • Computer Time to break the problem into pieces, distribute them across cores and recombine them to get final result • Developing parallel code

Parallelism using Multicore • Amdahl’s Law • If program has
some part which must be executed by a single core ("bottleneck"), amount of parallelization using multiple cores becomes limited • Most of the Data Science code can be parallelized better • Python's Global Interpreter Lock (GIL) • Python process can utilize one core at any given time • Can be avoided using • Multiprocessing • Tools: numpy or numexpr, Cython etc. • Distributed models of computing • Not a problem for numeric Python eco-system (NumPy, Pandas etc). https://en.wikipedia.org/wiki/Amdahl%27s_law

Memory Units • Properties • Read/Write Speed • Latency: Time
taken by the device to find data • Different Memory Units • Spinning Hard Drive: Extremely slow read/write, but large capacity (10 TB) • Solid State Hard Drive: Faster read/write, smaller capacity (1 TB Range) • RAM: Faster read/write. Limited Capacity (64 GB) • L1/L2 Cache • Extremely fast, very small capacity (MB Range) • Data going to CPU goes through here.

Memory Units… • Tiered Approach • Data start from hard
drive, moves to RAM, smaller subset moves to L1/L2 cache, finally arrives at Processor • Moving around data is often the most expensive thing (compared to the actual computation) • With large data • If data doesn’t fit into RAM, moving Data directly between Processor and Hard Disk is unimaginably slow (minutes -> days)

Large Data Strategies • Simple Definition: Data that doesn’t fit
into RAM • Strategy • Slim & Trim the data, so that it fits into memory • Use more RAM • Out of Core Computing • Minimize penalties of streaming data out of hard drive • Use Distributed Computing

Distributed Computing • Break data into small pieces • Send
those to different computers • Execute those on different computers • Bring back the partial results from those computers • Recombine the partial results to generate final output

Data Science with Large Data : Approach • Single Computer
• Use parallelization using multiple cores • Try to fit data into RAM, if not, stream data out of hard drive • Distributed Computing https://azure.microsoft.com/en-in/pricing/details/virtual-machines/linux/

Data Science with Large Data : Challenges • Popular libraries
(Numpy, Pandas etc.) are not designed to scale beyond a single core/processor • Scikit-Learn can (using JobLib) • Not designed to scale beyond a single machine • For larger than RAM data, can’t be used • Data needs to be loaded into RAM for computation

Daskto rescue… • Parallelizes Python Eco-System • Parallelize using Multiple
Cores (single/multiple computers) • Handles large data • Out of Core Computing: For larger than RAM data, streams from hard drive • Familiar API • Uses data structures of Pandas, Numpy (etc.) internally • Dask copies most of their API • Scales up on a cluster. Scales down to a laptop • Enables transitioning from single machine workflows to parallel & distributed computing without learning new frameworks or without changing much code

What is Dask? • Parallel Computing Library in Python •
Open Source: 300+ contributors, 20 active maintainers • Core maintainers of Numpy, Pandas, Scikit-Learn are the Core maintainers of Dask Reference: https://www.youtube.com/watch?v=nnndxbr_Xq4

Two Parts of Dask • Dynamic Task Scheduling • Big
Data Collection

https://github.com/arnabbiswas1/dask_workshop

Reference • High Performance Python, 2nd Edition • Duke MIDS
Fall 2020 Practical Data Science Course • Dask Documentation & Videos • Dask Video Tutorial 2020 by Jacob Tomlinson • Parallel and Distributed Computing in Python with Dask, SciPy 2020 • Scalable Machine Learning with Dask by Tom Augspurger

Questions?

Scale Up Your Data Science Work Flow Using Dask

Scale Up Your Data Science Work Flow Using Dask

Arnab Biswas

More Decks by Arnab Biswas

Other Decks in Technology

Featured

Transcript

Scale Up Data Science Workflow using Dask Arnab Biswas twitter:

Agenda - Fundamentals of Computer Architecture - Parallelism - Challenges

Fundamentals of Computer System • Components • Computing Unit: CPU,

Computing Unit • Properties • Instructions per cycle (IPC): Number

Parallelism • Parallelism as a process • Break a problem

Parallelism using Multicore • Amdahl’s Law • If program has

Memory Units • Properties • Read/Write Speed • Latency: Time

Memory Units… • Tiered Approach • Data start from hard

Large Data Strategies • Simple Definition: Data that doesn’t fit

Distributed Computing • Break data into small pieces • Send

Data Science with Large Data : Approach • Single Computer

Data Science with Large Data : Challenges • Popular libraries

Daskto rescue… • Parallelizes Python Eco-System • Parallelize using Multiple

What is Dask? • Parallel Computing Library in Python •

Two Parts of Dask • Dynamic Task Scheduling • Big

https://github.com/arnabbiswas1/dask_workshop

Reference • High Performance Python, 2nd Edition • Duke MIDS

Questions?