Slide 1

Slide 1 text

Scale Up Data Science Workflow using Dask Arnab Biswas twitter: @arnabbiswas1 arnab.blog

Slide 2

Slide 2 text

Agenda - Fundamentals of Computer Architecture - Parallelism - Challenges with Large Data - Distributed Computing - Dask : What & Why? - Dask : Big Data Collection (DataFrame & Array) - Dask ML https://github.com/arnabbiswas1/dask_workshop

Slide 3

Slide 3 text

Fundamentals of Computer System • Components • Computing Unit: CPU, GPU • Memory Unit: RAM, Hard Drive, CPU Cache (L1/L2) • Bus: Connector between Computing & Memory Unit • Network Connection: Very slow connection connecting to other computing and memory units

Slide 4

Slide 4 text

Computing Unit • Properties • Instructions per cycle (IPC): Number of operations CPU can do in one cycle • Clock Speed: Number of cycles CPU can execute in one sec • Multicore Architecture • Processors are not getting faster • Chip Makers are putting multiple CPUs ("cores") within the same physical unit increasing total capacity

Slide 5

Slide 5 text

Parallelism • Parallelism as a process • Break a problem into lot of smaller independent problems • Solve each smaller problem independently using a single core • Combine the results • Parallelism is hard • Computer Time to break the problem into pieces, distribute them across cores and recombine them to get final result • Developing parallel code

Slide 6

Slide 6 text

Parallelism using Multicore • Amdahl’s Law • If program has some part which must be executed by a single core ("bottleneck"), amount of parallelization using multiple cores becomes limited • Most of the Data Science code can be parallelized better • Python's Global Interpreter Lock (GIL) • Python process can utilize one core at any given time • Can be avoided using • Multiprocessing • Tools: numpy or numexpr, Cython etc. • Distributed models of computing • Not a problem for numeric Python eco-system (NumPy, Pandas etc). https://en.wikipedia.org/wiki/Amdahl%27s_law

Slide 7

Slide 7 text

Memory Units • Properties • Read/Write Speed • Latency: Time taken by the device to find data • Different Memory Units • Spinning Hard Drive: Extremely slow read/write, but large capacity (10 TB) • Solid State Hard Drive: Faster read/write, smaller capacity (1 TB Range) • RAM: Faster read/write. Limited Capacity (64 GB) • L1/L2 Cache • Extremely fast, very small capacity (MB Range) • Data going to CPU goes through here.

Slide 8

Slide 8 text

Memory Units… • Tiered Approach • Data start from hard drive, moves to RAM, smaller subset moves to L1/L2 cache, finally arrives at Processor • Moving around data is often the most expensive thing (compared to the actual computation) • With large data • If data doesn’t fit into RAM, moving Data directly between Processor and Hard Disk is unimaginably slow (minutes -> days)

Slide 9

Slide 9 text

Large Data Strategies • Simple Definition: Data that doesn’t fit into RAM • Strategy • Slim & Trim the data, so that it fits into memory • Use more RAM • Out of Core Computing • Minimize penalties of streaming data out of hard drive • Use Distributed Computing

Slide 10

Slide 10 text

Distributed Computing • Break data into small pieces • Send those to different computers • Execute those on different computers • Bring back the partial results from those computers • Recombine the partial results to generate final output

Slide 11

Slide 11 text

Data Science with Large Data : Approach • Single Computer • Use parallelization using multiple cores • Try to fit data into RAM, if not, stream data out of hard drive • Distributed Computing https://azure.microsoft.com/en-in/pricing/details/virtual-machines/linux/

Slide 12

Slide 12 text

Data Science with Large Data : Challenges • Popular libraries (Numpy, Pandas etc.) are not designed to scale beyond a single core/processor • Scikit-Learn can (using JobLib) • Not designed to scale beyond a single machine • For larger than RAM data, can’t be used • Data needs to be loaded into RAM for computation

Slide 13

Slide 13 text

Daskto rescue… • Parallelizes Python Eco-System • Parallelize using Multiple Cores (single/multiple computers) • Handles large data • Out of Core Computing: For larger than RAM data, streams from hard drive • Familiar API • Uses data structures of Pandas, Numpy (etc.) internally • Dask copies most of their API • Scales up on a cluster. Scales down to a laptop • Enables transitioning from single machine workflows to parallel & distributed computing without learning new frameworks or without changing much code

Slide 14

Slide 14 text

What is Dask? • Parallel Computing Library in Python • Open Source: 300+ contributors, 20 active maintainers • Core maintainers of Numpy, Pandas, Scikit-Learn are the Core maintainers of Dask Reference: https://www.youtube.com/watch?v=nnndxbr_Xq4

Slide 15

Slide 15 text

Two Parts of Dask • Dynamic Task Scheduling • Big Data Collection

Slide 16

Slide 16 text

https://github.com/arnabbiswas1/dask_workshop

Slide 17

Slide 17 text

Reference • High Performance Python, 2nd Edition • Duke MIDS Fall 2020 Practical Data Science Course • Dask Documentation & Videos • Dask Video Tutorial 2020 by Jacob Tomlinson • Parallel and Distributed Computing in Python with Dask, SciPy 2020 • Scalable Machine Learning with Dask by Tom Augspurger

Slide 18

Slide 18 text

Questions?