Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scale Up Your Data Science Work Flow Using Dask

Arnab Biswas
December 16, 2020

Scale Up Your Data Science Work Flow Using Dask

As a Data Scientist, we encounter few major challenges while dealing with large volume of data:

1. Popular libraries like Numpy, Pandas are not designed to scale beyond a single core/processor. Scikit-Learn can utilize multiple cores .
2. Numpy, Pandas, Scikit-Learn are not designed to scale beyond a single machine.
3. For a laptop or workstation, RAM is often limited to 16 or 32 GB. For Numpy, Pandas, Scikit-Learn, data needs to be loaded into RAM. So, if size of the data exceeds size of the main memory, these libraries can't be used.

This set of notebooks describe how this challenges can be addressed by using open source, parallel computation library Dask.

Arnab Biswas

December 16, 2020
Tweet

More Decks by Arnab Biswas

Other Decks in Technology

Transcript

  1. Agenda - Fundamentals of Computer Architecture - Parallelism - Challenges

    with Large Data - Distributed Computing - Dask : What & Why? - Dask : Big Data Collection (DataFrame & Array) - Dask ML https://github.com/arnabbiswas1/dask_workshop
  2. Fundamentals of Computer System • Components • Computing Unit: CPU,

    GPU • Memory Unit: RAM, Hard Drive, CPU Cache (L1/L2) • Bus: Connector between Computing & Memory Unit • Network Connection: Very slow connection connecting to other computing and memory units
  3. Computing Unit • Properties • Instructions per cycle (IPC): Number

    of operations CPU can do in one cycle • Clock Speed: Number of cycles CPU can execute in one sec • Multicore Architecture • Processors are not getting faster • Chip Makers are putting multiple CPUs ("cores") within the same physical unit increasing total capacity
  4. Parallelism • Parallelism as a process • Break a problem

    into lot of smaller independent problems • Solve each smaller problem independently using a single core • Combine the results • Parallelism is hard • Computer Time to break the problem into pieces, distribute them across cores and recombine them to get final result • Developing parallel code
  5. Parallelism using Multicore • Amdahl’s Law • If program has

    some part which must be executed by a single core ("bottleneck"), amount of parallelization using multiple cores becomes limited • Most of the Data Science code can be parallelized better • Python's Global Interpreter Lock (GIL) • Python process can utilize one core at any given time • Can be avoided using • Multiprocessing • Tools: numpy or numexpr, Cython etc. • Distributed models of computing • Not a problem for numeric Python eco-system (NumPy, Pandas etc). https://en.wikipedia.org/wiki/Amdahl%27s_law
  6. Memory Units • Properties • Read/Write Speed • Latency: Time

    taken by the device to find data • Different Memory Units • Spinning Hard Drive: Extremely slow read/write, but large capacity (10 TB) • Solid State Hard Drive: Faster read/write, smaller capacity (1 TB Range) • RAM: Faster read/write. Limited Capacity (64 GB) • L1/L2 Cache • Extremely fast, very small capacity (MB Range) • Data going to CPU goes through here.
  7. Memory Units… • Tiered Approach • Data start from hard

    drive, moves to RAM, smaller subset moves to L1/L2 cache, finally arrives at Processor • Moving around data is often the most expensive thing (compared to the actual computation) • With large data • If data doesn’t fit into RAM, moving Data directly between Processor and Hard Disk is unimaginably slow (minutes -> days)
  8. Large Data Strategies • Simple Definition: Data that doesn’t fit

    into RAM • Strategy • Slim & Trim the data, so that it fits into memory • Use more RAM • Out of Core Computing • Minimize penalties of streaming data out of hard drive • Use Distributed Computing
  9. Distributed Computing • Break data into small pieces • Send

    those to different computers • Execute those on different computers • Bring back the partial results from those computers • Recombine the partial results to generate final output
  10. Data Science with Large Data : Approach • Single Computer

    • Use parallelization using multiple cores • Try to fit data into RAM, if not, stream data out of hard drive • Distributed Computing https://azure.microsoft.com/en-in/pricing/details/virtual-machines/linux/
  11. Data Science with Large Data : Challenges • Popular libraries

    (Numpy, Pandas etc.) are not designed to scale beyond a single core/processor • Scikit-Learn can (using JobLib) • Not designed to scale beyond a single machine • For larger than RAM data, can’t be used • Data needs to be loaded into RAM for computation
  12. Daskto rescue… • Parallelizes Python Eco-System • Parallelize using Multiple

    Cores (single/multiple computers) • Handles large data • Out of Core Computing: For larger than RAM data, streams from hard drive • Familiar API • Uses data structures of Pandas, Numpy (etc.) internally • Dask copies most of their API • Scales up on a cluster. Scales down to a laptop • Enables transitioning from single machine workflows to parallel & distributed computing without learning new frameworks or without changing much code
  13. What is Dask? • Parallel Computing Library in Python •

    Open Source: 300+ contributors, 20 active maintainers • Core maintainers of Numpy, Pandas, Scikit-Learn are the Core maintainers of Dask Reference: https://www.youtube.com/watch?v=nnndxbr_Xq4
  14. Reference • High Performance Python, 2nd Edition • Duke MIDS

    Fall 2020 Practical Data Science Course • Dask Documentation & Videos • Dask Video Tutorial 2020 by Jacob Tomlinson • Parallel and Distributed Computing in Python with Dask, SciPy 2020 • Scalable Machine Learning with Dask by Tom Augspurger