As a Data Scientist, we encounter few major challenges while dealing with large volume of data:
1. Popular libraries like Numpy, Pandas are not designed to scale beyond a single core/processor. Scikit-Learn can utilize multiple cores .
2. Numpy, Pandas, Scikit-Learn are not designed to scale beyond a single machine.
3. For a laptop or workstation, RAM is often limited to 16 or 32 GB. For Numpy, Pandas, Scikit-Learn, data needs to be loaded into RAM. So, if size of the data exceeds size of the main memory, these libraries can't be used.
This set of notebooks describe how this challenges can be addressed by using open source, parallel computation library Dask.