Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2016 - Dillon Niederhut - What to do when your data is large, but not big

PyBay
August 20, 2016

2016 - Dillon Niederhut - What to do when your data is large, but not big

Description
This talk will present strategies in Python for handling data that is too large to fit in memory and/or too slow to process in one thread, but small enough to still fit in one machine.

Abstract
Unless you work at a large internet company, you probably don't have BIG data, but you might have LARGE data. Large data consume an unacceptable amount of time and memory when medium strategies are used, but also incur unnecessary financial and latency costs when big strategies are used. Two basic strategies for handling large data, chunking and parallelization, will be discussed with live coded examples in Python.

Bio
I'm a research scientist currently living in the Bay Area and working in neuroethology, human evolution, and natural language processing. I currently work at D-Lab, where I help researchers apply advances in computation to their research paradigms.

https://youtu.be/g-YCaX3ml2Q

PyBay

August 20, 2016
Tweet

More Decks by PyBay

Other Decks in Programming

Transcript

  1. Large data in python Dillon Niederhut Introduction Motivation Strategies Closing

    What to do when your data are large but not big Dillon Niederhut PyBay – the San Francisco Bay Area Python Conference 20 August 2016
  2. Large data in python Dillon Niederhut Introduction Motivation Strategies Closing

    about this talk • data at github.com/deniederhut/pybay 2016 • python libraries : celery, h5py, numpy, pandas, pymongo • other libraries : mongodb, rabbitmq, sqlite
  3. Large data in python Dillon Niederhut Introduction Motivation Strategies Closing

    contact • dillon.niederhut.us • @dillonniederhut