Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Formats for Data Science @PyConDe

Data Formats for Data Science @PyConDe

Talk @ **PyConDE**

The CSV is the most widely adopted data format. It used to
store and share *not-so-big* scientific data. However, this format is not particularly
suited in case data require any sort of internal
hierarchical structure, or if data are too big. To this end, other data formats must be considered.
These formats may include *general purpose* solutions, e.g. [No]SQL databases, HDFS (Hadoop File System);
or may be specifically designed for scientific data, e.g. hdf5, ROOT, NetCDF.

In this talk, the different data formats will be presented and compared w.r.t. their
usage for scientific computations along with corresponding Python libraries.

Finally, few notes about the new trends for **columnar databases** will be discussed for very fast
in-memory analytics (e.g. *MonetDB*).

Valerio Maggio

October 29, 2016
Tweet

More Decks by Valerio Maggio

Other Decks in Programming

Transcript

  1. PyConDE DATA FORMATS FOR 
 DATA SCIENCE Data Scientist and

    Researcher Fondazione Bruno Kessler (FBK)
 Trento, Italy Valerio Maggio @leriomaggio Remastered Director’s Confidential now with different jokes!
  2. Data Science is about data… • First step in Data

    Science is 
 Data Loading and Processing • Data Loading Q: What’s the format in which data are saved ? Q: What’s the most Pythonic way to do that ? • Data Storage Q: What’s the best format to store (and so share) my data?
  3. Storage matters BINARY FORMAT • Space is not the only

    concern (for text). Speed matters! • Python conversion to int() and float() are slow • costly atoi()/atof() C functions * A. Scopatz, K.D. Huff - Effective Computations in Physics - Field Guide to Research in Python, O’Reilly 2015 * Integers and floats in native and string representations
  4. HIERARCHICAL DATA FORMAT 5 (a.k.a. HDF5) • Free and open

    source file format specification • (+) Works great with both big or tiny datasets • (+) Storage friendly • Allows for Compression • (+) Dev. Friendly • Query DSL + Multiple-language support • Python: PyTables, hdf5, h5py
  5. HIERARCHICAL DATA FORMAT 5 (a.k.a. HDF5) • Free and open

    source file format specification • (+) Works great with both big or tiny datasets • (+) Storage friendly • Allows for Compression • (+) Dev. Friendly • Query DSL + Multiple-language support • Python: PyTables, hdf5, h5py from Keras Documentation
  6. DATA CHUNKING A. Scopatz, K.D. Huff - Effective Computations in

    Physics - Field Guide to Research in Python, O’Reilly 2015
  7. DATA CHUNKING A. Scopatz, K.D. Huff - Effective Computations in

    Physics - Field Guide to Research in Python, O’Reilly 2015 • Small chunks are good for accessing only some of the data at a time. 
 • Large chunks are good for accessing lots of data at a time. 
 • Reading and writing chunks may happen in parallel
  8. Particle Physics • Data Analysis Framework (and tool) dev. @CERN

    • Written in C++; Native extension in Python (aka PyROOT) • ROOT6 also ships a Jupyter Kernel • Definition of a new Binary Data Format (.root) • based on the serialisation of C++ Objects • Speed and Storage Matters!!
  9. Information Architecture RELATIONAL DATABASES • Normalisation (No Duplicates) & Fixed

    Structure • SQL: Structured Query Language • Many different dialects! • ORM is the way!
  10. SQL

  11. Flexibility NO-SQL DATABASES • Your data requires a flexible (not

    fixed) structure • JSON-based data format (mostly) • e.g. MongoDB pymongo
  12. Flexibility + Validation • Your data requires a flexible(~ish) 


    structure • But you want to validate your data • XML-based data format
  13. Speed IN-MEMORY DB ANALYTICS • Normalisation (No Duplicates) & Fixed

    Structure • (Super effective) in-DB Analytics • Column-oriented DB
  14. BIG DATA AND COLUMNAR DBS • Big Data World is

    shifting towards columnar DBs • better oriented to OLAP (analytics) rather than OLTP