Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DS4DS workshop: Labeled Arrays and Datasets in Python

Stephan Hoyer
September 18, 2015

DS4DS workshop: Labeled Arrays and Datasets in Python

This was a 10 minute lightning talk on the labeled data features of xray and pandas, and what we need from the rest of the SciPy stack.

Stephan Hoyer

September 18, 2015
Tweet

Other Decks in Programming

Transcript

  1. Labeled arrays and datasets in Python Stephan Hoyer (@shoyer) Data

    Structures for Data Science Workshop UC Berkeley BIDS, 18 September 2015
  2. Xray vectorizes by dimension names, not order time time space

    + = space time space + = time space time space
  3. The ecosystem needs duck-typed ndarrays Dask Xray’s current backends Future(?)

    backends What’s the generic version of np.concatenate?
  4. We also need better data types Categorical Dates & times

    Missing data Physical Units We should be able to write new dtypes in Python
  5. Indexes enable label-based alignment + = d e f g

    h i a b c d e f g d e f g xray style (intersection) = a b c d e f g h i pandas style (union) + d e f g h i a b c d e f g
  6. Lots of other features for easy data analysis Feature Pandas

    Xray IO read_csv(), .to_csv() open_dataset(), .to_netcdf() Split-apply-combine .groupby(...).mean() Plotting .plot() Missing values .isnull(), .fillna(), .dropna() Convert to NumPy .values
  7. Printing a Dataset shows its contents >>> ds <xray.Dataset> Dimensions:

    (time: 10, latitude: 8, longitude: 8) Coordinates: * time (time) datetime64 2015-01-01 2015-01-02 2015-01-03 2015-01-04 ... * latitude (latitude) float64 50.0 47.5 45.0 42.5 40.0 37.5 35.0 32.5 * longitude (longitude) float64 -105.0 -102.5 -100.0 -97.5 -95.0 -92.5 ... elevation (longitude, latitude) int64 201 231 582 239 1848 1004 1004 ... land_cover (longitude, latitude) object 'forest' 'urban' 'farmland'... Data variables: temperature (time, longitude, latitude) float64 13.7 8.031 18.36 24.95 ... pressure (time, longitude, latitude) float64 1.374 1.142 1.388 0.9992 ...
  8. Xray and pandas make missing data (mostly) easy Select it:

    .isnull() Skip it: .mean() Drop it: .dropna() Fill it: .fillna()