Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pangeo Zarr Cloud Data Storage

Pangeo Zarr Cloud Data Storage

Technical overview of zarr for cloud storage of netCDF-style datasets.

Presented at the 2018 Pangeo Developers Workshop: https://medium.com/pangeo/the-2018-pangeo-developers-workshop-1be359dac33c

Ryan Abernathey

August 13, 2018
Tweet

More Decks by Ryan Abernathey

Other Decks in Technology

Transcript

  1. X a r r ay a n d z a

    r r o n t h e c l o u d P a n g e o e x p e r i m e n t s
  2. !2 W h at i s c l i m

    at e d ata? time longitude latitude elevation Data variables used for computation Coordinates describe data Attributes + land_cover Metadata Standard: http://cfconventions.org/
  3. !3 H o w I s c l i m

    at e d ata s t o r e d t o d ay ? Opaque binary file formats. Access via dedicated C libraries.
 Python wrappers for C libraries.
 
 Optimized for parallel writes on HPC.
  4. !4 F i l e / B l o c

    k s t o r a g e Image credit: https://blog.ubuntu.com/2015/05/18/what-are-the-different-types-of-storage-block-object-and-file • Operating system provides mechanism to read / write files and directories (e.g. POSIX). • Seeking and random access to bytes within files is fast. • “Most file systems are based on a block device, which is a level of abstraction for the hardware responsible for storing and retrieving specified blocks of data”
  5. !5 O b j e c t s t o

    r a g e Image credit: https://blog.ubuntu.com/2015/05/18/what-are-the-different-types-of-storage-block-object-and-file • An object is a collection of bytes associated with a unique identifier • Bytes are read and written with http calls • Significant overhead each individual operation • Application level (not OS dependent) • Implemented by S3, GCS, Azure, Ceph, etc. • Underlying hardware…who knows?
  6. • Permits any storage layer (FTP, iRODS, S3) to look

    like a POSIX filesystem !6 F u s e
  7. !7 H D F F i l e s i

    n o b j e c t S t o r a g e ? By Matt Rocklin (Anaconda) http://matthewrocklin.com/blog/work/2018/02/06/hdf-in-the-cloud Cloud Optimized GeoTIFF HDF + FUSE HDF + Custom Reader Build a Distributed Service New Storage Format (e.g. zarr) pros fast, well-established works with existing files, no changes to HDF lib needed works with existing files, no complex FUSE tricks offloads the problem to others, maintains stable API fast, intuitive, modern Cons data model not sophisticated enough complex, low performance, brittle Requires plugins to HDF library and tweaks to downstream libs Complex, introduces intermediary, probably not free not a community standard
  8. S h a r i n g D ata i

    n t h e C l o u d E R A !8 Traditional Approach: A Data Access Portal Data Access Server file.0001.nc file.0002.nc file.0003.nc file.0004.nc Data Granules (netCDF files) Client Client Client Data Center Internet
  9. S h a r i n g D ata i

    n t h e C l o u d E R A !9 Direct Access to Cloud Object Storage Catalog chunk.0.0.0 chunk.0.0.1 chunk.0.0.2 chunk.0.0.3 Data Granules
 (netCDF files or something new) Cloud Object Storage Client Client Client Cloud Data Center Cloud Compute Instances
  10. • Python library for storage of chunked, compressed ND- arrays

    • Developed by Alistair Miles (Imperial) for genomics research (@alimanfoo) • Arrays are split into user-defined chunks; each chunk is optional compressed (zlib, zstd, etc.) • Can store arrays in memory, directories, zip files, or any python mutable mapping interface (dictionary) • External libraries (s3fs, gcsf) provide a way to store directly into cloud object storage !10 z a r r Zarr Group: group_name .zgroup .zattrs .zarray .zattrs Zarr Array: array_name 0.0 0.1 2.0 1.0 1.1 2.1
  11. !11 z a r r Zarr Group: group_name .zgroup .zattrs

    .zarray .zattrs Zarr Array: array_name 0.0 0.1 2.0 1.0 1.1 2.1 { "chunks": [ 5, 720, 1440 ], "compressor": { "blocksize": 0, "clevel": 3, "cname": "zstd", "id": "blosc", "shuffle": 2 }, "dtype": "<f8", "fill_value": "NaN", "filters": null, "order": "C", "shape": [ 8901, 720, 1440 ], "zarr_format": 2 } Example .zarray file (json)
  12. !12 z a r r Zarr Group: group_name .zgroup .zattrs

    .zarray .zattrs Zarr Array: array_name 0.0 0.1 2.0 1.0 1.1 2.1 { "_ARRAY_DIMENSIONS": [ "time", "latitude", "longitude" ], "comment": "The sea level anomaly is the sea surface height above mean sea surface; it is referenced to the [1993, 2012] period; see the product user manual for details", "coordinates": "crs", "grid_mapping": "crs", "long_name": "Sea level anomaly", "standard_name": "sea_surface_height_above_sea_level", "units": "m" } Example .attrs file (json)
  13. • Developed new xarray backend which allows xarray to read

    and write directly to a Zarr store (with @jhamman) • It was pretty easy! Data models are quite similar • Automatic mapping between zarr chunks <—> dask chunks • We needed to add a custom, “hidden” attribute (_ARRAY_DIMENSIONS) to give the zarr arrays dimensions !13 X a r r ay + z a r r
  14. 1. Open the original data files into a single xarray

    dataset with reasonable chunks
 
 ds = xr.open_mfdatset(‘bunch_o_files_*.nc’, chunks={‘time’: 1}) 2. Export to zarr
 
 ds.to_zarr(‘/path/to/zarr/directory’)
 ——or——
 ds.to_zarr(gcsamp_object) 3. [maybe] upload to cloud storage
 
 $ gcsutil -m cp -r /path/to/zarr/directory gs://pangeo-data/path !14 P r e pa r i n g d ata s e t s f o r z a r r c l o u d s t o r a g e
  15. !15 R e a d i n g z a

    r R d ata f r o m pa n g e o c l o u d e n v i r o n m e n t s gcsmap = gcsfs.mapping.GCSMap('pangeo-data/dataset-duacs-rep-global-merged-allsat-phy-l4-v3-alt') ds = xr.open_zarr(gcsmap) <xarray.Dataset> Dimensions: (latitude: 720, longitude: 1440, nv: 2, time: 8901) Coordinates: crs int32 ... lat_bnds (time, latitude, nv) float32 dask.array<shape=(8901, 720, 2), chunksize=(5, 720, 2)> * latitude (latitude) float32 -89.875 -89.625 -89.375 -89.125 -88.875 ... lon_bnds (longitude, nv) float32 dask.array<shape=(1440, 2), chunksize=(1440, 2)> * longitude (longitude) float32 0.125 0.375 0.625 0.875 1.125 1.375 1.625 ... * nv (nv) int32 0 1 * time (time) datetime64[ns] 1993-01-01 1993-01-02 1993-01-03 ... Data variables: adt (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> err (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> sla (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> ugos (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> ugosa (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> vgos (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> vgosa (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> Attributes: Conventions: CF-1.6 Metadata_Conventions: Unidata Dataset Discovery v1.0 cdm_data_type: Grid comment: Sea Surface Height measured by Altimetry... contact: [email protected] creator_email: [email protected] creator_name: CMEMS - Sea Level Thematic Assembly Center creator_url: http://marine.copernicus.eu date_created: 2014-02-26T16:09:13Z date_issued: 2014-01-06T00:00:00Z date_modified: 2015-11-10T19:42:51Z
  16. !16

  17. • We consider this work a “proof of concept” •

    Zarr is not a community standard • Zarr is not a single file (like geotiff), can’t really be “downloaded” • Only python (for now) !17 T h e f u t u r e f o r Z a r r