Pangeo Zarr Cloud Data Storage

X a r r ay a n d z a
r r o n t h e c l o u d P a n g e o e x p e r i m e n t s

!2 W h at i s c l i m
at e d ata? time longitude latitude elevation Data variables used for computation Coordinates describe data Attributes + land_cover Metadata Standard: http://cfconventions.org/

!3 H o w I s c l i m
at e d ata s t o r e d t o d ay ? Opaque binary ﬁle formats. Access via dedicated C libraries.  Python wrappers for C libraries.    Optimized for parallel writes on HPC.

!4 F i l e / B l o c
k s t o r a g e Image credit: https://blog.ubuntu.com/2015/05/18/what-are-the-different-types-of-storage-block-object-and-file • Operating system provides mechanism to read / write files and directories (e.g. POSIX). • Seeking and random access to bytes within files is fast. • “Most file systems are based on a block device, which is a level of abstraction for the hardware responsible for storing and retrieving specified blocks of data”

!5 O b j e c t s t o
r a g e Image credit: https://blog.ubuntu.com/2015/05/18/what-are-the-different-types-of-storage-block-object-and-file • An object is a collection of bytes associated with a unique identifier • Bytes are read and written with http calls • Significant overhead each individual operation • Application level (not OS dependent) • Implemented by S3, GCS, Azure, Ceph, etc. • Underlying hardware…who knows?

• Permits any storage layer (FTP, iRODS, S3) to look
like a POSIX ﬁlesystem !6 F u s e

!7 H D F F i l e s i
n o b j e c t S t o r a g e ? By Matt Rocklin (Anaconda) http://matthewrocklin.com/blog/work/2018/02/06/hdf-in-the-cloud Cloud Optimized GeoTIFF HDF + FUSE HDF + Custom Reader Build a Distributed Service New Storage Format (e.g. zarr) pros fast, well-established works with existing files, no changes to HDF lib needed works with existing files, no complex FUSE tricks offloads the problem to others, maintains stable API fast, intuitive, modern Cons data model not sophisticated enough complex, low performance, brittle Requires plugins to HDF library and tweaks to downstream libs Complex, introduces intermediary, probably not free not a community standard

S h a r i n g D ata i
n t h e C l o u d E R A !8 Traditional Approach: A Data Access Portal Data Access Server file.0001.nc file.0002.nc file.0003.nc file.0004.nc Data Granules (netCDF files) Client Client Client Data Center Internet

S h a r i n g D ata i
n t h e C l o u d E R A !9 Direct Access to Cloud Object Storage Catalog chunk.0.0.0 chunk.0.0.1 chunk.0.0.2 chunk.0.0.3 Data Granules  (netCDF ﬁles or something new) Cloud Object Storage Client Client Client Cloud Data Center Cloud Compute Instances

• Python library for storage of chunked, compressed ND- arrays
• Developed by Alistair Miles (Imperial) for genomics research (@alimanfoo) • Arrays are split into user-deﬁned chunks; each chunk is optional compressed (zlib, zstd, etc.) • Can store arrays in memory, directories, zip ﬁles, or any python mutable mapping interface (dictionary) • External libraries (s3fs, gcsf) provide a way to store directly into cloud object storage !10 z a r r Zarr Group: group_name .zgroup .zattrs .zarray .zattrs Zarr Array: array_name 0.0 0.1 2.0 1.0 1.1 2.1

!11 z a r r Zarr Group: group_name .zgroup .zattrs
.zarray .zattrs Zarr Array: array_name 0.0 0.1 2.0 1.0 1.1 2.1 { "chunks": [ 5, 720, 1440 ], "compressor": { "blocksize": 0, "clevel": 3, "cname": "zstd", "id": "blosc", "shuffle": 2 }, "dtype": "<f8", "fill_value": "NaN", "filters": null, "order": "C", "shape": [ 8901, 720, 1440 ], "zarr_format": 2 } Example .zarray ﬁle (json)

!12 z a r r Zarr Group: group_name .zgroup .zattrs
.zarray .zattrs Zarr Array: array_name 0.0 0.1 2.0 1.0 1.1 2.1 { "_ARRAY_DIMENSIONS": [ "time", "latitude", "longitude" ], "comment": "The sea level anomaly is the sea surface height above mean sea surface; it is referenced to the [1993, 2012] period; see the product user manual for details", "coordinates": "crs", "grid_mapping": "crs", "long_name": "Sea level anomaly", "standard_name": "sea_surface_height_above_sea_level", "units": "m" } Example .attrs ﬁle (json)

• Developed new xarray backend which allows xarray to read
and write directly to a Zarr store (with @jhamman) • It was pretty easy! Data models are quite similar • Automatic mapping between zarr chunks <—> dask chunks • We needed to add a custom, “hidden” attribute (_ARRAY_DIMENSIONS) to give the zarr arrays dimensions !13 X a r r ay + z a r r

1. Open the original data ﬁles into a single xarray
dataset with reasonable chunks    ds = xr.open_mfdatset(‘bunch_o_files_*.nc’, chunks={‘time’: 1}) 2. Export to zarr    ds.to_zarr(‘/path/to/zarr/directory’)  ——or——  ds.to_zarr(gcsamp_object) 3. [maybe] upload to cloud storage    $ gcsutil -m cp -r /path/to/zarr/directory gs://pangeo-data/path !14 P r e pa r i n g d ata s e t s f o r z a r r c l o u d s t o r a g e

!15 R e a d i n g z a
r R d ata f r o m pa n g e o c l o u d e n v i r o n m e n t s gcsmap = gcsfs.mapping.GCSMap('pangeo-data/dataset-duacs-rep-global-merged-allsat-phy-l4-v3-alt') ds = xr.open_zarr(gcsmap) <xarray.Dataset> Dimensions: (latitude: 720, longitude: 1440, nv: 2, time: 8901) Coordinates: crs int32 ... lat_bnds (time, latitude, nv) float32 dask.array<shape=(8901, 720, 2), chunksize=(5, 720, 2)> * latitude (latitude) float32 -89.875 -89.625 -89.375 -89.125 -88.875 ... lon_bnds (longitude, nv) float32 dask.array<shape=(1440, 2), chunksize=(1440, 2)> * longitude (longitude) float32 0.125 0.375 0.625 0.875 1.125 1.375 1.625 ... * nv (nv) int32 0 1 * time (time) datetime64[ns] 1993-01-01 1993-01-02 1993-01-03 ... Data variables: adt (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> err (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> sla (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> ugos (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> ugosa (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> vgos (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> vgosa (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> Attributes: Conventions: CF-1.6 Metadata_Conventions: Unidata Dataset Discovery v1.0 cdm_data_type: Grid comment: Sea Surface Height measured by Altimetry... contact: [email protected] creator_email: [email protected] creator_name: CMEMS - Sea Level Thematic Assembly Center creator_url: http://marine.copernicus.eu date_created: 2014-02-26T16:09:13Z date_issued: 2014-01-06T00:00:00Z date_modified: 2015-11-10T19:42:51Z

• We consider this work a “proof of concept” •
Zarr is not a community standard • Zarr is not a single ﬁle (like geotiff), can’t really be “downloaded” • Only python (for now) !17 T h e f u t u r e f o r Z a r r

Pangeo Zarr Cloud Data Storage

Pangeo Zarr Cloud Data Storage

Ryan Abernathey

More Decks by Ryan Abernathey

Other Decks in Technology

Featured

Transcript

X a r r ay a n d z a

!2 W h at i s c l i m

!3 H o w I s c l i m

!4 F i l e / B l o c

!5 O b j e c t s t o

• Permits any storage layer (FTP, iRODS, S3) to look

!7 H D F F i l e s i

S h a r i n g D ata i

S h a r i n g D ata i

• Python library for storage of chunked, compressed ND- arrays

!11 z a r r Zarr Group: group_name .zgroup .zattrs

!12 z a r r Zarr Group: group_name .zgroup .zattrs

• Developed new xarray backend which allows xarray to read

1. Open the original data ﬁles into a single xarray

!15 R e a d i n g z a

!16

• We consider this work a “proof of concept” •