Cloud Native Climate Data with Zarr and XArray

Z a r r : B e y o n
d n e t c d f C l o u d o p t i m i z e d s t o r a g e f o r m a t s f o r c l i m a t e d a t a

Z a r r : B e y o n
d n e t c d f C l o u d o p t i m i z e d s t o r a g e f o r m a t s f o r c l i m a t e d a t a H D F

!3 W h at i s c l i m
at e d ata? time longitude latitude elevation Data variables used for computation Coordinates describe data Attributes + land_cover Metadata Standard:  http://cfconventions.org/ Image Credit: Stephan Hoyer (xarray)

W h at S c i e n c e
d o w e w a n t t o d o w i t h c l i m at e d ata? !4

d o w e w a n t t o d o w i t h c l i m at e d ata? !5 Take the mean!

d o w e w a n t t o d o w i t h c l i m at e d ata? !6 Analyze spatiotemporal variability

d o w e w a n t t o d o w i t h c l i m at e d ata? !7 Machine learning! Credit: Berkeley Lab

!8 H o w I s c l i m
at e d ata s t o r e d t o d ay ? Opaque binary ﬁle formats. Access via dedicated C libraries.  Python wrappers for C libraries.    Optimized for parallel writes on HPC.

!9 F i l e / B l o c
k s t o r a g e Image credit: https://blog.ubuntu.com/2015/05/18/what-are-the-different-types-of-storage-block-object-and-file • Operating system provides mechanism to read / write files and directories (e.g. POSIX). • Seeking and random access to bytes within files is fast. • “Most file systems are based on a block device, which is a level of abstraction for the hardware responsible for storing and retrieving specified blocks of data”

!10 O b j e c t s t o
r a g e Image credit: https://blog.ubuntu.com/2015/05/18/what-are-the-different-types-of-storage-block-object-and-file • An object is a collection of bytes associated with a unique identifier • Bytes are read and written with http calls • Significant overhead each individual operation • Application level (not OS dependent) • Implemented by S3, GCS, Azure, Ceph, etc. • Underlying hardware…who knows?

C e n t r a L P r o
b l e m : !11 We don’t know how issue an HTTP request to peek into an HDF ﬁle. HDF client library does not support this. If we want to know what’s in it, we have to download the whole thing.

• Open source Python library for storage of chunked, compressed
ND-arrays • Developed by Alistair Miles (Imperial) for genomics research (@alimanfoo) • Arrays are split into user-deﬁned chunks; each chunk is optional compressed (zlib, zstd, etc.) • Can store arrays in memory, directories, zip ﬁles, or any python mutable mapping interface (dictionary) • External libraries (s3fs, gcsf) provide a way to store directly into cloud object storage !12 z a r r Zarr Group: group_name .zgroup .zattrs .zarray .zattrs Zarr Array: array_name 0.0 0.1 2.0 1.0 1.1 2.1 https://zarr.readthedocs.io/

!13 z a r r Zarr Group: group_name .zgroup .zattrs
.zarray .zattrs Zarr Array: array_name 0.0 0.1 2.0 1.0 1.1 2.1 { "chunks": [ 5, 720, 1440 ], "compressor": { "blocksize": 0, "clevel": 3, "cname": "zstd", "id": "blosc", "shuffle": 2 }, "dtype": "<f8", "fill_value": "NaN", "filters": null, "order": "C", "shape": [ 8901, 720, 1440 ], "zarr_format": 2 } Example .zarray ﬁle (json)

!14 z a r r Zarr Group: group_name .zgroup .zattrs
.zarray .zattrs Zarr Array: array_name 0.0 0.1 2.0 1.0 1.1 2.1 { "_ARRAY_DIMENSIONS": [ "time", "latitude", "longitude" ], "comment": "The sea level anomaly is the sea surface height above mean sea surface; it is referenced to the [1993, 2012] period; see the product user manual for details", "coordinates": "crs", "grid_mapping": "crs", "long_name": "Sea level anomaly", "standard_name": "sea_surface_height_above_sea_level", "units": "m" } Example .attrs ﬁle (json)

S h a r i n g D ata i
n t h e C l o u d E R A !15 Traditional Approach: A Data Access Portal Data Access Server file.0001.nc file.0002.nc file.0003.nc file.0004.nc Data Granules (netCDF files) Client Client Client Data Center Internet

S h a r i n g D ata i
n t h e C l o u d E R A !16 Direct Access to Cloud Object Storage Metadata chunk.0.0.0 chunk.0.0.1 chunk.0.0.2 chunk.0.0.3 Data Granules  (netCDF ﬁles or something new) Cloud Object Storage Client Client Client Cloud Data Center Cloud Compute Instances

• Developed new xarray backend which allows xarray to read
and write directly to a Zarr store (with @jhamman) • It was pretty easy! Data models are quite similar • Automatic mapping between zarr chunks <—> dask chunks • We needed to add a custom, “hidden” attribute (_ARRAY_DIMENSIONS) to give the zarr arrays dimensions !17 X a r r ay + z a r r

1. Open the original data ﬁles into a single xarray
dataset with reasonable chunks    ds = xr.open_mfdatset(‘bunch_o_files_*.nc’, chunks={‘time’: 1}) 2. Export to zarr    ds.to_zarr(‘/path/to/zarr/directory’)  ——or——  ds.to_zarr(gcsamp_object) 3. [maybe] upload to cloud storage    $ gcsutil -m cp -r /path/to/zarr/directory gs://pangeo-data/path !18 P r e pa r i n g d ata s e t s f o r z a r r c l o u d s t o r a g e

!19 R e a d i n g z a
r R d ata f r o m pa n g e o c l o u d e n v i r o n m e n t s gcsmap = gcsfs.mapping.GCSMap('pangeo-data/dataset-duacs-rep-global-merged-allsat-phy-l4-v3-alt') ds = xr.open_zarr(gcsmap) <xarray.Dataset> Dimensions: (latitude: 720, longitude: 1440, nv: 2, time: 8901) Coordinates: crs int32 ... lat_bnds (time, latitude, nv) float32 dask.array<shape=(8901, 720, 2), chunksize=(5, 720, 2)> * latitude (latitude) float32 -89.875 -89.625 -89.375 -89.125 -88.875 ... lon_bnds (longitude, nv) float32 dask.array<shape=(1440, 2), chunksize=(1440, 2)> * longitude (longitude) float32 0.125 0.375 0.625 0.875 1.125 1.375 1.625 ... * nv (nv) int32 0 1 * time (time) datetime64[ns] 1993-01-01 1993-01-02 1993-01-03 ... Data variables: adt (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> err (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> sla (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> ugos (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> ugosa (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> vgos (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> vgosa (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> Attributes: Conventions: CF-1.6 Metadata_Conventions: Unidata Dataset Discovery v1.0 cdm_data_type: Grid comment: Sea Surface Height measured by Altimetry... contact: [email protected] creator_email: [email protected] creator_name: CMEMS - Sea Level Thematic Assembly Center creator_url: http://marine.copernicus.eu date_created: 2014-02-26T16:09:13Z date_issued: 2014-01-06T00:00:00Z date_modified: 2015-11-10T19:42:51Z

!21 C o m pa r i s o n
w i t h T H R E D D S • Worked with Luca Cinquini, Hans Vahlenkamp, Aparna Radhakrishnan to connect pangeo to ESGF server running in Google Cloud • Used Dask to issue parallel OpenDAP reads from a cluster

!22 C o m pa r i s o n
w i t h T H R E D D S • Worked with Luca Cinquini, Hans Vahlenkamp, Aparna Radhakrishnan to connect pangeo to ESGF server running in Google Cloud • Used Dask to issue parallel OpenDAP reads from a cluster

• Zarr started in python. Now implementations in C++, Java,
Julia. • Unidata NetCDF is looking at Zarr as a potential backend for the next generation of NetCDF • You can use zarr today on on pangeo cloud environments! !23 T h e f u t u r e f o r Z a r r http://pangeo.io

Cloud Native Climate Data with Zarr and XArray

Cloud Native Climate Data with Zarr and XArray

Ryan Abernathey

More Decks by Ryan Abernathey

Other Decks in Science

Featured

Transcript

Z a r r : B e y o n

Z a r r : B e y o n

!3 W h at i s c l i m

W h at S c i e n c e

W h at S c i e n c e

W h at S c i e n c e

W h at S c i e n c e

!8 H o w I s c l i m

!9 F i l e / B l o c

!10 O b j e c t s t o

C e n t r a L P r o

• Open source Python library for storage of chunked, compressed

!13 z a r r Zarr Group: group_name .zgroup .zattrs

!14 z a r r Zarr Group: group_name .zgroup .zattrs

S h a r i n g D ata i

S h a r i n g D ata i

• Developed new xarray backend which allows xarray to read

1. Open the original data ﬁles into a single xarray

!19 R e a d i n g z a

!20

!21 C o m pa r i s o n

!22 C o m pa r i s o n

• Zarr started in python. Now implementations in C++, Java,