Cloud Native Climate Data with Zarr and XArray

Slide 1

Slide 1 text

Z a r r : B e y o n d n e t c d f C l o u d o p t i m i z e d s t o r a g e f o r m a t s f o r c l i m a t e d a t a

Slide 2

Slide 2 text

Z a r r : B e y o n d n e t c d f C l o u d o p t i m i z e d s t o r a g e f o r m a t s f o r c l i m a t e d a t a H D F

Slide 3

Slide 3 text

!3 W h at i s c l i m at e d ata? time longitude latitude elevation Data variables used for computation Coordinates describe data Attributes + land_cover Metadata Standard:  http://cfconventions.org/ Image Credit: Stephan Hoyer (xarray)

Slide 4

Slide 4 text

W h at S c i e n c e d o w e w a n t t o d o w i t h c l i m at e d ata? !4

Slide 5

Slide 5 text

W h at S c i e n c e d o w e w a n t t o d o w i t h c l i m at e d ata? !5 Take the mean!

Slide 6

Slide 6 text

W h at S c i e n c e d o w e w a n t t o d o w i t h c l i m at e d ata? !6 Analyze spatiotemporal variability

Slide 7

Slide 7 text

W h at S c i e n c e d o w e w a n t t o d o w i t h c l i m at e d ata? !7 Machine learning! Credit: Berkeley Lab

Slide 8

Slide 8 text

!8 H o w I s c l i m at e d ata s t o r e d t o d ay ? Opaque binary ﬁle formats. Access via dedicated C libraries.  Python wrappers for C libraries.    Optimized for parallel writes on HPC.

Slide 9

Slide 9 text

!9 F i l e / B l o c k s t o r a g e Image credit: https://blog.ubuntu.com/2015/05/18/what-are-the-different-types-of-storage-block-object-and-file • Operating system provides mechanism to read / write files and directories (e.g. POSIX). • Seeking and random access to bytes within files is fast. • “Most file systems are based on a block device, which is a level of abstraction for the hardware responsible for storing and retrieving specified blocks of data”

Slide 10

Slide 10 text

!10 O b j e c t s t o r a g e Image credit: https://blog.ubuntu.com/2015/05/18/what-are-the-different-types-of-storage-block-object-and-file • An object is a collection of bytes associated with a unique identifier • Bytes are read and written with http calls • Significant overhead each individual operation • Application level (not OS dependent) • Implemented by S3, GCS, Azure, Ceph, etc. • Underlying hardware…who knows?

Slide 11

Slide 11 text

C e n t r a L P r o b l e m : !11 We don’t know how issue an HTTP request to peek into an HDF ﬁle. HDF client library does not support this. If we want to know what’s in it, we have to download the whole thing.

Slide 12

Slide 12 text

• Open source Python library for storage of chunked, compressed ND-arrays • Developed by Alistair Miles (Imperial) for genomics research (@alimanfoo) • Arrays are split into user-deﬁned chunks; each chunk is optional compressed (zlib, zstd, etc.) • Can store arrays in memory, directories, zip ﬁles, or any python mutable mapping interface (dictionary) • External libraries (s3fs, gcsf) provide a way to store directly into cloud object storage !12 z a r r Zarr Group: group_name .zgroup .zattrs .zarray .zattrs Zarr Array: array_name 0.0 0.1 2.0 1.0 1.1 2.1 https://zarr.readthedocs.io/

Slide 13

Slide 13 text

!13 z a r r Zarr Group: group_name .zgroup .zattrs .zarray .zattrs Zarr Array: array_name 0.0 0.1 2.0 1.0 1.1 2.1 { "chunks": [ 5, 720, 1440 ], "compressor": { "blocksize": 0, "clevel": 3, "cname": "zstd", "id": "blosc", "shuffle": 2 }, "dtype": "

Slide 14

Slide 14 text

!14 z a r r Zarr Group: group_name .zgroup .zattrs .zarray .zattrs Zarr Array: array_name 0.0 0.1 2.0 1.0 1.1 2.1 { "_ARRAY_DIMENSIONS": [ "time", "latitude", "longitude" ], "comment": "The sea level anomaly is the sea surface height above mean sea surface; it is referenced to the [1993, 2012] period; see the product user manual for details", "coordinates": "crs", "grid_mapping": "crs", "long_name": "Sea level anomaly", "standard_name": "sea_surface_height_above_sea_level", "units": "m" } Example .attrs ﬁle (json)

Slide 15

Slide 15 text

S h a r i n g D ata i n t h e C l o u d E R A !15 Traditional Approach: A Data Access Portal Data Access Server file.0001.nc file.0002.nc file.0003.nc file.0004.nc Data Granules (netCDF files) Client Client Client Data Center Internet

Slide 16

Slide 16 text

S h a r i n g D ata i n t h e C l o u d E R A !16 Direct Access to Cloud Object Storage Metadata chunk.0.0.0 chunk.0.0.1 chunk.0.0.2 chunk.0.0.3 Data Granules  (netCDF ﬁles or something new) Cloud Object Storage Client Client Client Cloud Data Center Cloud Compute Instances

Slide 17

Slide 17 text

• Developed new xarray backend which allows xarray to read and write directly to a Zarr store (with @jhamman) • It was pretty easy! Data models are quite similar • Automatic mapping between zarr chunks <—> dask chunks • We needed to add a custom, “hidden” attribute (_ARRAY_DIMENSIONS) to give the zarr arrays dimensions !17 X a r r ay + z a r r

Slide 18

Slide 18 text

1. Open the original data ﬁles into a single xarray dataset with reasonable chunks    ds = xr.open_mfdatset(‘bunch_o_files_*.nc’, chunks={‘time’: 1}) 2. Export to zarr    ds.to_zarr(‘/path/to/zarr/directory’)  ——or——  ds.to_zarr(gcsamp_object) 3. [maybe] upload to cloud storage    $ gcsutil -m cp -r /path/to/zarr/directory gs://pangeo-data/path !18 P r e pa r i n g d ata s e t s f o r z a r r c l o u d s t o r a g e

Slide 19

Slide 19 text

!19 R e a d i n g z a r R d ata f r o m pa n g e o c l o u d e n v i r o n m e n t s gcsmap = gcsfs.mapping.GCSMap('pangeo-data/dataset-duacs-rep-global-merged-allsat-phy-l4-v3-alt') ds = xr.open_zarr(gcsmap) Dimensions: (latitude: 720, longitude: 1440, nv: 2, time: 8901) Coordinates: crs int32 ... lat_bnds (time, latitude, nv) float32 dask.array * latitude (latitude) float32 -89.875 -89.625 -89.375 -89.125 -88.875 ... lon_bnds (longitude, nv) float32 dask.array * longitude (longitude) float32 0.125 0.375 0.625 0.875 1.125 1.375 1.625 ... * nv (nv) int32 0 1 * time (time) datetime64[ns] 1993-01-01 1993-01-02 1993-01-03 ... Data variables: adt (time, latitude, longitude) float64 dask.array err (time, latitude, longitude) float64 dask.array sla (time, latitude, longitude) float64 dask.array ugos (time, latitude, longitude) float64 dask.array ugosa (time, latitude, longitude) float64 dask.array vgos (time, latitude, longitude) float64 dask.array vgosa (time, latitude, longitude) float64 dask.array Attributes: Conventions: CF-1.6 Metadata_Conventions: Unidata Dataset Discovery v1.0 cdm_data_type: Grid comment: Sea Surface Height measured by Altimetry... contact: [email protected] creator_email: [email protected] creator_name: CMEMS - Sea Level Thematic Assembly Center creator_url: http://marine.copernicus.eu date_created: 2014-02-26T16:09:13Z date_issued: 2014-01-06T00:00:00Z date_modified: 2015-11-10T19:42:51Z

Slide 20

Slide 20 text

!20

Slide 21

Slide 21 text

!21 C o m pa r i s o n w i t h T H R E D D S • Worked with Luca Cinquini, Hans Vahlenkamp, Aparna Radhakrishnan to connect pangeo to ESGF server running in Google Cloud • Used Dask to issue parallel OpenDAP reads from a cluster

Slide 22

Slide 22 text

!22 C o m pa r i s o n w i t h T H R E D D S • Worked with Luca Cinquini, Hans Vahlenkamp, Aparna Radhakrishnan to connect pangeo to ESGF server running in Google Cloud • Used Dask to issue parallel OpenDAP reads from a cluster

Slide 23

Slide 23 text

• Zarr started in python. Now implementations in C++, Java, Julia. • Unidata NetCDF is looking at Zarr as a potential backend for the next generation of NetCDF • You can use zarr today on on pangeo cloud environments! !23 T h e f u t u r e f o r Z a r r http://pangeo.io