Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cloud Native Climate Data with Zarr and XArray

Ryan Abernathey
December 12, 2018

Cloud Native Climate Data with Zarr and XArray

Presentation given by Ryan Abernathey at AGU 2018:
https://agu.confex.com/agu/fm18/meetingapp.cgi/Paper/390015

The Pangeo Project aims to empower scientists to work with very large datasets quickly and interactively. We do this by deploying open source technologies, such as
xarray, dask, jupyter, and kubernetes on conventional high-performance computers (HPC; like NCAR's Cheyenne and NASA's Pleiades) or on cloud platforms (like Google Cloud Platform or Amazon Web Services). Using the Pangeo software stack, users can distribute their computations over many compute nodes with minimal changes to their code, making it easy to scale analysis up to large datasets. Examples of datasets that Pangeo users want to analyze include remote sensing products (such as ocean infrared sea-surface temperature) as well as high resolution simulation outputs.
Using Pangeo, scientists are no longer limited by the rate the CPU can perform calculations, since many compute nodes can be used in parallel for analysis. Instead, the main bottleneck in most workflows is nearly always the rate at which data can be delivered to the compute nodes. Consequently, the access mechanism and file format of the underlying data becomes a crucial consideration for performance.

The most common format for relevant datasets is an archive of hundreds to thousands of individual netCDF / HDF files. This format does not work particularly well on cloud object storage; the opaque nature of the files makes it nearly impossible to extract a single variable or piece of data or metedata without reading the whole file. The Pangeo project has been experimenting instead with the zarr format. When used in conjunction with xarray, reading data from zarr looks and feels identical to reading traditional netCDF files. However, there are major advantage to the zarr format:

Metadata is kept separate from data in a lightweight .json format
Arrays are stored in a flexible chunked / compressed binary format
Individual chunks can be retrieved independently in a thread-safe manner
When zarr data is placed in cloud object storage, the result is a high-performance, "analysis ready" data archive. The rate at which data can be extracted from such an archive scales linearly with the number of compute nodes which are reading from it simultaneously.

This talk will demonstrate the performance characteristics and scalability of zarr datasets in real-world workflows using a cloud-based Pangeo environment.

Ryan Abernathey

December 12, 2018
Tweet

More Decks by Ryan Abernathey

Other Decks in Science

Transcript

  1. Z a r r : B e y o n

    d n e t c d f C l o u d o p t i m i z e d s t o r a g e f o r m a t s f o r c l i m a t e d a t a
  2. Z a r r : B e y o n

    d n e t c d f C l o u d o p t i m i z e d s t o r a g e f o r m a t s f o r c l i m a t e d a t a H D F
  3. !3 W h at i s c l i m

    at e d ata? time longitude latitude elevation Data variables used for computation Coordinates describe data Attributes + land_cover Metadata Standard:
 http://cfconventions.org/ Image Credit: Stephan Hoyer (xarray)
  4. W h at S c i e n c e

    d o w e w a n t t o d o w i t h c l i m at e d ata? !4
  5. W h at S c i e n c e

    d o w e w a n t t o d o w i t h c l i m at e d ata? !5 Take the mean!
  6. W h at S c i e n c e

    d o w e w a n t t o d o w i t h c l i m at e d ata? !6 Analyze spatiotemporal variability
  7. W h at S c i e n c e

    d o w e w a n t t o d o w i t h c l i m at e d ata? !7 Machine learning! Credit: Berkeley Lab
  8. !8 H o w I s c l i m

    at e d ata s t o r e d t o d ay ? Opaque binary file formats. Access via dedicated C libraries.
 Python wrappers for C libraries.
 
 Optimized for parallel writes on HPC.
  9. !9 F i l e / B l o c

    k s t o r a g e Image credit: https://blog.ubuntu.com/2015/05/18/what-are-the-different-types-of-storage-block-object-and-file • Operating system provides mechanism to read / write files and directories (e.g. POSIX). • Seeking and random access to bytes within files is fast. • “Most file systems are based on a block device, which is a level of abstraction for the hardware responsible for storing and retrieving specified blocks of data”
  10. !10 O b j e c t s t o

    r a g e Image credit: https://blog.ubuntu.com/2015/05/18/what-are-the-different-types-of-storage-block-object-and-file • An object is a collection of bytes associated with a unique identifier • Bytes are read and written with http calls • Significant overhead each individual operation • Application level (not OS dependent) • Implemented by S3, GCS, Azure, Ceph, etc. • Underlying hardware…who knows?
  11. C e n t r a L P r o

    b l e m : !11 We don’t know how issue an HTTP request to peek into an HDF file. HDF client library does not support this. If we want to know what’s in it, we have to download the whole thing.
  12. • Open source Python library for storage of chunked, compressed

    ND-arrays • Developed by Alistair Miles (Imperial) for genomics research (@alimanfoo) • Arrays are split into user-defined chunks; each chunk is optional compressed (zlib, zstd, etc.) • Can store arrays in memory, directories, zip files, or any python mutable mapping interface (dictionary) • External libraries (s3fs, gcsf) provide a way to store directly into cloud object storage !12 z a r r Zarr Group: group_name .zgroup .zattrs .zarray .zattrs Zarr Array: array_name 0.0 0.1 2.0 1.0 1.1 2.1 https://zarr.readthedocs.io/
  13. !13 z a r r Zarr Group: group_name .zgroup .zattrs

    .zarray .zattrs Zarr Array: array_name 0.0 0.1 2.0 1.0 1.1 2.1 { "chunks": [ 5, 720, 1440 ], "compressor": { "blocksize": 0, "clevel": 3, "cname": "zstd", "id": "blosc", "shuffle": 2 }, "dtype": "<f8", "fill_value": "NaN", "filters": null, "order": "C", "shape": [ 8901, 720, 1440 ], "zarr_format": 2 } Example .zarray file (json)
  14. !14 z a r r Zarr Group: group_name .zgroup .zattrs

    .zarray .zattrs Zarr Array: array_name 0.0 0.1 2.0 1.0 1.1 2.1 { "_ARRAY_DIMENSIONS": [ "time", "latitude", "longitude" ], "comment": "The sea level anomaly is the sea surface height above mean sea surface; it is referenced to the [1993, 2012] period; see the product user manual for details", "coordinates": "crs", "grid_mapping": "crs", "long_name": "Sea level anomaly", "standard_name": "sea_surface_height_above_sea_level", "units": "m" } Example .attrs file (json)
  15. S h a r i n g D ata i

    n t h e C l o u d E R A !15 Traditional Approach: A Data Access Portal Data Access Server file.0001.nc file.0002.nc file.0003.nc file.0004.nc Data Granules (netCDF files) Client Client Client Data Center Internet
  16. S h a r i n g D ata i

    n t h e C l o u d E R A !16 Direct Access to Cloud Object Storage Metadata chunk.0.0.0 chunk.0.0.1 chunk.0.0.2 chunk.0.0.3 Data Granules
 (netCDF files or something new) Cloud Object Storage Client Client Client Cloud Data Center Cloud Compute Instances
  17. • Developed new xarray backend which allows xarray to read

    and write directly to a Zarr store (with @jhamman) • It was pretty easy! Data models are quite similar • Automatic mapping between zarr chunks <—> dask chunks • We needed to add a custom, “hidden” attribute (_ARRAY_DIMENSIONS) to give the zarr arrays dimensions !17 X a r r ay + z a r r
  18. 1. Open the original data files into a single xarray

    dataset with reasonable chunks
 
 ds = xr.open_mfdatset(‘bunch_o_files_*.nc’, chunks={‘time’: 1}) 2. Export to zarr
 
 ds.to_zarr(‘/path/to/zarr/directory’)
 ——or——
 ds.to_zarr(gcsamp_object) 3. [maybe] upload to cloud storage
 
 $ gcsutil -m cp -r /path/to/zarr/directory gs://pangeo-data/path !18 P r e pa r i n g d ata s e t s f o r z a r r c l o u d s t o r a g e
  19. !19 R e a d i n g z a

    r R d ata f r o m pa n g e o c l o u d e n v i r o n m e n t s gcsmap = gcsfs.mapping.GCSMap('pangeo-data/dataset-duacs-rep-global-merged-allsat-phy-l4-v3-alt') ds = xr.open_zarr(gcsmap) <xarray.Dataset> Dimensions: (latitude: 720, longitude: 1440, nv: 2, time: 8901) Coordinates: crs int32 ... lat_bnds (time, latitude, nv) float32 dask.array<shape=(8901, 720, 2), chunksize=(5, 720, 2)> * latitude (latitude) float32 -89.875 -89.625 -89.375 -89.125 -88.875 ... lon_bnds (longitude, nv) float32 dask.array<shape=(1440, 2), chunksize=(1440, 2)> * longitude (longitude) float32 0.125 0.375 0.625 0.875 1.125 1.375 1.625 ... * nv (nv) int32 0 1 * time (time) datetime64[ns] 1993-01-01 1993-01-02 1993-01-03 ... Data variables: adt (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> err (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> sla (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> ugos (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> ugosa (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> vgos (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> vgosa (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> Attributes: Conventions: CF-1.6 Metadata_Conventions: Unidata Dataset Discovery v1.0 cdm_data_type: Grid comment: Sea Surface Height measured by Altimetry... contact: [email protected] creator_email: [email protected] creator_name: CMEMS - Sea Level Thematic Assembly Center creator_url: http://marine.copernicus.eu date_created: 2014-02-26T16:09:13Z date_issued: 2014-01-06T00:00:00Z date_modified: 2015-11-10T19:42:51Z
  20. !20

  21. !21 C o m pa r i s o n

    w i t h T H R E D D S • Worked with Luca Cinquini, Hans Vahlenkamp, Aparna Radhakrishnan to connect pangeo to ESGF server running in Google Cloud • Used Dask to issue parallel OpenDAP reads from a cluster
  22. !22 C o m pa r i s o n

    w i t h T H R E D D S • Worked with Luca Cinquini, Hans Vahlenkamp, Aparna Radhakrishnan to connect pangeo to ESGF server running in Google Cloud • Used Dask to issue parallel OpenDAP reads from a cluster
  23. • Zarr started in python. Now implementations in C++, Java,

    Julia. • Unidata NetCDF is looking at Zarr as a potential backend for the next generation of NetCDF • You can use zarr today on on pangeo cloud environments! !23 T h e f u t u r e f o r Z a r r http://pangeo.io