Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pangeo Cloud Datastore (Lightning Talk)

Pangeo Cloud Datastore (Lightning Talk)

A lightning talk from the August 2019 Pangeo Community Meeting

http://pangeo.io/meetings/2019_summer-meeting.html

Ryan Abernathey

August 21, 2019
Tweet

More Decks by Ryan Abernathey

Other Decks in Science

Transcript

  1. Pa n g e o C l o u d

    D ata s t o r e R y a n A b e r n a t h e y 
 A u g u s t 2 0 1 9
  2. H D F o n t h e C l

    o u d ? !2 We don’t know how issue an HTTP request to peek into an HDF file. HDF client library does not support this. If we want to know what’s in it, we have to download the whole thing. POSIX Filesystem: System calls Cloud Object Store: HTTP Requests
  3. • Open source library for storage of chunked, compressed ND-arrays

    • Created by Alistair Miles (Imperial) for genomics research (@alimanfoo); now community supported standard • Arrays are split into user-defined chunks; each chunk is optional compressed (zlib, zstd, etc.) • Can store arrays in memory, directories, zip files, or any python mutable mapping interface (dictionary) • External libraries (s3fs, gcsf) provide a way to store directly into cloud object storage • Implementations in Python, C++, Java (N5), Julia !3 z a r r Zarr Group: group_name .zgroup .zattrs .zarray .zattrs Zarr Array: array_name 0.0 0.1 2.0 1.0 1.1 2.1 https://zarr.readthedocs.io/
  4. !4 R e a d i n g z a

    r R C l o u d d ata gcsmap = gcsfs.mapping.GCSMap('pangeo-data/dataset-duacs-rep-global-merged-allsat-phy-l4-v3-alt') ds = xr.open_zarr(gcsmap) <xarray.Dataset> Dimensions: (latitude: 720, longitude: 1440, nv: 2, time: 8901) Coordinates: crs int32 ... lat_bnds (time, latitude, nv) float32 dask.array<shape=(8901, 720, 2), chunksize=(5, 720, 2)> * latitude (latitude) float32 -89.875 -89.625 -89.375 -89.125 -88.875 ... lon_bnds (longitude, nv) float32 dask.array<shape=(1440, 2), chunksize=(1440, 2)> * longitude (longitude) float32 0.125 0.375 0.625 0.875 1.125 1.375 1.625 ... * nv (nv) int32 0 1 * time (time) datetime64[ns] 1993-01-01 1993-01-02 1993-01-03 ... Data variables: adt (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> err (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> sla (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> ugos (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> ugosa (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> vgos (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> vgosa (time, latitude, longitude) float64 dask.array<shape=(8901, 720, 1440), chunksize=(5, 720, 1440)> Attributes: Conventions: CF-1.6 Metadata_Conventions: Unidata Dataset Discovery v1.0 cdm_data_type: Grid comment: Sea Surface Height measured by Altimetry... contact: [email protected] creator_email: [email protected] creator_name: CMEMS - Sea Level Thematic Assembly Center creator_url: http://marine.copernicus.eu date_created: 2014-02-26T16:09:13Z date_issued: 2014-01-06T00:00:00Z date_modified: 2015-11-10T19:42:51Z
  5. !5 z a r r i n Pa n g

    e o 0.0 2.0 1.0 Chunks .zattrs Metadata Dask worker Dask worker Dask worker Juptyer pod
  6. !6 C o m pa r i s o n

    w i t h T H R E D D S • Worked with Luca Cinquini, Hans Vahlenkamp, Aparna Radhakrishnan to connect pangeo to ESGF server running in Google Cloud • Used Dask to issue parallel reads
  7. !7 C o m pa r i s o n

    w i t h T H R E D D S • Worked with Luca Cinquini, Hans Vahlenkamp, Aparna Radhakrishnan to connect pangeo to ESGF server running in Google Cloud • Used Dask to issue parallel reads
  8. • Datasets are cataloged using an intake catalog (yaml) stored

    in GitHub • Catalog is heavily nested • Continuous automated testing • I hacked sphinx to render the catalog as a pretty HTML site • Future: let’s move to a STAC based catalog and write an intake plugin !8 K e e p i n g T r a c k o f C l o u d D ata s e t s https://pangeo-data.github.io/pangeo-datastore/