Pangeo Earthcube TAC Update

Pa n g e o U p d at e
R ya n A b e r n at h e y E a r t h c u b e TA C M e e t i n g , O c t. 2 0 1 9

!2 W h at D r i v e s
P r o g r e s s i n G e o s c i e n c e ? New Ideas New Observations New Simulations E 5 r 0 jUj p ðN/jUj jfj/jUj P 1D (k) ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi N2 2 jUj2k2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jUj2k2 2 f2 q dk, (3) where k 5 (k, l) is now the wavenumber in the reference frame along and across the mean flow U and P 1D (k) 5 1 2p ð1‘ 2‘ jkj jkj P 2D (k, l) dl (4) is the effective one-dimensional (1D) topographic spectrum. Hence, the wave radiation from 2D topography reduces to an equivalent problem of wave radiation from 1D topography with the effective spectrum given by P1D (k). The effective 1D spectrum captures the effects of 2D c. Bottom topography Simulations are configured with multiscale topography characterized by small-scale abyssal hills a few ki- lometers wide based on multibeam observations from Drake Passage. The topographic spectrum associated with abyssal hills is well described by an anisotropic parametric representation proposed by Goff and Jordan (1988): P 2D (k, l) 5 2pH2(m 2 2) k 0 l 0 1 1 k2 k2 0 1 l2 l2 0 !2m/2 , (5) where k0 and l0 set the wavenumbers of the large hills, m is the high-wavenumber spectral slope, related to the pa- FIG. 3. Averaged profiles of (left) stratification (s21) and (right) flow speed (m s21) in the bottom 2 km from observations (gray), initial condition in the simulations (black), and final state in 2D (blue) and 3D (red) simulations.

!3 Credit: NASA SVS / PODAAC

A R G O !4

!5 MITgcm LLC4320 Simulation Grid resolution:  1 - 2 km
Single 3D scalar ﬁeld:  80 GB Output frequency: 1 hour Simulation Length: 1 year Output data volume: 2 PB Credit: NASA JPL / Dimitris Menemenlis

C o m p u t e r A r
c h i t e c t u r e !6 CC by 4.0 by Karl upp CC by 4.0 by Karl Rupp via Willi Rath (GEOMAR)

C o m p u t e r A r
c h i t e c t u r e !7 Adapted from image by Karl upp CC by 4.0 by Karl Rupp via Willi Rath (GEOMAR)

!8 C o m p u t e r a
r c h i t e c t u r e :   S i m u l at i o n NASA Pleaides Supercomputer

!9 C o m p u t e r a
r c h i t e c t u r e :   A n a ly s i s a n d V i s u a l i z at i o n

Adapted from image by Karl upp C o m p
u t e r A r c h i t e c t u r e !10 CC by 4.0 by Karl Rupp via Willi Rath (GEOMAR)

!11 W h at S c i e n c
e d o w e w a n t t o d o w i t h O C E A N d ata? Take the mean! Analyze spatiotemporal variability Machine learning! Need to process all the data!

Pa n g e o C h a l l
e n g e !12 How can we develop ﬂexible analysis tools that meet our community’s diverse needs and scale to Petabyte-sized datasets?

• Open Community • Open Source Software • Open Source
Infrastructure !13 W h at i s Pa n g e o ? “A community platform for Big Data geoscience”

!14 Pa n g e o C o m m
u n i t y http://pangeo.io

Pa n g e o C o m m u
n i t y !15 We have a growing group of engaged participants in discussion. ✓GitHub Discussions ✓Weekly Checkin Meetings ✓Medium Blog ✓Working Groups ✓US / UK / Europe ✓Students / Postdocs / Faculty / Software Devs / Data Scientists ✓Academia / National Labs / Industry / NGO ✓Weather / Climate / Oceans / Geoscience / Neuroscience / Bioinformatics? / Astronomy?

Pa n g e o C o m m u
n i t y !16 We have a growing group of engaged participants in discussion. ✓GitHub Discussions ✓Weekly Checkin Meetings ✓Medium Blog ✓Working Groups ✓US / UK / Europe ✓Students / Postdocs / Faculty / Software Devs / Data Scientists ✓Academia / National Labs / Industry / NGO ✓Weather / Climate / Oceans / Geoscience / Neuroscience / Bioinformatics? / Astronomy?

• Held in Seattle in August, 2019 • 60+ Attendees,
Academia + Industry + Gov’t Labs • More details: https://medium.com/pangeo/pangeo2019-17f1a2a016e0 • Lots of networking + hacking! !17 2 0 1 9 Pa n g e o C o m m u n i t y M e e t i n g

✓Unidata NetCDF roadmap calls for Zarr backend ✓NCAR NCL end-of-life
plan and the “pivot to python” cites Pangeo as a key technology for the future of data analysis ✓NASA DAACs publicly exploring Pangeo-style approaches to data distribution ✓CSIRO adoption ✓ECMWF adoption C o m m u n i t y M i l e s t o n e S !18

✓Unidata NetCDF roadmap calls for Zarr backend ✓NCAR NCL end-of-life
plan and the “pivot to python” cites Pangeo as a key technology for the future of data analysis ✓NASA DAACs publicly exploring Pangeo-style approaches to data distribution ✓CSIRO adoption ✓ECMWF adoption C o m m u n i t y M i l e s t o n e S !19

• Not easy to track impact of Pangeo in publications 
(what should people cite?) • Volunteer reporting within our community collected 32 publications supported by Pangeo tools, including Nature, Science, PNAS  http://pangeo.io/publications.html • Is there a better way to do this? !20 P u b l i c at i o n s

• https://cmip6hack.github.io/ • 70 participants, 2 locations • Using Pangeo
on both HPC (Cheyenne) and Cloud (Google) • CMIP6 cloud data accepted into Google Public Datasets program! !21 U p c o m i n g : C M I P 6 H a c k at h o n

!22 Pa n g e o F u n d
i n g http://pangeo.io

Pa n g e o S o f t w
a r e !23

!24 source: stackoverﬂow.com S c i e n t i
f i c P y t h o n f o r D ata S c i e n c e

aospy S c i e n t i f i
c P y t h o n f o r C l i m at e !25 SciPy Credit: Stephan Hoyer, Jake Vanderplas (SciPy 2015)

S o f t w a r e !26 Intake
Supporting community-driven open-source

J u p y t e r !27 “Project Jupyter
exists to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages.”

X a r r ay !28 time longitude latitude elevation
Data variables used for computation Coordinates describe data Indexes align data Attributes metadata ignored by operations + land_cover “netCDF meets pandas.DataFrame” https://github.com/pydata/xarray

x a r r ay : E x p r
e s s i v e & h i g h - l e v e l !29 sst_clim = sst.groupby('time.month').mean(dim='time') sst_anom = sst.groupby('time.month') - sst_clim nino34_index = (sst_anom.sel(lat=slice(-5, 5), lon=slice(190, 240)) .mean(dim=('lon', 'lat')) .rolling(time=3).mean(dim='time')) nino34_index.plot()

d a s k !30 Complex computations represented as a
graph of individual tasks.   Scheduler optimizes execution of graph. https://github.com/dask/dask/ ND-Arrays are split into chunks that comfortably ﬁt in memory

Pa n g e o I n f r a
s t r u c t u r e !31

F i l e - b a s e d
A p p r o a c h !32 a) file-based approach step 1 : dow nload step 2: analyze ` file file file b) database approach file file file local disk files Data provider’s responsibilities End user’s responsibilities

C l o u d - N at i v
e A p p r o a c h !33 object store record query c) cloud approach object object object cloud region compute cluster worker worker scheduler notebook Data provider’s responsibilities End user’s responsibilities

!34 Pa n g e o A r c h
i t e c t u r e Jupyter for interactive access remote systems Cloud / HPC Xarray provides data structures and intuitive interface for interacting with datasets Parallel computing system allows users deploy clusters of compute nodes for data processing. Dask tells the nodes what to do. Distributed storage “Analysis Ready Data”  stored on globally-available distributed storage.

!35 Pa n g e o D e p l
o y m e n t s NASA Pleiades pa n g e o . p y d ata . o r g NCAR Cheyenne Over 1000 unique users since March http://pangeo.io/deployments.html

!36 F i l e / B l o c
k s t o r a g e Image credit: https://blog.ubuntu.com/2015/05/18/what-are-the-different-types-of-storage-block-object-and-file • Operating system provides mechanism to read / write files and directories (e.g. POSIX). • Seeking and random access to bytes within files is fast. • “Most file systems are based on a block device, which is a level of abstraction for the hardware responsible for storing and retrieving specified blocks of data”

!37 O b j e c t s t o
r a g e Image credit: https://blog.ubuntu.com/2015/05/18/what-are-the-different-types-of-storage-block-object-and-file • An object is a collection of bytes associated with a unique identifier • Bytes are read and written with http calls • Significant overhead each individual operation • Application level (not OS dependent) • Implemented by S3, GCS, Azure, Ceph, etc. • Underlying hardware…who knows?

M o v i n g D ata t o
t h e C l o u d !38 Does legacy code work well with object storage? Importance of cloud-optimized data formats. POSIX Filesystem: System calls Cloud Object Store: HTTP Requests

• A regular GeoTIFF file… • Aimed at being hosted
on a HTTP file server… • With an internal organization that enables more efficient workflows on the cloud. • Rapid adoption in the open-source geospatial community (OGC, FOSS4G) !39 C l o u d O p t i m i z e d G e o T i f f

• Open source library for storage of chunked, compressed ND-arrays
• Created by Alistair Miles (Imperial) for genomics research (@alimanfoo); now community supported standard • Arrays are split into user-deﬁned chunks; each chunk is optional compressed (zlib, zstd, etc.) • Can store arrays in memory, directories, zip ﬁles, or any python mutable mapping interface (dictionary) • External libraries (s3fs, gcsf) provide a way to store directly into cloud object storage • Implementations in Python, C++, Java (N5), Julia !40 z a r r Zarr Group: group_name .zgroup .zattrs .zarray .zattrs Zarr Array: array_name 0.0 0.1 2.0 1.0 1.1 2.1 https://zarr.readthedocs.io/

!41 Pa n g e o D ata F l
o w i n t h e C l o u d 0.0 2.0 1.0 Data Chunks .zattrs Metadata Dask worker Dask worker Dask worker Juptyer pod

!42 C o m pa r i s o n
w i t h T H R E D D S • Worked with Luca Cinquini, Hans Vahlenkamp, Aparna Radhakrishnan to connect pangeo to ESGF server running in Google Cloud • Used Dask to issue parallel reads

!43 C o m pa r i s o n
w i t h T H R E D D S • Worked with Luca Cinquini, Hans Vahlenkamp, Aparna Radhakrishnan to connect pangeo to ESGF server running in Google Cloud • Used Dask to issue parallel reads

• Datasets are cataloged using an intake catalog (yaml) stored
in GitHub • Catalog is heavily nested • Continuous automated testing • Sphinx to render the catalog as a pretty HTML site • Future: let’s move to a STAC based catalog and write an intake plugin !44 K e e p i n g T r a c k o f C l o u d D ata s e t s https://pangeo-data.github.io/pangeo-datastore/

• Use and contribute to xarray, dask, zarr, jupyterhub, etc.
• Access an existing Pangeo deployment on an HPC cluster, or cloud resources (http://pangeo.io/deployments.html) • Talk with us!  - https://github.com/pangeo-data/pangeo - software dev  - https://discourse.pangeo.io - NEW! science user discussion !45 H o w t o g e t i n v o lv e d http://pangeo.io

Pangeo Earthcube TAC Update

Pangeo Earthcube TAC Update

More Decks by Ryan Abernathey

Featured

Transcript