Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cloud Native Climate Data with Pangeo - NOAA NE...

Cloud Native Climate Data with Pangeo - NOAA NESDIS Cloud Technical Discussion

Ryan Abernathey

August 30, 2019
Tweet

More Decks by Ryan Abernathey

Other Decks in Science

Transcript

  1. R ya n A b e r n at h

    e y N O A A N E S D I S C l o u d T e c h n i c a l D i s c u s s i o n C l o u d N at i v e C l i m at e D ata w i t h Pa n g e o
  2. !2 W h at D r i v e s

    P r o g r e s s i n C l i m at e S c i e n c e ?
  3. !2 W h at D r i v e s

    P r o g r e s s i n C l i m at e S c i e n c e ? New Ideas
  4. !2 W h at D r i v e s

    P r o g r e s s i n C l i m at e S c i e n c e ? New Ideas E 5 r 0 jUj p ðN/jUj jfj/jUj P 1D (k ) ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi N2 2 jUj2k 2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jUj2k 2 2 f2 q dk , (3) where k 5 (k , l) is now the wavenumber in the reference frame along and across the mean flow U and P 1D (k ) 5 1 2p ð1‘ 2‘ jk j jkj P 2D (k , l) dl (4) is the effective one-dimensional (1D) topographic spectrum. Hence, the wave radiation from 2D topogra- phy reduces to an equivalent problem of wave radiation from 1D topography with the effective spectrum given by P1D (k ). The effective 1D spectrum captures the effects of 2D c. Bottom topography Simulations are configured with multiscale topogra- phy characterized by small-scale abyssal hills a few ki- lometers wide based on multibeam observations from Drake Passage. The topographic spectrum associated with abyssal hills is well described by an anisotropic parametric representation proposed by Goff and Jordan (1988): P 2D (k , l) 5 2p H2(m 2 2) k 0 l 0 1 1 k 2 k 2 0 1 l2 l2 0 !2m/2 , (5) where k 0 and l0 set the wavenumbers of the large hills, m is the high-wavenumber spectral slope, related to the pa- FIG. 3. Averaged profiles of (left) stratification (s21) and (right) flow speed (m s21) in the bottom 2 km from observations (gray), initial condition in the simulations (black), and final state in 2D (blue) and 3D (red) simulations.
  5. !2 W h at D r i v e s

    P r o g r e s s i n C l i m at e S c i e n c e ? New Ideas New Observations E 5 r 0 jUj p ðN/jUj jfj/jUj P 1D (k ) ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi N2 2 jUj2k 2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jUj2k 2 2 f2 q dk , (3) where k 5 (k , l) is now the wavenumber in the reference frame along and across the mean flow U and P 1D (k ) 5 1 2p ð1‘ 2‘ jk j jkj P 2D (k , l) dl (4) is the effective one-dimensional (1D) topographic spectrum. Hence, the wave radiation from 2D topogra- phy reduces to an equivalent problem of wave radiation from 1D topography with the effective spectrum given by P1D (k ). The effective 1D spectrum captures the effects of 2D c. Bottom topography Simulations are configured with multiscale topogra- phy characterized by small-scale abyssal hills a few ki- lometers wide based on multibeam observations from Drake Passage. The topographic spectrum associated with abyssal hills is well described by an anisotropic parametric representation proposed by Goff and Jordan (1988): P 2D (k , l) 5 2p H2(m 2 2) k 0 l 0 1 1 k 2 k 2 0 1 l2 l2 0 !2m/2 , (5) where k 0 and l0 set the wavenumbers of the large hills, m is the high-wavenumber spectral slope, related to the pa- FIG. 3. Averaged profiles of (left) stratification (s21) and (right) flow speed (m s21) in the bottom 2 km from observations (gray), initial condition in the simulations (black), and final state in 2D (blue) and 3D (red) simulations.
  6. !2 W h at D r i v e s

    P r o g r e s s i n C l i m at e S c i e n c e ? New Ideas New Observations E 5 r 0 jUj p ðN/jUj jfj/jUj P 1D (k ) ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi N2 2 jUj2k 2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jUj2k 2 2 f2 q dk , (3) where k 5 (k , l) is now the wavenumber in the reference frame along and across the mean flow U and P 1D (k ) 5 1 2p ð1‘ 2‘ jk j jkj P 2D (k , l) dl (4) is the effective one-dimensional (1D) topographic spectrum. Hence, the wave radiation from 2D topogra- phy reduces to an equivalent problem of wave radiation from 1D topography with the effective spectrum given by P1D (k ). The effective 1D spectrum captures the effects of 2D c. Bottom topography Simulations are configured with multiscale topogra- phy characterized by small-scale abyssal hills a few ki- lometers wide based on multibeam observations from Drake Passage. The topographic spectrum associated with abyssal hills is well described by an anisotropic parametric representation proposed by Goff and Jordan (1988): P 2D (k , l) 5 2p H2(m 2 2) k 0 l 0 1 1 k 2 k 2 0 1 l2 l2 0 !2m/2 , (5) where k 0 and l0 set the wavenumbers of the large hills, m is the high-wavenumber spectral slope, related to the pa- FIG. 3. Averaged profiles of (left) stratification (s21) and (right) flow speed (m s21) in the bottom 2 km from observations (gray), initial condition in the simulations (black), and final state in 2D (blue) and 3D (red) simulations.
  7. !2 W h at D r i v e s

    P r o g r e s s i n C l i m at e S c i e n c e ? New Ideas New Observations E 5 r 0 jUj p ðN/jUj jfj/jUj P 1D (k ) ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi N2 2 jUj2k 2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jUj2k 2 2 f2 q dk , (3) where k 5 (k , l) is now the wavenumber in the reference frame along and across the mean flow U and P 1D (k ) 5 1 2p ð1‘ 2‘ jk j jkj P 2D (k , l) dl (4) is the effective one-dimensional (1D) topographic spectrum. Hence, the wave radiation from 2D topogra- phy reduces to an equivalent problem of wave radiation from 1D topography with the effective spectrum given by P1D (k ). The effective 1D spectrum captures the effects of 2D c. Bottom topography Simulations are configured with multiscale topogra- phy characterized by small-scale abyssal hills a few ki- lometers wide based on multibeam observations from Drake Passage. The topographic spectrum associated with abyssal hills is well described by an anisotropic parametric representation proposed by Goff and Jordan (1988): P 2D (k , l) 5 2p H2(m 2 2) k 0 l 0 1 1 k 2 k 2 0 1 l2 l2 0 !2m/2 , (5) where k 0 and l0 set the wavenumbers of the large hills, m is the high-wavenumber spectral slope, related to the pa- FIG. 3. Averaged profiles of (left) stratification (s21) and (right) flow speed (m s21) in the bottom 2 km from observations (gray), initial condition in the simulations (black), and final state in 2D (blue) and 3D (red) simulations.
  8. !2 W h at D r i v e s

    P r o g r e s s i n C l i m at e S c i e n c e ? New Ideas New Observations New Simulations E 5 r 0 jUj p ðN/jUj jfj/jUj P 1D (k ) ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi N2 2 jUj2k 2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jUj2k 2 2 f2 q dk , (3) where k 5 (k , l) is now the wavenumber in the reference frame along and across the mean flow U and P 1D (k ) 5 1 2p ð1‘ 2‘ jk j jkj P 2D (k , l) dl (4) is the effective one-dimensional (1D) topographic spectrum. Hence, the wave radiation from 2D topogra- phy reduces to an equivalent problem of wave radiation from 1D topography with the effective spectrum given by P1D (k ). The effective 1D spectrum captures the effects of 2D c. Bottom topography Simulations are configured with multiscale topogra- phy characterized by small-scale abyssal hills a few ki- lometers wide based on multibeam observations from Drake Passage. The topographic spectrum associated with abyssal hills is well described by an anisotropic parametric representation proposed by Goff and Jordan (1988): P 2D (k , l) 5 2p H2(m 2 2) k 0 l 0 1 1 k 2 k 2 0 1 l2 l2 0 !2m/2 , (5) where k 0 and l0 set the wavenumbers of the large hills, m is the high-wavenumber spectral slope, related to the pa- FIG. 3. Averaged profiles of (left) stratification (s21) and (right) flow speed (m s21) in the bottom 2 km from observations (gray), initial condition in the simulations (black), and final state in 2D (blue) and 3D (red) simulations.
  9. !2 W h at D r i v e s

    P r o g r e s s i n C l i m at e S c i e n c e ? New Ideas New Observations New Simulations E 5 r 0 jUj p ðN/jUj jfj/jUj P 1D (k ) ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi N2 2 jUj2k 2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jUj2k 2 2 f2 q dk , (3) where k 5 (k , l) is now the wavenumber in the reference frame along and across the mean flow U and P 1D (k ) 5 1 2p ð1‘ 2‘ jk j jkj P 2D (k , l) dl (4) is the effective one-dimensional (1D) topographic spectrum. Hence, the wave radiation from 2D topogra- phy reduces to an equivalent problem of wave radiation from 1D topography with the effective spectrum given by P1D (k ). The effective 1D spectrum captures the effects of 2D c. Bottom topography Simulations are configured with multiscale topogra- phy characterized by small-scale abyssal hills a few ki- lometers wide based on multibeam observations from Drake Passage. The topographic spectrum associated with abyssal hills is well described by an anisotropic parametric representation proposed by Goff and Jordan (1988): P 2D (k , l) 5 2p H2(m 2 2) k 0 l 0 1 1 k 2 k 2 0 1 l2 l2 0 !2m/2 , (5) where k 0 and l0 set the wavenumbers of the large hills, m is the high-wavenumber spectral slope, related to the pa- FIG. 3. Averaged profiles of (left) stratification (s21) and (right) flow speed (m s21) in the bottom 2 km from observations (gray), initial condition in the simulations (black), and final state in 2D (blue) and 3D (red) simulations.
  10. !2 W h at D r i v e s

    P r o g r e s s i n C l i m at e S c i e n c e ? New Ideas New Observations New Simulations E 5 r 0 jUj p ðN/jUj jfj/jUj P 1D (k ) ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi N2 2 jUj2k 2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi jUj2k 2 2 f2 q dk , (3) where k 5 (k , l) is now the wavenumber in the reference frame along and across the mean flow U and P 1D (k ) 5 1 2p ð1‘ 2‘ jk j jkj P 2D (k , l) dl (4) is the effective one-dimensional (1D) topographic spectrum. Hence, the wave radiation from 2D topogra- phy reduces to an equivalent problem of wave radiation from 1D topography with the effective spectrum given by P1D (k ). The effective 1D spectrum captures the effects of 2D c. Bottom topography Simulations are configured with multiscale topogra- phy characterized by small-scale abyssal hills a few ki- lometers wide based on multibeam observations from Drake Passage. The topographic spectrum associated with abyssal hills is well described by an anisotropic parametric representation proposed by Goff and Jordan (1988): P 2D (k , l) 5 2p H2(m 2 2) k 0 l 0 1 1 k 2 k 2 0 1 l2 l2 0 !2m/2 , (5) where k 0 and l0 set the wavenumbers of the large hills, m is the high-wavenumber spectral slope, related to the pa- FIG. 3. Averaged profiles of (left) stratification (s21) and (right) flow speed (m s21) in the bottom 2 km from observations (gray), initial condition in the simulations (black), and final state in 2D (blue) and 3D (red) simulations.
  11. !5 MITgcm LLC4320 Simulation Grid resolution:
 1 - 2 km

    Single 3D scalar field:
 80 GB Output frequency: 1 hour Simulation Length: 1 year Output data volume: 2 PB Credit: NASA JPL / Dimitris Menemenlis
  12. !5 MITgcm LLC4320 Simulation Grid resolution:
 1 - 2 km

    Single 3D scalar field:
 80 GB Output frequency: 1 hour Simulation Length: 1 year Output data volume: 2 PB Credit: NASA JPL / Dimitris Menemenlis
  13. C o m p u t e r A r

    c h i t e c t u r e !6 CC by 4.0 by Karl upp CC by 4.0 by Karl Rupp via Willi Rath (GEOMAR)
  14. C o m p u t e r A r

    c h i t e c t u r e !7 Adapted from image by Karl upp CC by 4.0 by Karl Rupp via Willi Rath (GEOMAR)
  15. !8 C o m p u t e r a

    r c h i t e c t u r e : 
 S i m u l at i o n NASA Pleaides Supercomputer
  16. !9 C o m p u t e r a

    r c h i t e c t u r e : 
 A n a ly s i s a n d V i s u a l i z at i o n
  17. Adapted from image by Karl upp C o m p

    u t e r A r c h i t e c t u r e !10 CC by 4.0 by Karl Rupp via Willi Rath (GEOMAR)
  18. !11 W h at S c i e n c

    e d o w e w a n t t o d o w i t h C l i m at e d ata? Take the mean! Analyze spatiotemporal variability Machine learning! Need to process all the data!
  19. Pa n g e o C h a l l

    e n g e !12 How can we develop flexible analysis tools that meet our community’s diverse needs and scale to Petabyte-sized datasets?
  20. • Open Community • Open Source Software • Open Source

    Infrastructure !13 W h at i s Pa n g e o ? “A community platform for Big Data geoscience”
  21. !14 Pa n g e o C o m m

    u n i t y http://pangeo.io
  22. !15 Pa n g e o F u n d

    i n g http://pangeo.io
  23. ✓NCAR NCL end-of-life plan and the “pivot to python” cites

    Pangeo as a key technology for the future of data analysis ✓NASA DAACs publicly exploring Pangeo-style approaches to data distribution ✓CSIRO adoption ✓ECMWF adoption ✓Unidata NetCDF roadmap calls for Zarr backend C o m m u n i t y M i l e s t o n e S !16
  24. ✓NCAR NCL end-of-life plan and the “pivot to python” cites

    Pangeo as a key technology for the future of data analysis ✓NASA DAACs publicly exploring Pangeo-style approaches to data distribution ✓CSIRO adoption ✓ECMWF adoption ✓Unidata NetCDF roadmap calls for Zarr backend C o m m u n i t y M i l e s t o n e S !17
  25. ✓NCAR NCL end-of-life plan and the “pivot to python” cites

    Pangeo as a key technology for the future of data analysis ✓NASA DAACs publicly exploring Pangeo-style approaches to data distribution ✓CSIRO adoption ✓ECMWF adoption ✓Unidata NetCDF roadmap calls for Zarr backend C o m m u n i t y M i l e s t o n e S !17
  26. !19 source: stackoverflow.com S c i e n t i

    f i c P y t h o n f o r D ata S c i e n c e
  27. aospy S c i e n t i f i

    c P y t h o n f o r C l i m at e !20 SciPy Credit: Stephan Hoyer, Jake Vanderplas (SciPy 2015)
  28. aospy S c i e n t i f i

    c P y t h o n f o r C l i m at e !20 SciPy Credit: Stephan Hoyer, Jake Vanderplas (SciPy 2015)
  29. J u p y t e r !21 “Project Jupyter

    exists to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages.”
  30. X a r r ay !22 time longitude latitude elevation

    Data variables used for computation Coordinates describe data Indexes align data Attributes metadata ignored by operations + land_cover “netCDF meets pandas.DataFrame” https://github.com/pydata/xarray
  31. x a r r ay : E x p r

    e s s i v e & h i g h - l e v e l !23 sst_clim = sst.groupby('time.month').mean(dim='time') sst_anom = sst.groupby('time.month') - sst_clim nino34_index = (sst_anom.sel(lat=slice(-5, 5), lon=slice(190, 240)) .mean(dim=('lon', 'lat')) .rolling(time=3).mean(dim='time')) nino34_index.plot()
  32. d a s k !24 Complex computations represented as a

    graph of individual tasks. 
 Scheduler optimizes execution of graph. https://github.com/dask/dask/ ND-Arrays are split into chunks that comfortably fit in memory
  33. d a s k !24 Complex computations represented as a

    graph of individual tasks. 
 Scheduler optimizes execution of graph. https://github.com/dask/dask/ ND-Arrays are split into chunks that comfortably fit in memory
  34. Pa n g e o I n f r a

    s t r u c t u r e !25
  35. F i l e - b a s e d

    A p p r o a c h !26 a) file-based approach step 1 : dow nload step 2: analyze ` file file file b) database approach file file file local disk files Data provider’s responsibilities End user’s responsibilities
  36. S e r v e r - S i d

    e D ata b a s e !27 ` file file file b) database approach record record record DBMS file file file local disk query c) cloud approach files Data provider’s responsibilities End user’s responsibilities
  37. C l o u d - N at i v

    e A p p r o a c h !28 object store record query c) cloud approach object object object cloud region compute cluster worker worker scheduler notebook Data provider’s responsibilities End user’s responsibilities
  38. !29 Pa n g e o A r c h

    i t e c t u r e Jupyter for interactive access remote systems Cloud / HPC Xarray provides data structures and intuitive interface for interacting with datasets Parallel computing system allows users deploy clusters of compute nodes for data processing. Dask tells the nodes what to do. Distributed storage “Analysis Ready Data”
 stored on globally-available distributed storage.
  39. !30 Pa n g e o D e p l

    o y m e n t s NASA Pleiades pa n g e o . p y d ata . o r g NCAR Cheyenne Over 1000 unique users since March http://pangeo.io/deployments.html
  40. !31 Government HPC Commercial Cloud Access ✅ Available to all

    federally funded projects ❌ Available only to federally funded projects ✅ Available globally to anyone with a credit card ❌ Authentication is not integrated with existing research infrastructure Cost ✅ Cost is hidden from researchers and billed by funding agencies ❌ Allocations, quotas, limits ❌ Cost is borne by individual researchers and hidden from funding agencies ✅ Economics of scale, unlimited resources Compute ✅ Homogeneous, high performance nodes ❌ Queues, batch scheduling, ssh access ❌ Fixed-size compute ✅ Flexible hardware (big, small, GPU) ✅ Instant provisioning of unlimited resources ✅ Spot market: burstable, volatile Storage ✅ Fast parallel filesystems (e.g. GPFS) ✅ Fast object storage
  41. C l o u d C l i m at

    e D ata !32
  42. !33 H o w I s c l i m

    at e d ata s t o r e d t o d ay ? Opaque binary file formats. Access via dedicated C libraries.
 Python wrappers for C libraries.
 
 Optimized for HPC environment.
  43. !34 F i l e / B l o c

    k s t o r a g e Image credit: https://blog.ubuntu.com/2015/05/18/what-are-the-different-types-of-storage-block-object-and-file • Operating system provides mechanism to read / write files and directories (e.g. POSIX). • Seeking and random access to bytes within files is fast. • “Most file systems are based on a block device, which is a level of abstraction for the hardware responsible for storing and retrieving specified blocks of data”
  44. !35 O b j e c t s t o

    r a g e Image credit: https://blog.ubuntu.com/2015/05/18/what-are-the-different-types-of-storage-block-object-and-file • An object is a collection of bytes associated with a unique identifier • Bytes are read and written with http calls • Significant overhead each individual operation • Application level (not OS dependent) • Implemented by S3, GCS, Azure, Ceph, etc. • Underlying hardware…who knows?
  45. W h y C a n ’ t w e

    J u s t D u m p o u r N e t C D F F i l e A r c h i v e s i n t o s 3 ? !36 • Can’t issue an HTTP request to peek into an HDF file. HDF client library does not support this*.
 ➡ if we want to know what’s in it, have to download the whole thing • Most netCDF datasets use very small “granules” (e.g. one netCDF file per day). Pangeo users want to look at the whole dataset. Scanning thousands of HDF files is very expensive in object storage.
  46. • Open source library for storage of chunked, compressed ND-arrays

    • Created by Alistair Miles (Imperial) for genomics research (@alimanfoo); now community supported standard • Arrays are split into user-defined chunks; each chunk is optional compressed (zlib, zstd, etc.) • Can store arrays in memory, directories, zip files, or any python mutable mapping interface (dictionary) • External libraries (s3fs, gcsf) provide a way to store directly into cloud object storage • Implementations in Python, C++, Java (N5), Julia !37 z a r r Zarr Group: group_name .zgroup .zattrs .zarray .zattrs Zarr Array: array_name 0.0 0.1 2.0 1.0 1.1 2.1 https://zarr.readthedocs.io/
  47. !38 z a r r Zarr Group: group_name .zgroup .zattrs

    .zarray .zattrs Zarr Array: array_name 0.0 0.1 2.0 1.0 1.1 2.1 { "chunks": [ 5, 720, 1440 ], "compressor": { "blocksize": 0, "clevel": 3, "cname": "zstd", "id": "blosc", "shuffle": 2 }, "dtype": "<f8", "fill_value": "NaN", "filters": null, "order": "C", "shape": [ 8901, 720, 1440 ], "zarr_format": 2 } Example .zarray file (json)
  48. !39 z a r r Zarr Group: group_name .zgroup .zattrs

    .zarray .zattrs Zarr Array: array_name 0.0 0.1 2.0 1.0 1.1 2.1 { "_ARRAY_DIMENSIONS": [ "time", "latitude", "longitude" ], "comment": "The sea level anomaly is the sea surface height above mean sea surface; it is referenced to the [1993, 2012] period; see the product user manual for details", "coordinates": "crs", "grid_mapping": "crs", "long_name": "Sea level anomaly", "standard_name": "sea_surface_height_above_sea_level", "units": "m" } Example .attrs file (json)
  49. !40 0.0 2.0 1.0 Chunks .zattrs Metadata Dask worker Dask

    worker Dask worker Juptyer pod S h a r i n g D ata i n t h e C l o u d E R A Pangeo Approach: Direct Access to Cloud Object Storage Cloud Object Store Cloud Compute Cluster HTTP
 GET
  50. !41 C o m pa r i s o n

    w i t h T H R E D D S • Worked with Luca Cinquini, Hans Vahlenkamp, Aparna Radhakrishnan to connect pangeo to ESGF server running in Google Cloud • Used Dask to issue parallel OpenDAP reads from a cluster
  51. !42 C o m pa r i s o n

    w i t h T H R E D D S • Worked with Luca Cinquini, Hans Vahlenkamp, Aparna Radhakrishnan to connect pangeo to ESGF server running in Google Cloud • Used Dask to issue parallel OpenDAP reads from a cluster
  52. • Cloud storage sticker price: $250 TB / year
 Few

    hidden costs. • Legacy data distribution approach: much cheaper per TB
 But MANY hidden costs • Hardware, bandwidth, IT support, etc. • Data is not accessible to compute, so users have to download it to local “dark replicas”. Huge cost multiplier to funding agencies! • What is the cost of missed science opportunities? !43 C o s t C o n s i d e r at i o n s
  53. • Use and contribute to xarray, dask, zarr, jupyterhub, etc.

    • Access an existing Pangeo deployment on an HPC cluster, or cloud resources (http://pangeo.io/deployments.html) • Adapt Pangeo elements to meet your projects needs (data portals, etc.) and give feedback via github: github.com/pangeo-data/pangeo !45 H o w t o g e t i n v o lv e d http://pangeo.io
  54. • Community-driven - Our needs are no different from those

    of our peer institutions. By developing infrastructure collaboratively, we can accomplish much more than any one institution can alone. • Open source - Because infrastructure is code, the code should be licensed in a way that enables the entire research community to reuse and build upon it. • Modular - “all in one” solutions are impossible to maintain long term. Separation of concerns is a key principle of good software and systems engineering. • Vendor neutral - Academic research infrastructure should use only vendor- neutral services APIs. If this principle is followed, it means we can redeploy our infrastructure anywhere. !47 Pa n g e o P r i n c i p l e s f o r 
 C l o u d - N at i v e S c i e n c e I n f r a s t r u c t u r e
  55. • https://github.com/pangeo- data/pangeo-cloud-federation • Cloud-based clusters managed with helm /kubernetes

    • Deployment is completely automated via GitHub / circleci • Resources scale elastically with demand !48 C o n t i n u o u s D e p l o y m e n t
  56. • https://pangeo-data.github.io/ pangeo-datastore/ • Datasets stored in zarr format (cloud-native

    HDF-replacement) • Cataloged using intake • Automated testing of datasets !49 C l o u d D ata C ata l o g
  57. !51

  58. !51