u n i t y P l a t f o r m f o r B i g D a t a G e o s c i e n c e OGC Member Meeting March 2, 2022 http://pangeo.io https://discourse.pangeo.io/ https://github.com/pangeo-data/ https://medium.com/pangeo @pangeo_data
Ph.D. From MIT, 2012 Associate Prof. at Columbia / LDEO https://ocean-transport.github.io/ Core developer of Xarray Core developer of Zarr Co-founder of Pangeo Open Source Advocate
/ weather data analytics. We are not data providers. We are data users. Pangeo is the user-centric, open-source platform that was missing in 2017 when we started this work.
f D ata S c i e n c e 4 ZRUN DERXWZHUHêVDWLVĆHGëRUEHWWHU How a Data Scientist Spends Their Day +HUHèVZKHUHWKHSRSXODUYLHZRIGDWDVFLHQWLVWVGLYHUJHVSUHWW\VLJQLĆFDQWO\IURPUHDOLW\*HQHUDOO\ ZHWKLQNRIGDWDVFLHQWLVWVEXLOGLQJDOJRULWKPVH[SORULQJGDWDDQGGRLQJSUHGLFWLYHDQDO\VLV7KDWèV actually not what they spend most of their time doing, however. $V\RXFDQVHHIURPWKHFKDUWDERYHRXWRIHYHU\GDWDVFLHQWLVWVZHVXUYH\HGDFWXDOO\VSHQGWKH Data scientist job satisfaction 60% 19% 9% 4% 5% 3% Building training sets: 3% Cleaning and organizing data: 60% Collecting data sets; 19% Mining data for patterns: 9% 5HĆQLQJDOJRULWKPV Other: 5% ,!;&!;!9$-'2ধ9;996'2&;,'139;ধ1'&3-2+ How do data scientists spend their time? Crowd fl ower Data Science Report (2016)
a d ” M o d e l 6 a) file-based approach step 1 : dow nload step 2: clean / organize ` file file file b) database / api approach record file file file local disk query files step 3: analyze
n l o a d ” M o d e l a) file-based approach step 1 : dow nload step 2: clean / organize ` file file file b) database / api approach record file file file local disk query files step 3: analyze
n l o a d ” M o d e l a) file-based approach step 1 : dow nload step 2: clean / organize ` file file file b) database / api approach record file file file local disk query files step 3: analyze
n l o a d ” M o d e l a) file-based approach step 1 : dow nload step 2: clean / organize ` file file file b) database / api approach record file file file local disk query files step 3: analyze
n l o a d ” M o d e l a) file-based approach step 1 : dow nload step 2: clean / organize ` file file file b) database / api approach record file file file local disk query files step 3: analyze
go under the hood • What if you want to access data that isn’t included? Data catalog is determined by provider, not users • Platforms are “single instance”: Fear of lock-in, possibility platform will disappear • Who pays? T h e T r o u b l e w i t h “ P l at f o r m s ” 13
Founded in 2017. Scientists and software developers coming together. http://pangeo.io/ Weekly meeting / seminar. Discourse Forum. Annual meeting. Workshops at AGU / AMS / etc. • Interoperable Software Foundation in Open Source Scienti fi c Python: Jupyter, Xarray, Dask, Zarr. Broad ecosystem of interoperable packages for analysis, visualization, and machine learning. • Data and Computing Infrastructure Deployment recipes for cloud and HPC. Open, public, cloud-based JupyterHubs and Binders for Data-proximate computing. PB of analysis-ready, cloud-optimized data stored in public cloud (GCS, AWS) and OpenStorageNetwork. W h at i s Pa n g e o ? 19
cloud infrastructure • Define science questions • Use software / infrastructure • Identify bugs / bottlenecks • Provide feedback to developers • Contribute widely the the open source scientific python ecosystem • Maintain / extend existing libraries, start new ones reluctantly • Solve integration challenges • Deploy interactive analysis environments • Curate analysis-ready datasets • Platform agnostic Agile development 👩💻 T h e Pa n g e o C o m m u n i t y P r o c e s s 20
e n - S o u r c e C l o u d S ta c k 22 Cloud-optimized storage for multidimensional arrays. Flexible, general-purpose parallel computing framework. High-level API for analysis of multidimensional labelled arrays. Kubernetes Object Storage Rich interactive computing environment in the web browser. xgcm xr f xhistogram gcm-filters climpred Cloud Services Domain specific packages Etc.
I n f r a s t r u c t u r e 23 Compute Services Data Lakes Dask Gateway .zarray .zattrs 0.0 0.1 2.0 1.0 1.1 2.1 .zarray .zattrs 0.0 0.1 2.0 1.0 1.1 2.1 .zarray .zattrs 0.0 0.1 2.0 1.0 1.1 2.1 .zarray .zattrs 0.0 0.1 2.0 1.0 1.1 2.1 Zarr Datasets Node Pools (Autoscaling) preemptible (spot instance) normal http://catalog.pangeo.io
public dataset • Data stored in Zarr format • Google provides free hosting in GCS • LDEO does the work of transferring the data from ESGF to GCS • Mirrored on AWS C M I P 6 C l o u d D ata s e t 24
helped existing, community developed tools work together. Open and community-driven from day 1. Sustainability • “Power users” always just want direct, data-proximate access to the raw data. Simplicity • The same stack is an effective base-layer for apps / dashboards / APIs, etc. Modularity M e s s a g e 1 : T h e Pa n g e o A p p r o a c h W o r k s ! 25
obviates the need for some APIs / services. • Legacy formats (netCDF / HDF5 / GRIB) don’t always play well with object storage. • ARCO data production takes time and skill. M e s s a g e 2 : A n a ly s i s - R e a d y, C l o u d - O P t i m i z e d ( A R C o ) D ata i s G r e at 27 ARCO Data
l e n g e : D ata G r av i t y 28 “Data gravity is the ability of a body of data to attract applications, services and other data." - Dave McCrory NASA (200 PB) NOAA BDP ASDI (incl. CMIP6) NCAR Datasets etc… Planetary Computer NOAA BDP Earth Engine NOAA BDP Descartes Pangeo SentinelHub Climate Change Atmosphere Marine ECMWF DOE XSEDE HECC NCAR
embraced by both science “power users” and builders of Earth-System analytics platforms. • Standards are extremely important for modular, open-source analytics platforms! • Analysis Ready, Cloud Optimized Data in object storage is the foundation of performant and fl exible cloud Earth-System analytics platforms. • We need a global scienti fi c commons that lives outside the big cloud providers. Otherwise data gravity will suck all of science into AWS. S u m m a r y 31 Thanks to our funders!
P ( S 3 ) I s M o r e P e r f o r m a n t a n d F l e x i b l e t h a n a B e s p o k e A P I 34 https://xpublish.readthedocs.io/ Serve dynamically generated Zarr data over HTTP. Client can’t tell the di ff erence.
: D e m o c r at i z i n g A R G O D ata P r o d u c t i o n 35 https://pangeo-forge.org/ An open source platform for creating ARCO datasets. We crowdsource “recipes” for ARCO data from the global science community. Cloud automation builds the datasets in a scalable and reproducible way.
a k e y o u r L e g a c y d ata l o o k a n d F e E L l i k e Z a r r 36 • Provides a uni fi ed way to represent a variety of chunked, compressed binary data formats (e.g. NetCDF/HDF5, GRIB2, TIFF, …) • Allows ef fi cient access to data from traditional fi le systems or cloud object storage. • Create virtual datasets from multiple fi les by extracting the byte ranges, compression information etc. and storing this metadata in a new, separate object. • Open Spec, python implementation. https://fsspec.github.io/kerchunk/