currents for the climate and ecosystems. Associate Research Scientist - Columbia University Core-Developer of xgcm Core-Developer of cmip6_preprocessing Maintainer of the Pangeo CMIP6 Cache Pangeo Fan, User, and Member Open Source/Open Science Advocate
I n s t i t u t i o n s c r e a t e “ D a t a F o r t r e s s e s * ” 16 Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons *Coined by Chelle Gentemann
I n s t i t u t i o n s c r e a t e “ D a t a F o r t r e s s e s * ” 16 Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons *Coined by Chelle Gentemann
4.0, via Wikimedia Commons Data ❌ Results not reproducible outside fortress ❌ Barrier to collaboration ❌ Inefficient / duplicative ❌ Can’t scale to future data needs ❌ Limits inclusion and knowledge transfer *Coined by Chelle Gentemann
Desire to go under the hood What if you want to access data that isn’t included? Data catalog is determined by provider, not users Platforms are “single instance”: Fear of lock-in, possibility platform will disappear
Seasonal Forecasting Research Education Industry *Coined by Fernando Perez O p e n O c e a n C l o u d W i l l b e a “ D a t a W a t e r i n g H o l e * ” Data Library Compute Environment
Seasonal Forecasting Research Education Industry *Coined by Fernando Perez O p e n O c e a n C l o u d W i l l b e a “ D a t a W a t e r i n g H o l e * ” Data Library Compute Environment ✅ Faster science, more discoveries ✅ Inherently reproducible ✅ Allows seamless global collaboration ✅ Unleashes creativity ✅ Cost effective ✅ Accessible to all ✅ Connects with industry
processing. Founded in 2017. Scientists and software developers coming together. http://pangeo.io/ Weekly meeting / seminar. Discourse Forum. Annual meeting. Workshops at AGU / AMS / etc. Interoperable Software Foundation in Open Source Scienti f ic Python: Jupyter, Xarray, Dask, Zarr. Broad ecosystem of interoperable packages for analysis, visualization, and machine learning. Data and Computing Infrastructure Deployment recipes for cloud and HPC. Open, public, cloud-based JupyterHubs and Binders for Data-proximate computing. PB of analysis-ready, cloud-optimized data stored in public cloud (GCS, AWS) and OpenStorageNetwork.
processing. Founded in 2017. Scientists and software developers coming together. http://pangeo.io/ Weekly meeting / seminar. Discourse Forum. Annual meeting. Workshops at AGU / AMS / etc. Interoperable Software Foundation in Open Source Scienti f ic Python: Jupyter, Xarray, Dask, Zarr. Broad ecosystem of interoperable packages for analysis, visualization, and machine learning. Data and Computing Infrastructure Deployment recipes for cloud and HPC. Open, public, cloud-based JupyterHubs and Binders for Data-proximate computing. PB of analysis-ready, cloud-optimized data stored in public cloud (GCS, AWS) and OpenStorageNetwork.
Cloud to provide a new public dataset • > 1 PB and counting • Data stored in Zarr format • Google provides free hosting in GCS • Mirrored on AWS https://pangeo-data.github.io/pangeo-cmip6-cloud/
need for tedious homogenizing / cleaning steps • Curated and cataloged A R C O D a t a 56 Analysis Ready, Cloud Optimzed $VGDWDVFLHQFHEHFRPHVPRUHFRPPRQSODFHDQG VLPXOWDQHRXVO\DELWGHP\VWLĆHGZHH[SHFWWKLV WUHQGWRFRQWLQXHDVZHOO$IWHUDOOODVW\HDUèV respondents were just as excited about their ZRUN DERXWZHUHêVDWLVĆHGëRUEHWWHU How a Data Scientist Spends Their Day +HUHèVZKHUHWKHSRSXODUYLHZRIGDWDVFLHQWLVWVGLYHUJHVSUHWW\VLJQLĆFDQWO\IURPUHDOLW\*HQ ZHWKLQNRIGDWDVFLHQWLVWVEXLOGLQJDOJRULWKPVH[SORULQJGDWDDQGGRLQJSUHGLFWLYHDQDO\VLV7 actually not what they spend most of their time doing, however. $V\RXFDQVHHIURPWKHFKDUWDERYHRXWRIHYHU\GDWDVFLHQWLVWVZHVXUYH\HGDFWXDOO\VSHQ PRVWWLPHFOHDQLQJDQGRUJDQL]LQJGDWD<RXPD\KDYHKHDUGWKLVUHIHUUHGWRDVêGDWDZUDQJOLQ FRPSDUHGWRGLJLWDOMDQLWRUZRUN(YHU\WKLQJIURPOLVWYHULĆFDWLRQWRUHPRYLQJFRPPDVWRGHE databases–that time adds up and it adds up immensely. Messy data is by far the more time- con DVSHFWRIWKHW\SLFDOGDWDVFLHQWLVWèVZRUNćRZ$QGQHDUO\VDLGWKH\VLPSO\VSHQWWRRPXF Data scientist job satisfaction 60% 19% 9% 4% 5% 3% Building training sets: 3% Cleaning and organizing data: 60% Collecting data sets; 19% Mining data for patterns: 9% 5HĆQLQJDOJRULWKPV Other: 5% ,!;&!;!9$-'2ধ9;996'2&;,'139;ধ1'&3 2 1 How do data scientists spend their time? Crowd fl ower Data Science Report (2016) What is “Analysis Ready”?
Supports lazy access and intelligent subsetting • Integrates with high-level analysis libraries and distributed frameworks A R C O D a t a 58 Analysis Ready, Cloud Optimzed What is “Cloud Optimized”?
to fi nd, clean, and homogenize data Tech Knowledge: How to ef fi ciently produce cloud-optimized formats Compute Resources: A place where to stage and upload the ARCO data Communication Skills: To explain to others how to use the data To produce useful ARCO data, you must have: Data Scientist 😩
package for describing and running data pipelines (“recipes”) Cloud platform for automatically executing recipes stored in GitHub repos. https://github.com/pangeo-forge/pangeo-forge-recipes https://pangeo-forge.org/
al. Requirements for a global data infrastructure in support of CMIP6. Geosci. Model Dev. 11, 3659–3680 (2018). Datasets are directly downloaded from ESGF nodes and minimally processed Files -> Dataset Carry complete metadata, handle_id etc for provenance We regularly f ilter our catalog for retracted datasets WIP: Fully automate work f low with pangeo-forge