P r i v i l e g e d I n s t i t u t i o n s c r e at e “ D ata F o r t r e s s e s * ” 7 Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons *Coined by Chelle Gentemann
P r i v i l e g e d I n s t i t u t i o n s c r e at e “ D ata F o r t r e s s e s * ” 8 Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons Data *Coined by Chelle Gentemann
P r i v i l e g e d I n s t i t u t i o n s c r e at e “ D ata F o r t r e s s e s * ” 9 Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons Data *Coined by Chelle Gentemann # step 1: open data
🗂 Emphasis on fi les as a medium of data exchange creates lots of work for individual scientists (downloading, organizing, cleaning). Most fi le-based datasets are a mess—even simulation output.
😫 Yet the hard work of data wrangling is rarely collaborative. Outputs are not reusable and can’t be shared.
💰 Doing data-intensive science requires either expensive local infrastructure or access to a big agency supercomputer. This really limits participation.
🏰 Data intensive science is locked inside data fortresses. Limiting access to outsiders is a feature, not a bug. Restricts collaboration and reproducibility! p r o b l e m s w i t h t h e S tat u s Q u o 10
S e pa r at i o n o f S t o r a g e a n d C o m p u t e 12 Storage costs are steady.
Data provider pays for storage costs.
May be subsidized by cloud provider. (Thanks AWS, GCP, Azure! Or can live outside the cloud (e.g. Wasabi, OSN) Compute costs for interactive data analysis are bursty.
Can take advantage of spot pricing
Multi-tenancy: We can all use the same stack, but each institution pays for its own users. This is completely di ff erent from the status quo infrastructure!
T h e Pa n g e o C l o u d - N a i v e S ta c k 13 Cloud-optimized storage for multidimensional arrays. Flexible, general-purpose parallel computing framework. High-level API for analysis of multidimensional labelled arrays. Kubernetes Object Storage Rich interactive computing environment in the web browser. xgcm xr f xhistogram gcm-filters climpred Cloud Services Domain specific packages Etc.
• No need for tedious homogenizing / cleaning steps
• Curated and cataloged A R C O D ata 14 Analysis Ready, Cloud Optimzed ZRUNDERXWZHUHêVDWLVĆHGëRUEHWWHU How a Data Scientist Spends Their Day +HUHèVZKHUHWKHSRSXODUYLHZRIGDWDVFLHQWLVWVGLYHUJHVSUHWW\VLJQLĆFDQWO\IURP ZHWKLQNRIGDWDVFLHQWLVWVEXLOGLQJDOJRULWKPVH[SORULQJGDWDDQGGRLQJSUHGLFWLY actually not what they spend most of their time doing, however. $V\RXFDQVHHIURPWKHFKDUWDERYHRXWRIHYHU\GDWDVFLHQWLVWVZHVXUYH\HG PRVWWLPHFOHDQLQJDQGRUJDQL]LQJGDWDFRPSDUHGWRGLJLWDOMDQLWRUZRUN(YHU\WKLQJIURPOLVWYHULĆFDWLRQWRUHPRYLQJFRP Data scientist job satisfaction 60% 19% 9% 4% 5% 3% Building training sets: 3% Cleaning and organizing data: 60% Collecting data sets; 19% Mining data for patterns: 9% 5HĆQLQJDOJRULWKPV Other: 5% ,!;&!;!9$-'2ধ9;996'2&;,'1 How do data scientists spend their time? Crowd fl ower Data Science Report (2016) What is “Analysis Ready”?
E X A M P L E O F A R C O D ATA 15 Chunked appropriately for analysis Rich metadata Everything in one dataset object https://catalog.pangeo.io/browse/master/ocean/sea_surface_height/
22 Pangeo Forge Recipes Pangeo Forge Cloud Open source python package for describing and running data pipelines (“recipes”) Cloud platform for automatically executing recipes stored in GitHub repos. https://github.com/pangeo-forge/pangeo-forge-recipes https://pangeo-forge.org/
Pa n g e o F o r g e R e c i p e s 23 FilePattern StorageConfig Recipe Executor Describes where to fi nd the source fi les which are the inputs to the recipe Describes where to store the outputs of our recipe A complete, self- contained representation of the pipeline Knows how to run the recipe. https://pangeo-forge.readthedocs.io/
Pa n g e o F o r g e C l o u d 24 Feedstock Contains the code and metadata for one or more Recipes Bakery https://pangeo-forge.org/ Storage Runs the recipes in the cloud using elastic scaling clusters GCS
V i s i o n : C o l l a b o r at i v e D ata C u r at i o n 25 Feedstock 🤓 Data User 🤓 Data Producer 🤓 Data Manager These data look weird… …Oh the metadata need an update. Ok I’ll make a PR to the recipe.