Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The evolution of the Pangeo project: from open ...

Joe Hamman
January 12, 2023

The evolution of the Pangeo project: from open source software to an open science community

The Pangeo project began as a community effort to tackle big data in the geosciences. Initially, the approach was to work at the software level, integrating and improving open source libraries like Xarray, Dask and Jupyter to work better on big data. However, in the years since the project began, the focus of the Pangeo community has branched out to include a cloud computing platform, open cloud-optimized dataset curation, machine learning, education and more. As a result, Pangeo has become a coordination point for many of the fundamental building blocks of open science. In this presentation, I will give an overview of how the focus of the Pangeo Project has evolved from open source software to open science, detailing some of the key milestones along the way. I’ll also highlight a range of applications, from federal agencies to academic labs to tech startups, that are utilizing advances brought on by the Pangeo project to enable more effective research and science-based action on climate change and other social and environmental issues. I’ll end with some perspectives on where the Pangeo project and the broader open science community should go from here in order to effectively address the pressing issues facing our planet and our society.

Joe Hamman

January 12, 2023
Tweet

More Decks by Joe Hamman

Other Decks in Science

Transcript

  1. Pa n g e o T h e e v

    o l u t i o n o f t h e Pa n g e o p r o j e c t: f r o m o p e n s o u r c e s o f t w a r e t o a n o p e n s c i e n c e c o m m u n i t y
  2. ‣ I am now the CTO at Earthmover. ‣ Previously,

    I was a scientist at the National Center for Atmospheric Research (NCAR) and the technology director at the non-pro fi t CarbonPlan. ‣ I contribute to open source scienti fi c Python projects like Xarray, Dask, Intake and Zarr. ‣ I’ve been part of the Pangeo community since the beginning. 2 H e l l o ! @jhamman @jhamman @_jhamman
  3. P r e a m b l e ! ‣

    This is a non-technical talk ‣ My goals: ‣ Share some of the Pangeo community’s achievements and lessons learned ‣ Share my opinion of why Pangeo has been successful ‣ Layout a vision for where Pangeo could go next* 3
  4. C o m m u n i t y 5

    @pangeo-data
 490 members discourse.pangeo.io
 1.1k members @pangeo-data
 5.3k followers
  5. T h e S i t u at i o

    n i s U r g e n T 6 https://www.epa.gov/climate-indicators/climate-change-indicators-heat-waves Earth Data Volume is Exploding Climate Change is Here https://www.nasa.gov/feature/amazing-earth-satellite-images-from-2018
  6. S i g n s o f p r o

    g r e s s 7
  7. 8 “Open source is everywhere. Its culture has demonstrated how

    transparent and collaborative innovation can transform modern society … Open source software accelerates the transition to a sustainable economy by supporting traceable decision-making, building capacity for localisation and customisation, providing new opportunities for participation, and preventing greenwashing by ensuring transparency and trust.” https://report.opensustain.tech/
  8. T h e O p e n S c i

    e n c e V i s i o n 9 https://earthdata.nasa.gov/esds/open-science for 👩🔬 in everyone_in_the_world: for 📄 in all_scientific_knowledge: 👩🔬.verify(📄) discovery = 👩🔬.extend(📄) This would transform the 🌎 by allowing all of humanity to participate in the scienti fi c process. What are the barriers to realizing this vision?
  9. 2 0 2 3 - t h e y e

    a r o f o p e n s c i e n c e 10 https://www.whitehouse.gov/ostp/news-updates/2023/01/11/fact-sheet-biden- harris-administration-announces-new-actions-to-advance-open-and-equitable- research https://www.nature.com/articles/d41586-023-00019-y
  10. T h e P r i vat e S e

    c t o r i s H e r e 11 Climate Tech Investment is Growing
  11. Pa n g e o 1 . 0 12 Jupyter

    for interactive access remote systems Cloud / HPC Xarray provides data structures and intuitive interface for interacting with datasets Parallel computing system allows users deploy clusters of compute nodes for data processing. Dask tells the nodes what to do. Distributed storage “Analysis Ready Data” 
 stored on globally-available distributed storage.
  12. 13 T h e pa n g e o pat

    t e r n Nov 7, 2019 - https://medium.com/pangeo/the-pangeo-pattern-9a81ca4bad42
  13. ‣ Data proximate interactive computing in the cloud ‣ Combined

    JupyterHub and Dask in a highly scalable Kubernetes cluster. ‣ Large maintenance burden! ‣ Now available as part of the Dask Helm Chart. 14 Pa n g e o c l o u d
  14. 15 X a r r ay + Z a r

    r 0.0 2.0 1.0 Chunks .zattrs Metadata Dask worker Dask worker Dask worker Juptyer pod Cloud optimized data storage backend for the NetCDF data model Cloud Object Store Cloud Compute Cluster HTTP
 GET
  15. 16 X a r r ay + Z a r

    r 0.0 2.0 1.0 Chunks .zattrs Metadata Dask worker Dask worker Dask worker Juptyer pod Cloud optimized data storage backend for the NetCDF data model Cloud Object Store Cloud Compute Cluster HTTP
 GET
  16. Pa n g e o 2 . 0 ‣ Pivoted

    away from running cloud infrastructure ‣ Doubled down on community and user-support ‣ Launched new efforts around: 1. Education and training 2. Cloud data curation and storage 3. Machine learning 17 Dec 22, 2020 - https://medium.com/pangeo/pangeo-2-0-2bedf099582d
  17. N e w c o m m u n i

    t y f o r u m s 18 d i s c o u r s e . pa n g e o . i o pa n g e o . i o / pa n g e o - s h o w c a s e
  18. P r o j e c t p y t

    h i a 19 ‣ Education and training hub for modern geoscience practices using Python / Pangeo ‣ Step by step tutorials to bring new users up to speed ‣ Cookbooks for real world applications
  19. 20 Pangeo Forge Recipes Pangeo Forge Cloud Open source python

    package for describing and running data pipelines (“recipes”) Cloud platform for automatically executing recipes stored in GitHub repos. https://github.com/pangeo-forge/pangeo-forge-recipes https://pangeo-forge.org/
  20. Pa n g e o - m l ‣ Improve

    interoperability of the scienti fi c Python ecosystem (Pangeo & ML Libraries) ‣ Development of new software interfaces connecting Xarray and ML libraries (e.g. xbatcher) ‣ Documenting common ML work fl ows using Pangeo tooling 21 Check out Max Jone’s talk on Xbatcher from yesterday.
  21. W h at w i l l Pa n g

    e o 3 . 0 b e ? 22
  22. C a s e s t u d y :

    c a r b o n p l a n ’ s c l i m at e d o w n s c a l i n g p r o j e c t ‣ Worked with 2i2c to deploy a JupyterHub on Azure ‣ Deployed our own Kubernetes cluster / Prefect Agent ‣ Managed our own Blob Storage, data versioning, transfers, etc. ‣ Built custom data portal / visualization tool. 23 https://carbonplan.org/research/cmip6-downscaling
  23. W h at w i l l Pa n g

    e o 3 . 0 b e ? 24 Goal: 
 We have to make it much, much, much easier 
 to use Pangeo to do open science.
  24. ‣ A platform is something you can build on— speci

    fi cally, new scienti fi c discoveries and new translational applications. Let’s call these projects. ‣ For open science to take off at a global scale, everyone in the world needs access to the platform (like GitHub) ‣ This is why we are excited about cloud, but cloud as-is (e.g. AWS) is not itself an open-science platform. O p e n S c i e n c e n e e d s a P l at f o r m 25 Infrastructure Platform Open- Science Project Platform Open- Science Project Open- Science Project Open- Science Project
  25. T h e M o d e r n D

    ata S ta c k 26 ‣ In the past 5 years, a platform has emerged for enterprise data science called the Modern Data Stack ‣ The MDS is centered around a “data lake” or “data warehouse” ‣ Different platform elements are provided by different SaaS companies; integration through standards and APIs ‣ No one in science uses any of this stuff https://continual.ai/post/the-modern-data-stack-ecosystem-fall-2021-edition
  26. ‣ Embrace commercial SaaS: a Modern Data Stack for Science

    ‣ Cultivate community-operated SaaS: e.g. Wikipedia, Conda Forge, Binder, 2i2c Hubs, ‣ We probably need a mix of both h o w c a n w e d e l i v e r a n o p e n s c i e n c e p l at f o r m i n a s c a l a b l e , s u s ta i n a b l e w ay ? 27 Community-operated SaaS for ETL (Extract / Transform / Load) of ARCO Data Our new startup. Building a commercial cloud data lake platform for scienti fi c data.
  27. F i n a l t h o u g

    h t s 28 ‣ Pangeo today has many more users than developers (the inverse was true in the beginning). It is time to scale! ‣ Pangeo has broadened its focus to include supporting the open- geoscience movement. However, it is not clear yet what speci fi c initiatives will be most impactful. ‣ Pangeo has provided the building blocks for the open science platform of the future. ‣ We are experimenting with new ways to scale the project, including various forms of service offerings.