Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pangeo Forge: Crowdsourcing Open Data in the Cloud

Pangeo Forge: Crowdsourcing Open Data in the Cloud

Ryan Abernathey

July 14, 2022
Tweet

More Decks by Ryan Abernathey

Other Decks in Science

Transcript

  1. Crowdsourcing Open Data in the Cloud

    View Slide

  2. C o n t r i b u t o r s
    2
    • Charles Stern (Columbia / LDEO)

    • Joe Hamman (CarbonPlan)

    • Anderson Banihirwe (CarbonPlan)

    • Rachel Wegener (U. Maryland)

    • Chiara Lepore (GRO Intelligence)

    • Sean Harkins (Development Seed)

    • Alex Merose (Google Research)

    • Tom Augspurger (Microsoft)

    • Martin Durant (Anaconda)

    • Many recipe contributors

    Funding: NSF Earthcube Program ($1.5M for 3 years)

    View Slide

  3. T h e O p e n S c i e n c e V i s i o n
    3
    https://earthdata.nasa.gov/esds/open-science
    for 👩🔬 in everyone_in_the_world:


    for 📄 in all_scientific_knowledge:


    👩🔬.verify(📄)


    discovery = 👩🔬.extend(📄)
    This would transform the 🌎 by allowing all of
    humanity to participate in the scienti
    fi
    c process.


    What are the barriers to realizing this vision?

    View Slide

  4. W h at i s N e e d e d t o R e p r o d u c e a
    S c i e n t i f i c D i s c o v e r y
    4
    The Code 📚


    The Environment 📦


    The Data 💾

    View Slide

  5. W h at i s N e e d e d t o R e p r o d u c e a
    S c i e n t i f i c D i s c o v e r y
    5
    ✅ The Code 📚


    Git, GitHub, GitLab, BitBucket, …


    ✅ The Environment 📦


    Pypi, Conda, Mamba, Conda Forge, Docker, …


    ✅ The Data 💾


    DOIs, Domain Data Repositories, Zenodo, Figshare, Dataverse, …

    View Slide

  6. W h at i s N e e d e d t o R e p r o d u c e a
    S c i e n t i f i c D i s c o v e r y
    6
    ⚠ The Data 💾


    But what about big data?
    Data-Intensive Science
    ?
    https://
    fi
    gshare.com/articles/
    fi
    gure/Earth_Data_Cube/4822930/2
    Big Data
    💡 Insights

    💡 Understanding

    💡 Predictions
    • open-ended problem


    • exploratory analysis


    • “human in the loop”


    • visualization needed


    • highly varied
    computational patterns /
    algorithms


    • no standard architecture

    View Slide

  7. P r i v i l e g e d I n s t i t u t i o n s c r e at e
    “ D ata F o r t r e s s e s * ”
    7
    Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons *Coined by Chelle Gentemann

    View Slide

  8. P r i v i l e g e d I n s t i t u t i o n s c r e at e
    “ D ata F o r t r e s s e s * ”
    8
    Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons
    Data
    *Coined by Chelle Gentemann

    View Slide

  9. P r i v i l e g e d I n s t i t u t i o n s c r e at e
    “ D ata F o r t r e s s e s * ”
    9
    Image credit: Moahim, CC BY-SA 4.0, via Wikimedia Commons
    Data
    *Coined by Chelle Gentemann
    # step 1: open data


    open(“/some/random/files/on/my/cluster”)


    # step 2: do mind-blowing AI!
    🤦

    View Slide

  10. 🗂 Emphasis on
    fi
    les as a medium of data exchange creates lots of work for
    individual scientists (downloading, organizing, cleaning). Most
    fi
    le-based
    datasets are a mess—even simulation output.


    😫 Yet the hard work of data wrangling is rarely collaborative. Outputs are not
    reusable and can’t be shared.


    💰 Doing data-intensive science requires either expensive local infrastructure
    or access to a big agency supercomputer. This really limits participation.


    🏰 Data intensive science is locked inside data fortresses. Limiting access to
    outsiders is a feature, not a bug. Restricts collaboration and reproducibility!
    p r o b l e m s w i t h t h e S tat u s Q u o
    10

    View Slide

  11. compute node
    C l o u d N at i v e S c i e n t i f i c D ata
    A n a ly t i c s
    11
    1. Analysis-Ready,

    Cloud-Optimized
    Data
    2. Data-Proximate
    Computing
    3. Elastic Distributed
    Processing
    compute node
    compute node
    compute node
    compute node
    compute node
    compute node
    compute node
    compute node
    compute node
    Compute Environment

    View Slide

  12. S e pa r at i o n o f S t o r a g e a n d
    C o m p u t e
    12
    Storage costs are steady.

    Data provider pays for
    storage costs.

    May be subsidized by
    cloud provider.

    (Thanks AWS, GCP, Azure!
    Or can live outside the
    cloud (e.g. Wasabi, OSN)
    Compute costs for
    interactive data analysis
    are bursty.

    Can take advantage of
    spot pricing

    Multi-tenancy:

    We can all use the same
    stack, but each institution
    pays for its own users.
    This is completely di
    ff
    erent from the status quo infrastructure!

    View Slide

  13. T h e Pa n g e o C l o u d - N a i v e S ta c k
    13
    Cloud-optimized storage for
    multidimensional arrays.
    Flexible, general-purpose parallel
    computing framework.
    High-level API for analysis of
    multidimensional labelled arrays.
    Kubernetes
    Object Storage
    Rich interactive
    computing environment
    in the web browser.
    xgcm xr
    f
    xhistogram gcm-filters
    climpred
    Cloud Services
    Domain specific packages
    Etc.

    View Slide

  14. • Think in “datasets” not “data
    fi
    les”


    • No need for tedious
    homogenizing / cleaning steps


    • Curated and cataloged
    A R C O D ata
    14
    Analysis Ready, Cloud Optimzed
    ZRUNDERXWZHUHêVDWLVĆHGëRUEHWWHU
    How a Data Scientist Spends Their Day
    +HUHèVZKHUHWKHSRSXODUYLHZRIGDWDVFLHQWLVWVGLYHUJHVSUHWW\VLJQLĆFDQWO\IURP
    ZHWKLQNRIGDWDVFLHQWLVWVEXLOGLQJDOJRULWKPVH[SORULQJGDWDDQGGRLQJSUHGLFWLY
    actually not what they spend most of their time doing, however.
    $V\RXFDQVHHIURPWKHFKDUWDERYHRXWRIHYHU\GDWDVFLHQWLVWVZHVXUYH\HG
    PRVWWLPHFOHDQLQJDQGRUJDQL]LQJGDWDFRPSDUHGWRGLJLWDOMDQLWRUZRUN(YHU\WKLQJIURPOLVWYHULĆFDWLRQWRUHPRYLQJFRP
    Data scientist job satisfaction
    60%
    19%
    9%
    4%
    5%
    3%
    Building training sets: 3%
    Cleaning and organizing data: 60%
    Collecting data sets; 19%
    Mining data for patterns: 9%
    5HĆQLQJDOJRULWKPV
    Other: 5%
    ,!;&!;!9$-'2ধ9;996'2&;,'1
    How do data scientists spend their time?
    Crowd
    fl
    ower Data Science Report (2016)
    What is “Analysis Ready”?

    View Slide

  15. E X A M P L E O F A R C O D ATA
    15
    Chunked
    appropriately for
    analysis
    Rich metadata
    Everything in one
    dataset object
    https://catalog.pangeo.io/browse/master/ocean/sea_surface_height/

    View Slide

  16. • Compatible with object storage

    (access via HTTP)


    • Supports lazy access and intelligent
    subsetting


    • Integrates with high-level analysis
    libraries and distributed frameworks
    A R C O D ata
    16
    Analysis Ready, Cloud Optimzed
    What is “Cloud Optimized”?

    View Slide

  17. A R C o D ata i s Fa s t !
    17
    https://doi.org/10.1109/MCSE.2021.3059437
    This also demonstrates the potential of the “hybrid cloud” model with OSN.

    View Slide

  18. P r o b l e m :
    18
    Making ARCO Data is Hard!
    Domain Expertise:

    How to
    fi
    nd, clean, and
    homogenize data
    Tech Knowledge:

    How to ef
    fi
    ciently produce
    cloud-optimized formats
    Compute Resources:

    A place where to stage and
    upload the ARCO data
    Analysis Skills:

    To validate and make use of
    the ARCO data.
    To produce useful ARCO data, you must have:
    Data Scientist
    😩

    View Slide

  19. W h o s e J o b i s i t t o M a k e A R C O
    D ata?
    19
    Data providers are concerned with
    preservation and archival quality.

    Scientists users know what they need to
    make the data analysis-ready.

    View Slide

  20. Pa n g e o F o r g e
    20
    Let’s democratize the production of ARCO data!


    Domain Expertise:

    How to
    fi
    nd, clean, and
    homogenize data
    🤓
    Data Scientist

    View Slide

  21. I n s p i r at i o n : C o n d a F o r g e
    21

    View Slide

  22. 22
    Pangeo Forge Recipes Pangeo Forge Cloud
    Open source python package for
    describing and running data pipelines
    (“recipes”)
    Cloud platform for automatically executing
    recipes stored in GitHub repos.
    https://github.com/pangeo-forge/pangeo-forge-recipes https://pangeo-forge.org/

    View Slide

  23. Pa n g e o F o r g e R e c i p e s
    23
    FilePattern
    StorageConfig
    Recipe Executor
    Describes where to
    fi
    nd the source
    fi
    les
    which are the inputs
    to the recipe
    Describes where to
    store the outputs of
    our recipe
    A complete, self-
    contained representation
    of the pipeline
    Knows how to run
    the recipe.
    https://pangeo-forge.readthedocs.io/

    View Slide

  24. Pa n g e o F o r g e C l o u d
    24
    Feedstock
    Contains the code
    and metadata for one
    or more Recipes
    Bakery
    https://pangeo-forge.org/
    Storage
    Runs the recipes in the
    cloud using elastic
    scaling clusters
    GCS

    View Slide

  25. V i s i o n : C o l l a b o r at i v e D ata C u r at i o n
    25
    Feedstock
    🤓
    Data User
    🤓
    Data Producer
    🤓
    Data Manager
    These data
    look weird…
    …Oh the
    metadata
    need an
    update.
    Ok I’ll make
    a PR to the
    recipe.

    View Slide

  26. D E M O
    26

    View Slide

  27. 27
    Oceanographers building
    a full-stack cloud SaaS
    automation platform.
    https://twitter.com/Colinoscopy/status/1255890780641689601
    Charles Stern

    View Slide

  28. 🙌 Pangeo Forge Cloud is live and open for business!

    pangeo-forge.org


    💣 Our recipes often contain 10000+ tasks. We are hitting the limits on Prefect
    as a work
    fl
    ow engine. Currently refactoring to move to Apache Beam.


    😫 Data has lots of edge cases! This is really hard.


    🌎 But we remain very excited about the potential of Pangeo Forge to
    transform how scientists interact with data.
    C u r r e n t s tAT U S
    28

    View Slide

  29. Pa n g e o F o r g e D e v e l o p m e n t
    29
    https://github.com/pangeo-forge/pangeo-forge-recipes
    This is a 💯% open project! Join us!

    View Slide