$30 off During Our Annual Pro Sale. View Details »

Pangeo Forge Tutorial Intoduction

Pangeo Forge Tutorial Intoduction

Ryan Abernathey

March 07, 2022
Tweet

More Decks by Ryan Abernathey

Other Decks in Science

Transcript

  1. A cloud-native data library for ocean, weather, and climate science.

    View Slide

  2. Y o u r I n s t r u c t o r s
    2
    Ryan Abernathey

    Columbia / LDEO
    Rachel Wegener

    U. Maryland
    Charles Stern

    Columbia / LDEO

    View Slide

  3. S c h e d u l e
    3
    11:30-11:50 Intro to Pangeo Forge Ryan
    11:50-12:15 Pangeo Forge Recipes Tutorial Rachel
    12:15-12:30 Pangeo Forge Cloud Tutorial
    Charles

    (video)
    12:30-12:40 Break ➡ Breakouts
    12:40-1:20 Work on recipes in breakouts
    1:20-1:30 Reconvene / wrap up Rachel & Ryan

    View Slide

  4. L e a r n i n g G o a l s
    4
    Goal Evidence
    Understand what Pangeo Forge Recipes does
    and how it relates to Pangeo Forge Cloud
    Explain when someone would need Pangeo Forge Recipes and when
    someone would need Pangeo Forge Cloud
    Learn to use Pangeo Forge Recipes for simple
    (concat-only) recipes
    Write a simple (concat-only) recipe and run it in binder
    Learn to transition a recipe to Pangeo Forge
    Cloud
    Make a PR to staged recipes
    Help other people to use Pangeo Forge on their
    own when the tutorial is done
    Generate ideas of how Pangeo Forge could benefit their work
    Understand path to become a tool contributor Create actionable issues on pangeo-forge-recipes issue tracker

    View Slide

  5. compute node
    C l o u d N at i v e D ata A n a ly t i c s
    5
    1. Analysis-Ready,

    Cloud-Optimized
    Data
    2. Data-Proximate
    Computing
    3. Elastic Distributed
    Processing
    compute node
    compute node
    compute node
    compute node
    compute node
    compute node
    compute node
    compute node
    compute node
    Compute Environment

    View Slide

  6. compute node
    C l o u d N at i v e D ata A n a ly t i c s
    5
    1. Analysis-Ready,

    Cloud-Optimized
    Data
    2. Data-Proximate
    Computing
    3. Elastic Distributed
    Processing
    compute node
    compute node
    compute node
    compute node
    compute node
    compute node
    compute node
    compute node
    compute node
    Compute Environment

    View Slide

  7. • Think in “Datasets” not “data
    fi
    les”


    • No need for tedious
    homogenizing / cleaning steps


    • Curated and cataloged
    A R C O D ata
    6
    Analysis Ready, Cloud Optimzed
    $VGDWDVFLHQFHEHFRPHVPRUHFRPPRQSODFHDQG
    VLPXOWDQHRXVO\DELWGHP\VWLĆHGZHH[SHFWWKLV
    WUHQGWRFRQWLQXHDVZHOO$IWHUDOOODVW\HDUèV
    respondents were just as excited about their
    ZRUNDERXWZHUHêVDWLVĆHGëRUEHWWHU
    How a Data Scientist Spends Their Day
    +HUHèVZKHUHWKHSRSXODUYLHZRIGDWDVFLHQWLVWVGLYHUJHVSUHWW\VLJQLĆFDQWO\IURPUHDOLW\*HQ
    ZHWKLQNRIGDWDVFLHQWLVWVEXLOGLQJDOJRULWKPVH[SORULQJGDWDDQGGRLQJSUHGLFWLYHDQDO\VLV7
    actually not what they spend most of their time doing, however.
    $V\RXFDQVHHIURPWKHFKDUWDERYHRXWRIHYHU\GDWDVFLHQWLVWVZHVXUYH\HGDFWXDOO\VSHQ
    PRVWWLPHFOHDQLQJDQGRUJDQL]LQJGDWDFRPSDUHGWRGLJLWDOMDQLWRUZRUN(YHU\WKLQJIURPOLVWYHULĆFDWLRQWRUHPRYLQJFRPPDVWRGHE
    databases–that time adds up and it adds up immensely. Messy data is by far the more time- con
    DVSHFWRIWKHW\SLFDOGDWDVFLHQWLVWèVZRUNćRZ$QGQHDUO\VDLGWKH\VLPSO\VSHQWWRRPXF
    Data scientist job satisfaction
    60%
    19%
    9%
    4%
    5%
    3%
    Building training sets: 3%
    Cleaning and organizing data: 60%
    Collecting data sets; 19%
    Mining data for patterns: 9%
    5HĆQLQJDOJRULWKPV
    Other: 5%
    ,!;&!;!9$-'2ধ9;996'2&;,'139;ধ1'&3
    2
    1


    How do data scientists spend their time?
    Crowd
    fl
    ower Data Science Report (2016)
    What is “Analysis Ready”?

    View Slide

  8. E X A M P L E O F A R C O D ATA
    7
    Chunked
    appropriately for
    analysis
    Rich metadata
    Everything in one
    dataset object
    https://catalog.pangeo.io/browse/master/ocean/sea_surface_height/

    View Slide

  9. • Compatible with object storage

    (access via HTTP)


    • Supports lazy access and intelligent
    subsetting


    • Integrates with high-level analysis
    libraries and distributed frameworks
    A R C O D ata
    8
    Analysis Ready, Cloud Optimzed
    What is “Cloud Optimized”?

    View Slide

  10. • Compatible with object storage

    (access via HTTP)


    • Supports lazy access and intelligent
    subsetting


    • Integrates with high-level analysis
    libraries and distributed frameworks
    A R C O D ata
    8
    Analysis Ready, Cloud Optimzed
    What is “Cloud Optimized”?

    View Slide

  11. A R C o D ata i s Fa s t !
    9
    https://doi.org/10.1109/MCSE.2021.3059437

    View Slide

  12. • Pangeo partnered with ESGF and
    Google Cloud to provide a new public
    dataset


    • > 1 PB and counting


    • Data stored in Zarr format


    • Google provides free hosting in GCS


    • Mirrored on AWS
    C M I P 6 C l o u d D ata s e t
    10

    View Slide

  13. P r o b l e m :
    11
    Making ARCO Data is Hard!
    To produce useful ARCO data, you must have:
    Data Scientist
    😩

    View Slide

  14. P r o b l e m :
    11
    Making ARCO Data is Hard!
    Domain Expertise:

    How to
    fi
    nd, clean, and
    homogenize data
    To produce useful ARCO data, you must have:
    Data Scientist
    😩

    View Slide

  15. P r o b l e m :
    11
    Making ARCO Data is Hard!
    Domain Expertise:

    How to
    fi
    nd, clean, and
    homogenize data
    Tech Knowledge:

    How to ef
    fi
    ciently produce
    cloud-optimized formats
    To produce useful ARCO data, you must have:
    Data Scientist
    😩

    View Slide

  16. P r o b l e m :
    11
    Making ARCO Data is Hard!
    Domain Expertise:

    How to
    fi
    nd, clean, and
    homogenize data
    Tech Knowledge:

    How to ef
    fi
    ciently produce
    cloud-optimized formats
    Compute Resources:

    A place where to stage and
    upload the ARCO data
    To produce useful ARCO data, you must have:
    Data Scientist
    😩

    View Slide

  17. P r o b l e m :
    11
    Making ARCO Data is Hard!
    Domain Expertise:

    How to
    fi
    nd, clean, and
    homogenize data
    Tech Knowledge:

    How to ef
    fi
    ciently produce
    cloud-optimized formats
    Compute Resources:

    A place where to stage and
    upload the ARCO data
    Communication Skills:

    To explain to others how to
    use the data
    To produce useful ARCO data, you must have:
    Data Scientist
    😩

    View Slide

  18. Pa n g e o F o r g e
    12
    Let’s democratize the production of ARCO data!


    🤓
    Data Scientist

    View Slide

  19. Pa n g e o F o r g e
    12
    Let’s democratize the production of ARCO data!


    Domain Expertise:

    How to
    fi
    nd, clean, and
    homogenize data
    🤓
    Data Scientist

    View Slide

  20. I n s p i r at i o n : C o n d a F o r g e
    13

    View Slide

  21. 14
    Pangeo Forge Recipes Pangeo Forge Cloud
    Open source python package for
    describing and running data pipelines
    (“recipes”)
    Cloud platform for automatically executing
    recipes stored in GitHub repos.
    https://github.com/pangeo-forge/pangeo-forge-recipes https://pangeo-forge.org/

    View Slide

  22. Pa n g e o F o r g e R e c i p e s
    15
    FilePattern
    StorageConfig
    Recipe Executor
    Describes where to
    fi
    nd the source
    fi
    les
    which are the inputs
    to the recipe
    Describes where to
    store the outputs of
    our recipe
    A complete, self-
    contained representation
    of the pipeline
    Knows how to run
    the recipe.
    https://pangeo-forge.readthedocs.io/

    View Slide

  23. F i l e Pat t e r n s
    16
    Describe where to
    fi
    nd the source
    fi
    les which are the inputs to the recipe
    ConcatDim (time)
    temperature
    humidity
    MergeDim

    (variable)
    http://data-provider.org/data/humidity_03.txt

    View Slide

  24. R e c i p e s
    17
    A complete, self-contained representation of a data transformation pipeline
    (Sensible defaults + lots of options to customize the transformation.)

    View Slide

  25. E x e c u t o r s
    18
    Executors know how to run recipes.

    Use whichever one makes sense for you!

    View Slide

  26. Pa n g e o F o r g e C l o u d
    19
    Feedstock
    Contains the code
    and metadata for one
    or more Recipes
    Bakery
    https://pangeo-forge.org/
    Storage
    Runs the recipes in the
    cloud using elastic
    scaling clusters
    Runs the recipes in the
    cloud using elastic
    scaling clusters
    GCS

    View Slide

  27. V i s i o n : C o l l a b o r at i v e D ata
    C u r at i o n
    20
    Feedstock
    🤓
    Data User
    🤓
    Data Producer
    🤓
    Data Manager
    These data
    look weird…
    …Oh the
    metadata
    need an
    update.
    Ok I’ll make
    a PR to the
    recipe.

    View Slide

  28. Pa n g e o F o r g e D e v e l o p m e n t
    21
    https://github.com/pangeo-forge/roadmap
    This is a 💯% open project!

    View Slide

  29. Y o u a r e a G u i n e a P i g !
    22
    This is all brand new! You are the
    fi
    rst people to try Pangeo Cloud.

    It will almost certainly break.

    Your feedback will help us improve it.

    🙏

    View Slide