$30 off During Our Annual Pro Sale. View Details »

Pangeo OGC March 2022

Pangeo OGC March 2022

Presentation given at the March 2022 OGC meeting: https://portal.ogc.org/meet/?p=default&mid=88

Ryan Abernathey

March 09, 2022
Tweet

More Decks by Ryan Abernathey

Other Decks in Science

Transcript

  1. Pa n g e o
    A C o m m u n i t y P l a t f o r m f o r

    B i g D a t a G e o s c i e n c e


    OGC Member Meeting

    March 2, 2022
    http://pangeo.io

    https://discourse.pangeo.io/

    https://github.com/pangeo-data/

    https://medium.com/pangeo

    @pangeo_data

    View Slide

  2. W h o A m I ?
    2
    Physical Oceanographer


    Ph.D. From MIT, 2012


    Associate Prof. at Columbia / LDEO


    https://ocean-transport.github.io/
    Core developer of Xarray


    Core developer of Zarr


    Co-founder of Pangeo


    Open Source Advocate

    View Slide

  3. 3
    Pangeo represents a user-centric approach to climate / ocean /
    weather data analytics.


    We are not data providers. We are data users.


    Pangeo is the user-centric, open-source platform that was missing in
    2017 when we started this work.

    View Slide

  4. 8 0 / 2 0 R u l e o f D ata S c i e n c e
    4
    ZRUNDERXWZHUHêVDWLVĆHGëRUEHWWHU
    How a Data Scientist Spends Their Day
    +HUHèVZKHUHWKHSRSXODUYLHZRIGDWDVFLHQWLVWVGLYHUJHVSUHWW\VLJQLĆFDQWO\IURPUHDOLW\*HQHUDOO\
    ZHWKLQNRIGDWDVFLHQWLVWVEXLOGLQJDOJRULWKPVH[SORULQJGDWDDQGGRLQJSUHGLFWLYHDQDO\VLV7KDWèV
    actually not what they spend most of their time doing, however.
    $V\RXFDQVHHIURPWKHFKDUWDERYHRXWRIHYHU\GDWDVFLHQWLVWVZHVXUYH\HGDFWXDOO\VSHQGWKH
    Data scientist job satisfaction
    60%
    19%
    9%
    4%
    5%
    3%
    Building training sets: 3%
    Cleaning and organizing data: 60%
    Collecting data sets; 19%
    Mining data for patterns: 9%
    5HĆQLQJDOJRULWKPV
    Other: 5%
    ,!;&!;!9$-'2ধ9;996'2&;,'139;ধ1'&3-2+
    How do data scientists spend their time?
    Crowd
    fl
    ower Data Science Report (2016)

    View Slide

  5. Credit: JPL / NASA PO.DAAC
    SWOT
    NISAR

    View Slide

  6. T h e “ D o w n l o a d ” M o d e l
    6
    a) file-based approach
    step 1
    : dow
    nload step 2: clean / organize
    `
    file
    file
    file
    b) database / api approach
    record
    file
    file
    file
    local disk
    query
    files
    step 3: analyze

    View Slide

  7. 7
    MB 😀
    T h e “ D o w n l o a d ” M o d e l
    a) file-based approach
    step 1
    : dow
    nload step 2: clean / organize
    `
    file
    file
    file
    b) database / api approach
    record
    file
    file
    file
    local disk
    query
    files
    step 3: analyze

    View Slide

  8. 8
    GB 😐
    T h e “ D o w n l o a d ” M o d e l
    a) file-based approach
    step 1
    : dow
    nload step 2: clean / organize
    `
    file
    file
    file
    b) database / api approach
    record
    file
    file
    file
    local disk
    query
    files
    step 3: analyze

    View Slide

  9. 9
    TB 😖
    T h e “ D o w n l o a d ” M o d e l
    a) file-based approach
    step 1
    : dow
    nload step 2: clean / organize
    `
    file
    file
    file
    b) database / api approach
    record
    file
    file
    file
    local disk
    query
    files
    step 3: analyze

    View Slide

  10. 10
    PB 😱
    T h e “ D o w n l o a d ” M o d e l
    a) file-based approach
    step 1
    : dow
    nload step 2: clean / organize
    `
    file
    file
    file
    b) database / api approach
    record
    file
    file
    file
    local disk
    query
    files
    step 3: analyze

    View Slide

  11. N e v e r M i n d …
    11
    H o w ?
    Let’s “bring the compute to the data”!

    View Slide

  12. U s e a “ P l at f o r m ”
    12

    View Slide

  13. • Scientists’ creativity often exceeds pre-baked capabilities.

    Desire to go under the hood


    • What if you want to access data that isn’t included?

    Data catalog is determined by provider, not users


    • Platforms are “single instance”:

    Fear of lock-in, possibility platform will disappear


    • Who pays?
    T h e T r o u b l e w i t h “ P l at f o r m s ”
    13

    View Slide

  14. O P E N C l o u d A r c h i t e c t u r e
    14
    Data Provider’s $ Data Consumer’s $

    View Slide

  15. O P E N C l o u d A r c h i t e c t u r e
    15
    Interactive Computing
    Data Provider’s $ Data Consumer’s $

    View Slide

  16. O P E N C l o u d A r c h i t e c t u r e
    16
    Data Provider’s $ Data Consumer’s $
    Interactive Computing
    Parallel Computing

    View Slide

  17. O P E N C l o u d A r c h i t e c t u r e
    17
    Data Provider’s $ Data Consumer’s $
    Interactive Computing
    Parallel Computing
    Analysis Ready Data

    Cloud Optimized Formats

    View Slide

  18. O P E N C l o u d A r c h i t e c t u r e
    18
    Data Provider’s $ Data Consumer’s $
    Interactive Computing
    Parallel Computing
    Analysis Ready Data

    Cloud Optimized Formats

    View Slide

  19. • Community obsessed with ef
    fi
    cient data processing.


    Founded in 2017. Scientists and software developers coming together. http://pangeo.io/

    Weekly meeting / seminar. Discourse Forum. Annual meeting. Workshops at AGU / AMS / etc.


    • Interoperable Software


    Foundation in Open Source Scienti
    fi
    c Python: Jupyter, Xarray, Dask, Zarr. Broad ecosystem of
    interoperable packages for analysis, visualization, and machine learning.


    • Data and Computing Infrastructure


    Deployment recipes for cloud and HPC. Open, public, cloud-based JupyterHubs and Binders
    for Data-proximate computing. PB of analysis-ready, cloud-optimized data stored in public
    cloud (GCS, AWS) and OpenStorageNetwork.
    W h at i s Pa n g e o ?
    19

    View Slide

  20. Scientific users / use cases
    Open-source software libraries
    HPC and cloud infrastructure
    • Define science questions


    • Use software / infrastructure


    • Identify bugs / bottlenecks


    • Provide feedback to developers
    • Contribute widely the the open source
    scientific python ecosystem


    • Maintain / extend existing libraries,
    start new ones reluctantly


    • Solve integration challenges
    • Deploy interactive analysis environments


    • Curate analysis-ready datasets


    • Platform agnostic
    Agile
    development
    👩💻
    T h e Pa n g e o C o m m u n i t y P r o c e s s
    20

    View Slide

  21. B r i n g i n g t o g e t h e r A c a d e m i a ,
    G o v. A g e n c i e s & I n d u s t r y
    21

    View Slide

  22. T h e Pa n g e o O p e n - S o u r c e C l o u d S ta c k
    22
    Cloud-optimized storage for
    multidimensional arrays.
    Flexible, general-purpose parallel
    computing framework.
    High-level API for analysis of
    multidimensional labelled arrays.
    Kubernetes
    Object Storage
    Rich interactive
    computing environment
    in the web browser.
    xgcm xr
    f
    xhistogram gcm-filters
    climpred
    Cloud Services
    Domain specific packages
    Etc.

    View Slide

  23. Pa n g e o C l o u d I n f r a s t r u c t u r e
    23
    Compute Services
    Data Lakes
    Dask Gateway
    .zarray .zattrs
    0.0 0.1
    2.0
    1.0 1.1
    2.1
    .zarray .zattrs
    0.0 0.1
    2.0
    1.0 1.1
    2.1
    .zarray .zattrs
    0.0 0.1
    2.0
    1.0 1.1
    2.1
    .zarray .zattrs
    0.0 0.1
    2.0
    1.0 1.1
    2.1
    Zarr Datasets
    Node Pools (Autoscaling)
    preemptible

    (spot instance) normal
    http://catalog.pangeo.io

    View Slide

  24. • Pangeo partnered with Google Cloud
    to provide a new public dataset


    • Data stored in Zarr format


    • Google provides free hosting in GCS


    • LDEO does the work of transferring
    the data from ESGF to GCS


    • Mirrored on AWS
    C M I P 6 C l o u d D ata s e t
    24

    View Slide

  25. • We didn’t build very much new stuff; we
    just helped existing, community developed
    tools work together. Open and
    community-driven from day 1.
    Sustainability


    • “Power users” always just want direct,
    data-proximate access to the raw data.
    Simplicity


    • The same stack is an effective base-layer
    for apps / dashboards / APIs, etc.
    Modularity
    M e s s a g e 1 :

    T h e Pa n g e o A p p r o a c h W o r k s !
    25

    View Slide

  26. A G E N C I E S U S I N G PA N G E O
    26
    Recent talks from the Pangeo Showcase Seminar: https://zenodo.org/communities/pangeo/

    View Slide

  27. • Good ARCO data (COG, Zarr,
    TileDB, Parquet) + S3 obviates
    the need for some APIs /
    services.


    • Legacy formats (netCDF /
    HDF5 / GRIB) don’t always play
    well with object storage.


    • ARCO data production takes
    time and skill.
    M e s s a g e 2 : A n a ly s i s - R e a d y, C l o u d -
    O P t i m i z e d ( A R C o ) D ata i s G r e at
    27
    ARCO Data

    View Slide

  28. F u t u r e C h a l l e n g e : D ata G r av i t y
    28
    “Data gravity is the ability of a body of data to attract applications, services and other data." - Dave McCrory





    NASA (200 PB)


    NOAA BDP


    ASDI (incl. CMIP6)

    NCAR Datasets

    etc…


    Planetary Computer

    NOAA BDP






    Earth Engine

    NOAA BDP


    Descartes

    Pangeo



    SentinelHub


    Climate Change


    Atmosphere


    Marine


    ECMWF
    DOE
    XSEDE
    HECC
    NCAR

    View Slide

  29. D ata G r av i t y
    29
    What is the stable steady-state solution?
    DOE
    XSEDE
    HECC
    NCAR
    ?

    View Slide

  30. W e n e e d a

    g l o b a l S c i e n t i f i c D ata C o m m o n s
    30
    Edge storage, decentralized web, web3
    DOE
    XSEDE
    HECC
    NCAR
    ?

    View Slide

  31. • The Pangeo approach (open source, modular, collaborative) has been embraced by both
    science “power users” and builders of Earth-System analytics platforms.


    • Standards are extremely important for modular, open-source analytics platforms!


    • Analysis Ready, Cloud Optimized Data in object storage is the foundation of performant and
    fl
    exible cloud Earth-System analytics platforms.


    • We need a global scienti
    fi
    c commons that lives outside the big cloud providers. Otherwise
    data gravity will suck all of science into AWS.
    S u m m a r y
    31
    Thanks to our funders!

    View Slide

  32. L e a r n M o r e
    32
    http://pangeo.io

    https://discourse.pangeo.io/

    https://github.com/pangeo-data/

    https://medium.com/pangeo

    @pangeo_data

    View Slide

  33. Extra Slides

    View Slide

  34. A R C O D ata + H T T P ( S 3 ) I s M o r e P e r f o r m a n t
    a n d F l e x i b l e t h a n a B e s p o k e A P I
    34
    https://xpublish.readthedocs.io/

    Serve dynamically generated Zarr data over HTTP.
    Client can’t tell the di
    ff
    erence.

    View Slide

  35. Pa n g e o F o r g e : D e m o c r at i z i n g
    A R G O D ata P r o d u c t i o n
    35
    https://pangeo-forge.org/
    An open source platform for creating ARCO datasets.
    We crowdsource “recipes” for ARCO data from the
    global science community. Cloud automation builds
    the datasets in a scalable and reproducible way.

    View Slide

  36. K e r c h u n k : M a k e y o u r L e g a c y
    d ata l o o k a n d F e E L l i k e Z a r r
    36
    • Provides a uni
    fi
    ed way to represent a variety of
    chunked, compressed binary data formats

    (e.g. NetCDF/HDF5, GRIB2, TIFF, …)


    • Allows ef
    fi
    cient access to data from traditional
    fi
    le
    systems or cloud object storage.


    • Create virtual datasets from multiple
    fi
    les by
    extracting the byte ranges, compression information
    etc. and storing this metadata in a new, separate
    object.


    • Open Spec, python implementation.
    https://fsspec.github.io/kerchunk/

    View Slide

  37. 37
    https://medium.com/pangeo

    View Slide

  38. T W O Pa p e r S
    38
    https://doi.org/10.1029/2020AV000354
    https://doi.org/10.1109/MCSE.2021.3059437

    View Slide

  39. View Slide