Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unlocking the Potential of Cloud Native Science with Pangeo

Ryan Abernathey
November 10, 2021

Unlocking the Potential of Cloud Native Science with Pangeo

Keynote Talk given at JHU IDIES Symposium

https://idies.jhu.edu/news-events/events/idies-annual-symposium/

Ryan Abernathey

November 10, 2021
Tweet

More Decks by Ryan Abernathey

Other Decks in Science

Transcript

  1. U n l o c k i n g t h e P o t e n t i a l o f C l o u d
    N at i v e S c i e n c e
    R y a n A b e r n a t h e y
    I D I E S 2 0 2 1

    View Slide

  2. W h o A m I ?
    2
    Physical Oceanographer
    Ph.D. From MIT, 2012
    Associate Prof. at Columbia / LDEO
    https://ocean-transport.github.io/
    Core developer of Xarray
    Core developer of Zarr
    Co-founder of Pangeo
    Open Source Advocate

    View Slide

  3. • Part I: What is “cloud native” science? What are its benefits?
    • Part II: Demo of Pangeo workflow in the Cloud
    • Part III: Deep dive on on Pangeo technology stack
    • Part IV: Future Challenges for Cloud Native Science
    T h i s Ta l k
    3

    View Slide

  4. T W O Pa p e r S
    4
    https://doi.org/10.1029/2020AV000354
    https://doi.org/10.1109/MCSE.2021.3059437

    View Slide

  5. 5
    Credit: NASA's Goddard Space Flight Center

    View Slide

  6. 6
    https://earthdata.nasa.gov/eosdis/cloud-evolution
    SWOT
    NISAR

    View Slide

  7. D ata i s E x p l o d i n g i n A l l F i e l d s !
    7
    James Webb Space Telescope
    Light Sheet Fluorescence Microscope

    View Slide

  8. W h at S c i e n c e d o w e w a n t
    t o d o w i t h A l l T h i s D ata?
    8

    View Slide

  9. 9
    Take the mean!
    W h at S c i e n c e d o w e w a n t
    t o d o w i t h A l l T h i s D ata?

    View Slide

  10. 10
    Analyze
    spatiotemporal
    variability
    W h at S c i e n c e d o w e w a n t
    t o d o w i t h A l l T h i s D ata?

    View Slide

  11. 11
    Machine learning!
    Credit: Berkeley Lab
    W h at S c i e n c e d o w e w a n t
    t o d o w i t h A l l T h i s D ata?

    View Slide

  12. 8 0 / 2 0 R u l e o f D ata S c i e n c e
    12
    ZRUNDERXWZHUHêVDWLVĆHGëRUEHWWHU
    How a Data Scientist Spends Their Day
    +HUHèVZKHUHWKHSRSXODUYLHZRIGDWDVFLHQWLVWVGLYHUJHVSUHWW\VLJQLĆFDQWO\IURPUHDOLW\*HQHUDOO\
    ZHWKLQNRIGDWDVFLHQWLVWVEXLOGLQJDOJRULWKPVH[SORULQJGDWDDQGGRLQJSUHGLFWLYHDQDO\VLV7KDWèV
    actually not what they spend most of their time doing, however.
    $V\RXFDQVHHIURPWKHFKDUWDERYHRXWRIHYHU\GDWDVFLHQWLVWVZHVXUYH\HGDFWXDOO\VSHQGWKH
    Data scientist job satisfaction
    60%
    19%
    9%
    4%
    5%
    3%
    Building training sets: 3%
    Cleaning and organizing data: 60%
    Collecting data sets; 19%
    Mining data for patterns: 9%
    5HĆQLQJDOJRULWKPV
    Other: 5%
    ,!;&!;!9$-'2ধ9;996'2&;,'139;ধ1'&3-2+
    How do data scientists spend their time?
    Crowdflower Data Science Report (2016)

    View Slide

  13. “ D ata W o r k ” i n A I
    13
    “…incentivizing data excellence as a first-class citizen of AI…”

    View Slide

  14. T h e “ D o w n l o a d ” M o d e l
    14
    a) file-based approach
    step 1
    : dow
    nload step 2: clean / organize
    `
    file
    file
    file
    b) database / api approach
    record
    file
    file
    file
    local disk
    query
    files
    step 3: analyze

    View Slide

  15. 15
    MB 😀
    T h e “ D o w n l o a d ” M o d e l
    a) file-based approach
    step 1
    : dow
    nload step 2: clean / organize
    `
    file
    file
    file
    b) database / api approach
    record
    file
    file
    file
    local disk
    query
    files
    step 3: analyze

    View Slide

  16. 16
    GB 😐
    T h e “ D o w n l o a d ” M o d e l
    a) file-based approach
    step 1
    : dow
    nload step 2: clean / organize
    `
    file
    file
    file
    b) database / api approach
    record
    file
    file
    file
    local disk
    query
    files
    step 3: analyze

    View Slide

  17. 17
    TB 😖
    T h e “ D o w n l o a d ” M o d e l
    a) file-based approach
    step 1
    : dow
    nload step 2: clean / organize
    `
    file
    file
    file
    b) database / api approach
    record
    file
    file
    file
    local disk
    query
    files
    step 3: analyze

    View Slide

  18. 18
    PB 😱
    T h e “ D o w n l o a d ” M o d e l
    a) file-based approach
    step 1
    : dow
    nload step 2: clean / organize
    `
    file
    file
    file
    b) database / api approach
    record
    file
    file
    file
    local disk
    query
    files
    step 3: analyze

    View Slide

  19. O P E N C l o u d A r c h i t e c t u r e
    19
    Data Provider’s $ Data Consumer’s $

    View Slide

  20. O P E N C l o u d A r c h i t e c t u r e
    20
    Interactive Computing
    Data Provider’s $ Data Consumer’s $

    View Slide

  21. O P E N C l o u d A r c h i t e c t u r e
    21
    Data Provider’s $ Data Consumer’s $
    Interactive Computing
    Parallel Computing

    View Slide

  22. O P E N C l o u d A r c h i t e c t u r e
    22
    Data Provider’s $ Data Consumer’s $
    Interactive Computing
    Parallel Computing
    Analysis Ready Data

    Cloud Optimized Formats

    View Slide

  23. O P E N C l o u d A r c h i t e c t u r e
    23
    Data Provider’s $ Data Consumer’s $
    Interactive Computing
    Parallel Computing
    Analysis Ready Data

    Cloud Optimized Formats

    View Slide

  24. • Performance
    • Reliability
    • Cost Effectiveness
    • Collaboration
    • Reproducibility
    • Creativity
    • Downstream Impacts
    • Access + Inclusion
    W h y C l o u d N at i v e ?
    24

    View Slide

  25. • Performance
    • Reliability
    • Cost Effectiveness
    • Collaboration
    • Reproducibility
    • Creativity
    • Downstream Impacts
    • Access + Inclusion
    W h y C l o u d N at i v e ?
    25
    vs.
    Massive computational resources available on
    demand. Elastic scaling. High throughput data
    storage for distributed processing.

    View Slide

  26. • Performance
    • Reliability
    • Cost Effectiveness
    • Collaboration
    • Reproducibility
    • Creativity
    • Downstream Impacts
    • Access + Inclusion
    W h y C l o u d N at i v e ?
    26
    All collaborators around the world can access
    the same computational environment and data.

    View Slide

  27. • Performance
    • Reliability
    • Cost Effectiveness
    • Collaboration
    • Reproducibility
    • Creativity
    • Downstream Impacts
    • Access + Inclusion
    W h y C l o u d N at i v e ?
    27
    Industry can exploit data more effectively if it’s
    already in the cloud.

    View Slide

  28. • Performance
    • Reliability
    • Cost Effectiveness
    • Collaboration
    • Reproducibility
    • Creativity
    • Downstream Impacts
    • Access + Inclusion
    W h y C l o u d N at i v e ?
    28
    Researchers are not constrained by local
    infrastructure.
    https://coessing.org/

    View Slide

  29. P i l l a r s o f C l o u d N at i v e
    29
    Analysis-Ready,

    Cloud-Optimized
    Data
    Data-Proximate
    Computing
    On Demand,
    Scalable Distributed
    Computing
    compute node
    compute node
    compute node
    compute node
    compute node
    compute node
    compute node
    compute node
    compute node

    View Slide

  30. D E M O
    30
    http://gallery.pangeo.io/repos/pangeo-gallery/physical-oceanography/01_sea-surface-height.html
    https://tinyurl.com/y5fngdbh

    View Slide

  31. • Open Community
    • Open Source Software
    • Open Source Infrastructure
    31
    W h at i s Pa n g e o ?
    “A community platform for Big Data geoscience”
    Funders

    View Slide

  32. 32
    Pa n g e o C o m m u n i t y
    ✓Students / Postdocs / Faculty / Software
    Devs / Data Scientists
    ✓Academia / National Labs / Industry / NGO
    ✓Weather / Climate / Oceans / Geoscience
    ✓US / UK / Europe / Australia
    Participation in Pangeo is open to anyone!
    http://pangeo.io

    View Slide

  33. Pa n g e o S h o w c a s e
    33
    https://pangeo.io/pangeo-showcase.html

    View Slide

  34. Pa n g e o S o f t w a r e E c o s y s t e m
    34
    Inspiration: Stephan Hoyer, Jake Vanderplas (SciPy 2015)
    SciPy

    View Slide

  35. Pa n g e o C l o u d S ta c k
    35
    Cloud-optimized storage for
    multidimensional arrays.
    Flexible, general-purpose
    parallel computing framework.
    High-level API for analysis of
    multidimensional labelled arrays.
    Kubernetes
    Object Storage
    Domain-Specific Packages

    View Slide

  36. 36
    0.0
    2.0
    1.0
    Chunks
    .zattrs
    Metadata
    Dask worker
    Dask worker
    Dask worker
    Juptyer pod
    T h e Pa n g e o C l o u d S ta c k
    Cloud Object Store Cloud Compute Cluster
    HTTP

    GET
    http://pangeo.io/cloud.html

    View Slide

  37. Pa n g e o C l o u d I n f r a s t r u c t u r e
    37
    Compute
    Data
    Dask Gateway
    .zarray .zattrs
    0.0 0.1
    2.0
    1.0 1.1
    2.1
    .zarray .zattrs
    0.0 0.1
    2.0
    1.0 1.1
    2.1
    .zarray .zattrs
    0.0 0.1
    2.0
    1.0 1.1
    2.1
    .zarray .zattrs
    0.0 0.1
    2.0
    1.0 1.1
    2.1
    Zarr Datasets
    Node Pools (Autoscaling)
    preemptible
    (spot instance) normal
    http://catalog.pangeo.io

    View Slide

  38. Pa n g e o C l o u d D ata C ata l o g
    38
    catalog.pangeo.io

    View Slide

  39. • Think in “Datasets” not “data files”
    • No need for tedious
    homogenizing / cleaning steps
    • Curated and cataloged
    A R C O D ata
    39
    Analysis Ready, Cloud Optimzed
    $VGDWDVFLHQFHEHFRPHVPRUHFRPPRQSODFHDQG
    VLPXOWDQHRXVO\DELWGHP\VWLĆHGZHH[SHFWWKLV
    WUHQGWRFRQWLQXHDVZHOO$IWHUDOOODVW\HDUèV
    respondents were just as excited about their
    ZRUNDERXWZHUHêVDWLVĆHGëRUEHWWHU
    How a Data Scientist Spends Their Day
    +HUHèVZKHUHWKHSRSXODUYLHZRIGDWDVFLHQWLVWVGLYHUJHVSUHWW\VLJQLĆFDQWO\IURPUHDOLW\*HQ
    ZHWKLQNRIGDWDVFLHQWLVWVEXLOGLQJDOJRULWKPVH[SORULQJGDWDDQGGRLQJSUHGLFWLYHDQDO\VLV7
    actually not what they spend most of their time doing, however.
    $V\RXFDQVHHIURPWKHFKDUWDERYHRXWRIHYHU\GDWDVFLHQWLVWVZHVXUYH\HGDFWXDOO\VSHQ
    PRVWWLPHFOHDQLQJDQGRUJDQL]LQJGDWDFRPSDUHGWRGLJLWDOMDQLWRUZRUN(YHU\WKLQJIURPOLVWYHULĆFDWLRQWRUHPRYLQJFRPPDVWRGHE
    databases–that time adds up and it adds up immensely. Messy data is by far the more time- con
    DVSHFWRIWKHW\SLFDOGDWDVFLHQWLVWèVZRUNćRZ$QGQHDUO\VDLGWKH\VLPSO\VSHQWWRRPXF
    Data scientist job satisfaction
    60%
    19%
    9%
    4%
    5%
    3%
    Building training sets: 3%
    Cleaning and organizing data: 60%
    Collecting data sets; 19%
    Mining data for patterns: 9%
    5HĆQLQJDOJRULWKPV
    Other: 5%
    ,!;&!;!9$-'2ধ9;996'2&;,'139;ধ1'&3
    2
    1


    How do data scientists spend their time?
    Crowdflower Data Science Report (2016)
    What is “Analysis Ready”?

    View Slide

  40. • Compatible with object storage
    (access via HTTP)
    • Supports lazy access and intelligent
    subsetting
    • Integrates with high-level analysis
    libraries and distributed frameworks
    A R C O D ata
    40
    Analysis Ready, Cloud Optimzed
    What is “Cloud Optimized”?

    View Slide

  41. E X A M P L E O F A R C O D ATA
    41
    Chunked
    appropriately for
    analysis
    Rich metadata
    Everything in one
    dataset object
    https://catalog.pangeo.io/browse/master/ocean/sea_surface_height/

    View Slide

  42. Xarray + Dask + Zarr
    42
    Legacy Server
    C l o u d O p t i m i z e d S c a l e s !

    View Slide

  43. P i l l a r s o f C l o u d N at i v e
    43
    Analysis-Ready,

    Cloud-Optimized
    Data
    Data-Proximate
    Computing
    On Demand,
    Scalable Distributed
    Computing
    compute node
    compute node
    compute node
    compute node
    compute node
    compute node
    compute node
    compute node
    compute node

    View Slide

  44. 😣 Legacy data formats
    😣 Storage and egress costs
    😣 Funding models
    C h a l l e n g e s f o r C l o u d N at i v e
    S c i e n c e
    44

    View Slide

  45. • Most of our existing scientific data formats (e.g. HDF5, FITS, ROOT,
    etc.) are NOT cloud-optimized (inefficient access on object storage)
    • Adopting CO formats (e.g. Parquet, Zarr) is confusing to users and
    data providers
    • Transcoding legacy data to ARCO format can be tedious and
    complicated
    • Some clever hacks, e.g. kerchunk: https://github.com/fsspec/kerchunk
    C h a l l e n g e : L e g a c y D ata F o r m at s
    45

    View Slide

  46. 46
    https://pangeo-forge.org/
    Pangeo Forge is an open source platform for
    data Extraction, Transformation, and Loading
    (ETL). The goal of Pangeo Forge is to make it
    easy to extract data from traditional data
    repositories and deposit in cloud object storage
    in analysis-ready, cloud-optimized (ARCO)
    format. Pangeo Forge is inspired directly by
    Conda Forge, a community-led collection of
    recipes for building conda packages. We hope
    that Pangeo Forge can play the same role for
    datasets.

    View Slide

  47. • S3 is very expensive! ($250K / PB / year)
    • Fear of “lock in” to specific cloud provider due to Egress fees
    ($50K to download 1 PB)
    • Possible solutions
    • OSN / Internet2
    • CloudFlare
    C h a l l e n g e : S t o r a g e a n d E g r e s s
    C o s t s
    47

    View Slide

  48. 48

    View Slide

  49. • Local infrastructure, used only by members, should clearly be paid for
    by local institution
    • Since cloud infrastructure can scale to accommodate any number of
    users (up to the entire field), it’s not clear who should pay for it
    • My option: make it easily “franchisable” — allow institutions to
    incrementally add capacity to a federation to support their users while
    still leveraging economics of scale
    C h a l l e n g e : F u n d i n g M o d e l
    49

    View Slide

  50. V I S I O N
    50
    data pods
    industry
    group
    research
    group
    HPC

    Centers

    View Slide

  51. L e a r n M o r e
    51
    http://pangeo.io

    https://discourse.pangeo.io/

    https://github.com/pangeo-data/

    https://medium.com/pangeo

    @pangeo_data

    View Slide