Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Rechunker (ESIP Summer 2020)

Rechunker (ESIP Summer 2020)

Presentation of rechunker library made at the 2020 ESIP Summer meeting.

https://rechunker.readthedocs.io/

Ryan Abernathey

July 22, 2020
Tweet

More Decks by Ryan Abernathey

Other Decks in Technology

Transcript

  1. R e c h u n k e r :
    T h e M i s s i n g L i n k f o r C l o u d
    O p t i m i z e d D ata
    R y a n A b e r n a t h e y & T o m A u g s p u r g e r

    View Slide

  2. O P E N C l o u d A r c h i t e c t u r e
    2
    Analysis Ready Data

    Cloud Optimized Formats
    Scalable Parallel
    Computing Frameworks
    Data Provider’s $ Data Consumer’s $

    View Slide

  3. • Break up large arrays into
    smaller, more-manageable
    sub-arrays: “chunks”
    • Chunks define a unit for
    parallelism
    • Play well with distributed
    computing frameworks
    (e.g. Dask)
    C h u n k e d A r r ay F o r m at s
    3

    View Slide

  4. P r o b l e m : C h u n k s n o t A l i g n e d
    w i t h A n a ly s i s
    4
    Space
    Tim
    e
    ✅Calculate global statistics at each point in time
    ❌Calculate timeseries statistics as each point in space

    View Slide

  5. P r o b l e m : C h u n k s n o t A l i g n e d
    w i t h A n a ly s i s
    5
    35 posts later…

    View Slide

  6. L a z y R e c h u n k i n g ( D A S K )
    6
    FROM TO
    • Requires global
    communication
    • Possibly uses lots of memory
    • Possibly creates a huge
    number of tasks

    View Slide

  7. https://github.com/pangeo-data/rechunker
    https://rechunker.readthedocs.io/
    R e c h u n k e r
    7
    Tom Augspurger

    View Slide

  8. • Respect memory limits. Rechunker’s algorithm guarantees that
    worker processes will not exceed a user-specified memory threshold.
    • Minimize the number of required tasks. Specificallly, for N source
    chunks and M target chunks, the number of tasks is always less than
    N + M.
    • Be embarassingly parallel. The task graph should be as simple as
    possible, to make it easy to execute using different task scheduling
    frameworks. This also means avoiding write locks, which are complex
    to manage.
    D e s i g n P r i n c i p l e s
    8

    View Slide

  9. A l g o r i t h m s
    9
    Source Target
    • Read each source chunk only once
    • Write source data to many target chunks
    • Problem: need to synchronize writes
    Source Target
    • Write each target only once
    • Read each source file many
    times
    • Problem: can blow out memor
    serial loop over source chunks
    may be very slow
    Push Pull

    View Slide

  10. A l g o r i t h m s
    10
    Source Intermediate Target
    Push / Pull

    View Slide

  11. single read
    A l g o r i t h m s
    11
    Source Intermediate Target
    Push / Pull Consolidated

    View Slide

  12. U s a g e
    12

    View Slide

  13. D a s k G r a p h
    13
    Read Source + Write Intermediate
    Read Intermediate + Write Target

    View Slide

  14. https://github.com/pangeo-data/rechunker
    https://rechunker.readthedocs.io/
    R o a d m a p
    14
    • Support other chunked array
    formats (e.g. TileDB)
    • Support other execution
    frameworks (e.g. lambda)
    • Command line interface
    • Please get involved!

    View Slide