Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Rechunker (ESIP Summer 2020)

Rechunker (ESIP Summer 2020)

Presentation of rechunker library made at the 2020 ESIP Summer meeting.

https://rechunker.readthedocs.io/

Ryan Abernathey

July 22, 2020
Tweet

More Decks by Ryan Abernathey

Other Decks in Technology

Transcript

  1. R e c h u n k e r :

    T h e M i s s i n g L i n k f o r C l o u d O p t i m i z e d D ata R y a n A b e r n a t h e y & T o m A u g s p u r g e r
  2. O P E N C l o u d A

    r c h i t e c t u r e 2 Analysis Ready Data
 Cloud Optimized Formats Scalable Parallel Computing Frameworks Data Provider’s $ Data Consumer’s $
  3. • Break up large arrays into smaller, more-manageable sub-arrays: “chunks”

    • Chunks define a unit for parallelism • Play well with distributed computing frameworks (e.g. Dask) C h u n k e d A r r ay F o r m at s 3
  4. P r o b l e m : C h

    u n k s n o t A l i g n e d w i t h A n a ly s i s 4 Space Tim e ✅Calculate global statistics at each point in time ❌Calculate timeseries statistics as each point in space
  5. P r o b l e m : C h

    u n k s n o t A l i g n e d w i t h A n a ly s i s 5 35 posts later…
  6. L a z y R e c h u n

    k i n g ( D A S K ) 6 FROM TO • Requires global communication • Possibly uses lots of memory • Possibly creates a huge number of tasks
  7. • Respect memory limits. Rechunker’s algorithm guarantees that worker processes

    will not exceed a user-specified memory threshold. • Minimize the number of required tasks. Specificallly, for N source chunks and M target chunks, the number of tasks is always less than N + M. • Be embarassingly parallel. The task graph should be as simple as possible, to make it easy to execute using different task scheduling frameworks. This also means avoiding write locks, which are complex to manage. D e s i g n P r i n c i p l e s 8
  8. A l g o r i t h m s

    9 Source Target • Read each source chunk only once • Write source data to many target chunks • Problem: need to synchronize writes Source Target • Write each target only once • Read each source file many times • Problem: can blow out memor serial loop over source chunks may be very slow Push Pull
  9. A l g o r i t h m s

    10 Source Intermediate Target Push / Pull
  10. single read A l g o r i t h

    m s 11 Source Intermediate Target Push / Pull Consolidated
  11. D a s k G r a p h 13

    Read Source + Write Intermediate Read Intermediate + Write Target
  12. https://github.com/pangeo-data/rechunker https://rechunker.readthedocs.io/ R o a d m a p 14

    • Support other chunked array formats (e.g. TileDB) • Support other execution frameworks (e.g. lambda) • Command line interface • Please get involved!