Rechunker (ESIP Summer 2020)

R e c h u n k e r :
T h e M i s s i n g L i n k f o r C l o u d O p t i m i z e d D ata R y a n A b e r n a t h e y & T o m A u g s p u r g e r

O P E N C l o u d A
r c h i t e c t u r e 2 Analysis Ready Data  Cloud Optimized Formats Scalable Parallel Computing Frameworks Data Provider’s $ Data Consumer’s $

• Break up large arrays into smaller, more-manageable sub-arrays: “chunks”
• Chunks deﬁne a unit for parallelism • Play well with distributed computing frameworks (e.g. Dask) C h u n k e d A r r ay F o r m at s 3

P r o b l e m : C h
u n k s n o t A l i g n e d w i t h A n a ly s i s 4 Space Tim e ✅Calculate global statistics at each point in time ❌Calculate timeseries statistics as each point in space

P r o b l e m : C h
u n k s n o t A l i g n e d w i t h A n a ly s i s 5 35 posts later…

L a z y R e c h u n
k i n g ( D A S K ) 6 FROM TO • Requires global communication • Possibly uses lots of memory • Possibly creates a huge number of tasks

https://github.com/pangeo-data/rechunker https://rechunker.readthedocs.io/ R e c h u n k e
r 7 Tom Augspurger

• Respect memory limits. Rechunker’s algorithm guarantees that worker processes
will not exceed a user-speciﬁed memory threshold. • Minimize the number of required tasks. Speciﬁcallly, for N source chunks and M target chunks, the number of tasks is always less than N + M. • Be embarassingly parallel. The task graph should be as simple as possible, to make it easy to execute using different task scheduling frameworks. This also means avoiding write locks, which are complex to manage. D e s i g n P r i n c i p l e s 8

A l g o r i t h m s
9 Source Target • Read each source chunk only once • Write source data to many target chunks • Problem: need to synchronize writes Source Target • Write each target only once • Read each source ﬁle many times • Problem: can blow out memor serial loop over source chunks may be very slow Push Pull

A l g o r i t h m s
10 Source Intermediate Target Push / Pull

single read A l g o r i t h
m s 11 Source Intermediate Target Push / Pull Consolidated

U s a g e 12

D a s k G r a p h 13
Read Source + Write Intermediate Read Intermediate + Write Target

https://github.com/pangeo-data/rechunker https://rechunker.readthedocs.io/ R o a d m a p 14
• Support other chunked array formats (e.g. TileDB) • Support other execution frameworks (e.g. lambda) • Command line interface • Please get involved!

Rechunker (ESIP Summer 2020)

Rechunker (ESIP Summer 2020)

Ryan Abernathey

More Decks by Ryan Abernathey

Other Decks in Technology

Featured

Transcript

R e c h u n k e r :

O P E N C l o u d A

• Break up large arrays into smaller, more-manageable sub-arrays: “chunks”

P r o b l e m : C h

P r o b l e m : C h

L a z y R e c h u n

https://github.com/pangeo-data/rechunker https://rechunker.readthedocs.io/ R e c h u n k e

• Respect memory limits. Rechunker’s algorithm guarantees that worker processes

A l g o r i t h m s

A l g o r i t h m s

single read A l g o r i t h

U s a g e 12

D a s k G r a p h 13

https://github.com/pangeo-data/rechunker https://rechunker.readthedocs.io/ R o a d m a p 14