Slide 1

Slide 1 text

Rechunking Ryan Abernathey

Slide 2

Slide 2 text

The Problem: Rechunking FROM TO • Source data is often collected contiguously in 1+ dimensions (e.g. satellite image) and chunked in another (e.g. time) • But then we want to analyze it along the time axis • This requires rechecking • dask.array.rechunk often fails • Huge graphs • Workers running out of memory

Slide 3

Slide 3 text

FROM TO • Source data is often collected contiguously in 1+ dimensions (e.g. satellite image) and chunked in another (e.g. time) • But then we want to analyze it along the time axis • This requires rechecking • dask.array.rechunk often fails • Huge graphs • Workers running out of memory The Problem: Rechunking

Slide 4

Slide 4 text

Can we do better?

Slide 5

Slide 5 text

2D Case Source Target Nx Nt C1 x C1 t C0 x C0 t

Slide 6

Slide 6 text

2D Case C1 x C1 t C0 x C0 t Array Shape Conserved Nt = const, Nx = const Number of Chunks Nt = nt Ct , Nx = nx Cx Chunk Size Sc = Ct Cx Chunk-Size-Preserving Operations C0 t C0 x = C1 t C1 x Full-Shuffle Rechunk C0 x = Nx , C1 t = Nt C1 x = C0 t Nx /Nt

Slide 7

Slide 7 text

Algorithms: Push Source Target • Read each source chunk only once • Write source data to many target chunks • Problem: need to synchronize writes

Slide 8

Slide 8 text

Algorithms: Pull Source Target • Write each target only once • Read each source file many times • Problem: can blow out memory, serial loop over source chunks may be very slow

Slide 9

Slide 9 text

Algorithms: Push / Pull Source Intermediate Target

Slide 10

Slide 10 text

Consolidating Reads Source Intermediate Target single read • Fewer Read Tasks • Fewer Intermediate Chunks Can consolidate source chunks up to size of target chunks.

Slide 11

Slide 11 text

Consolidating Writes Source Intermediate Target • Fewer Write Tasks single write single read

Slide 12

Slide 12 text

Push-Pull-Consolidated Algorithm Source Intermediate Target single write single read single read

Slide 13

Slide 13 text

sing if consolidate_reads: read_chunks = consolidate_chunks( source_chunks, mem_limit, constraint=target_chunks) else:
 read_chunks = source_chunks
 
 
 intermediate_chunks = min(read_chunks, write_chunks)
 
 
 if consolidate_writes:
 write_chunks = consolidate_chunks( target_chunks, mem_limit) else:
 write_chunks = target_chunks Push-Pull-Consolidated

Slide 14

Slide 14 text

sing if consolidate_reads: read_chunks = consolidate_chunks( source_chunks, mem_limit, constraint=target_chunks) else:
 read_chunks = source_chunks
 
 
 intermediate_chunks = min(read_chunks, write_chunks)
 
 
 if consolidate_writes:
 write_chunks = consolidate_chunks( target_chunks, mem_limit) else:
 write_chunks = target_chunks Push-Pull-Consolidated

Slide 15

Slide 15 text

sing if consolidate_reads: read_chunks = consolidate_chunks( source_chunks, mem_limit, constraint=target_chunks) else:
 read_chunks = source_chunks
 
 
 intermediate_chunks = min(read_chunks, write_chunks)
 
 
 if consolidate_writes:
 write_chunks = consolidate_chunks( target_chunks, mem_limit) else:
 write_chunks = target_chunks Push-Pull-Consolidated

Slide 16

Slide 16 text

sing if consolidate_reads: read_chunks = consolidate_chunks( source_chunks, mem_limit, constraint=target_chunks) else:
 read_chunks = source_chunks
 
 
 intermediate_chunks = min(read_chunks, write_chunks)
 
 
 if consolidate_writes:
 write_chunks = consolidate_chunks( target_chunks, mem_limit) else:
 write_chunks = target_chunks Push-Pull-Consolidated

Slide 17

Slide 17 text

Source Intermediate Target single write single read single read for each dim: assert read_chunk >= source_chunk assert write_chunk >= target_chunk assert int_chunk == min(read_chunk, write_chunk)

Slide 18

Slide 18 text

No content