Save 37% off PRO during our Black Friday Sale! »

Rechunking

 Rechunking

654d48d6c1c10c50c160954ba31207a2?s=128

Ryan Abernathey

June 04, 2020
Tweet

Transcript

  1. Rechunking Ryan Abernathey

  2. The Problem: Rechunking FROM TO • Source data is often

    collected contiguously in 1+ dimensions (e.g. satellite image) and chunked in another (e.g. time) • But then we want to analyze it along the time axis • This requires rechecking • dask.array.rechunk often fails • Huge graphs • Workers running out of memory
  3. FROM TO • Source data is often collected contiguously in

    1+ dimensions (e.g. satellite image) and chunked in another (e.g. time) • But then we want to analyze it along the time axis • This requires rechecking • dask.array.rechunk often fails • Huge graphs • Workers running out of memory The Problem: Rechunking
  4. Can we do better?

  5. 2D Case Source Target Nx Nt C1 x C1 t

    C0 x C0 t
  6. 2D Case C1 x C1 t C0 x C0 t

    Array Shape Conserved Nt = const, Nx = const Number of Chunks Nt = nt Ct , Nx = nx Cx Chunk Size Sc = Ct Cx Chunk-Size-Preserving Operations C0 t C0 x = C1 t C1 x Full-Shuffle Rechunk C0 x = Nx , C1 t = Nt C1 x = C0 t Nx /Nt
  7. Algorithms: Push Source Target • Read each source chunk only

    once • Write source data to many target chunks • Problem: need to synchronize writes
  8. Algorithms: Pull Source Target • Write each target only once

    • Read each source file many times • Problem: can blow out memory, serial loop over source chunks may be very slow
  9. Algorithms: Push / Pull Source Intermediate Target

  10. Consolidating Reads Source Intermediate Target single read • Fewer Read

    Tasks • Fewer Intermediate Chunks Can consolidate source chunks up to size of target chunks.
  11. Consolidating Writes Source Intermediate Target • Fewer Write Tasks single

    write single read
  12. Push-Pull-Consolidated Algorithm Source Intermediate Target single write single read single

    read
  13. sing if consolidate_reads: read_chunks = consolidate_chunks( source_chunks, mem_limit, constraint=target_chunks) else:


    read_chunks = source_chunks
 
 
 intermediate_chunks = min(read_chunks, write_chunks)
 
 
 if consolidate_writes:
 write_chunks = consolidate_chunks( target_chunks, mem_limit) else:
 write_chunks = target_chunks Push-Pull-Consolidated
  14. sing if consolidate_reads: read_chunks = consolidate_chunks( source_chunks, mem_limit, constraint=target_chunks) else:


    read_chunks = source_chunks
 
 
 intermediate_chunks = min(read_chunks, write_chunks)
 
 
 if consolidate_writes:
 write_chunks = consolidate_chunks( target_chunks, mem_limit) else:
 write_chunks = target_chunks Push-Pull-Consolidated
  15. sing if consolidate_reads: read_chunks = consolidate_chunks( source_chunks, mem_limit, constraint=target_chunks) else:


    read_chunks = source_chunks
 
 
 intermediate_chunks = min(read_chunks, write_chunks)
 
 
 if consolidate_writes:
 write_chunks = consolidate_chunks( target_chunks, mem_limit) else:
 write_chunks = target_chunks Push-Pull-Consolidated
  16. sing if consolidate_reads: read_chunks = consolidate_chunks( source_chunks, mem_limit, constraint=target_chunks) else:


    read_chunks = source_chunks
 
 
 intermediate_chunks = min(read_chunks, write_chunks)
 
 
 if consolidate_writes:
 write_chunks = consolidate_chunks( target_chunks, mem_limit) else:
 write_chunks = target_chunks Push-Pull-Consolidated
  17. Source Intermediate Target single write single read single read for

    each dim: assert read_chunk >= source_chunk assert write_chunk >= target_chunk assert int_chunk == min(read_chunk, write_chunk)
  18. None