Algorithm — Rechunker 0.5.4.dev6+g2c197f4 documentation (original) (raw)

The algorithm used by rechunker tries to satisfy several constraints simultaneously:

The algorithm we chose emerged via a lively discussion on thePangeo Discourse Forum. We call it Push / Pull Consolidated.

Algorithm Schematic

Visualization of the Push / Pull Consolidated algorithm for a hypothetical 2D array. Each rectangle represents a single chunk. The dashed boxes indicate consolidate reads / writes.

A rough sketch of the algorithm is as follows

  1. User inputs a source array with a specific shape, chunk structure and data type. Also specifies target_chunks, the desired chunk structure of the output array and max_mem, the maximum amount of memory each worker is allowed to use.
  2. Determine the largest batch of data we can write by one worker givenmax_mem. These are the write_chunks.
  3. Determine the largest batch of data we can read by one worker givenmax_mem, plus the additional constraint of trying to fit within write chunks if possible. These are the read_chunks.
  4. If write_chunks == read chunks, we can avoid creating an intermediate dataset and copy the data directly from source to target.
  5. Otherwise, intermediate chunks are defined as the minimum ofwrite_chunks and read_chunks along each axis. The source is copied first to the intermediate array and then from intermediate to target.