AMD FidelityFX™ Parallel Sort (original) (raw)

Algorithm overview

AMD FidelityFX Parallel Sort is an AMD RDNA™-optimized version of the Radix Sort algorithm.

At a high level, the algorithm works by recursing over a data set to be sorted (key or key/value pairs), and re-arranging it in place by 4-bit increments. Each pass guarantees that the data set is fully sorted up to the number of bits processed. For example, after 4 iterations, we are guaranteed that the first 16 bits of the key is properly sorted.

For each iteration that is executed, 5 actions are taken on the data set:

  1. The 4-bit value range we are currently sorting is summed up into buckets from 0-15, so that we know how many of each value occurs throughout the data set.
  2. The number of occurrences go through a reduction phase in order to pre-increment offsets on a thread group basis later on.
  3. The reduced occurrences go through a scan-prefix to calculate offset values for each value group (0-15) on a thread group basis.
  4. The full occurrences buffer then also goes through a scan-prefix, and adds the reduced scan-prefix values to properly index the data across all thread groups.
  5. The data set is read in one more time, and written to its new sorted offset location. If there is also a payload, it is also copied over at this time.

Once all iterations have run (in the case of 32-bit keys, it runs 8 times), the entire data set is sorted.