First 5 minutes of hell

The packing problem (a notorious usage of the PPS algorithm)

  • why?
    • it solves an universal problem of computing dense addresses within a sparse, distributed subset
    • we have a lot of locations where we have some array values or processes and we want to relocate them to some denser, contiguous space while preserving their order, we use the PPS to calculate their offsets over their characteristic vector (a vector of zeros and ones indicating if the value/process is there or not)
  • so each process knows that he belongs to or not (), but it has no idea, what rank it is (globally)
  • treat the flags as binary integers and perform the PPS with operation to the array
    • now every process knows its rank (the denotes the rank of the last member at or before )
  • the packing operation/routine now physically moves the “package” (the process/array value) from the processor to the output port (the index in the output array)
    • the result is a dense array of members, where each member knows it’s rank (where to send packages) and how many processes are before it (in the linear order)
    • the final output port addresses are for four processes (there is minus one, because the port addresses are zero-based)
  • it completes in time (the PPS complexity)

Parallel/distributed RadixSort

  • it is a canonical sorting algorithm on hypercubic networks (it fully utilizes the PPS operation, which is effective on various networks)
    • splits are going from the LSB to the MSB
    • there are two PPS running at the same time
      • one calculates the offset of all numbers with bit 0 in -th position (preserving order)
        • packing upward
      • the second calculates the offset of all numbers with bit 1 in -th position
        • packing downward
    • positions are calculated and a packing routing actually moves the data to their new positions
  • the time complexity: (each PPS has a complexity of and there are Split iterations)

The packing problem

Setup. Consider a distributed system where each processor holds a single bit of a characteristic vector of some subset of the processor set: The bits are arbitrary; some processors belong to , others do not. The problem is that every processor knows only its own local membership flag and has no idea about its position within (how many members precede it).

Goal: each processor should learn its rank within , i.e. its index among the members of in the linear order of processor indices. This is called ranking within the distributed subset .

Solution via PPS. Treat the boolean flags as binary integers and apply a PPS with operation to the array . After the PPS, each holds the value which is exactly the number of members of in the prefix . For each this number is its 1-based rank within ; for non-members it is the rank of the last member at or before .

Packing. Once each member knows its rank , the packing operation routes the packet from the -th processor in to output port . The members of thus end up densely packed into the leading output positions, in their original order, with non-members skipped. After this routing each member knows both how many members are in its prefix and where its data should be sent.

Topological example (slide 27). The slide shows packing on an indirect ordinary butterfly :

  • Initial flags (top to bottom): .
  • After PPS, the array of prefix sums is: .
  • Final output port addresses for members of are the prefix-sum values minus one (zero-based), so the four members are routed to output ports .

This illustration uses an indirect butterfly because PPS on indirect trees and butterflies runs in steps (Lemma 4), so the packing itself completes in logarithmic time.

Why packing matters

Packing has many applications and is one of the canonical building blocks of distributed parallel algorithms. The lecturer described it as the notorious example of PPS usage. The intuition is that packing solves the universal problem of computing dense addresses within a sparse, distributed subset: any time a subset of processes (or array entries) has to be relocated into a contiguous region while preserving their order, the offsets are computed by a PPS over their characteristic vector. Parallel RadixSort, treated next, is the most important sorting algorithm built entirely on packing.

Parallel / distributed RadixSort

Input: an array of numbers, each at most bits wide (binary case; the algorithm generalizes to arbitrary radix bases).

Algorithm RadixSort(A[0..2^n - 1]):
  for i := 0..n-1 do_sequentially
    Split(A, i)

Definition of : a permutation of such that

  • all numbers whose bit is are packed upward, and
  • all numbers whose bit is are packed downward, with stable ordering inside each of the two groups.

Implementation of one Split. Inside , the two sub-packings are completely independent and may be run simultaneously if enough ports are available:

  • One PPS computes, for every position whose bit is , its destination offset in the upper half of the output array.
  • A second PPS does the same for positions whose bit is , computing offsets in the lower half.
  • Two packing routings then permute the data into the new positions.

Thus: Iteration order. The iteration index runs from (least significant bit) to (most significant bit). The number of iterations equals the number of digits in the representation; it is assumed to be a fixed, known constant determined by the bit-width of the numbers.

The lecturer summarized the iteration as a split operator that combines a PPS-based offset computation with a permutation of the numbers. After such splits, the input array is fully sorted.

Parallel time. For numbers on the ordinary butterfly : The reasoning: there are Split iterations, each of which executes PPSs on a butterfly, and each PPS takes steps - hence .

Example (slide 28). Starting from the eight 3-bit numbers the slide traces three Splits, one per bit position (least significant first). After , numbers ending in are at the top and those ending in at the bottom, with stability preserved. After and , the array is sorted:

Why RadixSort fits PPS so well

The lecturer’s framing: RadixSort is based solely on packing. Every Split is just two PPSs followed by two permutations - no comparisons, no merges, no recursion. Because PPS has efficient implementations on every topology of interest (PRAM, hypercube, butterfly, mesh, WH networks), RadixSort inherits all of these implementations. This makes RadixSort one of the cleanest illustrations of how a single primitive (PPS) can power an entire sorting algorithm, and explains why it is the canonical sorting algorithm for hypercubic networks.

Potential exam questions

  1. Define the packing problem precisely. What does each processor hold initially, and what does each processor need to learn?
  2. Show how a single PPS over the characteristic vector of a subset solves the packing problem. What is the binary operation used, and what does ‘s post-PPS value represent?
  3. On the indirect ordinary butterfly with initial flag vector , compute the array of PPS results and the final output port addresses of the members of .
  4. Why is the indirect butterfly (or any indirect binary tree) a convenient topology for packing? What is the parallel time complexity?
  5. State the RadixSort algorithm in pseudocode. What is the input size assumption, and what does one iteration of the main loop do?
  6. Define . Why is it a permutation of , and what stability property does it preserve?
  7. Decompose into its PPS and routing components. Why are two PPSs needed rather than one?
  8. Derive the parallel time complexity of RadixSort on the ordinary butterfly . Why does the bound come out as ?
  9. In which order are the bit positions processed in RadixSort - least-significant first or most-significant first - and why does the chosen order ensure correctness when each Split is stable?
  10. Explain why RadixSort is described as being based “solely on packing”. What primitive does not appear in the algorithm, and what consequence does that have for its implementability on parallel topologies?
  11. Trace one full Split of the input for bit position . Show the resulting array order.
  12. How would RadixSort generalize from binary digits to an arbitrary radix base ? What would change in the per-iteration PPS structure?
  13. Compare RadixSort’s reliance on PPS with QuickSort’s reliance on segmented PPS (covered in Lecture 10’s QuickSort discussion). In what sense are both algorithms “scan-based sorts”?
  14. Why is the packing problem regarded as the canonical exemplary application of PPS in distributed computing? Name two other algorithms that build directly on packing.