First 5 minutes of hell
The parallel prefix sum is an operation, which cumulatively sums the input array using any associative and commutative binary operation. The output then contains all the prefixes of .
The sequential approach is simple, just iterate through the array, keep the local sum and in each loop, write the actual sum to the current -position. Since this is data dependent (the -th operation depends on the th operation), it is not easily parallelizable.
Parallel approach (the EREW part):
- there are three sections:
- initialization (1 thread initializes “it’s own” -th value in the shared array)
- in the loop (looping through variable):
- first reads only: each thread looks on the -th field back (exclusive read)
- all threads look back 1 element, then 2 elements, then 4 elements etc.
- only threads with (their rank) bigger than are active (so no looking out of bounds of the array)
- and adds the value from that element with it’s own element in their local register (no writes)
- second writes only: just write the register value into the value (exclusive writes)
- the magic is that each element to which the thread looks at, already contains the sum of the chunk that is big
- after , the chunk contains = 1 elements
- after , the chunk contains = 2 consecutive elements
- etc.
- it all takes logarithmic time , same as the parallel reduction (despite producing outputs instead of one)
- if there are only processors, the
Parallel approach (indirect tree):
- there are leaves, that contain the input data at the start and then the output data in the end
- the height of the tree is , the whole computation is solved in parallel steps ( steps for a complete tree or butterfly)
- sweep-up phase and sweep-down phase (copying the input from up to both children)
- the values are only in the leaves, because it is a indirect tree (it has only routers/switches on the inner nodes, only leaves are the CPU+memory+router nodes)
Parallel approach (direct tree):
- direct tree has nodes (all compuatation nodes), so the input values are in all nodes
- every nodes contains the initial input value and then the output value at the end
- the tree has to be linearized first (so we exactly know, how are the nodes ordered (to match the array order structure)) - POSTORDER traversal linearization is chosen
- the tree is traversed in the POSTORDER, using the same algorithm as in the indirect tree with these modifications
![]()
- sweep-down operation is the same (just the value is added to the node value)
- sweep-up node sends the sum up and then distributes the 3 = 3 (from the left) and 4 = 1 + 3 (from the left), so the final prefix order is retained
- the array is also computed in operations
Parallel approach (any topology):
- if the topology/graph is connected, we will just run the BFS spanning tree and run the direct tree approach on it
- PPS of input values can be solved on an arbitrary -node network with a constant degree in O(diam(G)) parallel steps (the diameter is the depth of the BFS spanning tree)
More elegant parallel solution for orthogonal topologies:
- hypercube
- the has to be (due to hypercube limited scalability)
- each processor (each vertex) has two registers
- accumulates everything it sees
- accumulates only the prefix (values only from neighbors with a lower index)
- in round each processor exchanges data with it’s neighbor on the -dimension (the index differs only by 1 bit in the -th position)
- both processors exchange green register values and add them up
- the yellow value flows only from the node with a lower index to the node with the higher index
- since we are going from (the right-most bit), the first element only sends the value up and doesn’t get anything from other nodes (the correct way, since it is a prefix sum and this node is the first one)
- this is a normal hypercube algorithm
- store-and-forward mesh/tori
- it needs to be linearized (a mapping from the array to the nodes): different linearizations imply different PPS algorithms
- simple example: map array onto a 2D mesh (in the lexicographically row-wise sense)
- wormhole meshes
- distances are not respected, 1D-mesh can simulate any multidimensional scheme (e.g. the indirect tree PPS algorithm), completing in the time
Scalability
- it would not make sense to have one processor per array value, in reality there are less processors, so each processor has to take more array values
- on APRAM model:
- the same as on the PRAM model, but there is
- operations every loop
- each loop is synchronized using barrier (binary tree reduction barrier implementation)
- = threads take the elements which they do sequentially
MPI function
- PPS is implemented in MPI, two versions:
MPI_Scan(standard, inexclusive) process ‘srecvbufcontains the prefix over data in .MPI_Exscan(exclusive) process ‘srecvbufcontains the prefix over data in- look below for the signature and the properties:
- there are also
MPI_Iscanas for ‘immediate/nonblocking’
MPI function: MPI_Scan (and MPI_Exscan)
PPS in MPI exists in two versions with the same signature:
- Standard
MPI_Scan— inclusive: process ‘srecvbufcontains the prefix over data in . - Exclusive
MPI_Exscan— process ‘srecvbufcontains the prefix over data in .
Signature:
MPI_Scan(const void* sendbuf, void* recvbuf, int count,
MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)Properties:
- Input determined by
sendbuf,count,datatype; output byrecvbuf,count,datatype. - The predefined binary operations are the same as in
MPI_Reduce(MPI_SUM,MPI_MAX,MPI_LXOR, …). MPI_IN_PLACEcan be used in place ofsendbufexactly as inMPI_Reduce.- User-defined binary operations are supported via
MPI_Op_create. - Apart from blocking scans, nonblocking scans exist (e.g.
MPI_Iscan). - SPPS (segmented PPS) is implementable using a user-defined reduction operation that acts on
(value, segment_number)pairs.
Definition (PPS / scan)
Given:
- An input array over a domain .
- An associative and commutative binary operation in .
Output: an array of all prefixes of : Example: produces .
Exemplary application: the CountingSort algorithm (PPS turns a count array into the offsets at which each value writes itself in the output). More generally, PPS solves any problem that can be expressed via a recurrence relation, because any recursively defined sequence with an associative combiner can be unrolled by scan.
Note on operation requirements: the slides require to be both associative and commutative for PPS. (For pure reduction, when the input array maps to processors in index order, associativity alone suffices; PPS is stated more strictly in Definition 2.)
Sequential algorithm (and why it cannot be data-parallelized naively)
Algorithm PrefixSum(in: X[0..n-1]; out: Y[0..n-1]):
i := 0; sum := X[i]; Y[i] := sum
while i < n-1 do
i := i + 1
sum := sum (+) X[i]
Y[i] := sum
This differs from sequential reduction only by the continuous write on every iteration. The algorithm is inherently sequential: iteration depends on iteration . If were not associative, there would be no chance to parallelize it at all. Because of this, the parallel PPS algorithm has to abandon the sequential viewpoint completely and adopt a different perspective on the problem.
All the topology-specific algorithms below initially assume (one input number per processor); scaling to is treated separately in the Scalability section.
PPS on EREW PRAM
threads compute PPS on the input shared array by in-place rewriting with . Each thread has a private variable .
Alg. PRAM_PPS(in,out: M[0..n-1]):
for all i := 0..n-1 do_in_par
y_i := X[i]; M[i] := y_i
for j := 0..ceil(log n)-1 do_seq
for all i := 2^j..n-1 do_in_par
y_i := M[i - 2^j] (+) M[i]
for all i := 2^j..n-1 do_in_par
M[i] := y_i
Invariant after step : .
In iteration , the operation is applied to pairs whose distance grows exponentially: first adjacent pairs (distance ), then distance , , and so on. Tracing back the components each output is built from confirms that at the end covers exactly the prefix .
The remarkable property: parallel PPS achieves the same complexity as pure reduction, despite computing prefix sums instead of just one. This is possible because the partial reductions superimpose disjointly across positions; no number is touched twice in the same step.
Complexity on EREW PRAM: , identical to parallel reduction.
PPS on APRAM
APRAM is derived from the EREW PRAM algorithm by inserting explicit barrier synchronizations between the strictly separated parallel rounds, because each round rewrites the shared array.
Alg. APRAM_PPS(in,out: M[0..n-1]):
for all i := 0..n-1 do_in_par
y_i := X[i]; M[i] := y_i
Barrier synchronization
for j := 0..ceil(log n)-1 do_seq
for all i := 2^j..n-1 do_in_par
y_i := y_i + M[i - 2^j]
for all i := 2^j..n-1 do_in_par
M[i] := y_i
Barrier synchronization
There are parallel steps, each followed by a barrier synchronization for processes that takes time (in the best implementations). Therefore: The extra factor of compared with the synchronous PRAM bound comes entirely from the barriers.
PPS on indirect trees and bidirectional butterflies
Indirect tree: leaves contain the initial input data and at the end the output data; internal nodes only perform the computation.
Lemma (Lemma 4 in slides): PPS of input values can be solved on any indirect tree with leaves in parallel steps, where is the height of .
The proof is by induction. The algorithm has two waves with three local rules:
- Sweep-up rule (a): an internal node receiving from its left child and from its right child sends upward and passes the left child’s value down into the right subtree.
- Sweep-up rule (b): the three-arity / ternary case (when relevant), receiving and sending upward while sending rightward (with already passed to the right subtree).
- Sweep-down rule (c): a node that has received some value from above duplicates it into both children.
One sweep-up wave initiates up to sweep-down waves.
Corollary 5: For = complete binary tree or butterfly, , so PPS completes in steps.
PPS on direct trees
Direct tree: every node of an -node tree contains an initial input value and an output value at the end.
PPS on a direct tree requires linearization first. With POSTORDER indexing, the previous indirect-tree algorithm applies with a single modification at internal nodes: the internal node’s own value participates in the sums (both when sending its accumulated value upward and when receiving from above). The reason this works is that in POSTORDER an internal node’s own value lies “between” its left and right children in the linear order.
Result: PPS on the array stored POSTORDER in a direct tree takes parallel steps. The proof is again by induction on the tree’s recursive structure.
PPS on an arbitrary topology
The direct-tree algorithm extends to any connected graph by constructing a breadth-first spanning tree and linearizing the graph via POSTORDER indexing of that spanning tree.
Corollary 7: PPS of input values can be solved on any -node bounded-degree network in parallel steps.
For orthogonal topologies, more elegant specialized solutions exist (see the next sections).
PPS on hypercubes (normal hypercube algorithm)
Consider inputs and PPS in lexicographic order on , . Every has two registers: and .
Algorithm Hypercube_PPS(X[0..2^r - 1]):
for all P_i, i := 0..2^r - 1, do_in_parallel
green_i := yellow_i := X[i] // initialization
for j := 0..r-1 do_sequentially
send green_i to P_{i XOR 2^j} // Bitwise XOR
receive newgreen from P_{i XOR 2^j}
green_i := green_i + newgreen
if (i XOR 2^j < i) then
yellow_i := yellow_i + newgreen
Algorithm proceeds across the hypercube dimensions from to . In each step, adjacent processors exchange data; accumulates everything arriving from the neighbor; is updated only when the incoming value comes from a processor with smaller index, tested by .
At the end:
- — each has accumulated into its yellow register exactly the elements belonging to its lexicographic prefix.
- — all green registers hold the global sum, so the algorithm simultaneously performs AAR (All-to-All Reduction) as a by-product.
This is an excellent example of a normal hypercube algorithm and is therefore optimal on all hypercubic networks.
PPS on SF meshes / tori
Multidimensional meshes again require linearization first. The simplest case is the row-wise lexicographic mapping of the input array to a 2-D mesh. PPS proceeds in three phases:
- Horizontal PPS in every row, executed independently and in parallel.
- Vertical PPS in the last (rightmost) column only.
- Horizontal OAB in every row except the first: each rightmost processor broadcasts the value it received in the vertical phase into its entire row, so the local row-prefix sums become global prefix sums.
For a square mesh of side , both the horizontal and vertical PPS phases run in time proportional to , matching the structure of a 1-D mesh PPS of processors.
Different mesh linearizations imply completely different PPS algorithms (e.g., diagonal or Morton-curve linearizations).
PPS on WH meshes in logarithmic time
On a wormhole (WH) mesh, distances need not be respected, so far-apart nodes can communicate efficiently. This allows a 1-D mesh to simulate any multidimensional scheme, in particular the indirect-binary-tree PPS algorithm. The result is a logarithmic-time PPS on a WH 1-D mesh, requiring parallel steps, with the data movements representing left-to-right transfers of left-subtree partial sums and the corresponding broadcasts down into right subtrees.
Scalability of PPS (general case )
Input array is split into subarrays of elements each, one per processor .
Algorithm Scaled_PPS(X[0..n-1]):
for all i := 0..p-1 do_in_parallel // O(q) steps
P_i computes sequential prefix sums S_i = [s_{i,0},...,s_{i,q-1}] of X_i
Define z_i := s_{i,q-1}
All P's perform a PPS on Z = [z_0,...,z_{p-1}] // O(log p) steps
producing [sigma_0,...,sigma_{p-1}]
// sigma_i = total sum of all numbers in processors P_0..P_i
for all i := 0..p-2 do_in_parallel P_i sends sigma_i to P_{i+1}
for all i := 1..p-1 do_in_parallel P_i receives sigma_{i-1} from P_{i-1}
for all i := 1..p-1 do_in_parallel // O(q) steps
P_i adds sigma_{i-1} to every element of S_i
The structure of this scaled algorithm is identical to the row-wise SF 2-D mesh PPS, except that the rows of the mesh correspond to the local memories of each process instead of separate nodes: the per-row horizontal PPS becomes a local sequential PPS, the rightmost-column PPS becomes the global PPS on , and the per-row broadcast becomes a local addition of to all elements of .
Asymptotic time complexity on PRAM, hypercubes, WH meshes/tori, and networks with logarithmic diameter: Or, with constants: Key observations:
- PPS has exactly the same scalability as parallel reduction.
- The scaled PPS is a normal hypercube algorithm, hence optimal on all hypercubic topologies.
MPI function: MPI_Scan (and MPI_Exscan)
PPS in MPI exists in two versions with the same signature:
- Standard
MPI_Scan— inclusive: process ‘srecvbufcontains the prefix over data in . - Exclusive
MPI_Exscan— process ‘srecvbufcontains the prefix over data in .
Signature:
MPI_Scan(const void* sendbuf, void* recvbuf, int count,
MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)Properties:
- Input determined by
sendbuf,count,datatype; output byrecvbuf,count,datatype. - The predefined binary operations are the same as in
MPI_Reduce(MPI_SUM,MPI_MAX,MPI_LXOR, …). MPI_IN_PLACEcan be used in place ofsendbufexactly as inMPI_Reduce.- User-defined binary operations are supported via
MPI_Op_create. - Apart from blocking scans, nonblocking scans exist (e.g.
MPI_Iscan). - SPPS (segmented PPS) is implementable using a user-defined reduction operation that acts on
(value, segment_number)pairs.
Summary table of complexities
| Model / topology | Time complexity of PPS |
|---|---|
| EREW PRAM () | |
| APRAM () | |
| Indirect tree / bidirectional butterfly | steps; for CBT/butterfly |
| Direct tree (POSTORDER) | steps |
| Arbitrary bounded-degree -node | |
| Hypercube (lexicographic) | , normal hypercube algorithm |
| SF 2-D mesh (row-wise) | three phases, for square meshes |
| WH 1-D mesh (simulates indirect tree) | |
| Scaled, on PRAM / hypercube / WH mesh/torus |
Potential exam questions
- Define PPS formally. What are the algebraic requirements on , and how do they differ from the requirements for parallel reduction?
- Why can the sequential PPS algorithm not be parallelized by straightforward loop parallelism? What property of is the precondition for any parallel PPS algorithm to exist?
- State and explain the EREW PRAM PPS algorithm. State and prove (or motivate) the invariant after step . What is its parallel time complexity, and how does it compare with parallel reduction?
- Describe the APRAM PPS algorithm. Where do barrier synchronizations have to be inserted, and why? Derive the time complexity .
- State Lemma 4 on PPS in indirect trees. Sketch the sweep-up and sweep-down rules. Why is the height of multiplied by 2?
- Explain how PPS on a direct tree is obtained from PPS on an indirect tree. What role does POSTORDER indexing play, and what is the single modification to the indirect-tree algorithm?
- State Corollary 7 (PPS on an arbitrary topology) and explain the role of the breadth-first spanning tree and POSTORDER linearization.
- Present the hypercube PPS algorithm in lexicographic order. Explain the purpose of the green and yellow registers, the role of the test , and why the algorithm is a “normal hypercube algorithm”. What are the final contents of the two registers in each ?
- Describe the three-phase PPS on a SF 2-D mesh with row-wise lexicographic mapping. Why do different mesh linearizations lead to different algorithms?
- How can a WH 1-D mesh perform PPS in logarithmic time? Which abstract algorithm is being simulated?
- State the scaled PPS algorithm for . Derive the time complexity . Why does it have exactly the same asymptotic scalability as parallel reduction?
- Why is the scaled PPS a normal hypercube algorithm and what does that imply about its optimality on hypercubic networks?
- Give the signature of
MPI_Scan. Explain the difference betweenMPI_ScanandMPI_Exscan. What is the role ofMPI_IN_PLACE? How can user-defined operations be used (e.g. for SPPS)? - Compare the time complexities of PPS on the EREW PRAM and on the APRAM. Where does the extra logarithmic factor come from, and how could it be reduced in practice?
- Argue that PPS has the same scalability functions and as parallel reduction. What does this tell you about the cost-efficiency of PPS?



