First 5 minutes of hell
The input array is arbitrarily segmented into multiple segments and the goal is to calculate the PPS for each segment independently.
Sequential solution is simple, just go one by one and write the running sum, each time you go into a new section, reset the running sum.
Parallel approach:
- we cannot reliably map segments to processors, as the counts and sizes are independent, so one processor could end up with a big segment and others with small ones
- the solution is to bring up a new binary operation with the segmentaion-awareness baked into it (and then do the global PPS as normally)
- see the operation below
- the values are lefthand-protected, so if they begin a new segment, they are not influenced by the running sum from the left
- we know that if the original operation is associative, the special derived operation is also associative
- so this SPPS has all properties of the PPS (it is segment-agnostic, it just uses a different operation) and it can run on all networks (PRAM, APRAM, hypercube, SF mesh/tori, WH mesh, wBF, oBF…)
- scalability is also the same as in the PPS
Implementation in the MPI
- via standard
MPI_ScanandMPI_Exscanfunctions, but we have to register a new operation:
- commit a new MPI-aware datatype: a pair (value, segment_number)
- so we transfer the ”|” into segment numbers, so when those numbers differ, we know that the next number is from a different segment
- register the operation using
MPI_Op_create
- call the
MPI_Scanor some other equivalent with this operation
Definition (segmented PPS)
The input array is arbitrarily separated into disjoint segments the aim is to calculate all prefix sums within these segments in isolation. Crucially:
- The number of segments and the sizes of segments are arbitrary, part of the input.
- There is no relationship whatsoever between the segmentation and the number of processes or threads. These two structures are completely independent.
Example (slide 32). A 4-segment input array under operation becomes, by applying 4 isolated PPSs: Each segment is treated entirely independently of the others - the scan accumulator is conceptually reset at every segment boundary.
Sequential case. Sequentially the algorithm is trivial: walk left to right in linear time and reset the accumulator whenever a segment boundary is crossed. The sequential SPPS is identical to non-segmented PPS except for this small reset rule.
Why parallelization is non-trivial. The naive approach of mapping segments to processors is hopeless: there could be too many or too few segments relative to the processor count, and the size distribution is data-dependent, so load balance is unpredictable - one thread might have nothing to do while another is overloaded. For efficient parallelization, the algorithm must abstract away from the segmentation entirely the parallel framework must be segmentation-agnostic. The only way to do that is to push the awareness of segmentation into the binary operation itself, so that the standard PPS parallel framework can be reused verbatim.
The main idea: a modified binary operation
Given an associative binary operation , define a new operation that operates on operands of two types: plain operands and protected (segment-starting) operands . The bar notation marks the operand as the beginning of a new segment, hence as a value that must be shielded from any influence coming from outside the segment.
The operation is defined by the following table (rows = left operand, columns = right operand):
Reading the table:
- — both plain: just do the usual operation.
- — the right operand starts a new segment, so the left operand is irrelevant the bar is preserved.
- — the left operand started a segment combine with the right, but keep the bar so the result remains protected against any further left-coming influence.
- — the right starts a new segment the left is ignored.
Two intuitive rules captured by the table:
- Protected on the right cannot be changed by anything to its left. Any value with a bar absorbs whatever comes from the left without modification: a segment boundary on the right blocks all left-coming influence.
- Protection is preserved leftward. Once a value has been protected (its left edge marked with a bar), all subsequent left-extensions inherit the bar the protection persists until explicitly broken by another segment start.
Associativity is preserved.
Lemma 9: If is associative, then so is .
Proof sketch (slide 33): just check the individual cases. As one example, The other cases are equally mechanical. Since associativity is the only algebraic property the standard parallel PPS algorithms require, this proves that SPPS reduces to a single standard PPS with as the operation.
From SPPS to a single global PPS
This is the central insight. The parallel array of processors or threads runs the same PPS algorithm as in the non-segmented case - hypercube PPS, indirect-tree PPS, EREW PRAM PPS, scaled PPS, anything - and the only modification is that the binary operation knows about segment boundaries. The segmentation is handled inside the operation, so the parallel structure is segmentation-agnostic and inherits all the efficiency, load balance, and scalability of the standard PPS framework.
The lecturer’s framing: the operation itself takes care of the segmentation, so the array of processors can completely forget about it. Computation that would otherwise have to dispatch differently per segment instead runs a single, uniform parallel pattern, with all the segmentation logic absorbed into pairwise combinations.
Implementation on indirect trees / butterflies
The non-segmented indirect-tree PPS algorithm (Lemma 4: steps on any indirect tree with leaves) applies verbatim to the SPPS, with in place of . The complexity is unchanged because the operation itself does the bookkeeping. For complete binary trees and butterflies this gives steps.
Worked example (slide 34). Input (three segments). Running the indirect-tree PPS over 6 steps with produces the leaf values (re-stated from the lecturer’s expected output). Each leaf hosts the prefix sum within its own segment no information ever crosses a segment boundary, because blocks it.
Implementations on PRAM and other topologies
Because SPPS reduces to a standard PPS with a different binary operation, all the PPS implementations of the parent question apply, with the same time complexities:
- EREW PRAM (): the algorithm
PRAM_PPS(with replaced by ) runs in parallel steps. The invariant becomes: after step , holds the -reduction of the suffix , automatically respecting segment boundaries. - Indirect trees / bidirectional butterflies: steps, for CBTs and butterflies.
- Direct trees with POSTORDER linearization: steps.
- Arbitrary -node bounded-degree network: steps.
- Hypercube in lexicographic order: steps, normal hypercube algorithm. The green and yellow registers carry -typed values, and the dimension-by-dimension exchanges use instead of .
- SF 2-D mesh, WH meshes, scaled PPS for : the same three-phase or scaled structure applies.
Scalability. Scaled SPPS on processors with elements per processor follows the same template as scaled PPS: on PRAM, hypercubes, WH meshes/tori, and networks with logarithmic diameter. It is a normal hypercube algorithm, hence optimal on hypercubic topologies. The only difference from non-segmented PPS is the use of inside the local sequential prefix sums, the global PPS over the per-processor totals, and the broadcast / addition phase - in every phase the operation is segmentation-aware.
The lecturer was explicit: SPPS has exactly the same scalability properties and the same set of optimal topologies as plain PPS.
MPI implementation of SPPS via user-defined reduction
MPI does not provide a built-in segmented scan. However, SPPS is implementable using MPI_Op_create and the standard MPI_Scan / MPI_Exscan calls.
Slight semantic shift. The MPI implementation uses a richer encoding than the bar notation: instead of inserting bars between operands, each value is paired with a segment number. Two values from the same segment combine under two values from different segments combine by simply keeping the right operand. The segmentation is captured by the segment IDs directly: This is semantically slightly richer than the bar formulation, because every segment has a unique identifiable ID. The operation is associative (analogously to Lemma 9) and non-commutative.
Code structure (from slide 39):
typedef struct { double val int sn } SegScanPair
void segScan(SegScanPair *in, SegScanPair *inout,
int *len, MPI_Datatype *dptr) {
SegScanPair c
for (int i = 0 i < *len ++i) {
if (in->sn == inout->sn) c.val = in->val + inout->val
else c.val = inout->val
c.sn = inout->sn
*inout = c
in++ inout++
}
}The integration into MPI follows the same recipe as the user-defined complex-number multiplication from Lecture 10 (slide 7). MPI must be told what the new data type looks like, and the new operator must be registered:
MPI_Datatype type[2] = { MPI_DOUBLE, MPI_INT }
MPI_Aint disp[2]
int blocklen[2] = { 1, 1 }
MPI_Datatype sspair
MPI_Get_address(&a, disp)
MPI_Get_address(&a.sn, disp + 1)
int base = disp[0]
for (int i = 0 i < 2 ++i) disp[i] -= base
MPI_Type_create_struct(2, blocklen, disp, type, &sspair)
MPI_Type_commit(&sspair)
MPI_Op myOp
MPI_Op_create(segScan, 0, &myOp)
MPI_Scan(&a, &answer, 1, sspair, myOp, comm)Two registration steps:
- Commit the data type.
MPI_Type_create_structplusMPI_Type_commitmake the(value, segment_number)pair an MPI-aware datatype. This is required because MPI must know how to communicate and buffer the composite values. - Create the operator.
MPI_Op_createregisters the C functionsegScanas a user-defined MPI operator, which can then be passed toMPI_Scanlike any built-in operator. The second argument (0) declares the operator as non-commutative (the segmentation-aware operator is not commutative).
Once both are registered, MPI_Scan runs the standard MPI scan with the new datatype and operator, and produces a segmented scan as output. Nonblocking variants and MPI_Exscan are equally available.
The lecturer noted that this exactly mirrors the procedure used for the user-defined complex-multiplication reduction (lecture 10, slide 7): commit data type, create operator, call collective. The recipe is identical only the semantics of the operator differs.
Summary of properties
| Property | Value |
|---|---|
| Reduces to | one standard PPS with operation |
| Associativity requirement | associative associative (Lemma 9) |
| Commutativity of | non-commutative (left/right operands play different roles) |
| Time on EREW PRAM () | |
| Time on indirect CBT / butterfly | |
| Time on hypercube () | , normal hypercube algorithm |
| Time on arbitrary bounded-degree | |
| Scaled time, | on PRAM / hypercube / WH mesh-torus |
| MPI support | no built-in implementable via MPI_Op_create over (value, segment_id) pairs and MPI_Scan / MPI_Exscan |
| Optimality | normal hypercube algorithm optimal on hypercubic topologies |
The take-home message: SPPS inherits everything from plain PPS - implementations, complexities, scalability, optimality - by absorbing segmentation into the operation rather than into the algorithmic structure.
Potential exam questions
- Define the segmented PPS problem. What is the relationship between the segmentation of the input array and the number of processors?
- Explain why the naive parallelization approach - assigning whole segments to processors - is unworkable. What property must a good SPPS implementation have, and why?
- State the main idea behind SPPS: how is segmentation pushed into the binary operation, and what does it gain us at the algorithmic level?
- Define the modified binary operation via its table on versus . Explain the semantics of each of the four cases.
- State Lemma 9 and prove that is associative whenever is, by checking representative cases.
- Is commutative? Justify your answer using the table.
- List four implementations of SPPS on different topologies and state their time complexities. Why are they exactly the same as for plain PPS?
- Give the algorithm and time complexity of SPPS on the EREW PRAM. What invariant does satisfy after step ?
- Describe how SPPS runs on a hypercube. What do the green and yellow registers carry, and how does the dimension-by-dimension exchange use ?
- Derive the scaled SPPS complexity on hypercubic networks. On which topologies is this scaled algorithm optimal?
- Explain how SPPS is implemented in MPI. What are the two registration steps required before
MPI_Scancan be called with a user-defined segmented operator? - Write out the encoding of used by the MPI implementation as a function of pairs. Why is this slightly richer than the bar notation, and what does the unique segment ID give you?
- Explain why
MPI_Op_createis called with the commutativity flag set to . What would go wrong if it were set to ? - Trace an SPPS by hand on the input using the indirect-tree algorithm. Verify that the result is and that no information crosses segment boundaries.
- The lecturer claimed SPPS has “exactly the same scalability as PPS”. Justify this claim by comparing the structure of the scaled algorithms in both cases.