First 5 minutes in hell
Problem of the arbitrary and row-wise mapping is that that both approaches still require the whole vector to be loaded in each process. This is solved by cutting the row-wise mapping (which already contributes to parts of vector) into columns ⇒ producing a 2D mesh of processes, each owning elements of and elements of .
- each process touches only a chunk of columns (therefore it only needs a (corresponding) chunk of )
- so the memory shrinks as I add more processes
One iteration:
- diagonal processes broadcast their column chunk along the whole column (“here is the chunk of , what you need”)
- convergence test
- all processes do a local multiplication of their tile with the chunk they got from their same-column-on-diagonal process
- each row reduces (sums) it’s results back to its diagonal process
- diagonal processes compute the norm (since all together, they have all pieces of vector), then they normalize the vector and the cycle restarts
MPI_Allreduce- each diagonal process gets the value- and they will divide their chunk with this to normalize it
- the the normalized block of is again broadcasted along the columns (this is actually the first step)
Three subcommunicators (to perform multiple and disjoint collective operations (like broadcasts) on different sets of processes without interfering)
- vcomm = columns, for the broadcasting
- hcomm = rows, for the sum
- dcomm = the diagonal only, for computing the norm
They are created by the
MPI_Comm_splitConvergence speed is data-dependent, the number of iterations varies per matrix. We want to understand the structure of nonzeros to ensure the uniform load.
Checkerboard mapping wins on every metric: lowest per-process memory, lowest communication latency per iteration. Its only cost is implementation complexity: three sub-communicators must be created and managed, and the diagonal-vs-non-diagonal asymmetry in the inner loop must be handled carefully.
Complexity: Both arrays
xandyare of size . Compared to the other mappings:
- Arbitrary: size , size
- Row-wise: size , size
- Checkerboard: size , size . For large , checkerboard has by far the smallest memory footprint - the only one whose per-process memory shrinks as grows.

Recap: the Power Method and its distributed structure
The Power Method finds the largest eigenvalue (in absolute value) and the corresponding eigenvector of a matrix , satisfying . The eigenvector is typically normalized. The algorithm has six phases:
- Ph.1: take any nonzero initial vector .
- Ph.2: compute (the MVM).
- Ph.3: compute the norm .
- Ph.4: replace with the normalized , i.e., .
- Ph.5: evaluate the convergence criterion.
- Ph.6: if satisfied, terminate; otherwise go to Ph.2. In a distributed implementation, is partitioned among MPI processes as so Ph.2 becomes local MVMs followed by a parallel reduction. The specific structure depends on the matrix mapping. The arbitrary and block row-wise mappings are covered in Exam Question 51. This note covers the third option: checkerboard mapping. The motivation for checkerboard mapping is to reduce both the per-process memory and the per-iteration communication volume. Where arbitrary mapping needs memory for both and , and row-wise needs for and for , checkerboard mapping needs only for both.
Checkerboard mapping: the setup
Processes form a virtual 2-D mesh . Process (with linear rank in MPI_COMM_WORLD) has:
- row index ,
- column index . Process holds a -submatrix of . For the input/output vectors, the natural convention used here is: and are mapped to the diagonal processes (those with ). Each diagonal process holds a block of consecutive elements of the vector. Why this is more efficient:
- Each process in column needs access only to the part of indexed by . The local nonzeros of only reference column indices in this range. Memory complexity: per process.
- After local MVMs, processes in the same row have local contributions to the same block of (corresponding to row indices ). These must be reduced within row .
- The most natural mapping of is back to the diagonal processes (): each row has its reduction root at the diagonal process .
Visual structure of one iteration
The four conceptual stages of one Power Method iteration under checkerboard mapping:
- Initial lives on the diagonal processes only.
- Column OABs: each diagonal broadcasts its block of to all other processes in column .
- Local MVMs: every multiplies its local submatrix by its received subvector.
- Row reductions: each row performs a parallel sum-reduction into the diagonal process , producing the new block there.
- Normalization: diagonal processes cooperate (via a separate reduction over their squared norms) to compute the global norm , then scale their local blocks of .
Sub-communicators: hcomm, dcomm, vcomm
The key MPI technique enabling the checkerboard implementation is the use of sub-communicators - groups of processes that participate in collective operations together, disjoint from other groups. MPI provides MPI_Comm_split to create these. MPI_Comm_split(comm, color, key, &newcomm) decomposes a base communicator into disjoint sub-communicators: every process with the same color value ends up in the same new communicator, and within each new communicator the processes are ranked according to their key values. The checkerboard Power Method needs three sub-communicators per process:
hcomm(horizontal communicator) - groups all processes in the same row . Used for the row-wise reductions of MVM partial results into the diagonal process. Color: .dcomm(diagonal communicator) - groups all diagonal processes () into one communicator, and all non-diagonal processes into another. Used so the diagonal processes can cooperatively compute the global norm of . Color: the boolean expressionI == J(so alltrue-valued processes form one communicator and allfalse-valued ones form another, but only thetrueone is used).vcomm(vertical communicator) - groups all processes in the same column . Used to broadcast the new normalized block from each diagonal process to the rest of column . Color: .
Sub-communicators let MPI perform multiple, disjoint, simultaneous collective operations - all rows reduce in parallel without interfering with each other.
Creating the sub-communicators
int p, r;
MPI_Comm_size(MPI_COMM_WORLD, &p); // total # of processes
MPI_Comm_rank(MPI_COMM_WORLD, &r); // current process rank
int q = 1;
while (q < p) { if (p == q * q) break; q++; } // q = integer sqrt(p)
int I = r / q; // process row index
int J = r % q; // process column index
MPI_Comm hcomm, dcomm, vcomm;
MPI_Comm_split(MPI_COMM_WORLD, I, r, &hcomm); // row reductions
MPI_Comm_split(MPI_COMM_WORLD, I == J, r, &dcomm); // diagonal-only reductions
MPI_Comm_split(MPI_COMM_WORLD, J, r, &vcomm); // column broadcastsWithin each sub-communicator the processes have new ranks, numbered from to :
- In
hcomm, the diagonal process of row has rank (because it has column index in the original 2-D mesh, and within the row the ranking inherits column indices). - In
vcomm, the diagonal process of column has rank . - In
dcomm, the ordering among diagonal processes is by their original rank.
The main iteration loop: structure
For each iteration:
- Local MVM: each process computes its local contribution .
- Row reduction: in each row , all processes reduce their local into the diagonal process via
MPI_Reduceonhcomm. After this, the diagonal processes (and only they) hold the new blocks. - Compute norm on diagonals: each diagonal process computes the local sum of squares of its block. These local partial norms are then reduced (with
MPI_Allreduce) acrossdcommso that every diagonal process ends up with the same global . - Local normalization on diagonals: each diagonal process divides its local block by .
- Column broadcast: each diagonal process broadcasts its normalized block to the rest of column via
MPI_Bcastonvcomm, with the diagonal process at rank withinvcomm. - Convergence test.
Step 2: row reduction (MPI_Reduce on hcomm)
static const long m = n / q; // local block size = n / sqrt(p)
std::vector<double> x(m, 1.0); // Ph.1: initial nonzero vector x
std::vector<double> y(m);
double alpha;
do {
... // Ph.2: local MVM, leaves local contribution in y[]
// Ph.2 (continued): row-wise reduction of local contributions into x[]
// on the diagonal process of row I (which has rank I in hcomm).
MPI_Reduce(&y[0], &x[0], m, MPI_DOUBLE, MPI_SUM, I, hcomm);
// x[] on diagonal processes now contains the new distributed yNote the root parameter is I - the rank of the diagonal process within hcomm. After this call, only the diagonal processes have meaningful data in x[]; the off-diagonal processes participate as contributors but do not receive the result (this is MPI_Reduce, not MPI_Allreduce). The reduction directly writes into x rather than into a separate y buffer - the same trick used in the row-wise mapping (Exam 51) that eliminates the copy step at the end of the iteration.
Step 3: computing the norm (MPI_Allreduce on dcomm)
// Ph.3: norm of vector y - only diagonal processes participate.
alpha = 0.0;
if (I == J) { // only diagonal processes
for (long i = 0; i < m; i++) alpha += x[i] * x[i];
MPI_Allreduce(MPI_IN_PLACE, &alpha, 1, MPI_DOUBLE, MPI_SUM, dcomm);
alpha = sqrt(alpha);
// Ph.4: normalize the local block
for (long i = 0; i < m; i++) x[i] /= alpha;
}The norm is computed in three substeps: local sum of squares, all-reduce of the partial sums across dcomm, then square root. Only diagonal processes participate. After this, every diagonal process holds:
- the same global
alpha(the eigenvalue estimate), - its local block of the normalized eigenvector in
x[]. The non-diagonal processes do not participate in this block - they wait at the next collective.
Step 5: column broadcast (MPI_Bcast on vcomm)
// Ph.4 (continued): broadcast normalized block from diagonal to column.
MPI_Bcast(&x[0], m, MPI_DOUBLE, J, vcomm);
// Now every process in column J has the same x[] - ready for next iter.
} while (...); // Ph.5: convergence testThe root is J - the rank of the diagonal process within vcomm. After this, every process in column holds the same block of , ready for the next iteration’s local MVM. The non-diagonal processes’ previously-stale x[] has been refreshed; the diagonal processes’ already-normalized x[] is unchanged (broadcasting from self).
Complete code
int p, r;
MPI_Comm_size(MPI_COMM_WORLD, &p);
MPI_Comm_rank(MPI_COMM_WORLD, &r);
int q = 1;
while (q < p) { if (p == q * q) break; q++; }
int I = r / q, J = r % q;
MPI_Comm hcomm, dcomm, vcomm;
MPI_Comm_split(MPI_COMM_WORLD, I, r, &hcomm);
MPI_Comm_split(MPI_COMM_WORLD, I == J, r, &dcomm);
MPI_Comm_split(MPI_COMM_WORLD, J, r, &vcomm);
static const long m = n / q;
std::vector<double> x(m, 1.0), y(m);
double alpha;
do {
... // Ph.2: local MVM
MPI_Reduce(&y[0], &x[0], m, MPI_DOUBLE, MPI_SUM, I, hcomm);
alpha = 0;
if (I == J) {
for (long i = 0; i < m; i++) alpha += x[i] * x[i];
MPI_Allreduce(MPI_IN_PLACE, &alpha, 1, MPI_DOUBLE, MPI_SUM, dcomm);
alpha = sqrt(alpha);
for (long i = 0; i < m; i++) x[i] /= alpha;
}
MPI_Bcast(&x[0], m, MPI_DOUBLE, J, vcomm);
} while (...); // Ph.5: convergence test
// The normalized eigenvector is stored distributively in x[] of diagonal processes.
// Created communicators must be released.
MPI_Comm_free(&hcomm);
MPI_Comm_free(&dcomm);
MPI_Comm_free(&vcomm);Memory complexity per process
Both arrays x and y are of size . Compared to the other mappings:
- Arbitrary: size , size .
- Row-wise: size , size .
- Checkerboard: size , size . For large , checkerboard has by far the smallest memory footprint - the only one whose per-process memory shrinks as grows.
Experimental evaluation (Blue Waters)
Same experimental setup as the previous two mappings: , , 1 MPI process per computing node, MPI+OpenMP hybrid:
- Measured communication latency per iteration: seconds.
- This is faster than arbitrary mapping ( s) and faster than row-wise mapping ( s). For larger (e.g., with ), the gap widens significantly: arbitrary s/iter, row-wise s/iter, checkerboard s/iter.
Summary remarks
- The experiments were done with very large matrices (). If these matrices were dense they would not fit anywhere; the practical relevance comes from sparse matrices, for which SpMVM is a fundamental kernel used by many HPC applications beyond the Power Method.
- Even though the per-iteration differences look modest in absolute terms (seconds), the number of iterations can be hundreds or thousands, so substantial total time is saved.
- The convergence speed of the Power Method is strongly data-dependent. The number of iterations varies with the matrix.
- For real-world sparse matrices, understanding the structure of nonzeros is key to achieving uniform load. Real matrices typically have very irregular structures, and good mapping is a trade-off between balancing local computation and minimizing communication and memory latency.
Checkerboard mapping wins on every metric: lowest per-process memory, lowest communication latency per iteration. Its only cost is implementation complexity: three sub-communicators must be created and managed, and the diagonal-vs-non-diagonal asymmetry in the inner loop must be handled carefully.
Potential exam questions
- Describe the checkerboard mapping of a sparse matrix on MPI processes. How are the row and column indices and computed from the linear rank in
MPI_COMM_WORLD? - What convention is used for mapping the input and output vectors , in the checkerboard Power Method? Why are the diagonal processes () chosen, and how is the memory footprint reduced compared to arbitrary or row-wise mapping?
- Why does each process in column only need access to the subvector ? What about the matrix mapping makes this sufficient?
- Describe the four conceptual stages of one Power Method iteration under checkerboard mapping: column broadcasts, local MVMs, row reductions, normalization.
- What is
MPI_Comm_split, and what are its four parameters? Explain the role of thecolorandkeyparameters in particular. - The checkerboard implementation creates three sub-communicators
hcomm,dcomm,vcomm. State the color used for each and explain what operation each enables. - Write the three
MPI_Comm_splitcalls that createhcomm,dcomm, andvcomm. In which sub-communicator does the diagonal process of row have rank ? In which does it have rank ? - Write the MPI call that performs the row reduction of local MVM contributions into the diagonal process. What is the root parameter, and in which communicator is the call made? Why is the result written into
xrather than into a separate buffer? - Why does only the diagonal processes’
x[]hold meaningful data after theMPI_Reducein row reductions? What does the off-diagonalx[]contain at that point? - Explain how the global norm is computed across the diagonal processes. Which sub-communicator is used and which MPI operation?
- Why is the test
if (I == J)wrapped around the norm computation block? What would go wrong if it were omitted? - Write the
MPI_Bcastcall that propagates the normalized block from each diagonal process to its column. What is the root parameter, and why is itJ? - State the per-process memory complexity of arrays
xandyin the checkerboard implementation. Compare it to the arbitrary and row-wise mappings, and explain why checkerboard is the only one whose memory scales down with . - State the measured latency per iteration of the checkerboard variant on Blue Waters at , . Express the speedup over arbitrary and row-wise mappings.
- Why is it important to call
MPI_Comm_freeonhcomm,dcomm, andvcommat the end of the program? - Compare the three Power Method MPI implementations (arbitrary, row-wise, checkerboard) along three axes: (a) per-process memory, (b) communication volume per iteration, (c) implementation complexity. Which mapping is preferred in practice and under what conditions?