First 5 minutes in hell

I have a grid of processes and each process holds one block of A, one block of B and one block of C. The goal is to compute C by multiplying A and B, block-by-block without needing more memory then it initially started with.

  • it could be scalable, the naive approach needs to load the full A row and full B column to perform the multiplication, which is infeasible for large matrices (to fit the data onto one computing node)
    • I am usually going distributed because I don’t have that much memory in a single node, and the naive approach then defeats it’s purpose
    • here the memory is exactly 3 blocks per process, nothing more

The flow:

  • blocks of A are flowing across rows, blocks of B are flowing across columns and at each iteration, each process multiplies it’s current block A with B and saves the result to C
  • each submatrix of A and submatrix of B visits each process at exactly the right time, when the process needs them
  • no process stores more then initially allocated
  • this flow is called “systolic” (as the submatrices flow through the matrix of processes like a blood in the heart)

Algorithm:

  • starts with checkerboard mapping of A, B, C matrices on a virtual 2D torus - there are cyclic shifts (torus is ideal)
  • two phases:
    • initial skew:
      • rotate all submatrices in row by positions
        • row 0 will not rotate
        • row 1: each matrix will rotate by 1 position
        • row 2: by two positions etc.
      • rotate all submatrices in a column by positions
      • this way, at the start of the second phase, each process will hold a correct pair of - they can be multiplied together
    • iterations:
      • each processor multiplies it’s submatrices A and B and add the result to C
      • rotate all matrices by one position (A’s along rows, B’s along columns)

Correctness:

  • this algorithm guarantees that at every tick, the A and B blocks held by share a middle index (the column index of A matches the row index of B)
    • every value of from to shows up exactly once

Properties:

  • memory-optimal
  • time complexity - - startup latency, communication (shifts) + local multiplications
  • scaling is (which is the same as the naive approach, but the naive approach is memory-inefficient)

Comparison to the Fox’s algorithm:

  • works in the similar sense, needs 4 matrices per process, but the matrix A never moves (only the matrix B), simplifying the cleanup
  • asymptotic time complexity is similar to the Cannon’s

Problem definition

Given two dense -matrices and , compute the product We use the standard “three-loop” definition: an output element is computed as the dot product of row of and column of : Strassen-type recursive divide-and-conquer algorithms are out of scope - only the cubic standard algorithm is considered. The matrices , , are checkerboard-block-mapped on a virtual 2-D mesh , i.e., each process initially holds one -submatrix , one , and produces one .

Motivation: why a naive standard algorithm is not enough

The naive standard MMM algorithm (StandardMMM, the prior algorithm in the same lecture):

  • Ph.1: Each row of processes performs an AAG (all-to-all gather) of its submatrices - so every process in row ends up with the full row of submatrices.
  • Ph.2: Each column of processes performs an AAG of its submatrices - every process in column ends up with the full column of submatrices.
  • Ph.3: Each locally computes . Assuming SF 2-D mesh, its complexity is with optimal scalability . But: it is memory-inefficient. It requires globally times more memory than the sequential algorithm, because each process accumulates a full stripe of and a full stripe of before doing the multiplication. For very large matrices - which is the only reason to use distributed memory in the first place - this is fatal. As grows, the per-process memory exceeds any fixed capacity.

Two reasons exist to go distributed: either the sequential time complexity is unacceptable, or the matrices do not fit in a single memory. The standard algorithm helps with the first but worsens the second. This motivates an algorithm that achieves the same asymptotic time complexity while remaining memory optimal - i.e., requiring per process only the capacity to store one submatrix, one submatrix, and one submatrix. Such algorithms are called systolic.

The systolic principle

A systolic algorithm reshuffles the submatrices of and through the processes in such a way that:

  • At every step, every process holds exactly one submatrix and one submatrix that it can immediately multiply and accumulate into its submatrix.
  • Each submatrix of and each submatrix of visits each process exactly when that process needs it.
  • No process ever needs to store more than its initially allocated capacity (one , one , one submatrix). The data thus “flow” through the system in rhythm, much like blood circulates in a body - hence the name systolic.

Cannon’s algorithm: structure

Starting state: checkerboard mapping of on a virtual 2-D torus (the algorithm naturally requires a torus because of the cyclic shifts; a mesh can simulate this but the torus is the ideal topology). The algorithm has two prologue phases (initial skew) followed by a main loop of iterations. Prologue:

  • Ph.1: For all in parallel, rotate the submatrices of in row by positions leftward (cyclic shift).
  • Ph.2: For all in parallel, rotate the submatrices of in column by positions upward (cyclic shift). After these two rotations, every process holds a pair . These have matching middle index , so they can be multiplied together correctly. Main loop (Ph.3):
  • Repeat times:
    • For all processors in in parallel: multiply the currently held and submatrices and add the result to .
    • For all in parallel: rotate the submatrices of in row by one position leftward.
    • For all in parallel: rotate the submatrices of in column by one position upward. After iterations of the main loop, each process has accumulated all partial products needed for its assigned , and the algorithm is complete.

Cannon’s algorithm is very regular: after the asymmetric prologue, every iteration looks the same - multiply, shift A leftward by 1, shift B upward by 1.

Correctness

The key invariant: after Phase 1 and Phase 2 of the prologue, process holds and - a matching pair with the same middle index. Reason: before Phase 1, held . Rotating row leftward by positions moves to the process . So receives . Symmetrically, before Phase 2, held . Rotating column upward by positions moves to , so receives . The two middle indices match. After each iteration of the main loop, both and submatrices held by have their middle index incremented by - because moves leftward by 1 (column index of submatrix decreases by 1 in the row stream, which means now sees the submatrix that previously had middle index ), and moves upward by 1 (row index of submatrix decreases by 1 in the column stream, which means now sees the submatrix whose middle index is ). So across iterations, the middle index runs through all values exactly once. The local accumulation therefore visits every exactly once, giving the correct mathematical result This is exactly the block formulation of matrix multiplication.

Properties

Time complexity. On an all-port WH hypercube : The terms are:

  • - startup latency: communication steps, each with constant setup.
  • - data transfer: shifts of an -sized block each (the per-step submatrix has elements, summed over steps gives ).
  • - the local arithmetic: submatrix multiplications, each costing , summed gives . Scalability: which is optimal scalability (same as the naive standard MMM). Memory optimality. Within the course of the algorithm, there is no replication or redundancy of data. Every process holds exactly one submatrix, one submatrix, and one submatrix at all times. The total distributed memory used is therefore the size of the input (the three matrices) - the minimum possible. This contrasts sharply with the naive standard algorithm, which would require times more. Regularity. Apart from the asymmetric prologue, the main loop consists of identical iterations: multiply + shift left + shift up. The shifts are unit cyclic shifts in row and column dimensions of the torus, which are the most basic and most efficiently supported communication patterns on 2-D tori. MPI provides direct support via MPI_Sendrecv_replace combined with MPI_Cart_shift on a Cartesian (toroidal) communicator. Ideal topology. Cannon’s algorithm is naturally a 2-D torus algorithm, because the cyclic shifts wrap around. On a mesh (without wraparound) the shifts can be simulated, but the torus is the ideal topology.

Comparison with Fox’s algorithm (Broadcast-Multiply-Roll)

Fox’s algorithm (sometimes called BMR) is an alternative systolic-style algorithm. Its key differences:

  • It does not perform the initial skewing of and . The initial checkerboard mapping is kept as is.
  • In iteration , the diagonal submatrices are broadcast within each row.
  • Each process then multiplies the received with its local , adds to , and then rotates upward by 1 in its column. This relaxes memory optimality: each process needs space for 4 submatrices (its local , an incoming broadcast buffer, its rotating , and its accumulating ) instead of 3. The advantage is simpler restoration of the original mapping at the end, because the submatrices never move - they are only copied via broadcast. The asymptotic time complexity and scalability are similar to Cannon’s.

Cannon vs Fox: Cannon is memory-optimal (3 submatrices per process) but more involved to restore to the original mapping. Fox needs 4 submatrices per process but the matrix never moves, simplifying cleanup.

Potential exam questions

  • Define the dense MMM problem and state which definition of matrix multiplication is used (the standard cubic one, not Strassen).
  • Why is the naive StandardMMM (AAG of rows of and columns of , then local product) unsuitable for very large matrices? State the memory overhead it imposes in terms of .
  • Define what a systolic algorithm is. What invariant about per-process memory must such an algorithm preserve?
  • Describe the initial mapping for Cannon’s algorithm. Why is the natural topology a 2-D torus rather than a mesh ?
  • Write out the four phases of Cannon’s algorithm: the two prologue rotations and the main loop. State exactly how many times each iteration is repeated, and what its three steps are.
  • After the prologue (Ph.1 + Ph.2) of Cannon’s algorithm, what pair of submatrices does process hold? Justify why this pair has the correct matching middle index for matrix multiplication.
  • Prove the correctness of Cannon’s algorithm: show that after iterations of the main loop, process has accumulated the correct value .
  • State the time complexity of Cannon’s algorithm on an all-port WH hypercube . Identify and explain the three terms (startup latency, data transfer, local arithmetic).
  • What is the asymptotic scalability function of Cannon’s algorithm? How does it compare to the naive StandardMMM?
  • State and justify the memory optimality property of Cannon’s algorithm. How much per-process memory does the algorithm require, and how does this compare to the naive StandardMMM and to Fox’s algorithm?
  • Why is Cannon’s algorithm called regular? Identify the structural difference between the prologue and the main loop, and explain why this matters for implementation.
  • Compare Cannon’s algorithm to Fox’s Broadcast-Multiply-Roll algorithm. State at least three differences: per-process memory requirement, structure of the initial mapping, and the cost of restoring the original mapping at the end.
  • Explain the relationship between Cannon’s algorithm and the cyclic shift permutation on a 2-D torus. Which MPI functions naturally support the communication primitives the algorithm needs?