PDP - Cannon algorithm of distributed multiplication of dense matrices, correctness, properties

First 5 minutes in hell

I have a $p \times p$ grid of processes and each process holds one block of A, one block of B and one block of C. The goal is to compute C by multiplying A and B, block-by-block without needing more memory then it initially started with.

it could be scalable, the naive approach needs to load the full A row and full B column to perform the multiplication, which is infeasible for large matrices (to fit the data onto one computing node)

I am usually going distributed because I don’t have that much memory in a single node, and the naive approach then defeats it’s purpose

here the memory is exactly 3 blocks per process, nothing more

The flow:

blocks of A are flowing across rows, blocks of B are flowing across columns and at each iteration, each process multiplies it’s current block A with B and saves the result to C

each submatrix of A and submatrix of B visits each process at exactly the right time, when the process needs them

no process stores more then initially allocated

this flow is called “systolic” (as the submatrices flow through the matrix of processes like a blood in the heart)

Algorithm:

starts with checkerboard mapping of A, B, C matrices on a virtual 2D torus $K (p, p)$ - there are cyclic shifts (torus is ideal)

two phases:

initial skew:

rotate all submatrices $A$ in row $i$ by $i$ positions

row 0 will not rotate

row 1: each matrix will rotate by 1 position

row 2: by two positions etc.

rotate all submatrices $B$ in a column $j$ by $j$ positions

this way, at the start of the second phase, each process will hold a correct pair of $P_{i, j}$ - they can be multiplied together

iterations:

each processor multiplies it’s submatrices A and B and add the result to C

rotate all matrices by one position (A’s along rows, B’s along columns)

Correctness:

this algorithm guarantees that at every tick, the A and B blocks held by $P_{i, j}$ share a middle index $m$ (the column index of A matches the row index of B)

every value of $m$ from $0$ to $p - 1$ shows up exactly once

Properties:

memory-optimal

time complexity - $p$ - startup latency, communication (shifts) + local multiplications

scaling is $N$ (which is the same as the naive approach, but the naive approach is memory-inefficient)

Comparison to the Fox’s algorithm:

works in the similar sense, needs 4 matrices per process, but the matrix A never moves (only the matrix B), simplifying the cleanup

asymptotic time complexity is similar to the Cannon’s

Problem definition

Given two dense $(N \times N)$ -matrices $A$ and $B$ , compute the product $C = A \times B .$ We use the standard “three-loop” definition: an output element $c_{i, k}$ is computed as the dot product of row $i$ of $A$ and column $k$ of $B$ : $c_{i, k} = \sum_{j = 0}^{N - 1} a_{i, j}, b_{j, k} .$ Strassen-type recursive divide-and-conquer algorithms are out of scope - only the cubic standard algorithm is considered. The matrices $A$ , $B$ , $C$ are checkerboard-block-mapped on a virtual 2-D mesh $M (p, p)$ , i.e., each process $P_{i, j}$ initially holds one $(N / p \times N / p)$ -submatrix $A_{i, j}$ , one $B_{i, j}$ , and produces one $C_{i, j}$ .

Motivation: why a naive standard algorithm is not enough

The naive standard MMM algorithm (StandardMMM, the prior algorithm in the same lecture):

Ph.1: Each row of processes performs an AAG (all-to-all gather) of its $A$ submatrices - so every process in row $i$ ends up with the full row of $A$ submatrices.
Ph.2: Each column of processes performs an AAG of its $B$ submatrices - every process in column $k$ ends up with the full column of $B$ submatrices.
Ph.3: Each $P_{i, k}$ locally computes $C_{i, k} = \sum_{j = 1}^{p} A_{i, j} B_{j, k}$ . Assuming SF 2-D mesh, its complexity is $T (N, p) = O! (\frac{N}{p} p + \frac{N}{p} N), E (N, p) = Θ! (\frac{N}{N + p}),$ with optimal scalability $ψ_{2} (N) = N$ . But: it is memory-inefficient. It requires globally $p$ times more memory than the sequential algorithm, because each process accumulates a full stripe of $A$ and a full stripe of $B$ before doing the multiplication. For very large matrices - which is the only reason to use distributed memory in the first place - this is fatal. As $p$ grows, the per-process memory exceeds any fixed capacity.

Two reasons exist to go distributed: either the sequential time complexity is unacceptable, or the matrices do not fit in a single memory. The standard algorithm helps with the first but worsens the second. This motivates an algorithm that achieves the same asymptotic time complexity while remaining memory optimal - i.e., requiring per process only the capacity to store one $A$ submatrix, one $B$ submatrix, and one $C$ submatrix. Such algorithms are called systolic.

The systolic principle

A systolic algorithm reshuffles the submatrices of $A$ and $B$ through the processes in such a way that:

At every step, every process holds exactly one $A$ submatrix and one $B$ submatrix that it can immediately multiply and accumulate into its $C$ submatrix.
Each submatrix of $A$ and each submatrix of $B$ visits each process exactly when that process needs it.
No process ever needs to store more than its initially allocated capacity (one $A$ , one $B$ , one $C$ submatrix). The data thus “flow” through the system in rhythm, much like blood circulates in a body - hence the name systolic.

Cannon’s algorithm: structure

Starting state: checkerboard mapping of $A, B, C$ on a virtual 2-D torus $K (p, p)$ (the algorithm naturally requires a torus because of the cyclic shifts; a mesh can simulate this but the torus is the ideal topology). The algorithm has two prologue phases (initial skew) followed by a main loop of $p$ iterations. Prologue:

Ph.1: For all $i = 0, \dots, p - 1$ in parallel, rotate the submatrices of $A$ in row $i$ by $i$ positions leftward (cyclic shift).
Ph.2: For all $j = 0, \dots, p - 1$ in parallel, rotate the submatrices of $B$ in column $j$ by $j$ positions upward (cyclic shift). After these two rotations, every process $P_{i, j}$ holds a pair $(A_{i, (i + j) mod p}, B_{(i + j) mod p, j})$ . These have matching middle index $(i + j) mod p$ , so they can be multiplied together correctly. Main loop (Ph.3):
Repeat $p$ times:
- For all processors in $K (p, p)$ in parallel: multiply the currently held $A$ and $B$ submatrices and add the result to $C$ .
- For all $i = 0, \dots, p - 1$ in parallel: rotate the submatrices of $A$ in row $i$ by one position leftward.
- For all $j = 0, \dots, p - 1$ in parallel: rotate the submatrices of $B$ in column $j$ by one position upward. After $p$ iterations of the main loop, each process has accumulated all $p$ partial products needed for its assigned $C_{i, j}$ , and the algorithm is complete.

Cannon’s algorithm is very regular: after the asymmetric prologue, every iteration looks the same - multiply, shift A leftward by 1, shift B upward by 1.

Correctness

The key invariant: after Phase 1 and Phase 2 of the prologue, process $P_{i, j}$ holds $A_{i, (i + j) mod p}$ and $B_{(i + j) mod p, j}$ - a matching pair with the same middle index. Reason: before Phase 1, $P_{i, j}$ held $A_{i, j}$ . Rotating row $i$ leftward by $i$ positions moves $A_{i, m}$ to the process $P_{i, m - i mod p}$ . So $P_{i, j}$ receives $A_{i, (j + i) mod p}$ . Symmetrically, before Phase 2, $P_{i, j}$ held $B_{i, j}$ . Rotating column $j$ upward by $j$ positions moves $B_{m, j}$ to $P_{m - j mod p, j}$ , so $P_{i, j}$ receives $B_{(i + j) mod p, j}$ . The two middle indices match. After each iteration of the main loop, both $A$ and $B$ submatrices held by $P_{i, j}$ have their middle index incremented by $1 (mod p)$ - because $A$ moves leftward by 1 (column index of $A$ submatrix decreases by 1 in the row stream, which means $P_{i, j}$ now sees the $A$ submatrix that previously had middle index $(i + j + 1) mod p$ ), and $B$ moves upward by 1 (row index of $B$ submatrix decreases by 1 in the column stream, which means $P_{i, j}$ now sees the $B$ submatrix whose middle index is $(i + j + 1) mod p$ ). So across $p$ iterations, the middle index runs through all values $0, 1, \dots, p - 1$ exactly once. The local accumulation $C_{i, j} \leftarrow C_{i, j} + A_{i, mid} \cdot B_{mid, j}$ therefore visits every $mid$ exactly once, giving the correct mathematical result $C_{i, j} = \sum_{mid = 0}^{p - 1} A_{i, mid} \cdot B_{mid, j} .$ This is exactly the block formulation of matrix multiplication.

Properties

Time complexity. On an all-port WH hypercube $Q_{l o g p}$ : $T_{Cannon} (N, p) ≐ t_{s} p + \frac{N}{p} t_{m} + O! (\frac{N ^{3}}{p}) .$ The terms are:

$t_{s} p$ - startup latency: $p$ communication steps, each with constant setup.
$\frac{N}{p} t_{m}$ - data transfer: $p$ shifts of an $\frac{N}{p}$ -sized block each (the per-step submatrix has $N / p$ elements, summed over $p$ steps gives $N / p$ ).
$O! (\frac{N ^{3}}{p})$ - the local arithmetic: $p$ submatrix multiplications, each costing $(N / p)^{3/2} = N^{3} / p^{3/2}$ , summed gives $N^{3} / p$ . Scalability: $ψ_{2} (N) = N$ which is optimal scalability (same as the naive standard MMM). Memory optimality. Within the course of the algorithm, there is no replication or redundancy of data. Every process holds exactly one $A$ submatrix, one $B$ submatrix, and one $C$ submatrix at all times. The total distributed memory used is therefore $3 \times$ the size of the input (the three matrices) - the minimum possible. This contrasts sharply with the naive standard algorithm, which would require $p$ times more. Regularity. Apart from the asymmetric prologue, the main loop consists of identical iterations: multiply + shift left + shift up. The shifts are unit cyclic shifts in row and column dimensions of the torus, which are the most basic and most efficiently supported communication patterns on 2-D tori. MPI provides direct support via MPI_Sendrecv_replace combined with MPI_Cart_shift on a Cartesian (toroidal) communicator. Ideal topology. Cannon’s algorithm is naturally a 2-D torus algorithm, because the cyclic shifts wrap around. On a mesh (without wraparound) the shifts can be simulated, but the torus is the ideal topology.

Comparison with Fox’s algorithm (Broadcast-Multiply-Roll)

Fox’s algorithm (sometimes called BMR) is an alternative systolic-style algorithm. Its key differences:

It does not perform the initial skewing of $A$ and $B$ . The initial checkerboard mapping is kept as is.
In iteration $k$ , the diagonal $A$ submatrices $A_{i, (i + k) mod p}$ are broadcast within each row.
Each process then multiplies the received $A$ with its local $B$ , adds to $C$ , and then rotates $B$ upward by 1 in its column. This relaxes memory optimality: each process needs space for 4 submatrices (its local $A$ , an incoming broadcast $A$ buffer, its rotating $B$ , and its accumulating $C$ ) instead of 3. The advantage is simpler restoration of the original mapping at the end, because the $A$ submatrices never move - they are only copied via broadcast. The asymptotic time complexity and scalability are similar to Cannon’s.

Cannon vs Fox: Cannon is memory-optimal (3 submatrices per process) but more involved to restore to the original mapping. Fox needs 4 submatrices per process but the $A$ matrix never moves, simplifying cleanup.

Potential exam questions

Define the dense MMM problem and state which definition of matrix multiplication is used (the standard cubic one, not Strassen).
Why is the naive StandardMMM (AAG of rows of $A$ and columns of $B$ , then local product) unsuitable for very large matrices? State the memory overhead it imposes in terms of $p$ .
Define what a systolic algorithm is. What invariant about per-process memory must such an algorithm preserve?
Describe the initial mapping for Cannon’s algorithm. Why is the natural topology a 2-D torus $K (p, p)$ rather than a mesh $M (p, p)$ ?
Write out the four phases of Cannon’s algorithm: the two prologue rotations and the main loop. State exactly how many times each iteration is repeated, and what its three steps are.
After the prologue (Ph.1 + Ph.2) of Cannon’s algorithm, what pair of submatrices does process $P_{i, j}$ hold? Justify why this pair has the correct matching middle index for matrix multiplication.
Prove the correctness of Cannon’s algorithm: show that after $p$ iterations of the main loop, process $P_{i, j}$ has accumulated the correct value $C_{i, j} = \sum_{m = 0}^{p - 1} A_{i, m} B_{m, j}$ .
State the time complexity of Cannon’s algorithm on an all-port WH hypercube $Q_{l o g p}$ . Identify and explain the three terms (startup latency, data transfer, local arithmetic).
What is the asymptotic scalability function $ψ_{2} (N)$ of Cannon’s algorithm? How does it compare to the naive StandardMMM?
State and justify the memory optimality property of Cannon’s algorithm. How much per-process memory does the algorithm require, and how does this compare to the naive StandardMMM and to Fox’s algorithm?
Why is Cannon’s algorithm called regular? Identify the structural difference between the prologue and the main loop, and explain why this matters for implementation.
Compare Cannon’s algorithm to Fox’s Broadcast-Multiply-Roll algorithm. State at least three differences: per-process memory requirement, structure of the initial mapping, and the cost of restoring the original mapping at the end.
Explain the relationship between Cannon’s algorithm and the cyclic shift permutation on a 2-D torus. Which MPI functions naturally support the communication primitives the algorithm needs?

Petrova digitální zahrada 🚀

Procházet

PDP - Cannon algorithm of distributed multiplication of dense matrices, correctness, properties

Problem definition

Motivation: why a naive standard algorithm is not enough

The systolic principle

Cannon’s algorithm: structure

Correctness

Properties

Comparison with Fox’s algorithm (Broadcast-Multiply-Roll)

Potential exam questions

Graf

Obsah

Příchozí odkazy