PDP - Multiplication of a dense matrix with a vector - row-wise, column-wise, checkerboard mapping, scalability

The big picture

You need to compute $y = A x$ with a $(N \times N)$ matrix on $p$ processors. Each processor only gets a slice of $A$ , so it has to talk to others to get the pieces of $x$ or $y$ it is missing. The three approaches differ only in how $A$ is sliced - row strips, column strips, or square tiles - which changes the communication, not the local work.

All three end up with the same total cost $T (N, p) ≐ k_{1}^{'} N + k_{2} N / p$ and the same scalability $p \leq c N$ .

there are $N / p$ multiply adds (computation)

$k_{2}$ - how fast the CPU computes (HW constant)

$N$ is the sum of the communication cost (AAB, OAB or reduction)

$k_{1}$ - how fast the underlying network moves a number

there are $N$ elements in the matrix, therefore $(N \times N)$

Problem definition

Given a square $(N \times N)$ -matrix (dense) $A$ and a $(N \times 1)$ -vector $x$ , compute $y = A x .$ The matrix is dense, stored as a 2-D array, with $N = n^{2}$ elements total. We use $N$ as the total number of matrix elements and write the matrix shape as $(N \times N)$ so that “matrix size” and “number of elements” are unambiguous. Three basic mappings of the dense matrix $A$ to processes are considered:

row-wise,
column-wise,
checkerboard.
The choice of mapping changes the structure of the algorithm: which data each process initially has, what it needs that it does not have, and what communication is required to obtain the missing data. The local computation count is the same in all three mappings ( $Θ (N / p)$ ); only the communication pattern differs.

Row-wise mapping

Row-wise

Slice $A$ into horizontal strips - each process gets $N / p$ full rows. Every process owns only a chunk of $x$ but needs the whole vector to do dot products, so first everyone shares their chunk with everyone else (all-to-all broadcast), then each process locally computes its rows of $y$ .

since after the broadcast, each process has the whole $x$ vector, so $P_{0}$ can calculate the $y_{0}$ locally etc.

Clean two-step flow: communicate, then compute. Output $y$ lands in the same layout as input $x$ , which is ideal for iterative methods like the Power Method.

Scalability: constant efficiency as long as each processor keeps a constant number of rows.

$N / p$ has to be kept constant

Processes $P_{i}$ form a 1-D virtual mesh $M (p)$ . Let $r = \frac{N}{p} .$ Each process is assigned a block of $r$ consecutive rows of $A$ . The vectors $x$ and $y$ are also mapped block-wise across the $p$ processes, each process holding a block of $r$ consecutive elements. Algorithm RowWiseMVM $(A, x, y)$ :

Ph.1 (AAB): Each process sends its part of $x$ to all other processes (all-to-all broadcast). After this, every process holds the full $x$ .
Ph.2 (local compute): For all $P_{i}$ in parallel, $P_{i}$ computes $y_{j}$ for $j = i r, \dots, (i + 1) r - 1$ - that is $r$ dot products, each of length $N$ . The output $y$ ends up mapped block-wise across processes in exactly the same way as the input $x$ . This is important for iterative algorithms like the Power Method that repeatedly apply MVM: no remapping of the vector is needed between iterations.

Row-wise MVM is the simplest case: one AAB of the vector, then purely local dot products. Communication and computation are cleanly separated.

Time complexity and scalability of `RowWiseMVM`

Local computation in Phase 2: $T_{2} (N, p) = Θ (N / p),$ since each process performs $N / p$ dot products of vectors of length $N$ , totaling $N / p$ multiply-adds. Communication in Phase 1 is the latency of an AAB of $N / p$ numbers, which depends on the topology and switching technology. Two examples:

Full-duplex 4-port SF $M (p, p)$ with noncombining TADT-based AAB: $T_{1} (N, p) = \frac{p}{4} (t_{s} + k_{1} N / p) ≐ \frac{k _{1} N}{4} .$
Full-duplex 1-port SF $M (p, p)$ with combining AAB (AAB by dimensions): $T_{1} (N, p) = p, t_{s} + k_{1} p, r + p, t_{s} + k_{1} p, p, r ≐ k_{1} N .$ In all such cases the total time has the form $T (N, p) = T_{1} (N, p) + T_{2} (N, p) ≐ k_{1}^{'} N + k_{2} \frac{N}{p} .$ Efficiency: $E (N, p) = \frac{k _{2} N}{k _{2} N + k _{1}^{'} p N} \geq E_{0}$ holds when $N \geq \frac{E _{0} k _{1}^{'}}{( 1 - E _{0} ) k _{2}} \cdot p ⟺ p \leq \frac{( 1 - E _{0} ) k _{2}}{E _{0} k _{1}^{'}} N .$ Constant efficiency $E_{0}$ is preserved as long as $p$ grows no faster than $N$ , i.e., $N / p$ can be kept constant - which means a constant number of matrix rows per processor.

Row-wise MVM has outstanding scalability: constant efficiency can be obtained even with a constant number of matrix rows per processor.

Column-wise mapping

Column-wise

Slice $A$ into vertical strips - each process gets $N / p$ full columns and the matching chunk of $x$ . Now each process has everything it needs to compute partial contributions to every element of $y$ , but no element is complete. So the order flips: compute first, then communicate. Phase 2 runs $p$ simultaneous reductions (one per output chunk, each rooted at a different process) that sum the partials into the final $y$ .

compute part: each process multiplies all elements in the column by the assigned $x_{i}$ value

each process produces the partial output of the $y_{i}$ (but no output is finished)

to get the complete $y_{i}$ , we have to sum the rows (the partial results)

parallel reductions

all reductions are happening in parallel (each is rooted in a different process, and therefore $y_{2}$ ends up in the $P_{2}$ ) - the pink diagonal

Same asymptotic cost and scalability as row-wise - the choice between them is usually driven by what fits the surrounding workflow (which mapping fits better a larger workflow).

Processes $P_{i}$ form a 1-D virtual mesh $M (p)$ . Let $r = N / p$ . Initial distribution:

$P_{i}$ owns subvector $x_{i r}, \dots, x_{(i + 1) r - 1}$ of $x$ .
$P_{i}$ owns columns $i r, \dots, (i + 1) r - 1$ of $A$ . Final distribution: $P_{i}$ owns subvector $y_{i r}, \dots, y_{(i + 1) r - 1}$ . The situation is reciprocal to row-wise: the order of computation and communication is reversed. Algorithm ColumnWiseMVM $(A, x, y)$ :
Ph.1 (local compute): For all $i = 0, \dots, p - 1$ in parallel, $P_{i}$ computes its local contributions to all elements of $y$ . Each $P_{i}$ has all rows but only a column-slice of $A$ , so it can produce partial values for every $y_{j}$ , but they are incomplete.
Ph.2 (parallel reductions): For all $i = 0, \dots, p - 1$ in parallel, $P_{i}$ becomes the root of a row-wise reduction with operation $+$ , accumulating the partial dot products from all $P_{j}$ to produce the final values of $y_{i r}, \dots, y_{(i + 1) r - 1}$ . Phase 2 consists of $p$ simultaneous reductions, each rooted at a different process, each carrying out the same operation $+$ on different data. On an underlying topology with sufficient bandwidth (e.g., a 1-D SF mesh with pipelining, or any standard hypercube), all $p$ reductions can be executed concurrently. The asymptotic complexity is the same as for row-wise mapping: time complexity and scalability are essentially identical.

Row-wise and column-wise mappings have the same asymptotic behavior. The choice between them is driven by other factors (e.g., which mapping fits a larger workflow).

Checkerboard mapping

Checkerboard

Arrange processes in a $p \times p$ grid and give each one a square tile of $A$ . The vectors live in the rightmost column of processes (the vector $x$ is on the right, so naturally the processes that span the square tiles on the right also contain the vector $x$ ).

each process owns a $(N / p \times N / p)$ -submatrix

Four phases: (1) boundary column ships its $x$ chunks to the diagonal (2) each diagonal process broadcasts its chunk down its column

so each row has all $x_{0}$ , $x_{1}$ , $x_{2}$ etc. to multiply it with the values it has (phase 3)

we need the same value $x_{0}$ along the whole column

(3) every process multiplies its tile by the subvector it received

it receives a vector chunk of the $x$ vector (it is a small local matrix to vector multiplication and the result is also a vector)

(4) each row of processes reduces its partial sums back to the rightmost column to form $y$ .

More communication phases but each one is smaller and runs in parallel across rows or columns, so the total cost and scalability match the striped mappings: constant efficiency requires $N / p \geq const$ .

Processes form a virtual 2-D mesh $M (p, p)$ . Each process $P_{i, j}$ owns a $(N / p \times N / p)$ -submatrix of $A$ . For the input/output vectors, the standard assumption used here is: vectors $x$ and $y$ are mapped to the last column of the virtual 2-D mesh (the rightmost column of processes). Other choices are possible; this is just the convention used in the analysis. Algorithm CheckerBoardMVM $(A, x, y)$ :

Ph.1: For all $i = 0, \dots, p - 1$ in parallel, boundary processor $P_{i, p - 1}$ sends its part of $x$ to the diagonal processor $P_{i, i}$ .
Ph.2: For all $i = 0, \dots, p - 1$ in parallel, $P_{i, i}$ broadcasts the received part of $x$ within its column.
Ph.3: For all $i, j = 0, \dots, p - 1$ in parallel, $P_{i, j}$ multiplies its local submatrix of $A$ with its received subvector of $x$ .
Ph.4: For all $i = 0, \dots, p - 1$ in parallel, the processors $P_{i, *}$ in row $i$ perform a parallel reduction with root $P_{i, p - 1}$ , summing the partial results into the final $y$ . Note that Phase 2 is not an inversion of a single broadcast - it is $p$ independent column-wise broadcasts running in $p$ disjoint column-sub-topologies in parallel. Similarly, Phase 4 consists of $p$ independent row-wise reductions running in $p$ disjoint row-sub-topologies in parallel.

Time complexity and scalability of `CheckerBoardMVM`

Each process owns a $(N / p \times N / p)$ -submatrix.

Phase 3 (local arithmetic): $T_{3} = k_{3} \frac{N}{p}$ parallel arithmetic operations.
Phase 1 complexity is of the same order (SF) or lower order (WH) than Phase 2 and can be ignored.
Phase 4 has the same asymptotic complexity as Phase 2 (broadcast and reduction run on the same data sizes on the same topology; the small constant from the addition operations does not change the asymptotic).
Phase 2 is an OAB of $N / p$ numbers; its latency depends on the underlying topology and switching. Example: SF mesh $M (p, p)$ : $T_{2} = k_{2} p \cdot \frac{N}{p} = k_{2} N,$ which is the same asymptotic order as in RowWiseMVM. Hence $E (N, p) \geq E_{0} if \frac{N}{p} \geq const .$ The scalability is essentially the same as for the striped mappings on the same hardware: assuming identical local node performance and identical communication links, all three mappings give asymptotically equivalent total time complexity.

Comparative summary

Row-wise: AAB of $x$ first, then local dot products. Output $y$ mapped same as input $x$ .
Column-wise: local partial products first, then $p$ simultaneous reductions to produce the output.
Checkerboard: send $x$ from boundary to diagonal, broadcast in columns, local submatrix-subvector multiply, reduce in rows.

On comparable underlying hardware (same topology of given size, same node performance, same link performance), all three have the same asymptotic time complexity: $T (N, p) ≐ k_{1}^{'} N + k_{2} \frac{N}{p},$ and the same scalability: constant efficiency requires $p \leq c N$ for some constant $c$ , equivalently a constant number of matrix rows per processor (or, for checkerboard, $N / p \geq$ const).

Potential exam questions

State the dense MVM problem and define the matrix and vector dimensions used in the analysis. What does $N$ denote and why is the matrix written as $(N \times N)$ ?
Describe in detail the RowWiseMVM algorithm. What is the size of each vector block? Which collective communication operation is used in Phase 1, and what is the data layout of $y$ at the end?
Derive the time complexity $T_{2} (N, p)$ of Phase 2 of RowWiseMVM. Show that $T_{2} = Θ (N / p)$ corresponds to $N / p$ dot products of length $N$ .
For RowWiseMVM on a full-duplex 4-port SF $M (p, p)$ with noncombining TADT-based AAB, derive $T_{1} (N, p)$ and explain why it simplifies to $≐ k_{1} N /4$ asymptotically.
For RowWiseMVM on a full-duplex 1-port SF $M (p, p)$ with combining AAB by dimensions, derive $T_{1} (N, p)$ and show that it is also $≐ k_{1} N$ asymptotically.
Starting from $T (N, p) = k_{1}^{'} N + k_{2} N / p$ , derive the isoefficiency condition $E (N, p) \geq E_{0}$ for RowWiseMVM and conclude what constraint this places on the relationship between $p$ and $N$ . Explain why this implies constant efficiency with a constant number of matrix rows per processor.
Describe ColumnWiseMVM. Why is the order of computation and communication reversed compared to row-wise? What is the structure of Phase 2 (how many simultaneous reductions, and on what data)?
Explain why RowWiseMVM and ColumnWiseMVM have asymptotically the same complexity and scalability, despite the different communication patterns.
Describe all four phases of CheckerBoardMVM. In particular, explain why Phase 2 is $p$ independent column broadcasts rather than a single global broadcast.
Why must the input vector $x$ first be sent from the last column of processes to the diagonal in Phase 1 of CheckerBoardMVM?
Derive the time complexity of Phase 3 of CheckerBoardMVM. Why does the complexity of Phase 1 not contribute asymptotically?
For CheckerBoardMVM on an SF $M (p, p)$ mesh, show that $T_{2} = k_{2} N$ (where $T_{2}$ is the cost of Phase 2 OAB), and conclude that the total time complexity is of the same order as RowWiseMVM.
State and justify the scalability condition for CheckerBoardMVM: under what relationship between $N$ and $p$ is constant efficiency preserved?
Compare the three mappings (row-wise, column-wise, checkerboard) in terms of: (a) total time complexity, (b) scalability, (c) the structure of communication, (d) the mapping of the output vector. Are there practical reasons to prefer one over the others despite their asymptotic equivalence?

Petrova digitální zahrada 🚀

Procházet

PDP - Multiplication of a dense matrix with a vector - row-wise, column-wise, checkerboard mapping, scalability

Problem definition

Row-wise mapping

Time complexity and scalability of `RowWiseMVM`

Column-wise mapping

Checkerboard mapping

Time complexity and scalability of `CheckerBoardMVM`

Comparative summary

Potential exam questions

Graf

Obsah

Příchozí odkazy

Petrova digitální zahrada 🚀

Procházet

PDP - Multiplication of a dense matrix with a vector - row-wise, column-wise, checkerboard mapping, scalability

Problem definition

Row-wise mapping

Time complexity and scalability of RowWiseMVM

Column-wise mapping

Checkerboard mapping

Time complexity and scalability of CheckerBoardMVM

Comparative summary

Potential exam questions

Graf

Obsah

Příchozí odkazy

Time complexity and scalability of `RowWiseMVM`

Time complexity and scalability of `CheckerBoardMVM`