The big picture

You need to compute with a matrix on processors. Each processor only gets a slice of , so it has to talk to others to get the pieces of or it is missing. The three approaches differ only in how is sliced - row strips, column strips, or square tiles - which changes the communication, not the local work.

All three end up with the same total cost and the same scalability .

  • there are multiply adds (computation)
    • - how fast the CPU computes (HW constant)
  • is the sum of the communication cost (AAB, OAB or reduction)
    • - how fast the underlying network moves a number
  • there are elements in the matrix, therefore

Problem definition

Given a square -matrix (dense) and a -vector , compute The matrix is dense, stored as a 2-D array, with elements total. We use as the total number of matrix elements and write the matrix shape as so that “matrix size” and “number of elements” are unambiguous. Three basic mappings of the dense matrix to processes are considered:

  • row-wise,
  • column-wise,
  • checkerboard.
  • The choice of mapping changes the structure of the algorithm: which data each process initially has, what it needs that it does not have, and what communication is required to obtain the missing data. The local computation count is the same in all three mappings (); only the communication pattern differs.

Row-wise mapping

Row-wise

Slice into horizontal strips - each process gets full rows. Every process owns only a chunk of but needs the whole vector to do dot products, so first everyone shares their chunk with everyone else (all-to-all broadcast), then each process locally computes its rows of .

  • since after the broadcast, each process has the whole vector, so can calculate the locally etc.

Clean two-step flow: communicate, then compute. Output lands in the same layout as input , which is ideal for iterative methods like the Power Method.

Scalability: constant efficiency as long as each processor keeps a constant number of rows.

  • has to be kept constant

Processes form a 1-D virtual mesh . Let Each process is assigned a block of consecutive rows of . The vectors and are also mapped block-wise across the processes, each process holding a block of consecutive elements. Algorithm RowWiseMVM:

  • Ph.1 (AAB): Each process sends its part of to all other processes (all-to-all broadcast). After this, every process holds the full .
  • Ph.2 (local compute): For all in parallel, computes for - that is dot products, each of length . The output ends up mapped block-wise across processes in exactly the same way as the input . This is important for iterative algorithms like the Power Method that repeatedly apply MVM: no remapping of the vector is needed between iterations.

Row-wise MVM is the simplest case: one AAB of the vector, then purely local dot products. Communication and computation are cleanly separated.

Time complexity and scalability of RowWiseMVM

Local computation in Phase 2: since each process performs dot products of vectors of length , totaling multiply-adds. Communication in Phase 1 is the latency of an AAB of numbers, which depends on the topology and switching technology. Two examples:

  • Full-duplex 4-port SF with noncombining TADT-based AAB:
  • Full-duplex 1-port SF with combining AAB (AAB by dimensions): In all such cases the total time has the form Efficiency: holds when Constant efficiency is preserved as long as grows no faster than , i.e., can be kept constant - which means a constant number of matrix rows per processor.

Row-wise MVM has outstanding scalability: constant efficiency can be obtained even with a constant number of matrix rows per processor.

Column-wise mapping

Column-wise

Slice into vertical strips - each process gets full columns and the matching chunk of . Now each process has everything it needs to compute partial contributions to every element of , but no element is complete. So the order flips: compute first, then communicate. Phase 2 runs simultaneous reductions (one per output chunk, each rooted at a different process) that sum the partials into the final .

  • compute part: each process multiplies all elements in the column by the assigned value
    • each process produces the partial output of the (but no output is finished)
    • to get the complete , we have to sum the rows (the partial results)
  • parallel reductions
    • all reductions are happening in parallel (each is rooted in a different process, and therefore ends up in the ) - the pink diagonal

Same asymptotic cost and scalability as row-wise - the choice between them is usually driven by what fits the surrounding workflow (which mapping fits better a larger workflow).

Processes form a 1-D virtual mesh . Let . Initial distribution:

  • owns subvector of .
  • owns columns of . Final distribution: owns subvector . The situation is reciprocal to row-wise: the order of computation and communication is reversed. Algorithm ColumnWiseMVM:
  • Ph.1 (local compute): For all in parallel, computes its local contributions to all elements of . Each has all rows but only a column-slice of , so it can produce partial values for every , but they are incomplete.
  • Ph.2 (parallel reductions): For all in parallel, becomes the root of a row-wise reduction with operation , accumulating the partial dot products from all to produce the final values of . Phase 2 consists of simultaneous reductions, each rooted at a different process, each carrying out the same operation on different data. On an underlying topology with sufficient bandwidth (e.g., a 1-D SF mesh with pipelining, or any standard hypercube), all reductions can be executed concurrently. The asymptotic complexity is the same as for row-wise mapping: time complexity and scalability are essentially identical.

Row-wise and column-wise mappings have the same asymptotic behavior. The choice between them is driven by other factors (e.g., which mapping fits a larger workflow).

Checkerboard mapping

Checkerboard

Arrange processes in a grid and give each one a square tile of . The vectors live in the rightmost column of processes (the vector is on the right, so naturally the processes that span the square tiles on the right also contain the vector ).

  • each process owns a -submatrix

Four phases: (1) boundary column ships its chunks to the diagonal (2) each diagonal process broadcasts its chunk down its column

  • so each row has all , , etc. to multiply it with the values it has (phase 3)
  • we need the same value along the whole column

(3) every process multiplies its tile by the subvector it received

  • it receives a vector chunk of the vector (it is a small local matrix to vector multiplication and the result is also a vector)

(4) each row of processes reduces its partial sums back to the rightmost column to form .

More communication phases but each one is smaller and runs in parallel across rows or columns, so the total cost and scalability match the striped mappings: constant efficiency requires .

Processes form a virtual 2-D mesh . Each process owns a -submatrix of . For the input/output vectors, the standard assumption used here is: vectors and are mapped to the last column of the virtual 2-D mesh (the rightmost column of processes). Other choices are possible; this is just the convention used in the analysis. Algorithm CheckerBoardMVM:

  • Ph.1: For all in parallel, boundary processor sends its part of to the diagonal processor .
  • Ph.2: For all in parallel, broadcasts the received part of within its column.
  • Ph.3: For all in parallel, multiplies its local submatrix of with its received subvector of .
  • Ph.4: For all in parallel, the processors in row perform a parallel reduction with root , summing the partial results into the final . Note that Phase 2 is not an inversion of a single broadcast - it is independent column-wise broadcasts running in disjoint column-sub-topologies in parallel. Similarly, Phase 4 consists of independent row-wise reductions running in disjoint row-sub-topologies in parallel.

Time complexity and scalability of CheckerBoardMVM

Each process owns a -submatrix.

  • Phase 3 (local arithmetic): parallel arithmetic operations.
  • Phase 1 complexity is of the same order (SF) or lower order (WH) than Phase 2 and can be ignored.
  • Phase 4 has the same asymptotic complexity as Phase 2 (broadcast and reduction run on the same data sizes on the same topology; the small constant from the addition operations does not change the asymptotic).
  • Phase 2 is an OAB of numbers; its latency depends on the underlying topology and switching. Example: SF mesh : which is the same asymptotic order as in RowWiseMVM. Hence The scalability is essentially the same as for the striped mappings on the same hardware: assuming identical local node performance and identical communication links, all three mappings give asymptotically equivalent total time complexity.

Comparative summary

  • Row-wise: AAB of first, then local dot products. Output mapped same as input .
  • Column-wise: local partial products first, then simultaneous reductions to produce the output.
  • Checkerboard: send from boundary to diagonal, broadcast in columns, local submatrix-subvector multiply, reduce in rows.

On comparable underlying hardware (same topology of given size, same node performance, same link performance), all three have the same asymptotic time complexity: and the same scalability: constant efficiency requires for some constant , equivalently a constant number of matrix rows per processor (or, for checkerboard, const).

Potential exam questions

  • State the dense MVM problem and define the matrix and vector dimensions used in the analysis. What does denote and why is the matrix written as ?
  • Describe in detail the RowWiseMVM algorithm. What is the size of each vector block? Which collective communication operation is used in Phase 1, and what is the data layout of at the end?
  • Derive the time complexity of Phase 2 of RowWiseMVM. Show that corresponds to dot products of length .
  • For RowWiseMVM on a full-duplex 4-port SF with noncombining TADT-based AAB, derive and explain why it simplifies to asymptotically.
  • For RowWiseMVM on a full-duplex 1-port SF with combining AAB by dimensions, derive and show that it is also asymptotically.
  • Starting from , derive the isoefficiency condition for RowWiseMVM and conclude what constraint this places on the relationship between and . Explain why this implies constant efficiency with a constant number of matrix rows per processor.
  • Describe ColumnWiseMVM. Why is the order of computation and communication reversed compared to row-wise? What is the structure of Phase 2 (how many simultaneous reductions, and on what data)?
  • Explain why RowWiseMVM and ColumnWiseMVM have asymptotically the same complexity and scalability, despite the different communication patterns.
  • Describe all four phases of CheckerBoardMVM. In particular, explain why Phase 2 is independent column broadcasts rather than a single global broadcast.
  • Why must the input vector first be sent from the last column of processes to the diagonal in Phase 1 of CheckerBoardMVM?
  • Derive the time complexity of Phase 3 of CheckerBoardMVM. Why does the complexity of Phase 1 not contribute asymptotically?
  • For CheckerBoardMVM on an SF mesh, show that (where is the cost of Phase 2 OAB), and conclude that the total time complexity is of the same order as RowWiseMVM.
  • State and justify the scalability condition for CheckerBoardMVM: under what relationship between and is constant efficiency preserved?
  • Compare the three mappings (row-wise, column-wise, checkerboard) in terms of: (a) total time complexity, (b) scalability, (c) the structure of communication, (d) the mapping of the output vector. Are there practical reasons to prefer one over the others despite their asymptotic equivalence?