PDP - AAS - communication latency lower bounds

Scope of the question

All-to-all scatter (AAS), also known as complete exchange or all-to-all personalized communication, is the most demanding of all collective communication operations. Every node sends $N - 1$ distinct (personalized) packets, one for each other node, and receives $N - 1$ distinct packets, one from each other node. The total number of exchanged packets is $N (N - 1)$ . The canonical example is the transposition of a row-mapped matrix $A_{N, N}$ : when row $i$ of $A$ is stored at process $P_{i}$ with $a_{i, j}$ at local position $j$ , performing the transpose requires every process to send element $a_{i, j}$ to process $P_{j}$ , which is exactly an AAS pattern. This question focuses on the lower bounds for the communication latency of AAS. Because the data volume is so large, distances and per-round costs become less important than the network’s capacity to carry traffic - both globally (network bandwidth) and across critical cuts (bisection bandwidth). Both bounds derive from purely combinatorial counting arguments and apply to any switching technology.

The two governing constraints

Two structural properties of the network govern the AAS lower bound, and both must be considered: in many graphs the two bounds coincide, but in graphs with asymmetric bottlenecks, one or the other dominates. The two bounds are:

the network bandwidth bound, based on the total number of edges in the network
the bisection bandwidth bound, based on the capacity of the cut separating the network into two halves The corresponding MPI call is MPI_Alltoall.

AAS is so intensive that distances effectively don’t matter; even the number of forwards is secondary. The two governing constraints are the network bandwidth and the bisection bandwidth.

Lemma 13: AAS lower bound from network bandwidth

Let $G$ be an $N$ -node full-duplex network with $m = ∣ E (G) ∣$ edges. The lower bound on the communication latency of AAS in which the nodes exchange packets of size $μ$ is: $τ_{AA S}^{N} (G, μ) = \frac{1}{2 m} (\sum_{u \neq = v} dist_{G} (u, v)) μ t_{m}$ Proof:

AAS requires the network to transmit a packet between each ordered pair of distinct nodes $u$ and $v$ . In the best case (using a shortest path), this packet uses $dist_{G} (u, v)$ edge crossings.
Summing over all ordered pairs, the total number of edge crossings required is $\sum_{u \neq = v} dist_{G} (u, v)$ .
Each edge is full-duplex, so it can be used in both directions simultaneously. The per-round capacity of the network is therefore $2 m$ packet crossings.
The total number of rounds needed is at least the ratio $\frac{\sum _{u \neq = v} dist _{G} ( u , v )}{2 m}$ , and each round-equivalent transmits one packet of size $μ$ , contributing $μ t_{m}$ to the latency. The bound formalizes a simple intuition: sum all the shortest-path distances across all source-destination pairs and divide by twice the edge count, then multiply by the per-packet transmission cost. This gives the best the graph can theoretically provide for AAS, irrespective of algorithm.

Lemma 14: AAS lower bound from bisection bandwidth

Let $G$ be an $N$ -node full-duplex network with edge bisection width $bw_{e} (G)$ . The lower bound on the communication latency of AAS in which the nodes exchange packets of size $μ$ is: $τ_{AA S} (G, μ) = \frac{⌈ N /2 ⌉ ⌊ N /2 ⌋ μ t _{m}}{bw _{e} ( G )}$ Proof:

Consider any partitioning of $V (G)$ into two halves $V_{1}$ and $V_{2}$ of sizes $⌈ N /2 ⌉$ and $⌊ N /2 ⌋$ .
During AAS, each node in $V_{1}$ sends one packet to each node in $V_{2}$ , and vice versa. So $⌈ N /2 ⌉ \cdot ⌊ N /2 ⌋$ packets must cross from $V_{1}$ to $V_{2}$ (and the same number in the reverse direction, but these can share the full-duplex edges).
The number of edges between $V_{1}$ and $V_{2}$ is at most $bw_{e} (G)$ (by definition of the edge bisection width, which is the minimum over all bisecting cuts; we get an upper bound on the cross-edges of any specific cut, but the worst-case cut gives this minimum).
All cross-traffic must pass through these $bw_{e} (G)$ edges, so the time needed is at least $\frac{⌈ N /2 ⌉ ⌊ N /2 ⌋ μ t _{m}}{bw _{e} ( G )}$ . The bound has a clean divide-and-conquer interpretation: at the deepest recursive step of any AAS algorithm, half the nodes on the left must collectively exchange everything with half the nodes on the right, and the only capacity for that crossing is the bisection width.

Comparison and dominance

The two bounds are derived from different structural properties of $G$ and capture different bottlenecks:

the network bandwidth bound captures the total work-to-capacity ratio averaged across all edges
the bisection bandwidth bound captures the worst single cut that all cross-traffic must traverse In many regular topologies the two bounds coincide (or differ only by small constants), because the per-edge load and the bisection load are balanced by symmetry. For example, in a hypercube $Q_{n}$ both bounds give the same asymptotic answer. In graphs with structurally asymmetric bottlenecks - e.g. a long thin mesh $M (1, z)$ or a tree topology - the bisection bound is much tighter because a single edge or small set of edges must carry all the cross-traffic. The right way to think about it: both bounds must hold, so the true lower bound is the maximum of the two. An algorithm achieving the maximum on a given graph is communication-optimal in the strongest sense.

Why distances don’t enter the latency in the usual way

Unlike OAB or OAS, where distance terms ( $γ t_{d}$ in the WH case, $δ μ t_{m}$ in the SF case) play a leading role, the AAS bounds above contain only the $μ t_{m}$ term and no separate $t_{d}$ or startup term. This is because:

the data volume per node is $(N - 1) μ$ , which dominates over any per-message startup for reasonable $N$
the number of distinct rounds is large (potentially $N - 1$ on a 1-port network), so the startup cost is folded into the round count rather than appearing separately
distances do affect the bound, but through the $\sum_{u \neq = v} dist_{G} (u, v)$ aggregate, not as a separate $δ$ multiplier For a more refined analysis on a specific switching technology, the bounds would be augmented with the appropriate $t_{s}$ and $t_{d}$ terms following the standard WH or SF latency models, but the AAS-specific lower bound structure is dominated by the bandwidth arguments above.

Order of magnitude: AAS vs the other CCOs

Putting the AAS bandwidth bound in context:

OAB: one source sends $μ$ bytes, total traffic $O (μ \cdot (N - 1))$ across the network
OAS / AOG: one source sends or receives $N - 1$ personalized packets, total traffic $O (μ \cdot (N - 1))$
AAB / AAG: every node broadcasts $μ$ bytes to all others, total traffic $O (N \cdot μ \cdot (N - 1))$ (the same packet is replicated $N$ times in the network)
AAS: every node sends $N - 1$ distinct packets to $N - 1$ destinations, total traffic $O (N \cdot μ \cdot (N - 1)) = O (N^{2} μ)$ but with $N (N - 1)$ distinct packets (none of them duplicates) The AAS volume of distinct content is what makes it the most demanding CCO and what justifies the dominance of the bandwidth-based bounds over latency-based bounds.

Application example: matrix transposition

The MPI call MPI_Alltoall is exactly the operation needed to transpose a row-distributed matrix. If matrix $A$ is mapped row-wise on $N$ processors (process $P_{i}$ holds row $i$ , with $a_{i, j}$ at local position $j$ ), then transposing $A$ requires:

$P_{i}$ must send element $a_{i, j}$ to $P_{j}$ for every $j \neq = i$
after the exchange, $P_{j}$ holds what used to be column $j$ of $A$ , which is now row $j$ of $A^{T}$ This is an instance of AAS with packet size $μ =$ (element size) and $N$ processes, and the lower bounds above are exactly the lower bounds on the time to transpose a distributed matrix. The matrix transposition example will reappear later in the course in the context of dense matrix multiplication and the Cannon algorithm.

Potential exam questions

Define the all-to-all scatter (AAS) operation. How many distinct packets are exchanged in total? What is the corresponding MPI call?
Give the canonical application example of AAS (matrix transposition) and explain how the row-wise mapping maps to the AAS pattern.
State Lemma 13 (the AAS lower bound from network bandwidth) and prove it.
State Lemma 14 (the AAS lower bound from bisection bandwidth) and prove it.
Compare the two AAS lower bounds. For which topologies do they coincide and for which does one dominate the other? Give an example of each case.
Why is the AAS lower bound dominated by data-volume terms rather than by distance or startup terms, unlike the bounds for OAB and OAS?
Compute the network-bandwidth and bisection-bandwidth lower bounds for AAS on a hypercube $Q_{n}$ , on a 2D torus $K (z, z)$ , and on a 1D mesh $M (z)$ . Identify the dominant bound in each case.
Why is the AAS volume of distinct content larger than the AAB volume, even though both involve $N$ sources and $N - 1$ destinations per source?
Define the edge bisection width $bw_{e} (G)$ . Why does it appear in the AAS lower bound and not in the OAB lower bound?
Show that for a 1D mesh $M (z)$ the bisection-bandwidth bound is much tighter than the network-bandwidth bound and explain why structurally.
Derive the bandwidth lower bound for AAS on $Q_{n}$ explicitly: compute $\sum_{u \neq = v} dist (u, v)$ and $∣ E (Q_{n}) ∣$ , then evaluate the ratio.

Petrova digitální zahrada 🚀

Procházet