First 5 minutes of hell

Dragonfly is a topology that takes into account the different prices and properties of cables (short, metallic, are cheap, long, optic fibres, are expensive).

The structure has 3 levels:

  • a router (switch with many ports)
  • a group of routers (connected with metallic) - in one rack or a row of racks
  • a system of many groups (connected by optical cables) - across the data center

Three parameters of :

  • = how many routers in each group
  • = how many computers are attached to each router
  • = how many ports each router has to other groups (optical cables)
  • each router then has ports for all connected computers and ports for other routers in the same group and ports for connection to other groups
    • router radix =

Properties:

  • the total number of groups in the is at most (each group has links to other groups +1 (the one group itself))
  • the total number of routers is the total number of groups multiplied by ( routers in each group)
  • the total number of compute nodes is the total number of routers mutliplied by
  • each group behaves as a virtual router with a very high port count ()

Routing:

  • at most 3 router hops (+ 2 hops from the computer/node to the router):
    • hop1: firstly, send the message from the source router to the other router, which owns the optical cable to the destination group (exactly one hop, because it is a complete graph)
    • hop2: cross the optical cable to the other group
    • hop3: jump from the entry router to the destination router
  • if the direct path is busy (congestion with a lot of messages), use an indirect path
    • randomly generate one intermediate router (from a different group) and send the message by a direct path to that router and the by the direct path from that intermediate router to the destination
      • cost: double latency
      • gain: predictable load balancing (randomization spreads the traffic)
    • this is called Valiant routing

The decision, which router in one group will be connected by the global (optical) link to which router is an important design question:

  • it affects e.g. bisection width

Context and motivation

The Dragonfly topology was proposed in 2008 and has many subvariants; the lecture covered the canonical (basic) one. It addresses a fundamental hardware reality: two link technologies are available, with very different cost / distance characteristics. Metallic (coax / electric) cables are cheap but only practical over short distances (order of meters). Optical fiber is required for long-distance links between groups, racks, or rooms, but is more expensive. The Dragonfly is designed hierarchically to exploit both: dense cheap connectivity locally via metallic links, sparse but direct optical links between groups.

It is also the most relevant modern HPC topology - the design is now the dominant choice for high-end machines (e.g., HPE Slingshot is the most recent contribution in this family).

Definition

The canonical Dragonfly is parameterised by three integers and is defined as a hierarchical graph with three levels: routers, groups, system.

  • = number of routers per group (connected via the intra-group network)
  • = number of compute nodes (leaves) attached to each router
  • = number of global optical ports per router (links to other groups)

The construction:

  1. Each router has ports to its compute nodes (leaves of the Dragonfly).
  2. Each router has up to ports for local connections to other routers within its group. The intra-group network can be a full graph (complete graph ) or another dense topology like a hypercube.
  3. Each router has ports for global optical links to other groups.
  4. The groups are interconnected so that each group has at least one direct optical link to every other group (i.e., the groups form a complete graph at the inter-group level).

A group acts as a virtual router with external (global) ports.

Counting formulas (canonical Dragonfly)

Total number of groups in the system:

The “+1” comes from the requirement that each group has at least one direct optical link to every other group: each group exposes global ports outward and needs to use each to reach a distinct other group, so there are at most groups in the whole system.

Total number of routers:

Total number of compute nodes (system size):

Ports per router:

  • ports to compute nodes (metallic, short)
  • ports for intra-group links (metallic)
  • ports for global links (optical)
  • Total metallic ports per router:
  • Total optical ports per router:

Total number of links in the system:

  • Bidirectional metallic local links (intra-group, assuming complete intra-group graph):
  • Bidirectional global optical links: (counted from one side; halve if counting unordered pairs of groups)

Virtual-router perspective: from outside, each group behaves as a single high-radix virtual router with external ports.

Example:

One group consists of routers connected as a complete graph (each router has local links to its peers). Each router has optical ports and compute nodes attached. The group exposes global ports - it behaves as a virtual router with 8 external ports.

For the full system :

  • groups
  • routers
  • compute nodes
  • metallic local links
  • global optical link-endpoints (i.e., 36 bidirectional optical cables)

Routing in

Assume the intra-group network is a complete graph (canonical case).

Minimal direct path: between any two routers in there exists a minimal direct path of length at most 3:

  1. At most 1 local link within the source group (to reach the router that has the global link to the destination group).
  2. At most 1 global optical link to the destination group.
  3. At most 1 local link within the destination group (to reach the router connected to the target compute node).

Hence the diameter (between routers) is at most 3, plus the one hop from compute node to its router on each end if measured node-to-node.

Valiant routing (indirect): if the direct path is busy or would cause a collision, an indirect path is used. It is implemented as a concatenation of two direct paths via a randomly chosen intermediate router:

  1. From source, route directly to a random intermediate router (possibly in any group).
  2. From the intermediate router, route directly to the destination.

This randomisation avoids deterministic conflicts at the cost of doubling the worst-case hop count (path length up to 6). Valiant routing works for any topology, with provable bounds on expected complexity. It is one of the key reasons Dragonfly is the mainstream modern HPC interconnect: it gracefully handles adversarial communication patterns.

An important design parameter beyond is which specific router in one group is connected by a global link to which specific router in another group. The choice affects properties like the bisection width but does not change , , , or the basic routing argument. The slides show with one possible global-link arrangement (the 9 groups arranged in a circle with global links forming a complex graph among them).

Properties summary

Hierarchical: 3 levels (router / group / system).

Hybrid link technology: cheap metallic local + expensive optical global, matching physical-distance constraints.

Virtual router abstraction: each group behaves like a single high-radix router with external ports, simplifying reasoning about inter-group routing.

Very low diameter (between routers): at most 3 hops on a direct path, with the canonical intra-group complete-graph design.

Scalability: the number of groups is bounded by , the total size by . To grow further one increases , , or (i.e., uses higher-radix routers).

Adaptive routing via Valiant: deterministic minimal-direct + randomized indirect Valiant fallback handles congestion.

Relationship to other modern topologies: Dragonfly is one of three families (Clos topologies, fat trees, Dragonfly) that dominate parallel HPC interconnects today. All three are based on high-radix routers and are often confused in the literature. HPE’s Slingshot is the most recent contribution in the Dragonfly family.

Potential exam questions

The questions below match the lecturer’s proof-heavy, definition-precise style.

  1. Define the canonical Dragonfly topology . Give the meaning of each of the three parameters, the three levels of the hierarchy, and describe the structure of a single group.
  2. Derive the formula for the maximum number of groups . Justify the “+1”.
  3. Derive the formulas for the total number of routers and total number of compute nodes in .
  4. How many ports of each type (metallic for compute nodes, metallic intra-group, optical global) does a single router in have? How many total metallic local links and global optical links does the system have?
  5. What does it mean that each group acts as a “virtual router”? How many external ports does this virtual router have?
  6. For : compute , , , the number of metallic local links, and the number of global optical links. How many ports does each group expose as a virtual router?
  7. Describe the minimal direct routing in when the intra-group network is a complete graph. What is the maximum number of hops between any two routers, and how does the path decompose?
  8. What is Valiant routing? Why is it needed in addition to direct routing in ? What is its worst-case hop count compared to direct routing?
  9. The slides note that the assignment of global links between specific routers in different groups is “an important design parameter”. Which derived property does this assignment affect, and which derived properties does it not affect (i.e., , , )?
  10. Why is Dragonfly designed as a hierarchy with two link types (metallic local + optical global)? Relate the answer to physical/technological constraints on cable distances.
  11. Why are Dragonfly topologies (together with Clos and fat trees) called “high-radix” interconnects? What hardware enabled their rise in modern HPC systems?
  12. Compare Dragonfly to a flat fat tree of the same node count . Discuss qualitatively the differences in cabling (optical vs metallic) and in routing.