Outshift | Training LLMs: An efficient GPU traffic routing mechanism within AI/ML cluster with rail-only connections

Data center networks are experiencing a significant surge in the demand for resources due to the rapid growth of AI/ML applications. ChatGPT has rapidly emerged as one of the fastest-growing applications, and its core technology, known as Large Language Models (LLM), has garnered significant attention in various research areas, including networking.

Training LLMs with trillions of parameters demands substantial AI acceleration resources, such as GPUs. With the continuous growth in LLM size and the slowdown of Moore's Law, there's a need to connect GPUs together to form a high-capacity GPU cluster, such as NVIDIA DGX SuperPOD™ (see 2 in References), to meet these requirements. During LLM training, we must execute stochastic gradient descent (SGD) iterations and make efficient use of both data and model parallelism by seamlessly transferring data between GPUs to achieve the computationally intensive LLM training within a reasonable timeframe. This presents an intriguing, yet challenging, task for data center networking technology.

Prerequisites for transferring data between GPUs to support LLM training

One central issue we aim to address is the following:

How can we efficiently transport LLM workload data from one GPU to another in a distributed LLM training setup?

It can be further divided into the following important problems:

Is it possible to develop a generalized LLM workload routing mechanism that can leverage the underlying LLM cluster network topology while remaining agnostic to specific proprietary interconnection technologies, like NVLink and InfiniBand?

Since LLM workload is a relatively new application for data center networking and it has its unique combination of network inter-connections technologies and topologies (See 2 in References), we need to first come up with a common model that takes advantage of the unique characteristics of LLM/DNN (Deep Neural Networks).
Can we define simple metrics to capture traffic conditions that facilitate efficient routing decisions? Ideally, these metrics should capture the essence of L2/L3 data traffic between GPUs, be generic, and yet hide these internal proprietary topologies so that our model will be generic and not tied to a specific vendor or version of the current LLM training setups. Furthermore, the data forwarding decision should be simple, not involving heavy computation in real-time. In addition, the mechanism should be dynamic, as the traffic pattern within those hyper-bandwidth domains (such as those circuit-switching connections with NVLinks) and RDMA capable connections (such as InfiniBand, RoCEv2, or Ultra Ethernet connections) is ever-changing.
Is it feasible to devise a mechanism that allows globally optimal end-to-end routing decisions to be determined solely based on the source GPU and destination GPU, or must it rely on intermediate components as well, thus requiring locally optimal routing decisions to be made on a hop-by-hop basis?

Here, we present a general network model with generalized routing metrics designed for data centers tailored to LLM training. This model is built upon high-bandwidth GPU domains interconnected by optimized rails. We utilize a global distributed registry to collect and distribute normalized traffic health scores for each domain and a sorted list of health scores for all rails. These scores are then employed to efficiently calculate network path traffic scores.

Additionally, we introduce a health ratio metric for each GPU, facilitating the efficient determination of LLM traffic routing decisions based solely on the source and destination health ratios. We find that that the optimal path from the source to the destination only depends on whether the ratio metric of the source is larger than the destination or not, regardless of the intermediate nodes in between, which facilitates efficient AI traffic routing between GPUs.

Our approach can be uniformly applied to different types of LLM traffic, including intra-domain traffic, same-rail traffic, and across-rail traffic. Furthermore, we extend this mechanism to determine the optimal utilization of remote rails for routing traffic when local rails become congested.

Detailed description: Using rail-only connections to create a novel AI/ML-workload routing architecture optimized for LLM training

In the following discussion, we will use a state-of-the-art underlying network connection architecture known as “rail-only connections” (see 2 in References for details and benefits) as our reference network topology. Rail-only connection topology was proposed by researchers from MIT and Meta. It is optimized for LLM training workloads in a data center. Our results can also be extended to a traditional fat-tree Clos data center network architecture, which is supported in legacy data center networking solution.

A LLM cluster with Rail-only connections

In this topology (Figure 1), we have M high-bandwidth domains, each of which contains K GPUs with high-speed any-to-any interconnections (such as NVLink/NVSwitch) and K rail switches. Each rail switch, or simply rail, connects the M GPUs with the same order number (referred to as ranking) within each domain.

Let G(d, g) represent the g-th GPU in domain d. To support distributed DNN (Deep Neural Networks) training and inference with SGD iteration and parallelization strategies, each GPU has two types of network connections: one, referred to as "d-interface,” facilitates any-to-any interconnect within a high-bandwidth domain, while the other, referred to as “r-interface,” is a RDMA-capable NIC connected to a rail switch. Our goal is to efficiently move LLM workloads between GPUs. In essence, we aim to address the following question:

How we can efficiently route LLM traffic from the source GPU G1 (G(d1, g1)) to target GPU G2 (G(d2, g2))?

Let us first tackle this from the simple cases to more involved ones:

Case 1: G1 and G2 are in the same domain, that is, d1 = d2. In that case, we should just use G1’s d-interface to utilize the high-bandwidth interconnect (yellow traffic in Figure 2).

Case 2: G1 and G2 are in different domains, but they are on the same rail, i.e., g1 = g2. In that case, we can typically use G1’s r -interface to route the traffic to G2 (green traffic in Figure 2). We will discuss some intricacy when the corresponding rail switch is busy.

Rail traffic

Case 3: G1 and G2 are in different domains and on different rails.

This is a bit involved as at least two hops would be needed to reach G2 from G1. We can either take the d-interface first, route the traffic to G(d1, g2) within the same domain d1 as the next hop, and from there take the r-interface to reach G2. Let us call this dr-path (Figure 3).

Alternatively, we can also take r-interface to route the traffic to G(d2, g1) in domain d2, and then use the d-interface to reach G2 within domain d2. We will call this path rd-path (Figure 4).

Screenshot 2024-02-22 at 5.30.01 PM.png

Screenshot 2024-02-22 at 5.30.09 PM.png

Which route to take depends on the dynamic traffic conditions within the LLM cluster. Next, we will develop an LLM cluster traffic condition model to capture and update the traffic conditions in a cohesive manner, enabling the creation of an efficient LLM workload routing mechanism.

We will begin by introducing several metrics used to represent the traffic condition of components within the LLM cluster. These metrics are referred to as ”traffic health scores” or simply ”health-scores,” denoted as ”h-scores," for LLM domains or rails. H-scores are integer values ranging from 0 to 100, where 100 represents 100% healthy, indicating no congestion in the domain or on the rail, while 0 signifies 0% healthy, indicating complete blockage of the concerned component.

Rail and Domain traffic h-score

There should be a distributed observability/monitoring architecture in place within the LLM cluster. This architecture includes a cluster controller or master node responsible for running a global traffic monitoring service, which we'll refer to as "healthd.” Each rail switch or interconnection domain should monitor and collect its traffic load, periodically sending reports to ”healthd.” ”healthd” will then consolidate and normalize the data from all rails and domains. It assigns a value between 0% and 100% as the traffic health score and stores this information in a global registry (such as etcd in a Kubernetes cluster, Apache ZooKeeper, or Consul). Note that our mechanism is independent of any specific health score calculation mechanism as congestion detection can be platform specific, such as ECN/PFC or DCQCN. Also please note that h-score collection and distribution occur in the control plane and do not affect real-time traffic routing or forwarding speed in the data plane. Additionally, traffic trending takes time, so distributed calculation of h-scores does not have to be real-time.

Additionally, each GPU syncs up with the global registry and maintains a local registry cache, allowing each GPU to query the health score of a specific rail “r” as H(r) or the health score of a domain "d" as H(d) at any given time. After normalization, we have 0 <= H(r) <= 1 and 0 <= H(d) <= 1. A lower health score value represents a more congested condition.

A network path p in an LLM cluster is a sequence of concatenated domain/rail components, denoted as p = c1°c2° … Cm. We will denote each component c in the path as c ∈ p.

Now we can define the h-score of a network path p as the product of the h-score of each segment it traverses.

Screenshot 2024-02-22 at 5.30.44 PM.png

With this formula, when one component’s health-score decreases, the path’s health-score decreases proportionally. Since all the components are chained in sequence, when one component’s health score is nearly zero (blocked), the whole path’s health-score would be zero (blocked) as well. When all components’ health scores are 1 (100% healthy), the whole path’s health-score would be 1 (100% healthy). Therefore, this simple formula reasonably simulates the traffic condition of the routing path.

For instance, in Case 1, since the path between G1 and G2 only travels through domain d, its health-score is also H(d). In Case 2, the path between G1 and G2 only travels through rail r, its health-score is H(r). If a path travels domain d1 first, then goes through rail r1 to go to domain d2, then reach the destination GPU via domain d2, then this 3-hop path’s health-score is H(d1) * H(r1) * H(d2). Note that if the destination connected directly to rail r1, then the domain traffic condition should not be involved in the path health-score calculation, so H(p) = H(d1) * H(r1).

GPU’s Rail/Domain traffic h-ratio

Besides h-score, we will introduce another crucial metric, heath-ratio, or h-ratio, 𝛾, for each GPU G(d, g) in the cluster,

𝛾 (G (d, g)) = H(g)/H(d).

We assume a LLM domain will never reach 0 otherwise the corresponding GPU’s 𝛾 value is infinite, which means the domain is completely blocked so we cannot perform any LLM operations there.

A dynamic routing congestion model

Now we can determine the route in G1 in Case 3 easily with the h-ratio metric introduced above (see Figure 5):

Screenshot 2024-02-22 at 5.31.12 PM.png

To verify the correctness of this approach, we know that when we choose GPU1’s r-interface as next hop, its path p1 can be represented as p1 = g1°d2, the other choice, p2, can be represented as p2 = d1°g2. So, we have H(p1) = H(g1)*H(d2), and H(p2) = H(d1)*H(g2). Since we have 𝛾 (G1) > 𝛾 (G2), by definition of 𝛾, we have H(g1) / H(d1) > H(g2) / H(d2), that is H(g1)*H(d2) > H(g2)*H(d1)

so H(p1) = H(g1)*H(d2) > H(d1)*H(g2) = H(p2), so we can choose the path with higher h-score.

In other words, when destination GPU’s h-ratio is lower than the source GPU’s h-ratio, we will choose r-interface when we move LLM workload downstream in terms of h-ratio, and choose d-interface when we move LLM workload upstream in terms of h-ratio.

Screenshot 2024-02-22 at 5.31.26 PM.png

Conclusions

We propose a novel AI/ML-workload routing architecture with generalized traffic condition metrics on a state-of-the-art DC network topology for generative AI training, such as LLM.

The “h-score” metric captures the traffic health (or congestion) condition trending of high-bandwidth domains and rail switches. It can be used to derive the traffic condition of an end-to-end path within a rail-only topology, forming a self-consistent, intuitive, and efficient-to-implement model.
We also introduce a metric, 𝛾 (gamma), representing the health ratio of the pair of the interfaces on each GPU. We find that the optimal path can be determined based on whether the metric of the source GPU and that of destination GPU is increasing or decreasing, regardless of the intermediate nodes traversed. This finding significantly simplifies the routing decision and enables efficient workload movement under dynamic traffic conditions.
To ensure data-plane routing efficiency, we introduce a distributed registry to store h-scores for domains and rails, as well as derived h-ratios for GPUs in the cluster. This information is stored both at global level in a DC controller and in local GPU as a registry cache on the control plane.
Our method relies on extracted h-core values for each rail/domain in an LLM cluster, making it independent of the internal interconnect topology and implementation of the domain and rail.
Our mechanism is also independent of the h-score and congestion detection algorithms which can use ECN/PFC, DCQCN, or any other platform specific methods.

Navigating AI in your enterprise? Learn more about how Outshift is empowering enterprises through AI.

References:

Nvidia dgx superpod: Next generation scalable infrastructure for ai leadership, reference architecture, 2023.
“How to Build Low-Cost Networks for Large Language Models (Without Sacrificing Performance)?” A paper on an efficient networking architecture for GPU/LLM, 2023.
“On the Impact of Packet Spraying in Data Center Networks.” A paper on packet spraying in Data Center Networks.