Published on 00/00/0000
Last updated on 00/00/0000
Published on 00/00/0000
Last updated on 00/00/0000
Share
Share
INSIGHTS
12 min read
Share
Data center networks are experiencing a significant surge in the demand for resources due to the rapid growth of AI/ML applications. ChatGPT has rapidly emerged as one of the fastest-growing applications, and its core technology, known as Large Language Models (LLM), has garnered significant attention in various research areas, including networking.
Training LLMs with trillions of parameters demands substantial AI acceleration resources, such as GPUs. With the continuous growth in LLM size and the slowdown of Moore's Law, there's a need to connect GPUs together to form a high-capacity GPU cluster, such as NVIDIA DGX SuperPOD™ (see 2 in References), to meet these requirements. During LLM training, we must execute stochastic gradient descent (SGD) iterations and make efficient use of both data and model parallelism by seamlessly transferring data between GPUs to achieve the computationally intensive LLM training within a reasonable timeframe. This presents an intriguing, yet challenging, task for data center networking technology.
One central issue we aim to address is the following:
How can we efficiently transport LLM workload data from one GPU to another in a distributed LLM training setup?
It can be further divided into the following important problems:
Here, we present a general network model with generalized routing metrics designed for data centers tailored to LLM training. This model is built upon high-bandwidth GPU domains interconnected by optimized rails. We utilize a global distributed registry to collect and distribute normalized traffic health scores for each domain and a sorted list of health scores for all rails. These scores are then employed to efficiently calculate network path traffic scores.
Additionally, we introduce a health ratio metric for each GPU, facilitating the efficient determination of LLM traffic routing decisions based solely on the source and destination health ratios. We find that that the optimal path from the source to the destination only depends on whether the ratio metric of the source is larger than the destination or not, regardless of the intermediate nodes in between, which facilitates efficient AI traffic routing between GPUs.
Our approach can be uniformly applied to different types of LLM traffic, including intra-domain traffic, same-rail traffic, and across-rail traffic. Furthermore, we extend this mechanism to determine the optimal utilization of remote rails for routing traffic when local rails become congested.
In the following discussion, we will use a state-of-the-art underlying network connection architecture known as “rail-only connections” (see 2 in References for details and benefits) as our reference network topology. Rail-only connection topology was proposed by researchers from MIT and Meta. It is optimized for LLM training workloads in a data center. Our results can also be extended to a traditional fat-tree Clos data center network architecture, which is supported in legacy data center networking solution.
In this topology (Figure 1), we have M high-bandwidth domains, each of which contains K GPUs with high-speed any-to-any interconnections (such as NVLink/NVSwitch) and K rail switches. Each rail switch, or simply rail, connects the M GPUs with the same order number (referred to as ranking) within each domain.
Let G(d, g) represent the g-th GPU in domain d. To support distributed DNN (Deep Neural Networks) training and inference with SGD iteration and parallelization strategies, each GPU has two types of network connections: one, referred to as "d-interface,” facilitates any-to-any interconnect within a high-bandwidth domain, while the other, referred to as “r-interface,” is a RDMA-capable NIC connected to a rail switch. Our goal is to efficiently move LLM workloads between GPUs. In essence, we aim to address the following question:
How we can efficiently route LLM traffic from the source GPU G1 (G(d1, g1)) to target GPU G2 (G(d2, g2))?
Let us first tackle this from the simple cases to more involved ones:
Case 1: G1 and G2 are in the same domain, that is, d1 = d2. In that case, we should just use G1’s d-interface to utilize the high-bandwidth interconnect (yellow traffic in Figure 2).
Case 2: G1 and G2 are in different domains, but they are on the same rail, i.e., g1 = g2. In that case, we can typically use G1’s r -interface to route the traffic to G2 (green traffic in Figure 2). We will discuss some intricacy when the corresponding rail switch is busy.
Case 3: G1 and G2 are in different domains and on different rails.
This is a bit involved as at least two hops would be needed to reach G2 from G1. We can either take the d-interface first, route the traffic to G(d1, g2) within the same domain d1 as the next hop, and from there take the r-interface to reach G2. Let us call this dr-path (Figure 3).
Alternatively, we can also take r-interface to route the traffic to G(d2, g1) in domain d2, and then use the d-interface to reach G2 within domain d2. We will call this path rd-path (Figure 4).
Which route to take depends on the dynamic traffic conditions within the LLM cluster. Next, we will develop an LLM cluster traffic condition model to capture and update the traffic conditions in a cohesive manner, enabling the creation of an efficient LLM workload routing mechanism.
We will begin by introducing several metrics used to represent the traffic condition of components within the LLM cluster. These metrics are referred to as ”traffic health scores” or simply ”health-scores,” denoted as ”h-scores," for LLM domains or rails. H-scores are integer values ranging from 0 to 100, where 100 represents 100% healthy, indicating no congestion in the domain or on the rail, while 0 signifies 0% healthy, indicating complete blockage of the concerned component.
There should be a distributed observability/monitoring architecture in place within the LLM cluster. This architecture includes a cluster controller or master node responsible for running a global traffic monitoring service, which we'll refer to as "healthd.” Each rail switch or interconnection domain should monitor and collect its traffic load, periodically sending reports to ”healthd.” ”healthd” will then consolidate and normalize the data from all rails and domains. It assigns a value between 0% and 100% as the traffic health score and stores this information in a global registry (such as etcd in a Kubernetes cluster, Apache ZooKeeper, or Consul). Note that our mechanism is independent of any specific health score calculation mechanism as congestion detection can be platform specific, such as ECN/PFC or DCQCN. Also please note that h-score collection and distribution occur in the control plane and do not affect real-time traffic routing or forwarding speed in the data plane. Additionally, traffic trending takes time, so distributed calculation of h-scores does not have to be real-time.
Additionally, each GPU syncs up with the global registry and maintains a local registry cache, allowing each GPU to query the health score of a specific rail “r” as H(r) or the health score of a domain "d" as H(d) at any given time. After normalization, we have 0 <= H(r) <= 1 and 0 <= H(d) <= 1. A lower health score value represents a more congested condition.
A network path p in an LLM cluster is a sequence of concatenated domain/rail components, denoted as p = c1°c2° … Cm. We will denote each component c in the path as c ∈ p.
Now we can define the h-score of a network path p as the product of the h-score of each segment it traverses.
With this formula, when one component’s health-score decreases, the path’s health-score decreases proportionally. Since all the components are chained in sequence, when one component’s health score is nearly zero (blocked), the whole path’s health-score would be zero (blocked) as well. When all components’ health scores are 1 (100% healthy), the whole path’s health-score would be 1 (100% healthy). Therefore, this simple formula reasonably simulates the traffic condition of the routing path.
For instance, in Case 1, since the path between G1 and G2 only travels through domain d, its health-score is also H(d). In Case 2, the path between G1 and G2 only travels through rail r, its health-score is H(r). If a path travels domain d1 first, then goes through rail r1 to go to domain d2, then reach the destination GPU via domain d2, then this 3-hop path’s health-score is H(d1) * H(r1) * H(d2). Note that if the destination connected directly to rail r1, then the domain traffic condition should not be involved in the path health-score calculation, so H(p) = H(d1) * H(r1).
Besides h-score, we will introduce another crucial metric, heath-ratio, or h-ratio, 𝛾, for each GPU G(d, g) in the cluster,
𝛾 (G (d, g)) = H(g)/H(d).
We assume a LLM domain will never reach 0 otherwise the corresponding GPU’s 𝛾 value is infinite, which means the domain is completely blocked so we cannot perform any LLM operations there.
Now we can determine the route in G1 in Case 3 easily with the h-ratio metric introduced above (see Figure 5):
To verify the correctness of this approach, we know that when we choose GPU1’s r-interface as next hop, its path p1 can be represented as p1 = g1°d2, the other choice, p2, can be represented as p2 = d1°g2. So, we have H(p1) = H(g1)*H(d2), and H(p2) = H(d1)*H(g2). Since we have 𝛾 (G1) > 𝛾 (G2), by definition of 𝛾, we have H(g1) / H(d1) > H(g2) / H(d2), that is H(g1)*H(d2) > H(g2)*H(d1)
so H(p1) = H(g1)*H(d2) > H(d1)*H(g2) = H(p2), so we can choose the path with higher h-score.
In other words, when destination GPU’s h-ratio is lower than the source GPU’s h-ratio, we will choose r-interface when we move LLM workload downstream in terms of h-ratio, and choose d-interface when we move LLM workload upstream in terms of h-ratio.
We propose a novel AI/ML-workload routing architecture with generalized traffic condition metrics on a state-of-the-art DC network topology for generative AI training, such as LLM.
Navigating AI in your enterprise? Learn more about how Outshift is empowering enterprises through AI.
References:
Get emerging insights on emerging technology straight to your inbox.
Discover why security teams rely on Panoptica's graph-based technology to navigate and prioritize risks across multi-cloud landscapes, enhancing accuracy and resilience in safeguarding diverse ecosystems.
The Shift is Outshift’s exclusive newsletter.
Get the latest news and updates on cloud native modern applications, application security, generative AI, quantum computing, and other groundbreaking innovations shaping the future of technology.