Optical Module Requirements for A100 and H100 GPUs in HPC Networks
Updated at Dec 22nd 20231 min read
In the HPC network industry, various methods are employed to determine the quantity of optical modules to GPUs, resulting in inconsistent outcomes. These discrepancies primarily arise from various optical modules utilized in different network architectures. In this article, we delve into these factors and explore how they influence the exact quantity of optical modules needed, particularly focusing on the configurations involving A100 and H100 GPUs.
What Affects the Number of Optical Modules?
In data centers, the number of optical modules is influenced by factors such as network cards, switches, and the number of units, which determine the network's performance, scalability, and overall cost.
Network Cards
The data rate of network cards will determine the type of optical modules used. High-performance network cards with multiple ports may require more optical modules to support increased data transmission capacity. For example, ConnectX-7 NIC supports NDR, NDR200, HDR, HDR100, EDR, FDR, and SDR InfiniBand speeds, impacting the selection of both the number and types of modules.
ConnectX-6 NIC Adapter: 200Gb/s, commonly paired with A100 GPUs.
ConnectX-7 NIC Adapter: 400Gb/s, typically used with H100 GPUs.
Switches
The switch's capacity to handle traffic and its architecture will affect the number and type of optical modules required. High-capacity switches with a large number of ports may need more optical modules to ensure efficient data routing and network performance. For example, both the MQM9700-NS2F switches and MQM9790-NS2F switches support 400Gb/s, with 32 ports. During network deployment, the use of MQM9790-NS2F switches impacts both the number and types of modules selected.
MQM9700-NS2F and MQM9790-NS2F Switches: 32-port OSFP 2x400Gb/s, a total of 64 channels of 400Gb/s transmission rate and 51.2Tb/s throughput rate.
MQM8700-HS2F and MQM8790-HS2F Switches: 40-port QSFP56, a total of 40 channels of 200Gb/s transmission rate and 16Tb/s throughput rate.
Number of Units (Scalable Unit)
The number of units dictates the configuration of the switch network's architecture. A two-tier structure is adopted for smaller quantities, whereas a three-tier architecture is implemented to accommodate larger quantities.
H100 SuperPOD: Each unit consists of 32 nodes (DGX H100 servers) and supports a maximum of 4 units to form a cluster, using a two-layer switching architecture.
A100 SuperPOD: Each unit consists of 20 nodes (DGX A100 servers) and supports a maximum of 7 units to form a cluster. A three-layer switching architecture is required if the number of units exceeds 5.
Optical Module Demand under Four Network Configurations
Case 1: A100+ConnectX6+MQM8700-HS2F Three-Layer Network
The A100 GPU is designed with eight compute interfaces, with an equal distribution of four interfaces located on the left and four on the right as depicted in the diagram. Currently, most A100 GPU shipments are coupled with ConnectX-6 for outward communications, providing connection speeds of up to 200Gb/s.
The First Layer: Each node has 8 interfaces (ports), and the node is connected to 8 leaf switches. Every 20 nodes form a single unit(SU). Therefore, in the first layer, a total of 8xSU leaf switches are required, along with 8xSUx20 cables and 2x8xSUx20 units of 200G optical modules.
The Second Layer: Using a non-blocking design, the speed is the same as that of the first layer, which is also 200Gbs/s. Therefore, the number of cables in the second layer should be the same as in the first. The number of spine switches required is calculated by dividing the number of cables by the number of leaf switches, which results in the need for (8xSUx200) / (8xSU) spine switches. However, when there are not enough leaf switches, it can have multiple connections between leaf and spine switches (no more than 40 interfaces). Therefore, when the unit quantity is 1/2/4/5, the required number of spine switches is 4/10/20/20, and the required number of optical modules is 320/640/1280/1600. The number of spine switches does not increase proportionally, but the number of optical modules does.
The Third Layer: When the system expands to encompass 7 units, it's necessary to implement a third-layer architectural setup. The necessary number of cables in the third layer remains unchanged from the second layer due to its non-blocking configuration.
NVIDIA's suggested blueprint for a SuperPOD entails the integration of networking across seven units, incorporating a third-layer architecture, and the adoption of core switches. The detailed graphic illustrates the varying quantities of switches across different layers and the associated cabling required for diverse unit counts.
For a setup of 140 servers, the total number of A100 GPUs involved is 1,120, achieved by multiplying 140 servers by eight. To support this configuration, 140 MQM8790-HS2F switches are deployed, alongside 3,360 cables. In addition, the setup necessitates the use of 6,720 200G InfiniBand transceiver. The proportion of A100 GPUs to 200G optical modules stands at a 1:6 ratio, correlating 1,120 GPUs to 6,720 optical modules.
Case 2: A100+ConnectX6+MQM9700-NS2F Two-Layer Network
The current configuration is not optimal. Gradually, a growing number of A100 GPUs may be connected via MQM9700-NS2F switches. Such a shift would lead to a decrease in the number of 200G optical transceivers but would increase the demand for 800G optical modules. The main difference can be seen in the connections at the initial layer. Instead of using eight separate 200G cables, we will use QSFP to OSFP adapters. Each adapter will support two connections, allowing for 1-to-4 connectivity.
The First Layer: For a cluster with 7 units and 140 servers, there is a total of 140x8 = 1,120 interfaces. This corresponds to 280 1 to-4 cables, resulting in a demand for 280 units of 800G and 1, 120 units of 200G optical transceivers. This requires 12 MQM9700-NS2F switches.
The Second Layer: Utilizing only 800G connections, 280x2 = 5600 units of 800G optical transceivers are needed along with 9 MQM9700-NS2F switches.
Therefore, for 140 servers and 1,120 A100 GPUs, a total of 21 switches (12 + 9) are required, along with 840 units of 800G optical transceivers and 1,120 units of 200G optical transceivers.
The ratio between A100 GPUS and 800G InfiniBand transceivers is 1,120:840, which simplifies to 1:0.75. The ratio between A1000 GPUs and 200G optical modules is 1:1.
Case 3: H100+ConnectX7+MQM9700-NS2F Two-Layer Network
A distinctive feature of the H100's architecture is that the card, despite housing 8 GPUs, comes outfitted with 8 400G networking cards which are combined to form 4 800G interfaces. This fusion creates a considerable need for 800G optical modules.
According to NVIDIA's recommended configuration, it is suggested to connect one to the server interface. This can be achieved by using a twin-port connection with two optical cables (MPO), where each cable is inserted into a separate switch.
The First Layer: Each unit consists of 32 servers, and each server is connected to 24=8 switches. In a SuperkPOD with 4 units, a total of 48=32 leaf switches are required. NVIDIA recommends reserving one node for management purposes (UFM). Since the impact on the usage of optical modules is limited, let's approximate the calculation based on 4 units with a total of 128 servers. Hence, a total of 4x128=512 units of 800G optical modules and 2x4x128=1024 units of 400G InfiniBand transceivers are needed.
The Second Layer: The switches are directly connected using 800G optical modules. Each leaf switch is connected downward with an unidirectional speed of 32x400G. To ensure consistent upstream and downstream speeds, the upward connection requires a unidirectional speed of 16x800G. This necessitates 16 spine switches, resulting in a total of 4x8x162=1024 units of 800G optical modules needed.
Under this architecture, the two layers require a total of 512 + 1024 = 1536 800G OSFP optical modules and 1024 400G OSFP optical transceivers, with a total of 1024 H100 units. The ratio of GPUs to 800G OSFP optical modules is 1024/1536, which is approximately 1:1.5. The ratio of GPUs to 400G OSFP optical modules is 1024/1024, which is 1:1.
Case 4: H100+ConnectX8 (Not Yet Released)+MQM9700-NS2F Three-Layer Network
In this hypothetical scenario, if H100 GPUs were to be upgraded with 800G network cards, the external interfaces would need to be expanded from four to eight OSFP interfaces. Consequently, the inter-layer connections would also make use of 800G optical transceivers. The basic network design would remain consistent with the original scenario, with the only change being the replacement of 200G optical transceivers with 800G counterparts. Therefore, within this network framework, the relationship between the number of GPUs and the required optical modules would continue to maintain a 1:65 ratio, just as in the initial scenario.
We organize the above four cases into the following table:
GPU | NIC Rate | Switch Rate | Layer | 200G Optical Module Requirement | 400G Optical Module Requirement | 800G Optical Module Requirement |
A100 | 200G | 200G | Layer3 | 1:06 | 0 | 0 |
A100 | 200G | 400G | Layer2 | 1:01 | 0 | 1:0.75 |
H100 | 400G | 400G | Layer2 | 0 | 1:01 | 1:1.5 |
H100 | 800G | 800G | Layer3 | 0 | 0 | 1:6.5 |
Conclusion
This article discusses how different architectures and components in high-performance computing (HPC) networks affect the number of optical modules required for GPUs. It focuses on analyzing the optical module demands of A100 and H100 GPUs under various NIC and switch configurations.
As a leading professional provider of networking solutions, FS offers a full range of high-quality optical modules from 1G to 800G, with the 400G and 800G transceivers being particularly noteworthy. These modules are 100% originally verified to be fully compatible with NVIDIA InfiniBand NDR devices to ensure optimal performance and reliability, meeting industry standards. We invite everyone to learn about and purchase our products to achieve excellence in high-performance computing and cutting-edge technology applications.