FREE SHIPPING on Orders Over US$79
United States

Exploring the Significance of InfiniBand Networking and HDR in Supercomputing

HowardUpdated at Jun 25th 20241 min read

Supercomputing, or high-performance computing (HPC), plays a pivotal role in solving complex computational problems across various fields, including climate modeling, genomics, and artificial intelligence. The continuous evolution of supercomputing necessitates advancements in networking technologies to handle massive data volumes and intricate computations. This article delves into the significance of InfiniBand networking and HDR in supercomputing, highlighting their contributions to enhanced performance, efficiency, and scalability.
The Popularity of InfiniBand in Supercomputers and HPC Data Centers
In June 2015, InfiniBand dominated the global Top 500 supercomputer list, holding an impressive 51.8% share, a year-over-year increase of 15.8%.
In the latest November 2023 Top500 list, InfiniBand maintained its leading position, highlighting its ongoing growth trend. Key trends included:
InfiniBand-based supercomputers led with 189 systems.
InfiniBand-based supercomputers dominated the top 100 systems with 59 installations.
NVIDIA GPUs and networking products, especially Mellanox HDR Quantum QM87xx switches and BlueField DPUs, were the primary interconnects in over two-thirds of the supercomputers.
Beyond traditional HPC applications, InfiniBand networks are widely used in enterprise data centers and public clouds. For instance, leading enterprise supercomputer NVIDIA Selene and Microsoft's Azure public cloud utilize InfiniBand to deliver superior business performance.
Advantages of InfiniBand Networks
InfiniBand is recognized as a future-proof standard for high-performance computing (HPC), renowned for its role in supercomputers, storage, and LAN networks. Key advantages include simplified management, high bandwidth, complete CPU offloading, ultra-low latency, cluster scalability and flexibility, quality of service (QoS), and SHARP support.
Effortless Network Management
InfiniBand features a pioneering network architecture designed for software-defined networking (SDN), managed by a subnet manager. This manager configures local subnets to ensure seamless network operation. All channel adapters and switches implement subnet management agents (SMAs) to cooperate with the subnet manager. Each subnet requires at least one subnet manager for initial setup and reconfiguration, with a failover mechanism to maintain uninterrupted subnet management.
Superior Bandwidth
InfiniBand consistently outperforms Ethernet in network data rates, crucial for server interconnects in HPC. Popular rates in 2014 were 40Gb/s QDR and 56Gb/s FDR, evolving to 100Gb/s EDR and 200Gb/s HDR in many supercomputers. The advent of advanced InfiniBand products with 400Gb/s NDR rates is being considered for high-performance computing systems.
Efficient CPU Offloading
InfiniBand enhances computing performance by minimizing CPU resource use through:
Hardware offloading the entire transport layer protocol stack
Kernel bypass with zero copy
RDMA (Remote Direct Memory Access) allows direct memory transfers between servers without CPU involvement
GPU direct technology further improves performance by enabling direct GPU memory access, which is essential for HPC applications like deep learning and machine learning.
Low Latency
InfiniBand achieves significantly lower latency compared to Ethernet. InfiniBand switches streamline layer 2 processing and employ cut-through technology, reducing forwarding latency to below 100ns. In contrast, Ethernet switches typically experience higher latency due to complex layer 2 processing. InfiniBand NICs (Network Interface Cards) also benefit from RDMA, reducing latency to around 600ns, while Ethernet-based applications hover around 10us, marking a tenfold latency advantage.
Scalability and Flexibility
InfiniBand supports up to 48,000 nodes in a single subnet, avoiding broadcast storms and unnecessary bandwidth usage. It supports various network topologies for scalability, from 2-layer fat-tree for smaller networks to 3-layer fat-tree and Dragonfly for larger deployments.
Quality of Service (QoS) Support
InfiniBand provides QoS by prioritizing traffic, ensuring high-priority applications are served first. This is achieved through virtual channels (VLs), which isolate traffic based on priority, ensuring efficient traffic management and high-priority application performance.
Stability and Resilience
InfiniBand features a self-healing network mechanism integrated into its switches, enabling rapid recovery from link failures within 1ms, significantly faster than typical recovery times.
Optimized Load Balancing
InfiniBand uses adaptive routing for load balancing, dynamically distributing traffic across switch ports to prevent congestion and optimize bandwidth utilization.
Network Computing Technology-SHARP
InfiniBand's SHARP technology offloads collective communication tasks from CPUs and GPUs to switches, reducing network data traversal and significantly boosting performance, especially in HPC applications.
Diverse Network Topologies
InfiniBand supports multiple topologies like fat-tree, Torus, Dragonfly+, Hypercube, and HyperX, catering to different requirements for scalability, cost efficiency, latency minimization, and transmission distance.
Overview of InfiniBand HDR Products
With growing client demands, 100Gb/s EDR is gradually being phased out, HDR is widely adopted for its flexibility, offering both HDR100 (100G) and HDR200 (200G) options.
InfiniBand HDR Switches
NVIDIA offers two types of InfiniBand HDR switches. The first is the HDR CS8500 modular chassis switch, a 29U switch with up to 800 HDR 200Gb/s ports. Each 200G port can split into 2x100G, supporting up to 1600 HDR100 (100Gb/s) ports. The second type is the QM87xx series fixed switch, with a 1U form factor integrating 40 200G QSFP56 ports. These ports can split into up to 80 HDR100 ports and also support EDR rates for 100G EDR NIC connections. Note that a single 200G HDR port can only downgrade to 100G for EDR NICs and cannot split into 2x100G for two EDR NICs.
The 200G HDR QM87xx switches come in two models: MQM8700-HS2F and MQM8790-HS2F. The only difference between these models is the management method. The QM8700 supports out-of-band management through a management port, while the QM8790 requires the NVIDIA UFMR platform for management.
Both QM8700 and QM8790 switches offer two airflow options. Details of the QM87xx series switches are as follows:
Production
Ports
Link speed
Interface Type
Rack Units
Management
40
200G
QSFP56
1 RU
Inband
40
200G
QSFP56
1 RU
Inband/outband
InfiniBand HDR NICs
HDR NICs come in a variety of models compared to HDR switches. There are two data rate options: HDR100 and HDR.
HDR100 NICs support 100Gb/s transmission rates. Two HDR100 ports can connect to an HDR switch using a 200G HDR-to-2x100G HDR100 cable. Unlike 100G EDR NICs, HDR100 NICs' 100G ports support both 4x25G NRZ and 2x50G PAM4 transmissions.
200G HDR NICs support 200Gb/s transmission rates and can connect directly to switches using 200G direct cables.
Each data rate has single-port, dual-port, and PCIe options, allowing businesses to choose based on their needs. Common IB HDR NIC models include:
Production
Ports
Supports InfiniBand speeds
Supports Ethernet speeds
Interface Type
Host Interface
Single
HDR100, EDR, FDR, QDR, DDR and SDR
100, 50, 40, 25, and 10Gb/s
QSFP56
PCIe 4.0x16
Dual
HDR, HDR100, EDR, FDR, QDR, DDR and SDR
100, 50, 40, 25, and 10Gb/s
QSFP56
PCIe 4.0x16
Single
HDR, HDR100, EDR, FDR, QDR, DDR and SDR
200, 100, 50, 40, 25, and 10Gb/s
QSFP56
PCIe 4.0x16
Dual
HDR, HDR100, EDR, FDR, QDR, DDR and SDR
200, 100, 50, 40, 25, and 10Gb/s
QSFP56
PCIe 4.0x16
Conclusion
InfiniBand networks, particularly with HDR technology, are pivotal in supercomputing. Their high throughput, low latency, scalability, reliability, and energy efficiency make them indispensable for modern HPC environments. As supercomputing evolves, InfiniBand will remain a cornerstone technology, driving advancements and enabling groundbreaking research and innovation.