FREE SHIPPING on Orders Over US$79
United States

RoCE Explained: How It Boosts HPC and AI Cluster Performance

HowardDec 09, 20251 min read

As HPC and AI clusters scale rapidly, network performance has become a critical bottleneck affecting GPU utilization and overall job efficiency. Traditional Ethernet often struggles with latency, congestion, and packet loss under heavy east-west traffic. RoCE (RDMA over Converged Ethernet) addresses these challenges by enabling ultra-low-latency, high-throughput data movement over Ethernet, making it a key technology for accelerating modern HPC workloads and large-scale AI training.
Network Bottlenecks in HPC and AI Clusters
As HPC simulations and AI training jobs scale to hundreds or thousands of GPUs, several fundamental network bottlenecks begin to limit performance:
High Latency Slows Down GPU-to-GPU Communication:
In distributed deep learning, operations such as AllReduce and AllGather are susceptible to latency. The higher the latency, the longer GPUs spend waiting for synchronization—leading to significant drops in overall throughput.
Insufficient Bandwidth for Massive Data Exchange:
HPC workloads (e.g., CFD, molecular dynamics) and AI workloads require continuous transfer of large data chunks across nodes. When the network cannot sustain these demands, computation comes to a halt.
Packet Loss Causes Exponential Performance Degradation:
AI traffic patterns tend to be bursty and imbalanced. Even minimal packet loss (as little as 0.01%) can trigger retransmissions, dramatically increasing latency and causing GPU idle time to skyrocket.
High-Cost Specialized Fabrics Limit Scalability:
While InfiniBand provides excellent performance, its specialized ecosystem increases cost and operational complexity. Ethernet, in contrast, offers a familiar environment with broad compatibility and lower total cost of ownership (TCO).
These challenges drive enterprises toward RDMA-enhanced Ethernet architectures, with RoCE becoming the preferred solution.
Introduction to RoCE Protocol
The RoCE protocol serves as a cluster network communication protocol that facilitates Remote Direct Memory Access (RDMA) over Ethernet. This protocol transfers packet send/receive tasks to the network card, eliminating the need for the system to enter kernel mode. Consequently, this reduces the overhead associated with copying, encapsulation, and decapsulation, leading to a substantial decrease in Ethernet communication latency. Additionally, it minimizes CPU resource utilization during communication, eases network congestion, and enhances the efficient utilization of bandwidth.
The RoCE protocol consists of two versions: RoCE v1 and RoCE v2. RoCE v1 operates as a link-layer protocol, requiring both communicating parties to be within the same Layer 2 network. In contrast, RoCE v2 functions as a network-layer protocol, enabling RoCE v2 protocol packets to be routed at Layer 3, providing superior scalability. Today, RoCEv2 is the dominant and widely deployed variant, offering ultra-low latency and high throughput on a familiar Ethernet infrastructure.
RoCE provides several performance and scalability benefits that make it ideal for modern HPC and AI clusters:
Ultra-low latency for faster GPU synchronization in distributed deep learning and tightly coupled HPC workloads.
Higher bandwidth utilization that accelerates collective communication (e.g., AllReduce, AllGather).
Reduced CPU overhead thanks to zero-copy transfers and direct memory access, freeing CPU cycles for computation.
Near–InfiniBand performance on Ethernet, allowing organizations to benefit from RDMA without adopting a separate networking ecosystem.
Excellent scalability for large CLOS fabrics with hundreds or thousands of nodes.
For more information about RoCE, you can read:
Why Lossless Ethernet Is Essential for RoCE
Since RDMA was initially designed based on the lossless InfiniBand network (with inherent losslessness, built into the hardware and Credit-based Flow Control protocol), the RoCEv2 protocol lacks a complete packet loss protection mechanism. The loss of any single packet will cause a large number of retransmissions, seriously affecting the data transmission performance. At the same time, the characteristics of high-performance computing and distributed storage-related applications are the one-to-many communication Incast traffic model, which is prone to causing instantaneous burst congestion and even packet loss in the queue cache of the Ethernet switch, resulting in increased application latency and decreased throughput, and reducing the performance of distributed applications.
Key Features for Building a Lossless Ethernet Network
To maintain stable RDMA performance, Ethernet switches must provide lossless networking, ensuring zero packet loss, low latency, and high throughput across the fabric. To build a lossless network, several key features are required:
Capability
Role
Advantages
PFC
Prevents buffer overflow and packet loss by regulating sending rates.
The training results are more accurate and the platform is more stable.
ECN
Detects and mitigates congestion to maintain network efficiency.
When multiple tasks are parallel, the overall training efficiency is higher
DLB
Distributes traffic to avoid hotspots and improve utilization.
Maximizes bandwidth usage and improves ROI through better link efficiency.
GLB
Performs cluster-wide traffic distribution based on global congestion awareness.
Ensures stable, high-throughput operation for ultra-large GPU clusters.
1. Priority Flow Control (PFC): PFC acts as the first line of defense, providing link-level losslessness by pausing traffic on specific priority queues before buffers overflow. It effectively prevents packet drops for RDMA flows but must be carefully configured to avoid issues such as head-of-line blocking and deadlocks.
2. Explicit Congestion Notification (ECN): ECN provides a proactive, hop-by-hop method for signaling impending congestion. When the switch queue depth crosses the ECN threshold, packets are marked instead of dropped. The receiver then signals the sender to reduce its sending rate, achieving intelligent traffic moderation without compromising performance.
3. Dynamic Load Balancing (DLB): DLB provides real-time, in-cluster traffic optimization by continuously redistributing flows based on instantaneous congestion and resource utilization. It quickly reacts to microbursts and uneven traffic patterns, helping maintain low latency and high throughput within a single data center or fabric.
4. Global Load Balancing (GLB): GLB offers cross-site, multi-region traffic distribution by routing workload to the optimal data center based on geography, latency, health status, or policy. It enhances availability and resiliency by dynamically redirecting traffic during site-level failures or maintenance events.
For more information about the RoCE Congestion Control Mechanism, you can read:
RoCE Applications and Deployment Challenges
From large-model training to real-time inference and distributed storage, RoCE-powered lossless Ethernet fabrics play a critical role in keeping GPUs, CPUs, and storage devices fully utilized. The following are typical use cases for RoCE:
Application Scenario
Sub-Scenarios
Advantages
High-Performance Computing (HPC)
Scientific research, climate modeling, simulation workloads, military computing, bio-pharma computing, genomics, image processing
Avoid packet loss that causes long-tail latency and slows computation.
AI Training
Large-model training (GPT), CV model training, recommendation models, and large-scale distributed training
Ensure stable synchronization and prevent iteration delays from retransmissions.
AI Inference
Autonomous driving, speech assistants, and real-time recommendation systems
Guarantee predictable low latency to meet strict SLA requirements.
Distributed Storage
Big data platforms, cloud storage, IoT storage, AI storage, CDN, 5G network
Maintain high throughput and data reliability under heavy I/O loads.
Despite its advantages, deploying RoCE at scale involves several practical challenges:
Complex PFC Configuration Across the Entire Network:
End-to-end consistency is required across switches, NICs, drivers, and operating systems. A single misconfiguration can significantly degrade performance.
ECN Tuning Varies by Workload:
Different models (LLMs, CV workloads, recommendation systems) have different latency sensitivities. ECN threshold tuning often requires workload-specific optimization.
Multi-Tenant Environments Require Stronger Isolation:
Cloud providers or shared HPC facilities must prevent one tenant’s traffic from affecting the RDMA flows of others.
Large-Scale CLOS Fabrics Increase Operational Complexity:
As clusters grow beyond hundreds of nodes, switch buffer size, queue management, route optimization, and flow distribution become critical.
To address these challenges, modern Ethernet switches are increasingly integrating advanced hardware, intelligent software, and telemetry-driven management platforms to simplify RoCE tuning and deliver predictable, lossless performance in large-scale HPC and AI deployments.
FS 400G RoCE Lossless Network Solution
The rapid advancement of AI has captivated global audiences, driving AI and ML to the forefront of enterprise innovation. At the core of AI's transformative power are data centers. FS 400G RoCE lossless network solution offers a full-stack, integrated approach, spanning from network hardware to management software. Powered by 400G PicOS® Ethernet switches and AmpCon-DC Management Platform, it delivers the highest performance for AI, machine learning, and HPC applications.
FS 400G PicOS® Ethernet Switch Portfolio
Powered by industry-leading Broadcom Tomahawk 3/4 chip, the switch throughput of up to 25.6 Tbps, featuring deterministic, low-latency, line-rate switching and full layer 2/3 capabilities to provide a scalable network foundation for AI workloads. Hot-swappable redundant power supplies and fans ensure high reliability and intelligent, efficient heat dissipation to support the development of low-carbon data centers.
Equipped with built-in RoCEv2 capabilities, the switch offers advanced features such as PFC, ECN, and dynamic load balancing, enabling low-latency, lossless communication for RDMA-based AI workloads—without requiring additional investment in network infrastructure.
AmpCon-DC Management Platform offers Day 0 to Day 2+ capabilities to manage PicOS® data center switches, enabling provisioning, monitoring, troubleshooting, and maintenance for higher resource utilization and lower opex. RoCE EasyDeploy simplifies lossless network deployment by automating PFC and ECN configuration, reducing manual effort and minimizing errors to accelerate high-performance fabric rollout for AI and HPC workloads.
The following are the 400G PicOS® Ethernet Switch Portfolio provided by FS:
Models
Form Factor
1U
1U
2U
4U
Switch Chip
BCM56980 Tomahawk 3
BCM56993 Tomahawk 4
BCM56990 Tomahawk 4
BCM56990 Tomahawk 4
Ports
32x 400G QSFP-DD
32x 400G QSFP-DD
64x 400G QSFP-DD
64x 400G QSFP-DD
Max. Ports with Breakout
64x 200GbE or 128x 100GbE
64x 200GbE or 128x 100GbE
128x 200GbE or 256x 100GbE
128x 200GbE or 256x 100/50GbE
Switching Capacity
12.8 Tbps
12.8 Tbps
25.6 Tbps
25.6 Tbps
Power Supplies
2 (1+1 Redundancy), AC
2 (1+1 Redundancy), AC
2 (1+1 Redundancy), AC
4 (3+1 Redundancy), AC
Fans
6 (5+1 Redundancy)
6 (5+1 Redundancy)
4 (3+1 Redundancy)
6 (5+1 Redundancy)
Airflow
Front-to-Back
Front-to-Back
Front-to-Back
Front-to-Back
Operating System
PicOS®
PicOS®
PicOS®
PicOS®
Management Platform
AmpCon-DC
AmpCon-DC
AmpCon-DC
AmpCon-DC
RoCEv2
PFC/ECN
DLB
RoCE EasyDeploy
MLAG
EVPN-VXLAN
×
Conclusion
As modern data centers continue to grow in size and complexity, RoCE—together with advanced congestion control and intelligent network management—has become a foundational technology for building efficient, scalable, and lossless AI/HPC networking.
FS AI data center solution is a quick way to deploy high-performing AI training and inference networks that are the most flexible to design and easiest to manage with limited IT resources. We integrate a complete industry-leading hardware portfolio and AmpCon-DC management platform to help customers easily build high-capacity, easy-to-operate network fabrics that deliver the fastest JCTs, maximize GPU utilization, and use limited IT resources. Ready to power your next-generation AI data center? Contact us today to explore tailored solutions for your networking.