Unlocking large scale AI training networks with MRC (Multipath Reliable Connection)

先看结论：OpenAI introduces MRC (Multipath Reliable Connection), a new supercomputer networking protocol released via OCP to improve resilience and pe

Supercomputer networking to accelerate large scale AI training Frontier model training depends on reliable supercomputer networks that can quickly move data between GPUs.

核心内容

To make this faster and more efficient, OpenAI has partnered with AMD, Broadcom, Intel, Microsoft, and NVIDIA to develop MRC (Multipath Reliable Connection): a novel protocol that improves GPU networking performance and resilience in large training clusters.

We released MRC today(opens in a new window) through the Open Compute Project (OCP) to enable the broader industry to use it.

With more than 900M people using ChatGPT every week, our systems are becoming core infrastructure for AI, helping people and businesses around the world build with increasingly capable models.

Prior to the inception of Stargate, we co-developed, brought up, and maintained our first three generations of supercomputers with great care and close collaboration with our partners over the span of a few years.

This invaluable experience informed our strong belief that, to efficiently use compute at the scale of Stargate and succeed in our mission, we need to rethink and drastically reduce complexity in every layer of the stack – including network design.

Publishing the MRC specification is part of OpenAI’s overall compute strategy: shared standards in key infrastructure layers can help scale AI systems more efficiently, reliably, and across a broader partner ecosystem.

In this post, we’ll cover the design of MRC, including: i) how it enables us to build multi-plane high-speed networks to create redundancy to ride out network failures, while using fewer components and less power ii) how MRC’s adaptive packet spraying virtually eliminates core congestion and iii) how our deployments use static source routing to bypass failures and eliminate whole classes of routing failure.

In concert, these benefits allow us to deliver better models to everyone faster.

When training large AI models, a single step can involve many millions of data transfers.

One transfer arriving late can ripple through the entire job, potentially causing GPUs to sit idle.

Network congestion, link, and device failures are the most common sources of delay and jitter in transfers.

These problems get more frequent, and harder to solve, as the size of the cluster increases.

This makes networking technology a key part of the design of Stargate.

To enable the current scale of Stargate supercomputers, we faced two key networking challenges.

First, whenever possible, we should minimize the possibility of network congestion.

延伸阅读：如果你想继续找可转化的工具入口，可以去工具合集和赚钱专题继续看。

进入 AI 工具导航页查看更多 AI 聊天