Using Muon Optimizer with DeepSpeed 在DeepSpeed中使用Muon优化器

DeepSpeed quietly integrating Muon Optimizer feels less like a feature checkbox and more like a tacit admission that AdamW, the long-reigning king of neural network training, might finally be showing its age. This isn’t just another optimizer plugin; it’s a validation wave from the frontier labs that actually matter—Moonshot AI, Zhipu AI, DeepSeek—who are voting with their compute budgets and paper titles. The real story isn’t in the DeepSpeed commit log; it’s in the glaring efficiency gap Muon

Hot

Quality

Impact

Analysis 深度分析

Let’s be blunt: Adam and its variants have been a magnificent, brute-force crutch. They work almost everywhere, but they’re wasteful. That dual-buffer state—storing both first and second moments—feels increasingly like a tax on scale. Muon, by contrast, is a leaner, more geometrically aware creature. Its core trick isn’t some incremental tweak to learning rate schedules; it’s the orthogonalization of the momentum matrix via Newton-Schulz iterations. This is where the real elegance lies. In standard training, gradient updates for crucial 2D weight matrices (think attention and MLP layers) are notoriously ill-conditioned—a few massive singular values dominate, drowning out the nuanced, low-rank signals that actually matter for generalization. Muon essentially performs a surgical equalization, amplifying those rare-but-vital update directions. It’s not just faster; it’s fundamentally smarter about where to spend its gradient budget.

The benchmarks are compelling, but almost beside the point. A 35% speedup on NanoGPT or hitting GPT-2 XL performance 25% faster is the kind of number that makes researchers’ eyes light up. The real headline is the memory savings. For a 1.6 trillion parameter model like DeepSeek-V4, halving the optimizer state memory for its 2D weight matrices isn’t a minor optimization—it’s a potential unlock for the next scale of model that would otherwise require a prohibitive number of accelerators. This isn’t about training a model slightly faster; it’s about whether a given architecture can feasibly be trained at all within a given cluster’s memory constraints. Muon changes the math of the entire scaling equation.

What’s fascinating is the social proof. We’re past the phase of arxiv papers and demo projects. When Zhipu AI’s GLM-5 (a 744B parameter beast) not only uses Muon but invents “Muon Split” to handle MLA architectures—orthogonalizing per attention head to close a performance gap with GQA—that’s not experimentation; that’s production engineering. This signals that Muon isn’t just a pretraining curiosity but a robust component capable of adaptation to novel, complex architectural choices. Moonshot’s use of a variant (MuonClip) in Kimi-K2 further cements this. These labs are investing engineering cycles to make Muon work with their flagship, money-making models. That tells you everything about its perceived value far more than any benchmark could.

The DeepSpeed integration itself, while technical, is the democratizing step. The challenge was real: DeepSpeed’s Zero stages handle gradients as flattened buffers, a format completely hostile to Muon’s need for 2D matrix structure. The solution—hijacking the get_flat_partition function in the DeepSpeedZeroOptimizer to apply the Muon update before flattening—is pragmatic and clever. It essentially sneaks Muon into the existing, battle-tested pipeline without requiring a full rewrite. This is how powerful ideas actually propagate: not by replacing infrastructure wholesale, but by being woven into it with surgical precision. Now, the legions of teams already locked into DeepSpeed can experiment with Muon without throwing away their orchestration.

However, let’s pump the brakes on the triumphalism. Muon’s current strength is explicitly tied to the hidden 2D weights. The notes mention that embeddings and layer norms (non-2D parameters) are not its target. This isn’t a universal Adam killer; it’s a specialist, a high-performance engine for a specific, critical part of the transformer anatomy. The art will be in the hybrid systems—knowing which parameters to hand to Muon and which to leave in the care of Adam or SGD. The early adopters are figuring this out, but for the broader community, it introduces a new layer of hyperparameter and architectural decision-making. Will we see a rise of “Muon-aware” architecture designs that maximize the proportion of eligible 2D weights? Almost certainly.

This development also subtly shifts the power dynamics of open-source AI. Tools like NanoGPT and nanochat, where Muon gained early traction, are proving grounds. When a method born in a speedrunning challenge gets adopted into the training pipeline of trillion-parameter flagship models, it demonstrates a remarkable pipeline from grassroots optimization to industrial scale. It suggests that the next great training algorithm might not come from a corporate research lab’s secret project, but from a clever tweak on a public leaderboard, validated by the community and then weaponized by the giants.

So, is the Muon Optimizer the new Adam? Not yet. Adam’s generality and simplicity are still powerful. But Muon represents a more disciplined, mathematically principled path forward for the specific, massive bottleneck of training foundation models. It’s a signal that the era of throwing ever-more memory and compute at the problem with blunt instruments is giving way to an era of precision. DeepSpeed’s support is the bridge that makes this precision accessible. The labs that have already crossed that bridge are training the next generation of AI. The rest of us are now scrambling to catch up, and the performance delta is starting to look like a canyon. The optimizer wars are no longer theoretical; they are happening in production, and Muon just fired the most convincing shot yet.

Muon优化器在DeepSpeed中获得官方支持，这条看似寻常的技术更新背后，藏着AI训练领域一场静默而激烈的军备竞赛。当Moonshot AI的Kimi-K2、智谱的GLM-5、甚至传闻中的DeepSeek-V4都纷纷押注这个新玩家时，我们不得不问：Muon真的是一次颠覆性的进步，还是另一个被巨头光环过度包装的“优化器时尚”？

Muon的核心卖点很清晰：专为神经网络的2D权重设计，通过牛顿-舒尔茨迭代将动量矩阵正交化。这个数学上优雅的操作，直指Transformer训练的一个痛点——梯度更新往往具有极高的条件数，几乎呈低秩状态，少数几个奇异方向主导了一切。Muon通过正交化强制均衡所有奇异值，让那些被掩盖的、但对泛化至关重要的更新方向得以“发声”。效果很诱人：在NanoGPT的速跑基准测试中，比AdamW快35%；在1.5B参数规模下，达到GPT-2 XL性能的速度快了25%。而且，它只需要一个动量缓冲区，相比Adam的两个，内存占用直接减半。

这听起来像是每个训练团队都梦寐以求的“白捡的效率”。但技术世界里，免费的午餐往往最贵。Muon的“正交化”步骤本身是计算密集的。在超大规模模型（比如文中的1.6T参数DeepSeek-V4）训练中，这一步的额外计算开销到底有多大？它节省的内存和提升的收敛速度，能否稳稳覆盖这部分新增成本？尤其是在分布式训练环境下，通信与计算重叠的复杂性下，正交化步骤会不会成为新的瓶颈？文中并未给出在超大规模下的详细分析，这恰恰是最关键的信息空白。

更值得玩味的是Muon在行业内的采纳路径。它由Keller Jordan在NanoGPT社区实验性提出，被Andrej Karpathy在其“nanochat”项目中使用，随后“变体”MuonClip被Moonshot AI用于其旗舰模型Kimi-K2。智谱AI甚至为它开发了专门的“Muon Split”技术，以解决MLA架构中使用Muon时出现的性能差距。这个路径清晰地展示了一种模式：前沿实验室正在将“社区魔改”的成果快速工业化，并反过来为其背书，形成技术正名。 这本身是健康的，但风险在于，当一家或几家头部机构大力推广某个技术组件时，会形成强大的光环效应，可能让其他团队在未充分评估自身架构适配性的情况下盲目跟进。Muon对2D权重的执着，意味着它天然“偏心”于注意力层和MLP层，对Embedding、LayerNorm等参数无能为力。这种“选择性优化”在复杂模型中是否会导致训练动态的某种失衡？

DeepSpeed团队为集成Muon所做的工程努力值得肯定。他们没有简单粗暴地处理扁平化的梯度缓冲区，而是巧妙地将更新逻辑下沉到ZeRO优化器（Stage 1 & 2）的get_flat_partition阶段，在此时参数梯度尚未被摊平，从而保留了矩阵的2D结构。这种“手术”体现了对底层架构的深刻理解。但这也引出了一个更根本的问题：当前的优化器生态是否过于碎片化？ 每个新优化器（Adam, AdamW, LAMB, Lion, Muon…）都需要框架进行特定适配，增加了工程复杂性和维护负担。DeepSpeed必须不断为这些“过客”优化器提供一流支持，否则就会失去开发者。Muon的加入，是生态繁荣的标志，也可能加剧框架的“优化器疲劳”。

归根结底，Muon的崛起反映了大型模型训练的一种趋势：在算法和数据创新趋于平缓时，对训练过程本身的工程和数学优化，成了新的突破口。 它不是魔法，而是一个在特定假设（高条件数梯度）下极其精巧的数学工具。它的价值不在于绝对性能的碾压，而在于为顶尖实验室提供了一种新的、可能更优的“控制旋钮”。对于大多数中小团队，使用成熟的AdamW可能仍是更稳妥的选择；但对于那些在千亿参数门槛上，追求每一分效率优势的前沿玩家，Muon的出现，无疑让他们的工具箱里又多了一件利器。这场静默的优化器进化，最终将由大规模训练的真实数据来裁决其功过。在此之前，保持审慎的乐观，比狂热的追捧更为明智。

Disclaimer: The above content is generated by AI and is for reference only.

Training LLM Open Source

Read Original →

Analysis 深度分析

Share to WeChat 分享到微信

Related Articles 相关文章