Research Papers 论文研究 3h ago Updated 1h ago 更新于 1小时前 48

Spectral Scaling Laws of Muon Muon的谱缩放定律

The dirty secret of scaling large language models isn’t just data or compute—it’s the silent, creeping inefficiency of the tools we use to train them. The latest research on the Muon optimizer, a darling of the open-source community, throws a wrench into the comfortable assumption that what works at 8 billion parameters will seamlessly work at 800 billion. It’s not just about making the model bigger; it’s about the optimizer’s guts expanding unevenly, creating a hidden computational tax at the f 当优化器成为大模型的变速箱,我们是否忽略了最精密的调校?Muon优化器凭借其正交化更新规则,已成为训练前沿大模型的“标配”,这本身没什么好惊讶的。真正让人坐不住的是,这篇论文捅破了一层窗户纸:我们用在这些庞然大物身上的核心算法组件,其关键假设正在随着模型规模的扩大而悄然失效,而整个行业对此几乎是盲视的。

65
Hot 热度
75
Quality 质量
65
Impact 影响力

Analysis 深度分析

The dirty secret of scaling large language models isn’t just data or compute—it’s the silent, creeping inefficiency of the tools we use to train them. The latest research on the Muon optimizer, a darling of the open-source community, throws a wrench into the comfortable assumption that what works at 8 billion parameters will seamlessly work at 800 billion. It’s not just about making the model bigger; it’s about the optimizer’s guts expanding unevenly, creating a hidden computational tax at the frontier.

Muon’s trick is elegant: it uses Newton-Schulz iterations to orthonormalize update matrices, preventing training from going off the rails. It’s fast, it’s effective, and it’s become the go-to for teams training the latest open models. But this new arXiv paper (2606.04058) provides the first rigorous stress test of its core assumption. By tracking the singular value spectrum of the momentum buffer across model sizes from 77 million to 2.8 billion parameters, the researchers reveal a predictable, yet troubling, law of scaling. After an initial burn-in, the singular values stabilize into a power-law relationship with model size—but the exponent depends critically on the layer’s depth.

Here’s the sharp takeaway: most layers, the early and middle workhorses of the network, scale mildly. Their singular values drop slowly as model size increases (around M⁻⁰.²⁵), meaning the standard, computationally cheap 5-step Newton-Schulz configuration remains perfectly adequate. You can keep scaling these layers without touching your optimizer settings. That’s the good news. The bad news lives at the tail end of the network.

Some of the final layers exhibit a dramatically different behavior, with singular values plummeting according to a much steeper power law (up to M⁻⁰.⁹⁶). At frontier scale—think models with hundreds of billions or trillions of parameters—these layers will have tiny singular values that fall squarely into the regime where Newton-Schulz iteration fails to orthonormalize properly. The result? The optimizer’s update directions become muddled in these critical final layers. These are the layers that typically handle nuance, composition, and the high-level reasoning we’re chasing. We’d be training our biggest, most expensive models with a broken tool exactly where we need precision most.

This is where the analysis gets opinionated. The paper’s proposed solution—a “principled, layer-aware recipe”—isn’t just a nice optimization. It’s an inevitable and mandatory shift in ML engineering practice. The one-size-fits-all optimizer configuration is dead for frontier training. The future is dynamic, heterogeneous compute allocation based on depth. You will need to run more Newton-Schulz iterations on your late layers, and fewer (or the same) on your early ones. The paper frames this as avoiding “unnecessary computation,” but let’s be blunt: it’s about preventing your massive training run from slowly poisoning its own most sensitive parameters.

This reveals a broader, often overlooked truth about scaling laws. We obsess over the scaling of loss with model size and data, but we pay too little attention to the scaling of the training dynamics themselves. The tools, the hyperparameters, the very fabric of the optimization process are not static constants in the face of size; they are dynamic systems with their own emergent behaviors. What looks like a stable hyperparameter at 7B is a latent bug at 70B.

The researchers deserve credit for mapping this terrain. Their finding that the problematic layers are late in the network aligns with the intuition that earlier layers learn general features while later ones become more specialized and sensitive. Their power-law quantification moves the discussion from anecdote to engineering blueprint. It’s no longer “we might need to tweak things”; it’s “for a 100B model, layers X through Z require Y% more NS iterations to maintain update fidelity.”

For practitioners, this means the next generation of optimizers and training frameworks must bake in this layer-wise adaptivity. Hardcoding iteration counts will be a recipe for inefficiency or, worse, silent model degradation. The computational overhead of running more iterations on a subset of layers is trivial compared to the cost of a wasted, multi-million dollar training run that converges to a suboptimal model because its final layers were poorly optimized.

So, while Muon and its orthonormalized kin are brilliant leaps forward, this paper is a crucial reality check. It tells us that scaling is a game of compounding subtleties. The optimizer that democratized state-of-the-art training for the open-source world now faces its own scaling limit—not in theory, but in the concrete silicon of the next-generation datacenter. The race to the frontier just got more nuanced. It’s not enough to throw more GPUs at the problem; you have to make sure the engine tuning keeps pace with the size of the beast.

当优化器成为大模型的变速箱,我们是否忽略了最精密的调校?Muon优化器凭借其正交化更新规则,已成为训练前沿大模型的“标配”,这本身没什么好惊讶的。真正让人坐不住的是,这篇论文捅破了一层窗户纸:我们用在这些庞然大物身上的核心算法组件,其关键假设正在随着模型规模的扩大而悄然失效,而整个行业对此几乎是盲视的。

论文的核心发现,用大白话说就是:Muon用来保持更新方向正交的Newton-Schulz(NS)迭代,是个“近视眼”。它只能处理那些“强壮”的奇异值,而那些“瘦弱”的方向就被粗暴地放过了。更关键的是,随着模型从几千万参数猛增到几十亿参数,不同网络层中这些“瘦弱方向”的分布规律完全不同。前面的层比较“淡定”,变化不大;而深层的层则剧烈收缩,NS迭代很快就会力不从心。这就像一辆为平原设计的车,开上高原后发动机开始缺氧,动力曲线完全变了,但我们却还在用同样的油门标定。

这揭示了当前大模型训练中一个深刻的矛盾:我们一边惊叹于Scaling Law带来的能力涌现,一边却在工程实现上偷懒,沿用着为小规模实验量身定制的“经验值”。5步NS迭代,在77M参数的玩具模型上完美无缺,在2.8B参数的模型上开始捉襟见肘,等到万亿参数的巨兽降临,它还能用吗?论文用冰冷的幂律关系告诉我们:不能。对于部分深层网络,其奇异值谱的缩放指数接近-1,意味着规模每扩大10倍,其特征值就衰减到原来的十分之一。现有的NS配置会直接失效,要么导致优化方向扭曲,要么被迫堆砌更多迭代次数,让本就高昂的计算成本雪上加霜。

于是,真正的价值浮出水面。这不仅仅是一篇描述现象的论文,它给这个“算力即正义”的时代开出了一个精妙的“反浪费”处方。它提供了一个基于数学规律的、逐层计算的“最小化有效配置”方案。这意味着,我们终于可以从“炼丹式”的盲目堆料,转向更精细的“算法微调”。与其在所有层上无差别地烧掉昂贵的算力进行NS迭代,不如把预算精准地分配给那些真正需要(比如深层)的层,而在其他层节省下来。这是一种从“暴力美学”到“计算理性”的思维转变。

可悲的是,这种理性的声音往往被淹没在追求更快、更大、更暴力的行业喧嚣中。当所有人都盯着万亿参数模型的新闻稿狂欢时,很少有人去关心驱动训练的那个“引擎”里,某个齿轮的啮合效率是不是已经下降了30%。这篇论文的价值,就在于它强迫我们从对模型规模的盲目崇拜中暂时清醒过来,审视我们所依赖的那些基础工具,在规模变化的放大镜下是否依然可靠。它提醒我们,通往AGI的道路,不仅需要更粗暴的算力堆叠,更需要这种对算法极限的深刻洞察和优雅调校。下一个训练下一代基础模型的团队,如果还对这些问题视而不见,那无异于在用一套过时的地图,驾驶一辆轮胎气压不对的汽车,却声称自己正全速驶向未来。

Disclaimer: The above content is generated by AI and is for reference only. 免责声明:以上内容由 AI 生成,仅供参考。

大模型 大模型 训练 训练 开源 开源
Share: 分享到: