All Deep Analysis Foresight AI News Open Source AI Products Research Papers AI Security AI Practices AI Skills AI Overseas

Research Papers 2d ago • Updated 2d ago 46

Muon$^p$: Muon with Fractional Spectral Powers

New optimizer Muon^p interpolates between Muon and gradient descent. Uses fractional spectral power updates instead of full singular value flattening. Computed efficiently via new low-degree bivariate polynomial recurrence. Shown to improve finetuning performance on billion-scale language models. Provides theoretical framework via Schatten q-norm loss improvement.

Hot

Quality

Impact

Analysis 深度分析

TL;DR

New optimizer Muon^p interpolates between Muon and gradient descent.
Uses fractional spectral power updates instead of full singular value flattening.
Computed efficiently via new low-degree bivariate polynomial recurrence.
Shown to improve finetuning performance on billion-scale language models.
Provides theoretical framework via Schatten q-norm loss improvement.

Key Data

Entity	Key Info	Data/Metrics
Optimizer	Muon^p	Fractional spectral-power update USᵖV⊤ for p ∈ (0,1)
Theoretical Relationship	Maximized linear loss improvement	Under Schatten q-norm for q = 1 + 1/p
Computational Method	Bivariate polynomial recurrence	Low-degree, odd, uses only matrix multiplications
Empirical Validation	Task	Billion-scale model finetuning
Empirical Validation	Improvement	Validation perplexity and downstream task performance
Limitation	Analyzed via	Spectral geometry

Deep Analysis

The introduction of Muon^p isn't just another incremental optimizer tweak; it's a fundamental challenge to a reigning dogma in neural network training. The original Muon optimizer operates on a powerful but blunt principle: replace the gradient's singular value decomposition (SVD) with its polar factor, effectively flattening the singular spectrum to zero. The implicit assumption is that the magnitude of the singular values (the spectrum) contains noise or harmful information, and flattening is always stabilizing. Muon^p says: not so fast. By using a fractional power p between 0 (full Muon) and 1 (standard gradient descent), it introduces a dial, not a switch. This is a shift from binary, ideology-driven design ("spectra are bad!") to a principled, interpolative one.

The brilliance lies in the implementation. A skeptic's immediate objection would be that computing fractional SVD powers is prohibitively expensive, destroying the very efficiency that makes Muon attractive. The authors preempt this by proving that fixed univariate polynomial iterations cannot compute this operation—a key theoretical negative result. Their solution, a set of low-degree bivariate recurrences, is a neat piece of numerical analysis. It preserves the matrix-multiplication-only workflow, meaning this isn't just a theoretical curiosity; it's a drop-in, scalable alternative. This clever engineering bridges the gap between a nuanced mathematical insight and practical utility, which is where most academic optimizer proposals die.

Where it gets truly interesting is the why. The paper's theoretical core, tying performance to the Schatten q-norm where q = 1 + 1/p, provides a geometric interpretation. It suggests that the "optimal" amount of spectral flattening depends on the loss landscape's specific curvature properties. This reframes the optimizer selection problem: instead of searching for a universally superior algorithm, we should match the optimizer's implicit spectral bias to the geometry of the task at hand. The strong empirical results on finetuning billion-scale models make intuitive sense here. Pre-trained weights already encode a complex, learned representation. Aggressively flattening their gradient updates (full Muon) might violently disrupt this learned structure. Muon^p, by allowing some singular value information to flow through, can make smaller, more respectful adjustments to the existing representation—akin to careful retuning versus a complete overhaul.

This work exposes a critical blind spot in much of optimization research: the fixation on first-order approximations (gradient direction) while often ignoring the second-order information already latent in the gradient's singular spectrum. Muon^p makes that spectrum a first-class citizen again, but in a controlled, tunable manner. The limitation noted—when Muon^p is less suitable—is equally telling. It acknowledges there's no free lunch. Tasks where the optimal path lies in a direction orthogonal to the principal singular vectors might suffer. This nuanced take, analyzing failures through spectral geometry, is more valuable than blanket claims of superiority. Ultimately, Muon^p is a step toward a more mature, problem-specific theory of optimization, moving past one-size-fits-all heuristics and toward algorithms that can adapt their update philosophy to the data they encounter.

Industry Insights

Expect a new wave of "spectral-aware" optimizers tailored for specific stages of the model lifecycle (e.g., more aggressive flattening for pre-training, fractional updates for fine-tuning).
The efficiency of the bivariate recurrence method could enable real-time spectral adaptation of optimizers within a single training run, based on monitored loss landscape geometry.
This work will accelerate research into the "spectral bias" of different optimizers, making SVD-based diagnostics a standard tool for analyzing training instability.

FAQ

Q: Why is preserving some singular value information in gradients beneficial?
A: Aggressive flattening (as in Muon) discards the relative importance of different update directions. In fine-tuning, this can lead to overly large, destructive updates to a carefully pre-trained model's weight structure. Muon^p allows the optimizer to respect the original gradient's magnitude-based priorities to a tunable degree.

Q: How does the computational complexity compare to the original Muon and Adam?
A: Muon^p's core operation (the bivariate recurrence) uses only matrix multiplications, matching Muon's computational complexity. It avoids the expensive full SVD or eigendecomposition that its theoretical foundation relies on, making it practical for large-scale training.

Q: When would I choose Muon^p over the original Muon optimizer?
A: Use Muon^p for tasks where preserving aspects of the learned representation is critical, most notably in fine-tuning large language models or computer vision models. The original Muon (p=0) might still be preferable in early pre-training stages where aggressive spectrum flattening helps establish broad, stable features quickly.

TL;DR

新优化器Muon^p提出用分数幂US^pV^p（p∈(0,1)）替代Muon的全谱平展UV^p，在梯度下降与谱平滑间取得平衡。
理论证明其无法用固定单变量多项式迭代计算，但推导出可保持矩阵乘法复杂度的双变量递归近似方法。
在数学上，该优化器在Schatten q-范数（q=1+1/p）下能最大化线性损失改善。
实证表明，在十亿规模模型微调中，Muon^p能有效提升验证困惑度及下游任务性能。
论文同时分析了其不适用场景，揭示了保留部分奇异谱信息何时能带来关键收益。

核心数据

实体	关键信息	数据/指标
Muon^p 优化器	核心更新公式	$US^pV^\top$，其中 $p \in (0,1)$
优化目标	最大化线性损失改善的范数	Schatten $q$-范数，$q=1+\frac{1}{p}$
实验验证	在模型规模	十亿（billion）规模模型
核心优势场景	主要提升方向	微调（finetuning）

深度解读

这篇论文击中了当前优化器设计一个极其微妙的痛点：Muon通过用极性因子UV^p替代梯度USV^p，粗暴地将奇异谱“拍平”，这看似在调整步长上取得了理论上的一致性和优雅性，但本质上是一种信息丢弃。作者敏锐地意识到，梯度矩阵的奇异值大小本身就承载着关于损失曲面各方向曲率的宝贵信息，全盘丢弃等于自断一臂。

Muon^p的精髓在于其“分寸感”。它提出的分数幂更新US^pV^p，就像在梯度下降（p→0）和全谱平滑的Muon（p→1）之间架起了一座可调节的桥梁。这个p参数不再是一个非此即彼的开关，而是一个精细的旋钮。作者没有止步于提出概念，而是直面其工程化难题：证明固定单变量多项式迭代无法计算分数谱幂，这看似堵死了一条路，却逼出了利用低阶双变量递归和矩阵乘法近似的核心技巧。这展示了在理论约束下追求实用性的典型研究路径。

更深刻的是其理论连接。将优化过程与在特定Schatten q-范数下的损失改善直接挂钩，赋予了p参数明确的几何意义。q=1+1/p这个关系式，将超参数的选择从经验调参提升到了对损失景观几何性质的主动塑造。这不再是“试出哪个好用”，而是“根据你相信损失景观符合何种结构来选择p”。

作者特别强调该优化器在“微调”中尤为有效，这极富洞察。预训练模型已经学习到了稳定的特征表示和梯度景观，在微调时，我们既不希望步子太大破坏已学知识（全谱平滑的Muon可能过于激进），也不希望步子太小、方向受噪。保留一部分（而非全部）奇异谱信息，可能正是对预训练知识的一种“温和再校准”。而在从头训练时，模型需要更彻底地探索和构建特征，此时更激进的谱平滑或标准SGD反而可能更合适。论文通过谱几何分析其局限性，正视了“银弹不存在”的现实。

行业启示

优化器设计正从追求“全局最优”理论属性，转向针对模型训练阶段（预训练/微调/RLHF）和任务特性的“情境化适配”。
对梯度矩阵进行“频谱雕刻”将成为精细化训练的重要方向，权重更新不再是单纯梯度缩放，而是对梯度空间结构的定向重塑。
理论突破的关键往往在于发现新旧方法间的“连续插值空间”，并用工程智慧在计算约束下实现它，Muon^p是典范。

FAQ

Q: Muon^p比标准Muon更好吗？为什么？
A: 不一定。在需要精细调整、且预训练基础稳固的微调任务中，保留部分奇异值信息通常更有利。但在从头训练或需要大范围探索时，标准Muon的强平滑性可能仍有优势。

Q: p值如何选择？
A: 论文将p与优化目标（Schatten q-范数）关联。实践上，p值需要作为超参数针对具体任务和模型进行调整，p越小越接近梯度下降，越大则越接近强谱平滑。

Q: Muon^p的计算开销是否显著增加？
A: 论文的核心贡献之一就是设计了保持与Muon相同计算复杂度（仅依赖矩阵乘法）的近似算法，因此额外计算开销很小，主要开销仍在矩阵乘法本身。

Disclaimer: The above content is generated by AI and is for reference only.

训练科学研究微调

Read Original →

Frequently Asked Questions 常见问题

Why is preserving some singular value information in gradients beneficial? ▾

Aggressive flattening (as in Muon) discards the relative importance of different update directions. In fine-tuning, this can lead to overly large, destructive updates to a carefully pre-trained model's weight structure. Muon^p allows the optimi

Analysis 深度分析

TL;DR

Key Data

Deep Analysis

Industry Insights

FAQ

TL;DR

核心数据

深度解读

行业启示

FAQ

Frequently Asked Questions 常见问题

Related Articles 相关文章