Feature Lottery? A Bifurcation Theory of Concept Emergence

This paper introduces a real-time, label-free method to detect structural representation transitions in neural networks by analyzing a passive Gaussian mixture model probe's dynamics. It reveals that representational shifts correspond to a **supercritical pitchfork bifurcation** in the loss Hessian, producing a universal phase coordinate β(t)/β_c(t) derived from hidden states alone. This coordinate empirically validates four distinct transition regimes across domains and provides a rigorous expl

Hot

Quality

Impact

TL;DR

### Background
### Key Points
#### 1. Theoretical Framework: Bifurcation Theory of Representations
The core innovation is modeling the evolution of representations as a dynamical system. A **passive GMM probe** attached to the encoder's hidden states acts as a sensor.
The onset of structure (where hidden states become linearly separable) is shown to coincide with a **supercritical pitchfork bifurcation**.

Analysis 深度分析

Background

Understanding when and how neural networks develop structured representations during training is crucial for interpretability and training diagnostics. Traditional methods for identifying these critical moments are retrospective and rely on labeled data or downstream metrics. This work moves the field towards a real-time, predictive framework by connecting representation dynamics to bifurcation theory in dynamical systems.

Key Points

1. Theoretical Framework: Bifurcation Theory of Representations

The core innovation is modeling the evolution of representations as a dynamical system. A passive GMM probe attached to the encoder's hidden states acts as a sensor.
The onset of structure (where hidden states become linearly separable) is shown to coincide with a supercritical pitchfork bifurcation.
This bifurcation is driven by the loss Hessian (specifically its eigenvalues crossing zero). The critical point β_c is the theoretically predicted zero-crossing of the relevant Hessian component.
The dynamic ratio β(t)/β_c(t) serves as a universal, label-free phase coordinate. It is computable purely from the network's hidden states at any moment t, enabling real-time monitoring.

2. Empirical Validation Across Diverse Settings

The β/β_c coordinate successfully identifies and characterizes four distinct transition regimes:

Sparse Autoencoders (SAEs) on Language Models (Pythia): Tracks the formation of interpretable features.
Self-Supervised Learning (SSL) on CIFAR: Monitors the emergence of semantic structure.
Grokking in Modular Arithmetic: Explains the delayed transition from memorization to generalization.
Feature Lottery in SAE Training: Demonstrates that early-stage atom purity (at ~5% of training) robustly predicts final convergence purity, with top early atoms achieving >12x baseline purity at convergence. This makes a feature's eventual interpretability predictable very early in training.

3. Explanation for Grokking: Delayed Symmetry-Breaking

A major theoretical contribution is explaining the characteristic delayed escape in grokking (where generalization happens long after memorization).
Under finite dissipation in the training dynamics (e.g., from weight decay or noise), macroscopic symmetry-breaking can lag the initial bifurcation point (the β/β_c zero-crossing) by orders of magnitude.
Microscopically, the bifurcation creates a shared unstable subspace in parameter space. The system cannot "choose" a symmetry-breaking direction immediately; it undergoes collective, slow dynamics before committing. This provides a rigorous dynamical account of grokking's delays.

4. Practical Implications for Training Diagnostics

The β/β_c ratio acts as an early-warning indicator for training health. It can detect the onset of useful structure, the crystallization of specific features, and episodes of representational collapse before they manifest in downstream task metrics.
This enables proactive intervention during training rather than post-hoc analysis.

Significance

This work bridges dynamical systems theory and deep learning practice. It provides a first-principles, real-time diagnostic tool (β/β_c) that is both theoretically grounded and empirically robust. Its key insight—that representational emergence is a phase transition governed by bifurcation dynamics—offers a unified lens for phenomena as diverse as grokking, feature formation in SAEs, and SSL. The practical predictive power for feature convergence and training health marks a substantial step towards more controllable and interpretable neural network training.

背景与问题

神经网络在训练中会经历表征结构化的突变时刻，例如“领悟”（grokking）现象。传统上，识别这些时刻依赖于回顾性的、与标签相关的评估指标，无法实时监控，也难以揭示其内在动力学机制。本文旨在回答一个核心问题：是否存在一种普适的、无需标签的物理量，能实时描述并预测这些关键相变？

核心内容

本文的理论核心是将神经网络训练动态类比为物理系统中的超临界叉形分岔。

模型与发现：作者将一个被动的高斯混合模型（GMM）探针附着在训练中的编码器上。分析发现，表示结构的出现对应于由损失函数海森矩阵驱动的叉形分岔。
核心坐标 β/β_c：系统存在一个理论上可预测的零点（临界值β_c）。将网络当前状态β与β_c相比，得到动态比值 β(t)/β_c(t)。这是一个标签无关、可仅从隐藏状态计算的“相空间坐标”，它统一描述了多种训练动态。
实验验证：该坐标成功预测并区分了四种不同的相变模式，涵盖语言模型自编码（Pythia）、自监督学习（CIFAR）和模块算术的领悟（grokking）等多种场景。
关键现象解释：
- 延迟相变（领悟的动力学解释）：在有限耗散下，宏观的对称性破缺（即最终结构化的表现）可能滞后于初始的临界点（β=β_c）数个数量级。这为观察到的“延迟领悟”提供了严格的动力学解释。
- 微观机制：分岔创造了一个共享的不稳定子空间，迫使特征集体发生对称性破缺。
- “特征彩票”：在稀疏自编码器（SAE）训练中，一个特征最终的可解释性在极早期（如训练进度的5%）就已可预测。训练早期原子纯度能稳健预测最终收敛纯度，顶尖早期原子的最终纯度可达基线水平的12倍以上。

意义与影响

理论贡献：提供了一个统一的动力学理论框架来解释神经网络表征的涌现，将抽象的训练过程与经典的非平衡相变理论联系起来。
实用价值：β/β_c坐标 可作为一个强大的早期预警指标：
- 检测可用结构的出现。
- 监控特征身份的固化。
- 提前数个epoch预警表征坍塌，远早于下游指标反应。
启发实践：“特征彩票”现象表明，早期训练阶段的微小差异会被放大并决定最终特征质量，这为训练策略（如初始化、早期正则化）的优化提供了关键洞察。

Disclaimer: The above content is generated by AI and is for reference only.

Read Original →