Feature Lottery? A Bifurcation Theory of Concept Emergence

Background

Understanding when and how neural networks develop structured representations during training is crucial for interpretability and training diagnostics. Traditional methods for identifying these critical moments are retrospective and rely on labeled data or downstream metrics. This work moves the field towards a real-time, predictive framework by connecting representation dynamics to bifurcation theory in dynamical systems.

Key Points

1. Theoretical Framework: Bifurcation Theory of Representations

The core innovation is modeling the evolution of representations as a dynamical system. A passive GMM probe attached to the encoder's hidden states acts as a sensor.
The onset of structure (where hidden states become linearly separable) is shown to coincide with a supercritical pitchfork bifurcation.
This bifurcation is driven by the loss Hessian (specifically its eigenvalues crossing zero). The critical point β_c is the theoretically predicted zero-crossing of the relevant Hessian component.
The dynamic ratio β(t)/β_c(t) serves as a universal, label-free phase coordinate. It is computable purely from the network's hidden states at any moment t, enabling real-time monitoring.

2. Empirical Validation Across Diverse Settings

The β/β_c coordinate successfully identifies and characterizes four distinct transition regimes:

Sparse Autoencoders (SAEs) on Language Models (Pythia): Tracks the formation of interpretable features.
Self-Supervised Learning (SSL) on CIFAR: Monitors the emergence of semantic structure.
Grokking in Modular Arithmetic: Explains the delayed transition from memorization to generalization.
Feature Lottery in SAE Training: Demonstrates that early-stage atom purity (at ~5% of training) robustly predicts final convergence purity, with top early atoms achieving >12x baseline purity at convergence. This makes a feature's eventual interpretability predictable very early in training.

3. Explanation for Grokking: Delayed Symmetry-Breaking

A major theoretical contribution is explaining the characteristic delayed escape in grokking (where generalization happens long after memorization).
Under finite dissipation in the training dynamics (e.g., from weight decay or noise), macroscopic symmetry-breaking can lag the initial bifurcation point (the β/β_c zero-crossing) by orders of magnitude.
Microscopically, the bifurcation creates a shared unstable subspace in parameter space. The system cannot "choose" a symmetry-breaking direction immediately; it undergoes collective, slow dynamics before committing. This provides a rigorous dynamical account of grokking's delays.

4. Practical Implications for Training Diagnostics

The β/β_c ratio acts as an early-warning indicator for training health. It can detect the onset of useful structure, the crystallization of specific features, and episodes of representational collapse before they manifest in downstream task metrics.
This enables proactive intervention during training rather than post-hoc analysis.

Significance

This work bridges dynamical systems theory and deep learning practice. It provides a first-principles, real-time diagnostic tool (β/β_c) that is both theoretically grounded and empirically robust. Its key insight—that representational emergence is a phase transition governed by bifurcation dynamics—offers a unified lens for phenomena as diverse as grokking, feature formation in SAEs, and SSL. The practical predictive power for feature convergence and training health marks a substantial step towards more controllable and interpretable neural network training.