Spectral Probe-Circuits: A Three-Step Recipe for Identifying Attention-Head Circuits in Pretrained Transformers
A three-step method for identifying attention-head circuits in pretrained transformers is introduced: spectral signal ranking (using time-integrated participation ratio), task-pattern screening, and causal verification via group ablation. The approach requires no labels or attribution gradients and generalizes across model scales (51M–7B parameters), architectures (dense and MoE), and training pipelines. A 2–6 head induction circuit is found causally necessary in every model tested. The fraction
Deep Analysis
Background
The paper addresses a fundamental challenge in mechanistic interpretability: how to systematically identify which attention heads in a transformer perform specific computational roles without relying on supervised labels, gradient-based attribution, or manual inspection. Existing approaches often require task-specific knowledge or expensive gradient computation, limiting scalability and generalizability. This work positions itself as the first installment of a three-paper program, with companion papers extending the method to developmental trajectories during pretraining and to composed-task circuits.
Key Points
1. The Three-Step Recipe
The core contribution is a three-stage pipeline for circuit identification:
Step 1 — Spectral Signal Ranking: Each head's attention output is characterized by its time-integrated participation ratio, a per-head spectral measure that quantifies sustained content-dependent computation. This is an unsupervised metric—no task labels or attribution gradients are needed. Heads with high participation ratios are those doing structured, data-dependent work rather than attending uniformly.
Step 2 — Task-Pattern Screen: The general spectral indicator is filtered into a task-specific candidate circuit. This narrows the ranked heads down to those actually relevant to a target task (e.g., induction).
Step 3 — Causal Verification via Group Ablation: Candidates are ablated as a group against a matched-random control to establish a causal (not merely correlational) claim about the circuit's necessity.
2. Cross-Scale and Cross-Architecture Generalization
The recipe is validated across an 8× parameter range (51M to 1B active parameters / 7B total), two architecture families (dense and mixture-of-experts), and four pretraining pipelines. This breadth is unusually rigorous for interpretability work. The key finding: a 2–6 head induction circuit is causally necessary in every model tested, with 94–100% drop in synthetic-induction top-1 accuracy after ablation.
3. Unsupervised Predictive Power
A particularly striking result is that the spectral signal identifies seed-specific circuits without supervision. On six independent seeds of a 51M-parameter probe model, the same computation reliably discovers the distinct induction circuit present in each seed. This suggests the spectral measure captures genuine computational structure rather than artifacts of a particular training run.
4. Scaling Laws for Specialized Computation
Two conserved scaling relationships emerge:
- The fraction of heads doing identifiable specialized computation is 17–19% across the Pythia family (124M to 410M). This constancy suggests a fundamental architectural or optimization constraint on how transformers allocate heads to specialized vs. general roles.
- Specific induction circuits remain 3–11 heads, scaling sublinearly with total head count. As models grow, the induction circuit does not proportionally expand—it stays relatively fixed, implying a form of computational modularity.
Significance
This work provides the first general-purpose, unsupervised, causally validated method for circuit discovery in transformers. Several aspects stand out:
- Methodological advance: By eliminating the need for labels, gradients, or manual heuristics, the recipe dramatically lowers the barrier to circuit identification and makes large-scale systematic studies feasible.
- Empirical regularities: The conserved 17–19% fraction and sublinear induction-circuit growth are not just validation artifacts—they point toward emergent organizational principles in how pretrained transformers distribute computation across heads.
- Foundation for further work: As the methodology anchor, this paper enables the companion studies on how circuits form during pretraining (developmental trajectories) and how they compose under multi-task settings, where pattern selectivity and causal structure may diverge.
- Architectural implications: The fact that the recipe ports across dense and MoE architectures suggests the organizational principles it reveals are not idiosyncratic to one architecture but reflect deeper properties of the transformer training process.
Disclaimer: The above content is generated by AI and is for reference only.