Research Papers 2d ago Updated 2d ago 59

Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography

Sparse autoencoders (SAEs) decompose large language models like GPT-2 XL and Llama-3.1-8B into interpretable features, revealing that semantic feature

85
Hot
90
Quality
80
Impact

Deep Analysis

Background

The article addresses the gap in understanding why intermediate layers of large language models (LLMs) best predict human brain responses to language. Despite this robust finding in computational neurolinguistics, the mechanisms behind it remain unexplained. To bridge this knowledge gap, researchers decompose LLMs into interpretable features using sparse autoencoders (SAEs).

Key Points

The study decomposes GPT-2 XL and Llama-3.1-8B into 16K-32K interpretable features per layer. A human-validated taxonomy with a Kappa score of at least 0.74 was used to categorize these features. The findings reveal that semantic features alone account for 94% of peak encoding performance, significantly outperforming variance-matched baselines.

Decomposition and Taxonomy

  1. Decomposition: LLMs are decomposed into fine-grained interpretable features.
  2. Taxonomy Validation: A taxonomy with a Kappa score of at least 0.74 was validated by human experts, ensuring the reliability of the feature categorization.

Semantic Feature Dominance

  • Peak Encoding Performance: Semantic features alone recover 94% of peak encoding performance (r = 0.285).
  • Comparison with Baselines: Variance-matched baselines fall short, with significant differences shown through statistical tests (p < 0.001, d = 1.31).

Significance

The study introduces a novel approach to understand the mechanistic link between language models and brain responses.

Novel Cortical Topography Prediction

  • Priori Categorization: Five semantic subcategories derived from three independent neuroscience programs were tested.
  • Convergence Test: A formal test confirmed that SAE-discovered features map onto distinct brain regions (Spearman ρ = 0.72, p < 0.001; hypergeometric p = 0.007), showing a granularity not achieved by previous methods.

Prediction of Reading Times

  • Beyond Lexical Controls: SAE features predict human reading times beyond simple lexical controls (ΔlogLik = 38.4, p < 0.001).
  • Exploratory Analyses: Preliminary evidence suggests the brain encodes unexpected semantic content, providing a new dimension to understanding language processing.

Generalization

The findings generalize across English, Chinese, and French, suggesting cross-linguistic applicability of SAEs in understanding neural responses.

Conclusion

By bridging sparse autoencoders with neural encoding models, this study offers profound insights into the mechanistic relationship between LLM layers and brain activity. This work not only advances our understanding of computational neurolinguistics but also provides a new methodological framework for future research.

Disclaimer: The above content is generated by AI and is for reference only.

LLM GPT LLaMA Embedding Model 神经编码
Share: