All Deep Analysis Foresight AI News Open Source AI Products Research Papers AI Security AI Practices AI Skills AI Overseas

Research Papers 1mo ago • Updated 1mo ago 59

Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography

Sparse autoencoders (SAEs) decompose large language models like GPT-2 XL and Llama-3.1-8B into interpretable features, revealing that semantic feature

Hot

Quality

Impact

TL;DR

### Background
### Key Points
#### Decomposition and Taxonomy
**Decomposition**: LLMs are decomposed into fine-grained interpretable features.
**Taxonomy Validation**: A taxonomy with a Kappa score of at least 0.74 was validated by human experts, ensuring the reliability of the feature categorization.

Analysis 深度分析

Background

The article addresses the gap in understanding why intermediate layers of large language models (LLMs) best predict human brain responses to language. Despite this robust finding in computational neurolinguistics, the mechanisms behind it remain unexplained. To bridge this knowledge gap, researchers decompose LLMs into interpretable features using sparse autoencoders (SAEs).

Key Points

The study decomposes GPT-2 XL and Llama-3.1-8B into 16K-32K interpretable features per layer. A human-validated taxonomy with a Kappa score of at least 0.74 was used to categorize these features. The findings reveal that semantic features alone account for 94% of peak encoding performance, significantly outperforming variance-matched baselines.

Decomposition and Taxonomy

Decomposition: LLMs are decomposed into fine-grained interpretable features.
Taxonomy Validation: A taxonomy with a Kappa score of at least 0.74 was validated by human experts, ensuring the reliability of the feature categorization.

Semantic Feature Dominance

Peak Encoding Performance: Semantic features alone recover 94% of peak encoding performance (r = 0.285).
Comparison with Baselines: Variance-matched baselines fall short, with significant differences shown through statistical tests (p < 0.001, d = 1.31).

Significance

The study introduces a novel approach to understand the mechanistic link between language models and brain responses.

Novel Cortical Topography Prediction

Priori Categorization: Five semantic subcategories derived from three independent neuroscience programs were tested.
Convergence Test: A formal test confirmed that SAE-discovered features map onto distinct brain regions (Spearman ρ = 0.72, p < 0.001; hypergeometric p = 0.007), showing a granularity not achieved by previous methods.

Prediction of Reading Times

Beyond Lexical Controls: SAE features predict human reading times beyond simple lexical controls (ΔlogLik = 38.4, p < 0.001).
Exploratory Analyses: Preliminary evidence suggests the brain encodes unexpected semantic content, providing a new dimension to understanding language processing.

Generalization

The findings generalize across English, Chinese, and French, suggesting cross-linguistic applicability of SAEs in understanding neural responses.

Conclusion

By bridging sparse autoencoders with neural encoding models, this study offers profound insights into the mechanistic relationship between LLM layers and brain activity. This work not only advances our understanding of computational neurolinguistics but also provides a new methodological framework for future research.

Disclaimer: The above content is generated by AI and is for reference only.

LLM GPT LLaMA Embedding Model 神经编码

Read Original →

Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography

Analysis 深度分析

Background

Key Points

Decomposition and Taxonomy

Semantic Feature Dominance

Significance

Novel Cortical Topography Prediction

Prediction of Reading Times

Generalization

Conclusion

背景与问题

核心内容

意义与影响

Related Articles 相关文章

Analysis 深度分析

Background

Key Points

Decomposition and Taxonomy

Semantic Feature Dominance

Significance

Novel Cortical Topography Prediction

Prediction of Reading Times

Generalization

Conclusion

背景与问题

核心内容

意义与影响

Share to WeChat 分享到微信

Related Articles 相关文章