Transcoders Trace Visual Grounding and Hallucinations in Vision-Language Models

Deep Analysis

Background

Recent advancements in generative vision-language models (VLMs) have significantly improved their performance on multimodal reasoning tasks. However, understanding how visual inputs are transformed into text remains a challenge due to the opaque nature of these models' internal processes. Existing interpretability methods, such as Sparse Autoencoders (SAEs), decompose static residual representations but fail to capture the functional updates that drive cross-modal interactions.

Key Points

The study proposes using Transcoders, sparse approximations of MLP sublayers, to provide a function-centric framework for understanding VLMs. Specifically, Transcoders act as causal proxies for layer-wise computation and are applied to Gemma 3-4B-IT to decompose the model into interpretable pathways linking image patches to token generation directions.

Transcoder Framework

Transcoders are used to analyze the model's computations in a way that aligns with their functional roles. This approach contrasts with traditional methods like SAEs, which focus on static representations and miss dynamic aspects of cross-modal interactions.

Comparison with Sparse Autoencoders (SAEs)

The research compares Transcoder attributions with those from SAEs. It finds that Transcoders produce stronger and more stable effects on visually grounded tokens under patch ablation, indicating a deeper understanding of the model's visual reasoning capabilities. Moreover, these attributions better align with semantically relevant image regions.

Visual Grounding Analysis

A False Visual Grounding counterfactual analysis confirms that the pathways recovered by Transcoders are specific to vision-language interaction, further validating their interpretability and relevance.

Significance

The function-centric circuit decomposition using Transcoders offers several significant contributions:

Enhanced Interpretability: The framework provides a clearer understanding of how visual inputs influence text generation in VLMs.
Stable and Stronger Effects: Transcoder attributions are more robust and effective, offering better alignment with semantically relevant image regions.
Predictive Power: A logistic classifier using graph features extracted from circuit traces produced by Transcoders can predict hallucinations in model outputs with an AUC of 0.68, demonstrating the practical utility of this approach.

In summary, the function-centric framework based on Transcoders not only improves interpretability but also enhances our ability to understand and predict the behavior of VLMs, paving the way for more transparent AI systems.

Disclaimer: The above content is generated by AI and is for reference only.