Tensor Cache: Eviction-conditioned Associative Memory for Transformers

Deep Analysis

Background

Transformers use autoregressive models where key-value (KV) caches grow linearly with context length, leading to high memory consumption. Sliding-window caching can bound memory but discards evicted tokens entirely, making relevant evidence outside the window inaccessible. Tensor Cache addresses these limitations by integrating a sliding-window cache as the first-level cache (L1) and an outer-product fast-weight memory as the second-level cache (L2), allowing for efficient access to recently evicted tokens.

Key Points

Two-Level Cache Mechanism: Tensor Cache uses a sliding-window softmax attention mechanism as L1, which stores recent tokens exactly. The L2 consists of a fixed-size outer-product fast-weight memory fed by KV pairs evicted from the window.
Linear Attention Identity: Recent tokens are stored locally in exact local attention. Evicted pairs are compressed into per-layer matrices and read through matrix multiplication using the linear-attention identity ( q_t(k_i \otimes v_i) = \langle q_t, k_i \rangle v_i ).
Learned Fusion Gate: A learned scalar gate fuses outputs from L1 and L2 to balance memory efficiency and model performance.
Training Parameters: Per-head decay and write-rate parameters are trained end-to-end. The outer-product memory and the read identity are well-known, but their application in Tensor Cache for efficient caching is novel.

Significance

The introduction of Tensor Cache represents a significant advancement in managing memory constraints while maintaining model quality in long-context language modeling tasks. By leveraging both sliding-window and outer-product mechanisms, it provides a balanced solution that improves the memory--quality frontier over traditional bounded-state baselines.

Key Insights:

Memory Efficiency: The combination of local storage (L1) and compressed off-window storage (L2) significantly reduces overall memory requirements.
Training Optimization: The learned fusion gate allows for flexible interaction between L1 and L2, optimizing the trade-off between recent and historical information.
Spurious Cross-Tokens: Identifying and mitigating spurious cross-token outer products through a parallel weighted-sum scan improves model performance without increasing computational overhead.

Overall, Tensor Cache demonstrates a practical approach to addressing memory limitations in transformer models, making it suitable for applications requiring long-context understanding.

Disclaimer: The above content is generated by AI and is for reference only.

Deep Analysis

Background

Key Points

Significance

Related Articles