Up to 580tps! New Speed Record of Qwen3.5-397B-A17B on GPU for Agentic Workloads with TokenSpeed

Deep Analysis

Article Type: This is a technical product launch and deep-dive, detailing the engineering behind a new open-source inference engine and its specific optimization for a flagship AI model.

Architectural Recognition of a Hybrid Model

The Qwen3.5 model is not a standard pure-Transformer; it uses a hybrid architecture interleaving full attention layers with linear attention layers (GDN). TokenSpeed's core innovation is its native, full-spectrum support for this hybrid stack, treating both cache types as first-class citizens throughout its runtime. This is a critical design choice, as an inference engine treating such a model as a standard Transformer would fail to optimize its unique memory and computation patterns. The engine's prefix caching and scheduling are built from the ground up to handle the distinct recurrent states (conv_state, ssm_state) of GDN layers alongside traditional KV caches.

The Dual-Layer Cache: Solving the State Problem for Agents

For agentic workloads, which involve long, multi-turn sequences with tool calls, a robust prefix cache is vital to avoid reprocessing the same context. TokenSpeed implements a sophisticated split architecture for this cache:

C++ Layer (Logic): Manages the radix-tree matching, page IDs, eviction policies, and the lifecycle of "MambaSlots" for GDN state.
Python Layer (Physics): Manages the actual GPU tensors for KV pages and Mamba states, including stream ordering and copy-on-write operations.

The key technical insight here is the handling of the GDN state boundary. Simply reusing cached page IDs (as with KV cache) is insufficient for linear attention layers; the recurrent state at the exact prefix boundary must also be preserved. TokenSpeed attaches a MambaSlot to the same radix-tree node as the cached KV prefix, using a two-slot system (working slot and checkpoint slot) to snapshot and publish the recurrent state. This ensures that the complex memory dynamics of hybrid models don't become a bottleneck for prefix reuse in agentic flows.

Performance as a Systemic Achievement

The headline 580 tps benchmark is presented not as the result of a single trick, but of a holistic, "speed-of-light" system design. The three pillars—eliminating memory copies, advanced kernel fusions, and fully overlapped CPU-GPU execution—point to a framework built for minimal latency and maximum hardware utilization. The explicit goal to match TensorRT-LLM performance while keeping vLLM's usability signals a focus on practical, high-efficiency deployment rather than just raw benchmarks. This performance envelope is specifically validated against the demanding, repetitive nature of agentic tasks, where high throughput directly translates to faster agent reasoning loops.

Disclaimer: The above content is generated by AI and is for reference only.

Deep Analysis

Architectural Recognition of a Hybrid Model

The Dual-Layer Cache: Solving the State Problem for Agents

Performance as a Systemic Achievement

Related Articles