Up to 580tps! New Speed Record of Qwen3.5-397B-A17B on GPU for Agentic Workloads with TokenSpeed
TokenSpeed is an inference engine that achieved a record-breaking 580 tokens per second running the Qwen3.5-397B-A17B model by systematically eliminating memory copies, fusing GPU kernels, and fully overlapping CPU and GPU execution to keep the GPU saturated. This extreme performance is specifically tailored for agentic workloads. On the functionality side, it supports complex serving scenarios through hybrid prefix caching for the model's mixed attention layers and unified state transfers betwe
Deep Analysis
Article Type: This is a technical product launch and deep-dive, detailing the engineering behind a new open-source inference engine and its specific optimization for a flagship AI model.
Architectural Recognition of a Hybrid Model
The Qwen3.5 model is not a standard pure-Transformer; it uses a hybrid architecture interleaving full attention layers with linear attention layers (GDN). TokenSpeed's core innovation is its native, full-spectrum support for this hybrid stack, treating both cache types as first-class citizens throughout its runtime. This is a critical design choice, as an inference engine treating such a model as a standard Transformer would fail to optimize its unique memory and computation patterns. The engine's prefix caching and scheduling are built from the ground up to handle the distinct recurrent states (conv_state, ssm_state) of GDN layers alongside traditional KV caches.
The Dual-Layer Cache: Solving the State Problem for Agents
For agentic workloads, which involve long, multi-turn sequences with tool calls, a robust prefix cache is vital to avoid reprocessing the same context. TokenSpeed implements a sophisticated split architecture for this cache:
- C++ Layer (Logic): Manages the radix-tree matching, page IDs, eviction policies, and the lifecycle of "MambaSlots" for GDN state.
- Python Layer (Physics): Manages the actual GPU tensors for KV pages and Mamba states, including stream ordering and copy-on-write operations.
The key technical insight here is the handling of the GDN state boundary. Simply reusing cached page IDs (as with KV cache) is insufficient for linear attention layers; the recurrent state at the exact prefix boundary must also be preserved. TokenSpeed attaches a MambaSlot to the same radix-tree node as the cached KV prefix, using a two-slot system (working slot and checkpoint slot) to snapshot and publish the recurrent state. This ensures that the complex memory dynamics of hybrid models don't become a bottleneck for prefix reuse in agentic flows.
Performance as a Systemic Achievement
The headline 580 tps benchmark is presented not as the result of a single trick, but of a holistic, "speed-of-light" system design. The three pillars—eliminating memory copies, advanced kernel fusions, and fully overlapped CPU-GPU execution—point to a framework built for minimal latency and maximum hardware utilization. The explicit goal to match TensorRT-LLM performance while keeping vLLM's usability signals a focus on practical, high-efficiency deployment rather than just raw benchmarks. This performance envelope is specifically validated against the demanding, repetitive nature of agentic tasks, where high throughput directly translates to faster agent reasoning loops.
Disclaimer: The above content is generated by AI and is for reference only.