Step 3.7 Flash Ranks First in Mainstream Output Speed List of Artificial Analysis

409 tokens/s, a new record marked in red on the speed leaderboard. Step 3.7 Flash from StepFun is like a race car that has hit maximum velocity on the digital highway, leaving all mainstream competitors behind. But is this glittering medal truly worth the entire industry's celebration?

Hot

Quality

Impact

Analysis 深度分析

Speed is undoubtedly one of the sexiest metrics of our time. In an era where users have long lost patience for "waiting," a model that can run 10 times faster creates a perceived chasm between being "usable" and being "truly useful." From a technical perspective, achieving this speed means mastering the entire suite of "internal martial arts"—architecture optimization, engineering deployment, operator fusion, and hardware adaptation. This is absolutely a display of hard strength and a beautiful victory for engineering culture in the dimension of efficiency. It solves a real pain point: when models are deployed at scale for conversations, search, real-time translation, and other scenarios, speed becomes the lifeline. Every 0.1-second reduction in latency means one less risk of breaking the user experience chain.

However, behind the leaderboard's glow lies a glaring shadow. What is the cost of this unrivaled speed? The metric "intelligent efficiency" is mentioned, but it is precisely the core of the problem. We’ve seen this too many times: models that achieve sky-high scores on specific benchmarks often lose their luster once they encounter real-world, complex scenarios requiring long-term reasoning or common-sense judgment. As the name suggests, Flash models prioritize "flash" speed responses. They may be highly specialized, sacrificing the model's versatility and depth of thinking for the sake of speed. For quick summaries, simple Q&A, or code completion, they are神器 (godsend). But would you dare let one independently write an industry analysis report or handle a complex task requiring multi-step trade-offs? I have my doubts.

Another eye-catching point on the leaderboard is the "speed-to-price ratio." This signals a dangerous escalation: the price war has evolved from a "feature war" to a "speed war." When every player is dragged into this arms race of being "both fast and cheap," what happens? True innovation gets stifled. Developing a "smarter," more comprehensive model that can understand nuances requires massive computational investment and lengthy training cycles—it’s slow and expensive. In contrast, refining an ultimate speed-specialized model has a clear path with immediate results. Capital and attention will flood toward the latter. Over time, we won’t get a more intelligent world but a digital fast-food world filled with "fast but shallow" models. They can quickly give you an answer, but the quality of that answer might not be worth your wait.

This reminds me of the original aspiration behind large language models. We seek intelligence closer to that of humans—partners capable of understanding, reasoning, and creating—not merely "faster typewriters." Speed is an important foundational attribute, but elevating it above "intelligence" itself is putting the cart before the horse. A 1000-point model, even if slow, is still 1000 points of intelligence; a 100-point model, no matter how fast it breaks through the sky, cannot solve truly difficult problems. Is the current evaluation system placing too much weight on "speed," potentially misleading the entire industry's R&D focus?

StepFun’s leaderboard sprint this time is a showcase of its technical prowess, which is fair enough. But if the industry collectively celebrates and treats "speed first" as the supreme glory, it would be a lamentably short-sighted view. We must guard against the inertia of "speed for speed’s sake." Users need "good and fast," not "only fast." True competitiveness lies in raising speed while ensuring the model remains sufficiently "smart," not sacrificing "smartness" for the hollow reputation of speed.

So, applaud this speed record—but only once. Then, we should calmly ask: How "smart" is your model? On the "marathon" track that requires deep thinking, can you still keep running? Speed is a means; intelligence is the end. Don’t let the numbers on the leaderboard blur our pursuit of the true essence of intelligence.

409 tokens/s，一个新纪录在速度榜上被标红。阶跃星辰的Step 3.7 Flash像一辆在数字高速上飙到极致速度的赛车，把所有主流对手甩在了身后。但这枚金光闪闪的勋章，真的值得整个行业欢呼吗？

速度，无疑是这个时代最性感的指标之一。在用户对“等待”早已失去耐心的今天，一个模型能快10倍，感知就是“能用”和“好用”的天堑。从技术上看，能跑到这个速度，意味着在架构优化、工程部署、算子融合、硬件适配这一整套“内功”上做到了极致。这绝对是硬实力的体现，是工程师文化在“效率”维度上的一次漂亮胜利。它解决了一个真实痛点：当模型被大规模部署到对话、搜索、实时翻译等场景时，速度就是生命线。每快0.1秒，用户体验的链条就少一环断裂的风险。

然而，榜单的光芒背后，是一道刺眼的阴影。速度一骑绝尘，代价是什么？“智能效率”这个指标被提及，但它恰恰是问题的核心。我们见过太多次了：在特定基准上跑分逆天的模型，一到真实、复杂、需要长期推理或常识判断的场景，就立刻“露馅”。Flash模型顾名思义，追求的就是“闪”速响应。它可能被高度特化，为速度牺牲了模型的通用性和深度思考能力。用它来快速摘要、简单问答、代码补全，堪称神器；但你敢让它独立撰写一份行业分析报告，或者处理一个需要多步权衡的复杂任务吗？我表示怀疑。

榜单上另一个醒目的点是“速度价格比”。这背后是一个危险的信号：价格战已经从“功能战”升级到了“速度战”。当所有玩家都被拖入这个“既要快，又要便宜”的军备竞赛时，会发生什么？真正的创新会被扼杀。因为研发一个更“聪明”、更全面、能理解弦外之音的模型，需要巨大的算力投入和漫长的训练周期，它慢，而且贵。而打磨一个极致的速度特化模型，路径清晰，效果立竿见影。资本和注意力会疯狂涌向后者。长此以往，我们得到的将不是一个更智能的世界，而是一个充斥着“快但浅薄”模型的数字快餐世界。它们能快速给你一个答案，但这个答案的质量，可能配不上你等待的时间。

这让我想起大模型最初的初心。我们追求的是更接近人类的智能，是能够理解、推理、创造的伙伴，而不仅仅是一台“更快的打字机”。速度是重要的基础属性，但若将其置于“智能”本体之上，就是本末倒置。一个1000分的模型，如果速度慢，它依然是1000分的智能；一个100分的模型，哪怕速度突破天际，它也无法解决真正的难题。当前的评测体系，对“速度”的权重是否过高，以至于可能误导整个行业的研发重心？

阶跃星辰这次刷榜，是其技术实力的展示，这无可厚非。但如果行业因此集体狂欢，将“速度第一”视为至高无上的荣耀，那将是一种可悲的短视。我们需要警惕那种“为了快而快”的惯性。用户需要的是“好又快”，而不是“只有快”。真正的竞争力，在于在保证足够“聪明”的前提下，把速度提上去，而不是牺牲“聪明”来换取速度的虚名。

所以，为这个速度纪录鼓掌，但只鼓一次。然后，我们更应该冷冷地问：你的模型，究竟有多“聪明”？在需要深度思考的“马拉松”赛道上，你还跑得动吗？快，是手段；智能，才是目的。别让排行榜上的数字，模糊了我们对智能本质的追求。

Disclaimer: The above content is generated by AI and is for reference only.

开源大模型评测推理

Read Original →

Analysis 深度分析

Related Articles 相关文章