New benchmark exposes how badly AI struggles with real knowledge work

The number is 3 percent. Not 30, not 70. Just 3. That’s the portion of realistic, complex knowledge-work tasks that even the best-performing AI models can fully solve, according to a new, frankly brutal, benchmark. Let that sink in. For all the breathless announcements and billion-dollar valuations, we’ve built a generation of tools that, when faced with the messy, ambiguous, multi-step reality of actual professional work, drop the ball 97% of the time.

Hot

Quality

Impact

TL;DR

这个数字是3%。不是30%，也不是70%。仅仅是3%。根据一项新的、堪称残酷的基准测试，这是即使最优秀的AI模型能够完全解决的现实、复杂知识型工作任务的比例。请仔细体会这一点。尽管有无数令人兴奋的宣告和数十亿美元的估值，我们所构建的这一代工具，在面对实际专业工作中混乱、模糊、多步骤的现实时，97%的情况下都会失败。
这个数字是3%。不是30%，也不是70%。仅仅是3%。根据一项新的、堪称残酷的基准测试，这是即使最优秀的AI模型能够完全解决的现实、复杂知识型工作任务的比例。请仔细体会这一点。尽管有无数令人兴奋的宣告和数十亿美元的估值，我们所构建的这一代工具，在面对实际专业工作中混乱、模糊、多步骤的现实时，97%的情况下都会失败。
这并非解决数学难题或撰写诗歌的问题。这关乎现代知识经济的核心：综合处理不同的报告，驾驭公司内部政治以起草一份备忘录，在遵循遗留系统约束的同时调试代码库，或在不断变化的渠道中策划一场营销活动。这项基准测试揭示了作为炫酷演示的AI与作为可靠同事的AI之间的巨大鸿沟。而我们一直对此视而不见。
炒作机器依靠轶事和精心挑选的成功案例运转。我们看到一个模型写出一首优美的十四行诗或通过律师资格考试，就推断它即将取代分析师、程序员和顾问。这3%的数据犹如一盆冷水。它表明这些模型并非只需稍加调整就“接近目标”。它们从根本上是不同的认知机器。它们是卓越的模式匹配者和文本预测者，但缺乏对情境的理解、持久的记忆、常识推理以及定义真正知识工作的实际问题构架能力。
可以这样理解：我们打造了世界上最不可思议的鹦鹉。它能够以惊人的准确性模仿复杂对话的声音。但当你要求它真正主持会议、解决冲突并制定可行计划时呢？它只会呱呱叫着提供一个统计上看似合理、但最终空洞无物的回答。

Analysis 深度分析

This isn’t about solving math problems or writing poetry. This is about the core of the modern knowledge economy: synthesizing disparate reports, navigating a company’s internal politics to draft a memo, debugging a codebase while respecting legacy constraints, or planning a marketing campaign across shifting channels. The benchmark exposes the chasm between AI as a flashy demo and AI as a reliable colleague. And it’s a chasm we’ve been politely ignoring.

The hype machine runs on anecdotes and cherry-picked victories. We see a model write a beautiful sonnet or pass a bar exam and extrapolate that it’s on the cusp of replacing analysts, coders, and consultants. This 3% figure is a cold bucket of water. It suggests these models are not “almost there” with a few tweaks. They are fundamentally different cognitive machines. They are brilliant pattern-matchers and text predictors, but they lack the situated understanding, the persistent memory, the common-sense reasoning, and the true problem-framing that defines real knowledge work.

Think of it this way: we’ve built the world’s most incredible parrot. It can mimic the sound of a complex conversation with terrifying accuracy. But when you ask it to actually lead the meeting, resolve the conflict, and produce an actionable plan? It squawks and offers a statistically plausible, but ultimately hollow, response.

This has massive implications for the enterprise AI gold rush. Companies are spending fortunes to “integrate AI” into workflows, promising massive productivity gains. But what if the foundational tech can only handle the trivial 3%? You’re not augmenting a human’s core work; you’re just automating the absolute simplest edges, like reformatting a document or summarizing a meeting where everyone already agreed. That’s fine, but it’s not the revolution being sold. It’s a better search bar.

The most damning indictment is what this says about our understanding of intelligence. We’ve conflated fluency with competence. The model sounds like it knows what it’s doing because it’s trained on the output of people who did. It’s a mirror, not an engine. When the task requires navigating a novel, unstructured problem with incomplete information and real-world consequences—the very essence of knowledge work—it shatters. The benchmark isn’t just testing the model; it’s testing our own gullibility.

Let’s be brutally honest: for the foreseeable future, AI is a tool for the 3 percent, not a replacement for the 97 percent. It’s a phenomenal accelerator for drafting, brainstorming, and rote extraction. It is not a junior associate, a reliable coder, or a strategic thinker. Calling it so is a dangerous fantasy that sets up enterprises for expensive disappointment and workers for unnecessary anxiety.

The decoder ring for all future AI news should now be this: what percentage of the actual, complex task did it solve? If the answer is anything close to this miserable 3%, we should treat the announcement not as a step toward superintelligence, but as a reminder of the profound, perhaps permanent, gap between statistical prediction and true understanding. The hype train has left the station, but the tracks end abruptly at a cliff. This benchmark just gave us the map.

这个数字是3%。不是30%，也不是70%。仅仅是3%。根据一项新的、堪称残酷的基准测试，这是即使最优秀的AI模型能够完全解决的现实、复杂知识型工作任务的比例。请仔细体会这一点。尽管有无数令人兴奋的宣告和数十亿美元的估值，我们所构建的这一代工具，在面对实际专业工作中混乱、模糊、多步骤的现实时，97%的情况下都会失败。

这并非解决数学难题或撰写诗歌的问题。这关乎现代知识经济的核心：综合处理不同的报告，驾驭公司内部政治以起草一份备忘录，在遵循遗留系统约束的同时调试代码库，或在不断变化的渠道中策划一场营销活动。这项基准测试揭示了作为炫酷演示的AI与作为可靠同事的AI之间的巨大鸿沟。而我们一直对此视而不见。

炒作机器依靠轶事和精心挑选的成功案例运转。我们看到一个模型写出一首优美的十四行诗或通过律师资格考试，就推断它即将取代分析师、程序员和顾问。这3%的数据犹如一盆冷水。它表明这些模型并非只需稍加调整就“接近目标”。它们从根本上是不同的认知机器。它们是卓越的模式匹配者和文本预测者，但缺乏对情境的理解、持久的记忆、常识推理以及定义真正知识工作的实际问题构架能力。

可以这样理解：我们打造了世界上最不可思议的鹦鹉。它能够以惊人的准确性模仿复杂对话的声音。但当你要求它真正主持会议、解决冲突并制定可行计划时呢？它只会呱呱叫着提供一个统计上看似合理、但最终空洞无物的回答。

这对企业AI热潮有着巨大的影响。各公司正斥巨资将“AI整合”到工作流程中，承诺带来巨大的生产力提升。但是……

Disclaimer: The above content is generated by AI and is for reference only.

Benchmark Evaluation LLM

Read Original →

Analysis 深度分析

Share to WeChat 分享到微信

Related Articles 相关文章