UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

No model currently exceeds 40% accuracy on a new benchmark designed to test a single, critical skill: can an AI think in probabilities, not just in sentences? That’s the punchline of UnpredictaBench, a study that exposes a fundamental weakness in today’s large language models, even as they are increasingly sold as universal simulators of human thought and complex systems. The finding is less a technical footnote and more a philosophical indictment. We’ve built machines that are extraordinary at

Hot

Quality

Impact

Analysis 深度分析

This isn’t about creativity or diversity—topics Silicon Valley loves to market. It’s about calibration, a far more sobering requirement. When researchers want to use an LLM to stand in for thousands of humans in an economic model, or to simulate market behavior, they don’t need the AI to brainstorm. They need it to faithfully reproduce a specific distribution of possible outcomes, complete with its inherent noise and uncertainty. UnpredictaBench’s 448 problems show that current models, when asked to sample from even basic statistical distributions or simple stochastic scenarios, tend to collapse their outputs into a narrow, predictable corridor. They find the "most plausible" path and then, ironically, become deterministic in their pursuit of it. They’re autocomplete on steroids, and autocomplete doesn’t understand variance.

The benchmark’s genius is in its simplicity. It separates the wheat from the chaff by ignoring the flashy, complex tasks and drilling down into this one, irreducible capability: generating calibrated randomness. The fact that scores spread from near-zero to just over 20% on the standard metric tells a story of a new kind of AI haves and have-nots, one that has little to do with model size or brand prestige. It suggests that some architectures or training regimes might have stumbled upon a better grasp of statistical reality, while others are just parroting distributions from their training data without any underlying model of chance.

Here’s the core problem: we are anthropomorphizing randomness. When a human imagines a “likely” parking spot or a “plausible” stock price movement, we’re engaging a messy, intuitive model built on lived experience, fear, and heuristic. An LLM has none of that. It has only text. Its “understanding” of probability is a syntactic reflection of how probability is discussed in the corpus it was trained on. It can write a beautiful treatise on the Monte Carlo method but fail to produce a single valid Monte Carlo sample from a Gaussian. It has learned the language of uncertainty, not the substance of it. We’re asking a poet to do the job of a statistician.

This disconnect has massive implications for the "digital twin" fever gripping every industry from finance to urban planning. The promise is that we can create an AI clone of a city, an economy, or a supply chain and run simulations on it. UnpredictaBench reveals that, before we even tackle the staggering complexity of those systems, our foundational tool can’t even faithfully mimic the basic noise within them. We’re trying to simulate a hurricane when the AI can’t accurately model a simple gust of wind. The benchmark is a necessary, humbling first step: a reality check that says, “before you scale, check the basics.”

The researchers note that adding "reasoning" (like chain-of-thought prompting) provides only marginal gains. This is the most telling detail. It confirms that this isn’t a failure of brute-force computation that more thinking can fix. It’s a failure of representation. The model’s internal world model, to the extent it has one, is built for narrative coherence, not for stochastic fidelity. Coherence demands a single, compelling story. Reality demands a distribution of messy, contradictory possibilities.

So where does this leave us? It suggests a potential fork in the road for AI development. One path doubles down on the language-game, optimizing models for ever-more convincing conversation and content. The other path takes this benchmark as a mission statement: to build a new class of models that genuinely internalize the concepts of distribution, variance, and uncertainty at a representational level. These would be less like eloquent parrots and more like silicon-based statisticians, entities that don’t just talk about chance but think in chance.

Until then, using today’s LLMs as direct substitutes for unpredictable real-world agents is a form of magical thinking. They are mirrors, not engines. They reflect the patterns and biases of their training data with stunning fidelity, but they cannot generate true, calibrated novelty. UnpredictaBench doesn’t just measure a technical shortcoming; it measures the chasm between the illusion of intelligence we’ve built and the messy, probabilistic reality it purports to simulate. The emperor’s new clothes, it turns out, have no variance.

你有没有试过让AI扮演一群人，模拟一个社区投票，或者一个股票市场的交易情绪？结果往往令人沮丧——所有“虚拟居民”像被集体催眠一样，要么全投赞成票，要么全看涨。那股鲜活的、充满噪音和矛盾的现实味儿，荡然无存。这就是当前大语言模型一个深藏不露的痼疾：它太“正确”了，正确到丧失了真实世界应有的荒谬与不确定性。arXiv上那篇关于UnpredictaBench的论文，就像一把锋利的手术刀，精准地剖开了这个我们一直假装看不见的肿瘤。

研究者们搞了个测试，名字就透着讽刺：UnpredictaBench（不可预测基准）。测试方法很直接，扔给模型一个明确的随机分布描述，比如“生成符合正态分布N(0,1)的100个样本点”，或者更生活化的，“模拟掷一个不公平骰子1000次”。然后，用严谨的统计学测试（柯尔莫哥洛夫-斯米尔诺夫检验）去评判AI吐出来的数，到底像不像真的从那个黑箱里随机抖落出来的。结果呢？一片惨淡。最好的模型，在严格标准下也只有两成多的“通过率”。这感觉就像让一个号称博古通今的学者去抛硬币，他却总是心不在焉地抛出同一面，还振振有词：“我认为这一面更合理。”

问题出在哪儿？论文指向了一个我们过于迷信的词：“多样性”。业界总在鼓吹要提升模型输出多样性，仿佛变着花样说话就是智能的体现。UnpredictaBench却一巴掌扇过来：在需要精密模拟的场合，我们要的不是花里胡哨的“多种合理”，而是严丝合缝的“一种真实”。真实世界的随机性是有其数学筋骨的。经济模拟中消费者的决策、疫情传播的波动、甚至游戏角色的行为模式，其底层都遵循着特定的、不总是符合人类直觉的统计规律。当前LLM的训练，本质上是海量文本上的“下一个词”预测，这天然地教会了它寻找最 likely（最可能）的路径，而不是 most probable（最概率化）的路径。它成了个优秀的“共识总结者”，却是个蹩脚的“随机发生器”。

这暴露了当前AI发展路径中一个令人不安的偏航。我们太痴迷于让模型输出听起来正确、看起来连贯、符合大多数人预期的答案。RLHF（基于人类反馈的强化学习）在其中起了推波助澜的作用——我们不断用“人类偏好”去微调模型，结果就是模型越来越懂得取悦我们，输出那些让人类评估者觉得“对”、“好”、“安全”的文本。但这套逻辑在模拟真实复杂系统时就失灵了。真实世界不总是“正确”和“美好”的，它充满了反直觉的尖峰和长尾。一个只会输出“政治正确”答案的AI，如何能模拟出特朗普上台前那股汹涌而难以言说的民间情绪？一个总是倾向于最常见答案的模型，又如何能预测到黑天鹅事件？

更辛辣的是，论文里提到，即便引入推理链（Chain-of-Thought）这种当下风靡一时的“思考”技巧，对提升分布采样能力也收效甚微。这暗示着，问题可能更深地根植于模型的架构和训练目标函数里。我们可能从第一步就走错了：我们用下一个词预测的范式，培养了一个天然的“确定性偏执狂”。它所理解的“随机”，或许只是词汇表里“随机”、“巧合”这些符号与上下文搭配的概率，而非数学意义上真正的概率分布。

那么，UnpredictaBench的出现意味着什么？它不仅仅是一个新的跑分榜单。它是对整个AI行业的一次冷水浇头，提醒我们：在欢呼LLM能写诗、能编程、能聊天的时候，它可能连一个最基本的掷骰子都模拟不好。它质疑了“用LLM替代人类进行社会经济模拟”这种宏叙事的基础——如果连简单分布都画不准，何谈模拟复杂的人性与社会？它也给那些盲目追求模型规模的人提了个醒：参数量不是万能药，也许我们需要新的训练范式，新的目标函数，去教模型理解并重现这个世界的“不讲道理”之处。

技术的傲慢常常在于，它总想用整齐划一的框架去收纳杂乱无章的真实。UnpredictaBench恰好站在了这个裂缝上。它告诉我们，通往通用人工智能（AGI）的路上，不仅要有回答问题的智慧，还得有生成问题的“混沌”。毕竟，一个无法真正拥抱不确定性、模拟世界“不堪”一面的AI，无论对话多么流畅，都只是一个高级版的应声虫，一个包裹在概率糖衣下的确定性傀儡。真正的智能，恐怕得先从学会“犯正确的错误”开始。

Disclaimer: The above content is generated by AI and is for reference only.

评测基准测试大模型

Read Original →

Analysis 深度分析

Related Articles 相关文章