What Do People Actually Want From AI? Mapping Preference Plurality

They’re trying to flatten the world into a single reward signal, and it’s not working. That’s the screaming takeaway from a new dissection of Reinforcement Learning from Human Feedback, the so-called alignment technique powering the most polished AI chatbots. Researchers digging into a massive dataset of what people across 75 countries actually *ask for* from AI have exposed RLHF not as a nuanced tuning tool, but as a crude epistemological blender. It purports to make models “aligned with human

Hot

Quality

Impact

Analysis 深度分析

They’re trying to flatten the world into a single reward signal, and it’s not working. That’s the screaming takeaway from a new dissection of Reinforcement Learning from Human Feedback, the so-called alignment technique powering the most polished AI chatbots. Researchers digging into a massive dataset of what people across 75 countries actually ask for from AI have exposed RLHF not as a nuanced tuning tool, but as a crude epistemological blender. It purports to make models “aligned with human values,” but the paper starkly reveals it mostly averages out dissent, erases context, and imposes a dangerously monolithic version of truth.

The fantasy sold by the AI labs is elegant: collect human preferences, train a model to predict which outputs people prefer, then use that as a guide. The reality, as this research brutally quantifies, is a system of profound distortion. The most requested value, “truthfulness,” is the perfect case study. Nearly half of respondents wanted it, which should be a slam dunk for a reward model. Except when you listen to what people mean by truthfulness. Some want sourced, verifiable facts. Some want the consensus of established experts. Some actively want contrarian, unpopular views presented as a counterweight to mainstream narratives. These aren’t minor variations on a theme; they are fundamentally different, often incompatible, epistemologies. Trying to capture “truthfulness” with a single scalar reward score derived from pairwise comparisons is like trying to describe the color blue using only a ruler. The tool is categorically mismatched for the task.

And that’s for the value everyone agrees on. For everything else, the method is even more bankrupt. The paper finds that most specific values—be they honesty, politeness, conservatism, progressivism—are demanded by fewer than one in four users. This isn’t a minoritarian inconvenience; it reveals that the entire premise of seeking a “general” preference is flawed. There is no “general” human. There are billions of situated people with contextual, competing desires. When the system is optimized to satisfy the plurality in any given comparison, it doesn’t create a wise mediator. It creates a bland, lowest-common-denominator pleaser that alienates everyone except the statistical ghost in the machine.

This flattening isn’t just mediocre; it’s actively violent, as the paper argues by invoking the term “epistemic violence.” This isn’t hyperbole. When an AI trained on aggregated preferences presents a single, averaged-out “truth” or a single mode of being “helpful,” it actively marginalizes and erases the minority perspectives that were sandblasted away during training. The user from a culture that values indirection isn’t “wrong” for preferring it; the system simply has no room for them. The one who wants the AI to challenge premises rather than comply isn’t being “difficult”; they’re expressing a valid intellectual stance the reward model couldn’t fit on its curve. The process doesn’t align the AI with “human values”; it aligns it with the values of the median, anonymized labeler in a feedback loop, a population whose demographics and biases are themselves opaque.

The consequences are tangible and current. Look at the raging debate over “guardrails” and “safety” features. This paper shows these aren’t universally desired safety nets but deeply contested features—some users see them as essential protections, others as paternalistic censorship. The RLHF regime doesn’t resolve this tension; it just silently picks a side and encodes it as a model “behavior,” then sells it as an objective safety standard. Similarly, the call for AI to be more “human-like” in its tone is shown to be controversial, not a given. The labs, however, double down on “conversational” and “relatable” models because that’s what their flattened feedback signal often favors, ignoring the substantial cohort that wants a dispassionate tool.

Most damning is the indictment of the industry’s favorite cop-out for AI’s persistent hallucinations and inaccuracies. Companies insist they’re working on it, pouring billions into scaling models. This research suggests they’re scaling the wrong problem. Hallucination rates aren’t stubbornly high because of a lack of data or compute; they’re high because the alignment method itself is incapable of properly learning what accuracy means to different users. The model isn’t trying to be wrong; it’s trying to maximize a reward score that was built on a hopelessly vague, averaged concept of “rightness.” It’s optimizing for a chimera, and the result is confident nonsense.

So where does that leave us? Stuck in a paradigm that treats human preference as a data problem to be solved with statistics, rather than a philosophical and social challenge to be navigated with humility. The paper’s call for methods that can handle plural, contextual values isn’t just an academic suggestion—it’s a diagnosis. Continuing down this path means building systems that will inevitably become better at gaslighting than at understanding, better at producing consensus-shaped outputs than at engaging with the messy, contradictory richness of human desire. The alternative isn’t a technical tweak to RLHF. It’s a fundamental rethink: moving away from monolithic reward models toward architectures that can preserve and reason about dissent, that can ask clarifying questions instead of assuming averages, and that can recognize that a model’s most important job might not be to flatten our disagreements, but to faithfully reflect that they exist. The first AI lab that truly understands this won’t just build a better model; it’ll have built a more honest mirror.

49%。这是论文里最刺眼的一个数字——在调研中，近半数的人向AI提出了“真实性”要求。乍一看，这像是一个难得的共识，一个可以被轻易编码进奖励函数的清晰目标。但论文接下来说的话，像一盆冰水，浇熄了这种技术乐观主义：在那要求“真实”的49%的人里，有人要的是“可溯源的引用”，有人要的是“权威专家的观点”，甚至有人要的，是“被主流排斥的少数派视角”。他们用的都是同一个词——“真实”，但内核是截然相反、甚至互相对立的认识论。当RLHF的二元比较（A好还是B好？）试图压平这道鸿沟时，它压根儿不是在“对齐”，而是在制造一种暴力的幻觉。

这暴露了当前主流对齐范式骨子里的傲慢。它假设存在一种“通用人类偏好”，像一个光滑的曲面，等待被算法拟合。但现实是，人类的偏好是一个布满沟壑、尖锐对立、甚至自相矛盾的崎岖地貌。论文里说得很直白：大多数价值观，只有不到四分之一的人会提及。这意味着，一个“讨好大多数人”的奖励模型，其本质是对剩下四分之三需求的系统性漠视。AI对齐，不知不觉中，变成了一场持续进行的、由数据标注员（往往还是特定地域、教育背景的群体）投票决定的“价值暴政”。那些被标注为“好”的回答，滤掉了文化背景的棱角，磨平了哲学立场的差异，最终变成了一道道平滑的、标准化的、温吞的“正确答案”流。这哪里是技术，分明是用工业流水线的方式，在处理一个需要人类学敏感度的复杂文化命题。

更辛辣的讽刺在于“人类般的行为”和“AI护栏”这两个例子。论文发现，这两点本身就是巨大的争议。一部分用户迫切希望AI更像人，有情感、有性格；另一部分用户则对此感到恐惧，坚决要求AI必须标识自己的机器属性。同样，关于“护栏”，有人觉得是安全的底线，是责任；有人则视其为讨厌的审查，是智能的枷锁。看，我们吵吵嚷嚷要AI学习的“人类价值观”内部，首先就没统一过。当前的对齐方法，就像拿着一把标着“人类共识”的钥匙，试图去开一扇根本不存在的门。它无法处理这种“既要…又要…”的困境，更无法理解，某些场景下，用户需要的是“默认遵守规则，但当我明确要求时，可以突破”。这种精妙的、基于上下文的语境差异，在非此即彼的二元比较面前，彻底失语了。

于是，我们就看到了论文点出的那个最扎心的结果：尽管用户对“准确性”、“少幻觉”有着如此清晰、强烈、跨文化的诉求，但耗资巨大的模型，幻觉率依然高得令人沮丧。为什么？因为RLHF这套流程，在塑造模型的过程中，用错误的信号扭曲了优化的方向。它奖励的可能不是真正的“真实”，而是奖励那种“看起来最像训练数据里被标注为真实的表述方式”。它优化的可能不是“减少错误”，而是优化“减少在评估员面前犯错的概率”。两者之间，隔着一道巨大的伦理与认知鸿沟。论文将此称为“认知暴力”，这个词一点都不过分。这是一种温和的、系统的、披着“用户喜好”外衣的抹杀——抹杀了差异，抹杀了争论的正当性，用统计学上的众数，取代了每一个具体、鲜活、可能彼此冲突的个体声音。

所以，当我们在谈论“对齐”时，我们究竟在谈什么？如果方法论本身建立在过度简化的假设上，如果它收集的信号从源头就被认为是“不具代表性”且“冲突”的，如果它处理冲突的方式只是粗暴地压制少数派，那么我们练出的，不是一个理解人类、服务人类的AI，而是一个精通于表演“最大公约数”式的、怯懦的、永远政治正确的鹦鹉。它可能永远说不了一句深刻到冒犯的真话，也永远给不出一个打破常规的、颠覆性的答案。因为它被锁死在了那条由有偏见的、不完全的反馈所铺就的“主流价值”的狭窄轨道上。

真正的对齐，或许不该再追求那个虚妄的“通用模型”。它可能需要拥抱多元、承认分歧、甚至在某些层面上，为不同的价值取向设计不同的交互模式。它得先学会诚实地说：“我理解你对‘真实’的定义，而另一个人的理解完全不同。这里有一份基于来源的陈述，一份专家意见汇编，和一份少数派观点记录。请选择。” 把选择权和争议性，重新交还给具体场景中的具体人，而不是藏在一个黑箱式的、由千万次二元比较训练出来的“奖励分数”后面。否则，我们造出的，不过是一个最精致的文化霸权工具，一个用代码写就的、自我感觉良好的平庸之王。

Disclaimer: The above content is generated by AI and is for reference only.

大模型对齐数据集评测

Read Original →

Analysis 深度分析

Related Articles 相关文章