What Do People Actually Want From AI? Mapping Preference Plurality
They’re trying to flatten the world into a single reward signal, and it’s not working. That’s the screaming takeaway from a new dissection of Reinforcement Learning from Human Feedback, the so-called alignment technique powering the most polished AI chatbots. Researchers digging into a massive dataset of what people across 75 countries actually *ask for* from AI have exposed RLHF not as a nuanced tuning tool, but as a crude epistemological blender. It purports to make models “aligned with human
Analysis
They’re trying to flatten the world into a single reward signal, and it’s not working. That’s the screaming takeaway from a new dissection of Reinforcement Learning from Human Feedback, the so-called alignment technique powering the most polished AI chatbots. Researchers digging into a massive dataset of what people across 75 countries actually ask for from AI have exposed RLHF not as a nuanced tuning tool, but as a crude epistemological blender. It purports to make models “aligned with human values,” but the paper starkly reveals it mostly averages out dissent, erases context, and imposes a dangerously monolithic version of truth.
The fantasy sold by the AI labs is elegant: collect human preferences, train a model to predict which outputs people prefer, then use that as a guide. The reality, as this research brutally quantifies, is a system of profound distortion. The most requested value, “truthfulness,” is the perfect case study. Nearly half of respondents wanted it, which should be a slam dunk for a reward model. Except when you listen to what people mean by truthfulness. Some want sourced, verifiable facts. Some want the consensus of established experts. Some actively want contrarian, unpopular views presented as a counterweight to mainstream narratives. These aren’t minor variations on a theme; they are fundamentally different, often incompatible, epistemologies. Trying to capture “truthfulness” with a single scalar reward score derived from pairwise comparisons is like trying to describe the color blue using only a ruler. The tool is categorically mismatched for the task.
And that’s for the value everyone agrees on. For everything else, the method is even more bankrupt. The paper finds that most specific values—be they honesty, politeness, conservatism, progressivism—are demanded by fewer than one in four users. This isn’t a minoritarian inconvenience; it reveals that the entire premise of seeking a “general” preference is flawed. There is no “general” human. There are billions of situated people with contextual, competing desires. When the system is optimized to satisfy the plurality in any given comparison, it doesn’t create a wise mediator. It creates a bland, lowest-common-denominator pleaser that alienates everyone except the statistical ghost in the machine.
This flattening isn’t just mediocre; it’s actively violent, as the paper argues by invoking the term “epistemic violence.” This isn’t hyperbole. When an AI trained on aggregated preferences presents a single, averaged-out “truth” or a single mode of being “helpful,” it actively marginalizes and erases the minority perspectives that were sandblasted away during training. The user from a culture that values indirection isn’t “wrong” for preferring it; the system simply has no room for them. The one who wants the AI to challenge premises rather than comply isn’t being “difficult”; they’re expressing a valid intellectual stance the reward model couldn’t fit on its curve. The process doesn’t align the AI with “human values”; it aligns it with the values of the median, anonymized labeler in a feedback loop, a population whose demographics and biases are themselves opaque.
The consequences are tangible and current. Look at the raging debate over “guardrails” and “safety” features. This paper shows these aren’t universally desired safety nets but deeply contested features—some users see them as essential protections, others as paternalistic censorship. The RLHF regime doesn’t resolve this tension; it just silently picks a side and encodes it as a model “behavior,” then sells it as an objective safety standard. Similarly, the call for AI to be more “human-like” in its tone is shown to be controversial, not a given. The labs, however, double down on “conversational” and “relatable” models because that’s what their flattened feedback signal often favors, ignoring the substantial cohort that wants a dispassionate tool.
Most damning is the indictment of the industry’s favorite cop-out for AI’s persistent hallucinations and inaccuracies. Companies insist they’re working on it, pouring billions into scaling models. This research suggests they’re scaling the wrong problem. Hallucination rates aren’t stubbornly high because of a lack of data or compute; they’re high because the alignment method itself is incapable of properly learning what accuracy means to different users. The model isn’t trying to be wrong; it’s trying to maximize a reward score that was built on a hopelessly vague, averaged concept of “rightness.” It’s optimizing for a chimera, and the result is confident nonsense.
So where does that leave us? Stuck in a paradigm that treats human preference as a data problem to be solved with statistics, rather than a philosophical and social challenge to be navigated with humility. The paper’s call for methods that can handle plural, contextual values isn’t just an academic suggestion—it’s a diagnosis. Continuing down this path means building systems that will inevitably become better at gaslighting than at understanding, better at producing consensus-shaped outputs than at engaging with the messy, contradictory richness of human desire. The alternative isn’t a technical tweak to RLHF. It’s a fundamental rethink: moving away from monolithic reward models toward architectures that can preserve and reason about dissent, that can ask clarifying questions instead of assuming averages, and that can recognize that a model’s most important job might not be to flatten our disagreements, but to faithfully reflect that they exist. The first AI lab that truly understands this won’t just build a better model; it’ll have built a more honest mirror.
Disclaimer: The above content is generated by AI and is for reference only.