Protocol for evaluating ChatGPT in biomedical association generation and verification using a RAG-enabled, cross-model majority voting workflow

The real story isn’t that we need new protocols to check if ChatGPT knows the difference between a symptom and a side effect; it’s that the researchers behind this paper had to build a Rube Goldberg machine just to get the AI to admit when it’s lying. This new evaluation framework is less a diagnostic tool and more a damning portrait of our current state, where we’re so intoxicated by the fluency of these models that we’ve forgotten how to trust them, especially when lives are on the line.

Hot

Quality

Impact

Analysis 深度分析

The core protocol is straightforward: force ChatGPT to generate biomedical links, then cross-reference them against known ontologies and published literature. It’s the standard approach—try to teach a toddler geography by pointing at a globe. But the paper’s real insight, and its most telling critique, comes in the acknowledgment that exact-match verification fails. You can’t just tell if the model said “aspirin” and the database says “aspirin.” The model might say “acetylsalicylic acid” or describe its function in a way that’s semantically correct but textually novel. This is where the researchers’ ingenuity—and our collective desperation—comes into play.

Their solution is to build a system of LLMs policing each other. They use a Retrieval-Augmented Generation (RAG) pipeline, powered by open-source models, to perform semantic verification. In plain English, they’ve created a hall monitor to check the homework of the star pupil. One AI generates the medical fact, another fetches relevant scientific papers, and a third acts as the judge, synthesizing everything to declare truth or hallucination. The stated goal is “exposing hallucination,” but the subtext is far more fascinating. We’re so profoundly uncertain about the truthfulness of our most powerful AI tools that we now require other, likely weaker AI tools to serve as their auditors. It’s a circular, self-referential ecosystem of doubt.

This approach brilliantly, if inadvertently, highlights the core instability of the entire edifice. We’ve built these magnificent, stochastic parrots so good at sounding authoritative that we now need entire secondary industries built around doubting them. The paper frames this as a “use case,” but it’s really a proof-of-concept for a necessary new layer of computational skepticism. The RAG-powered verifier isn’t just a tool; it’s a monument to the fact that we cannot take the primary output at face value. We are, in essence, using one AI’s probabilistic guess to calibrate the trustworthiness of another’s.

And this is for biomedical associations—areas where a hallucinated drug interaction isn’t just a factual error, it’s a potential tragedy. The self-consistency strategy across different ChatGPT models is a clever band-aid. It’s the equivalent of asking the same question five times and seeing if the liar gets their story straight. It might filter out the most random, flaky fabrications, but it will utterly fail to catch the confidently wrong, logically consistent errors that are the hallmark of a capable language model. If the underlying training data has a subtle bias or a gap, all five versions of the model will likely hallucinate in the same, convincing way.

What this paper truly exposes is the end of the era where we can treat AI as a monolithic oracle. The future of reliable AI, especially in high-stakes fields, isn’t a single, smarter model. It’s going to be a messy, adversarial ecosystem of models checking each other—a digital immune system. We’re moving from “Does the AI know?” to “How can we design a system of AIs to figure out what it knows, and where it’s bluffing?” This research is a foundational brick in that new architecture. It’s a sobering admission: the age of passive consumption is over. We’ve entered the age of active, automated fact-checking, where the only thing that can keep an AI honest is another, skeptical AI. And we’re just getting started.

真正的焦点并非我们需要新协议来检验ChatGPT是否能区分症状与副作用，而是本文研究者不得不构建一套如鲁布·戈德堡机械般复杂的系统，才迫使AI承认自己的谎言。这套新的评估框架与其说是诊断工具，不如说是对当前现状的尖锐画像——我们如此沉醉于模型生成的流畅语言，以至于忘记了该如何信任它们，尤其在生死攸关之际。

核心协议看似简单：强制ChatGPT生成生物医学关联，再通过已知知识图谱与已发表文献进行交叉验证。这如同指着地球仪教孩童认识地理的常规方法。但论文的真正洞见与最深刻的批判，恰恰体现在它承认精确匹配验证的失效——不能仅凭模型输出“阿司匹林”与数据库记录“阿司匹林”相符就判定正确。模型可能用“乙酰水杨酸”指代，或以语义正确却文本新颖的方式描述其功能。此处既彰显了研究者的巧思，也折射出整个领域的集体困境。

他们的解决方案是构建LLM相互监督的体系：通过开源模型驱动的检索增强生成流程执行语义验证。通俗而言，他们创造了一个“班级纪律委员”来检查优等生的作业。一个AI生成医学事实，另一个调取相关科学论文，第三个则作为裁判综合所有信息以判定真伪。表面上目标是“揭露幻觉”，但潜藏的真相更耐人寻味——我们对最强大AI工具的真实性已陷入深刻怀疑，如今竟需要其他可能更薄弱的AI工具担任审计员。这构成了一个自我指涉的怀疑循环。

该方法精妙（虽非刻意）地揭示了问题的核心：

Disclaimer: The above content is generated by AI and is for reference only.

GPT RAG 医疗AI

Read Original →

Analysis 深度分析

Related Articles 相关文章