Can I Take Another Dose? Evaluating LLM Decision-Making Under Temporal Uncertainty in OTC Dosing QA

Someone just asked ChatGPT if they can take a third dose of Tylenol because their headache is back. The model cheerfully said yes. It was wrong. A new benchmark called DoseBench, tucked away on arXiv, just handed us the most concrete, terrifying proof yet that the AI models we're increasingly trusting with our health are, at their core, glorified pattern-matchers failing at middle-school math. This isn't a philosophical problem about AGI alignment; it's a catastrophic, present-day failure of bas

Hot

Quality

Impact

Analysis 深度分析

The scenario is painfully mundane. You have a bottle of over-the-counter ibuprofen. The label says a dose is 200mg, don't exceed 800mg in 24 hours, and wait 4-6 hours between doses. You took some at 8 AM, then at 1 PM. It's now 5 PM. Your back hurts again. Can you take more? For a human, this is a quick mental calendar check. For an LLM, it's a nightmare of "rolling-window reasoning." The model has to track disparate timestamps, perform subtraction across midnight boundaries, and hold multiple constraints in mind simultaneously—all while parsing the inherent ambiguity of human speech ("I took 'a couple' earlier").

DoseBench’s results are a scandal presented as a dataset. The models don't just fail occasionally; they fail systematically and, most damningly, confidently. They will assert a safe dosage when a 24-hour limit has been clearly breached, or counsel waiting when it's perfectly fine. The metrics on "consistency" are particularly galling. Run the exact same dosing scenario twice, and the model might give opposite advice. This isn't a bug in a novel feature; it's a flaw in the fundamental architecture. These systems have no persistent, internal sense of time. They are processing text strings that mention time, not building a coherent model of a patient's 24-hour intake history. The "thinking" is a probabilistic hallucination of a solution that looks right linguistically.

Why does this matter so much? Because the tech industry's favorite narrative is that LLMs are rapidly becoming competent general-purpose reasoning engines. DoseBench is a brutal counter-narrative. It isolates a narrow, well-defined, high-stakes task—temporal logic with constraint following—and shows the emperor has no clothes. The models are, in essence, doing very sophisticated autocomplete for medical advice. They're retrieving patterns from forums and textbooks but lack the mechanistic understanding to calculate or verify the safety of their own outputs. The finding that high confidence scores often correlate with incorrect answers is the killer. It means the models' own uncertainty metrics are unreliable exactly when they need to be most trustworthy. You can't build safe systems on a foundation where the model doesn't know what it doesn't know.

This exposes a deeper arrogance in the "move fast and integrate LLMs everywhere" ethos. We are deploying these tools as consultants in a domain governed by strict, non-negotiable physical and chemical rules. The dose-response curve for acetaminophen isn't a vibe; it's a steep drop-off into liver failure. A system that can't reliably track rolling windows is fundamentally unsuited for this task, regardless of how well it can explain the mechanism of action for ibuprofen in fluent prose. DoseBench proves the "last mile" problem isn't about polish—it's about core competency. We're trying to build self-driving cars with engineers who can't reliably pass a driver's test.

The real-world implication is a minefield of liability and harm. Imagine this baked into a pharmacy kiosk, a hospital triage app, or a feature on a wearable. The company behind it would be deploying a known, demonstrably faulty safety mechanism. DoseBench provides the smoking gun—the evidence that the failure mode is predictable and inherent. It’s not about needing more training data on medicine; it’s about the model's inability to perform the specific type of sequential, mathematical reasoning the task demands. Fine-tuning might paper over the cracks on some scenarios, but it won't install a clock inside the transformer.

What this study really does is reorient the AI safety conversation away from distant existential risk and toward present, measurable, and mundane peril. The most dangerous AI isn't a superintelligence plotting world domination; it's a helpful-sounding chatbot that confidently tells you it's safe to double your dose of a drug that can kill you if misused. DoseBench is a gift to regulators and a necessary kick in the pants for developers. It says, unequivocally, that fluency is not competency, and a confident tone is not a safety feature. Until these models can reliably pass a test this basic, their role in health advice should be restricted to "consult a doctor"—a response they might even get right, if they're consistent enough.

一个需要你精确计算“过去12小时内吃了多少片布洛芬”的时候到了，而你手边只有几个号称无所不知的大语言模型。最新的研究（arXiv:2606.04262）像一盆冷水浇下来：它们连这个最基本的“药不能乱吃”的问题都搞不定。

这篇论文没有搞那些动辄评测上万条、涵盖罕见病的宏大榜单。它很实在，就盯着日常生活中最常见、也最容易出事的场景：一个成年人想吃一片非处方止痛药（扑热息痛或布洛芬），但他不确定剂量是否安全。这需要模型完成几个简单但关键的动作：记住用户上次吃药的时间，计算一个滚动的24小时窗口内到底摄入了多少，死死守住产品说明书上的每日最大剂量上限，并处理用户可能模糊、不完整的陈述。研究人员专门制作了一个包含81个此类场景的“剂量基准测试”（DOSEBENCH），然后让四个主流大模型反复作答，收集了超过1600条回复。

结果？灾难性的直白。模型们最常犯的错，恰恰是最致命的：它们算不清那个“滚动窗口”。它们就像患了时间旅行悖论，无法将“三小时前”和“十二小时前”这两条信息动态地整合进“当前剂量”的计算里。更令人不安的是，模型经常在犯错的同时，表现出极高的“自信”。它们给出流畅、镇定、甚至附有详细解释的答案，但内核是错的。这种“稳定的错误”比胡言乱语可怕一万倍——因为它最具欺骗性，能让一个本身就心存疑虑的用户彻底放下戒备，去冒不必要的肝脏或肾脏风险。

这项研究像一把锋利的手术刀，切开了当前医疗AI领域最虚荣的一块皮肤：我们总在追求模型能“看懂”多么复杂的医学影像，能“理解”多么深奥的病理论文，却对它能否处理好“24小时内别吃超过8片药”这种幼儿园级别的安全约束视而不见。DOSEBENCH的价值不在于它发现了多新的原理，而在于它用一个极端简化的、但绝对关乎生死的微观场景，暴露了大模型在基础“约束遵循”和“时序推理”上的根本性缺陷。这不是模型“知识不足”，而是它们的底层逻辑里，可能压根就没有把“安全规则”当作不可逾越的刚性边界，而是当作可以灵活“诠释”的文本。

这引向一个更尖锐的批评：整个医疗AI赛道，尤其是直接面向消费者的健康问答，是不是跑偏了？我们沉醉于AI能扮演“全科医生”的宏大叙事里，急于落地，急于商业化，却跳过了最关键的一步——验证它是否具备处理最基础、最高频安全场景的可靠资质。论文里指出的“不完整用药史”情况尤其致命：在现实世界中，人能记清的药量总是不全的。一个真正安全的系统，在面对模糊信息时，其默认动作应该是“拒绝回答并强烈建议咨询医生或药师”，而不是自信地给出一个具体数字。但现有模型似乎被训练得过于“乐于助人”，以至于丧失了判断何时该说“我不知道，这很危险”的能力。

所以，这篇文章与其说是在评估模型，不如说是在给行业敲警钟。在把大模型包装成随身健康顾问之前，请先通过DOSEBENCH这样简单到“简陋”却直指要害的测试。如果连吃几片布洛芬的时间都算不明白，我们又凭什么相信它能管理慢性病用药、解读复杂的检查报告？医疗AI的圣杯，或许不在于它有多聪明，而在于它有多“胆小”——知道能力的边界，并对生命保持最基本的敬畏。目前看来，我们离这个目标还差得很远。

Disclaimer: The above content is generated by AI and is for reference only.

医疗AI 大模型评测

Read Original →

Analysis 深度分析

Related Articles 相关文章