Can I Take Another Dose? Evaluating LLM Decision-Making Under Temporal Uncertainty in OTC Dosing QA
Someone just asked ChatGPT if they can take a third dose of Tylenol because their headache is back. The model cheerfully said yes. It was wrong. A new benchmark called DoseBench, tucked away on arXiv, just handed us the most concrete, terrifying proof yet that the AI models we're increasingly trusting with our health are, at their core, glorified pattern-matchers failing at middle-school math. This isn't a philosophical problem about AGI alignment; it's a catastrophic, present-day failure of bas
Analysis
Someone just asked ChatGPT if they can take a third dose of Tylenol because their headache is back. The model cheerfully said yes. It was wrong. A new benchmark called DoseBench, tucked away on arXiv, just handed us the most concrete, terrifying proof yet that the AI models we're increasingly trusting with our health are, at their core, glorified pattern-matchers failing at middle-school math. This isn't a philosophical problem about AGI alignment; it's a catastrophic, present-day failure of basic numerical reasoning wrapped in a confident, helpful tone.
The scenario is painfully mundane. You have a bottle of over-the-counter ibuprofen. The label says a dose is 200mg, don't exceed 800mg in 24 hours, and wait 4-6 hours between doses. You took some at 8 AM, then at 1 PM. It's now 5 PM. Your back hurts again. Can you take more? For a human, this is a quick mental calendar check. For an LLM, it's a nightmare of "rolling-window reasoning." The model has to track disparate timestamps, perform subtraction across midnight boundaries, and hold multiple constraints in mind simultaneously—all while parsing the inherent ambiguity of human speech ("I took 'a couple' earlier").
DoseBench’s results are a scandal presented as a dataset. The models don't just fail occasionally; they fail systematically and, most damningly, confidently. They will assert a safe dosage when a 24-hour limit has been clearly breached, or counsel waiting when it's perfectly fine. The metrics on "consistency" are particularly galling. Run the exact same dosing scenario twice, and the model might give opposite advice. This isn't a bug in a novel feature; it's a flaw in the fundamental architecture. These systems have no persistent, internal sense of time. They are processing text strings that mention time, not building a coherent model of a patient's 24-hour intake history. The "thinking" is a probabilistic hallucination of a solution that looks right linguistically.
Why does this matter so much? Because the tech industry's favorite narrative is that LLMs are rapidly becoming competent general-purpose reasoning engines. DoseBench is a brutal counter-narrative. It isolates a narrow, well-defined, high-stakes task—temporal logic with constraint following—and shows the emperor has no clothes. The models are, in essence, doing very sophisticated autocomplete for medical advice. They're retrieving patterns from forums and textbooks but lack the mechanistic understanding to calculate or verify the safety of their own outputs. The finding that high confidence scores often correlate with incorrect answers is the killer. It means the models' own uncertainty metrics are unreliable exactly when they need to be most trustworthy. You can't build safe systems on a foundation where the model doesn't know what it doesn't know.
This exposes a deeper arrogance in the "move fast and integrate LLMs everywhere" ethos. We are deploying these tools as consultants in a domain governed by strict, non-negotiable physical and chemical rules. The dose-response curve for acetaminophen isn't a vibe; it's a steep drop-off into liver failure. A system that can't reliably track rolling windows is fundamentally unsuited for this task, regardless of how well it can explain the mechanism of action for ibuprofen in fluent prose. DoseBench proves the "last mile" problem isn't about polish—it's about core competency. We're trying to build self-driving cars with engineers who can't reliably pass a driver's test.
The real-world implication is a minefield of liability and harm. Imagine this baked into a pharmacy kiosk, a hospital triage app, or a feature on a wearable. The company behind it would be deploying a known, demonstrably faulty safety mechanism. DoseBench provides the smoking gun—the evidence that the failure mode is predictable and inherent. It’s not about needing more training data on medicine; it’s about the model's inability to perform the specific type of sequential, mathematical reasoning the task demands. Fine-tuning might paper over the cracks on some scenarios, but it won't install a clock inside the transformer.
What this study really does is reorient the AI safety conversation away from distant existential risk and toward present, measurable, and mundane peril. The most dangerous AI isn't a superintelligence plotting world domination; it's a helpful-sounding chatbot that confidently tells you it's safe to double your dose of a drug that can kill you if misused. DoseBench is a gift to regulators and a necessary kick in the pants for developers. It says, unequivocally, that fluency is not competency, and a confident tone is not a safety feature. Until these models can reliably pass a test this basic, their role in health advice should be restricted to "consult a doctor"—a response they might even get right, if they're consistent enough.
Disclaimer: The above content is generated by AI and is for reference only.