How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines

This research empirically demonstrates that LLM agents with structured tool-calling interfaces exhibit significant behavioral inconsistency when executing identical tasks, challenging a fundamental assumption of deterministic reliability in production AI systems.

Hot

Quality

Impact

Analysis 深度分析

The core tension this paper exposes is one of the most quietly consequential in applied AI: the clash between the engineering expectation of deterministic systems and the probabilistic reality of large language models. In production environments, we build intricate workflows where one agent's action triggers another, data is fetched, APIs are called, and side effects occur—all predicated on the assumption that "if X, then Y" holds true. This study systematically dismantles that assumption for a specific, and increasingly common, class of agents: those that don't just reason and search, but that call discrete, typed tools with real-world consequences. It’s a vital contribution because it moves the consistency debate beyond the realm of open-ended reasoning or search-based "ReAct" agents into the structured, consequential interfaces that are actually being deployed to manage workflows, process data, and interact with enterprise software.

What makes the finding particularly alarming is the context: structured tool-calling with typed parameters was supposed to be the solution to the chaos of free-text actions. The argument was that by constraining the LLM's output to a well-defined schema—"call function get_customer_data(customer_id: string) with argument C123"—we inject a layer of predictability and debuggability. This research suggests that layer may be thinner than we hoped. The inconsistency isn't just about the LLM occasionally generating a malformed JSON; it's about the model arriving at a different strategic decision in its tool selection or sequencing under identical conditions. This points to a deep-seated issue: the "reasoning" that precedes the tool call, which is baked into the model's weights and influenced by its sampling temperature and internal stochastic processes, remains fundamentally non-deterministic. The structured output is merely a veneer over that fluid, probabilistic cognition.

The implications cascade through the entire stack of AI system design. For developers and MLOps, this inconsistency is a debugging nightmare. How do you replicate an error, audit a decision path, or ensure compliance if the agent that caused a failure might not reproduce it? It undermines logging, testing, and monitoring frameworks built on the premise of causal reproducibility. For businesses, it introduces a novel category of operational risk. An agent might correctly process an insurance claim 99 times, then on the 100th identical input, choose a different tool or different sequence leading to an erroneous denial, all without any external system change. This isn't a traditional software bug; it's a failure of probabilistic reliability, and it erodes the very trust needed to delegate consequential tasks to AI.

Furthermore, this work challenges the notion of "testing" AI agents in a meaningful way. Traditional software testing relies on deterministic pass/fail criteria. How do you write a test suite for a system that might legitimately choose three different valid tool sequences to achieve the same goal? It forces a reevaluation of evaluation itself—perhaps success metrics must shift from bit-for-bit consistency to goal-oriented robustness, measuring whether the final outcome is correct regardless of the path taken, which is a much harder problem to specify and guarantee.

Ultimately, the paper is a sobering call for a more nuanced paradigm. The path forward isn't abandoning tool-calling agents, but designing for their inherent inconsistency. This might involve building ensemble-like systems where multiple agent passes are run and consensus is required for critical actions, or developing sophisticated "consistency wrappers" that monitor and correct for drift. It also underscores the need for a new class of development tools focused not on deterministic debugging, but on probabilistic profiling and reliability engineering. We are building systems on a probabilistic substrate, and this research is a stark reminder that our engineering frameworks must evolve to explicitly account for that foundation, rather than pretend it doesn't exist. The reliability of our future AI systems may depend on it.

本文揭示的核心矛盾是应用人工智能领域中最具潜在影响力的问题之一：工程领域对确定性系统的预期与大语言模型的概率性现实之间的冲突。在生产环境中，我们构建了复杂的工作流——一个智能体的操作触发另一个操作、数据被获取、API被调用、副作用产生——所有这些都基于“若X则Y”成立的假设。本研究系统性地瓦解了针对一类特定且日益常见的智能体的这一假设：这些智能体不仅进行推理和搜索，还会调用具有现实世界影响的离散、类型化工具。这是一项至关重要的贡献，因为它将一致性争论从开放式推理或基于搜索的“ReAct”智能体领域，推进到了实际用于管理数据流、处理数据及与企业软件交互的结构化、具后果性的接口层面。

这一发现尤其令人警醒的关键在于其背景：具有类型化参数的结构化工具调用原本被认为是解决自由文本操作混乱问题的方案。其核心论点是，通过将大语言模型的输出约束到明确定义的模式中——例如“使用参数C123调用函数get_customer_data(customer_id: string)”——我们注入了一层可预测性和可调试性。这项研究表明，这层保护可能比我们所期望的更为薄弱。这种不一致性不仅关乎大语言模型偶尔生成格式错误的JSON；更在于模型在相同条件下，在工具选择或排序上会达成不同的战略决策。这指向了一个深层次问题：工具调用前的“推理”过程——这一过程固化于模型权重之中，并受其采样温度及内部随机过程的影响——从根本上仍是非确定性的。结构化输出只是覆盖在那种流动、概率性认知之上的表层。

其影响波及人工智能系统设计的整个技术栈。对开发者与机器学习运维而言，这种不一致性是一场调试噩梦。如果导致故障的智能体可能无法重现相同的错误，你该如何复现问题、审计决策路径或确保合规性？

Disclaimer: The above content is generated by AI and is for reference only.

Agent Evaluation Benchmark LLM

Read Original →

Analysis 深度分析

Share to WeChat 分享到微信

Related Articles 相关文章