Research Papers 论文研究 6h ago Updated 47m ago 更新于 47分钟前 48

VeriGate: Verifier-Gated Step-Level Supervision for GRPO VeriGate: 验证器门控的GRPO步级监督

Group Relative Policy Optimization, or GRPO, is the engine under the hood of many of today's most capable reasoning models. It’s a clever recipe: sample a bunch of potential answers to a problem, let a verifier (a separate model that checks correctness) score them, then use the relative success of each sample to guide the original model's learning. But this approach has a critical, well-known flaw. When the verifier is binary—right or wrong—and every sampled answer to a tough problem gets the sa 群组相对策略优化(Group Relative Policy Optimization,简称GRPO)是当今许多最强推理模型背后的核心引擎。这是一种巧妙的方案:针对问题生成多个候选答案,由验证器(负责检查正确性的独立模型)进行评分,然后利用每个样本的相对成功程度来指导原始模型的学习。但这种方法存在一个关键且众所周知的缺陷:当验证器采用二元评判(对或错),且面对难题时所有采样答案得分相同(例如全部错误),优化的"相对"部分就会消失。学习信号变得平坦,模型无法从中学到任何东西。

60
Hot 热度
80
Quality 质量
65
Impact 影响力

Analysis 深度分析

Group Relative Policy Optimization, or GRPO, is the engine under the hood of many of today's most capable reasoning models. It’s a clever recipe: sample a bunch of potential answers to a problem, let a verifier (a separate model that checks correctness) score them, then use the relative success of each sample to guide the original model's learning. But this approach has a critical, well-known flaw. When the verifier is binary—right or wrong—and every sampled answer to a tough problem gets the same score (say, all wrong), the "relative" part of the optimization vanishes. The learning signal flatlines. The model learns nothing.

This isn't just an academic curiosity; it’s a fundamental bottleneck in building models that can truly reason through complex, multi-step problems. Outcome-only rewards are blunt instruments. They tell you the final destination was wrong, but offer no clues about which step in the journey first sent you off the cliff. This limitation forces models into a costly, high-variance exploration mode, often leading to reward hacking—where the model finds degenerate shortcuts that satisfy the verifier's narrow logic without actually solving the problem in a robust, human-like way.

Enter VeriGate, a new framework that addresses this with elegant, if somewhat intricate, pragmatism. The core insight isn't to discard the verifier, but to make it a conditional gatekeeper. The system keeps the verifier firmly in command whenever its outcome-based scores provide a meaningful gradient—i.e., when some answers are right and some are wrong. It’s only when the verifier’s signal is degenerate (all right or all wrong) that VeriGate cautiously introduces process-level supervision from a Process Reward Model (PRM).

This is where the real engineering nuance lies. Most attempts to graft step-level feedback onto GRPO have been clumsy, often by collapsing all the step scores from a PRM into a single number for the whole trajectory. This discards crucial sequence information. VeriGate does something smarter. It converts PRM step scores into "future-cumulated" rewards, essentially asking: "If I were at this specific step, what would the expected final outcome be?" This creates a credit assignment map across the reasoning chain, pointing not just to the final error, but to the last correctable step before things went sideways.

Finally, it normalizes these detailed, token-level advantages across the group of samples. This restores a learning gradient even in the degenerate cases and, crucially, is more resistant to reward hacking than simply optimizing aggregated PRM scores. The model isn't just chasing the PRM's approval at each step; it's learning a more nuanced policy about which steps are most indicative of eventual success.

The results on the MATH benchmark are striking and should make the field take notice. For a 1.5B parameter model, average accuracy jumps by roughly 20%. For a larger 7B model, it's about 12%. This isn't a marginal improvement; it's a leap. The paper also shows substantial reductions in zero-gradient stalls and reward hacking. VeriGate doesn't just make the model more accurate; it makes the training process more stable and the resulting reasoning more robust. It's a direct answer to the "all-or-nothing" critique of outcome-based RL.

This feels less like a flashy new paradigm and more like a critical, long-overdue systems upgrade. The research community has been acutely aware of the sparse reward problem for years. VeriGate’s contribution is a concrete, working patch that acknowledges the complementary strengths of outcome and process supervision. It suggests the future isn’t about replacing outcome verifiers, but about intelligently layering more granular feedback on top of them when the coarse signal fails.

One has to wonder, however, about the hidden tax. The framework requires maintaining and querying both a verifier and a process reward model during training. This significantly increases computational overhead and complexity. The 20% accuracy boost must be weighed against that cost. Furthermore, the efficacy is tied to the quality of the PRM. A flawed PRM could introduce its own systematic biases, poisoning the step-level signal it's meant to provide.

Ultimately, VeriGate is a potent reminder that progress in AI often comes not from a single brilliant idea, but from the meticulous plumbing of existing ones. It treats reasoning not as a monolithic black box, but as a chain of decisions where the pivotal moment might be several steps back from the final, obvious failure. By teaching models to locate that moment, we might finally be building systems that don't just guess their way to an answer, but actually understand the path.

群组相对策略优化(Group Relative Policy Optimization,简称GRPO)是当今许多最强推理模型背后的核心引擎。这是一种巧妙的方案:针对问题生成多个候选答案,由验证器(负责检查正确性的独立模型)进行评分,然后利用每个样本的相对成功程度来指导原始模型的学习。但这种方法存在一个关键且众所周知的缺陷:当验证器采用二元评判(对或错),且面对难题时所有采样答案得分相同(例如全部错误),优化的"相对"部分就会消失。学习信号变得平坦,模型无法从中学到任何东西。

群组相对策略优化(GRPO)是当今最强推理模型的驱动核心。其精妙之处在于:先为问题生成一组候选答案,由独立验证器对正确性进行评分,再通过各样本的相对成功度来引导原始模型的学习。然而这种方法存在一个致命缺陷:当验证器采用二元评判机制(正确/错误),且某个难题的所有采样答案获得相同评分时(如全部错误),优化过程中的"相对比较"机制便会失效。此时学习信号完全停滞,模型无法获得任何有效反馈。

这不仅是理论层面的抽象问题,更是构建能够真正处理复杂多步骤推理任务的模型时面临的核心瓶颈。仅基于最终结果的奖励机制如同粗暴的工具——它只能告诉你终点是错误的,却无法指明是哪个步骤首次导致偏离正轨。这种局限性迫使模型进入高成本、高方差的探索模式,往往引发"奖励黑客"现象:模型会找到满足验证器狭隘逻辑的畸形捷径,而非以稳健、类人的方式真正解决问题。

此时VeriGate框架应运而生,它以精巧而务实的思路解决了这一困境。其核心洞见并非抛弃验证器,而是将其改造为条件守门员。系统仅在验证器基于结果的评分能提供有效梯度时(即部分答案正确、部分错误时)才完全依赖它;当验证器信号出现退化(全对或全错)时,系统才会谨慎引入来自过程奖励模型(PRM)的步骤级监督。

这正是工程实践的精妙所在。多数试图嫁接步骤级反馈的尝试……

Disclaimer: The above content is generated by AI and is for reference only. 免责声明:以上内容由 AI 生成,仅供参考。

训练 训练 推理 推理 微调 微调
Share: 分享到: