Deep Analysis 深度解析 · 11 min read 14 分钟阅读 · 1h ago

GPT-5.6 vs Claude Opus 4.8 vs MiniMax M3: The Three-Way Battle for AI Supremacy GPT-5.6 vs Claude Opus 4.8 vs MiniMax M3:三强争霸,谁在领跑?

June 2026 feels like the opening act of a three-kingdom saga.

Anthropic dropped Claude Opus 4.8 on May 28, keeping the same price but adding an "honesty mode." MiniMax open-sourced M3 on June 1, claiming coding performance that beats GPT-5.5. OpenAI's GPT-5.6 is still lurking in Codex backend logs under the codename iris-alpha, its 1.5-million-token context window already keeping plenty of people up at night.

Three flagship models landing within two weeks is no coincidence. The AI model race has shifted from "who ships first" to "who ships right." This article cuts through the noise and breaks down all three across architecture, benchmarks, pricing, and real-world use.


Claude Opus 4.8: Honesty as a Feature

Anthropic's Opus 4.8, released May 28, looks like a modest upgrade on the surface. Anthropic itself called it a "modest but tangible improvement" — surprisingly humble for a company valued at $965 billion.

But the data tells a different story.

By the Numbers

Opus 4.8 scores 88.6% on SWE-Bench Verified, up from 87.6% on 4.7. That single-point gain might seem small, but SWE-Bench Verified is approaching saturation — every percentage point gets harder to earn. The real gap shows on SWE-Bench Pro: 69.2% for Opus 4.8 versus 58.6% for GPT-5.5 and 54.2% for Gemini 3.1 Pro. That's an 11-point lead, far beyond "marginal."

On Terminal-Bench 2.1, Opus 4.8 hits 74.6%. GPT-5.5 with Codex CLI reaches 83.4%, but that's a scaffold advantage — on the standardized Terminus-2 harness, Opus 4.8's 74.6% sits close to GPT-5.5.

GPQA Diamond: 93.6%. GDPval-AA: 1890 Elo, well ahead of GPT-5.5's 1769.

What Actually Changed

Opus 4.8's biggest win isn't on any leaderboard. It's an internal Anthropic metric called "honesty" — the model is 4x less likely than 4.7 to let a code flaw pass without flagging it.

In engineering terms, that's enormous. A model that says "I'm not sure about this" is safer than one that confidently delivers a ticking time bomb.

The Fast mode is another practical improvement: 2.5x faster than standard at 2x the price ($10/$50 per 1M tokens), but 3x cheaper than the previous Fast mode. For daily development, Fast mode covers 80% of use cases.

Claude Code also gets a "dynamic workflows" research preview — the model decomposes a hard problem into hundreds of parallel sub-agents, each approaching from a different angle, cross-validating results before reporting back. This is the right direction for agentic AI: the bottleneck isn't individual model capability, it's multi-agent coordination.

Bottom Line

Opus 4.8 isn't a parameter-stacking victory — it's an engineering refinement win. It keeps the 1M context window and $5/$25 pricing while pushing forward on every dimension. Its main weakness is cost — $25 per million output tokens is not cheap for individual developers or small teams.


MiniMax M3: The Open-Source Surprise

If Opus 4.8 is a steady march forward, MiniMax M3 is a flanking maneuver.

Released June 1, M3's headline feature is that it's open-weight. But not the kind of open that trails behind closed models — M3 beats GPT-5.5 and Gemini head-to-head on several hard benchmarks.

Real Architectural Innovation

M3 ditches the mainstream Dense Transformer or MoE路线 for a custom sparse attention architecture called MSA (MiniMax Sparse Attention). This isn't a tweak — it's a fundamental rethinking of the attention mechanism.

Standard Full Attention scales quadratically with input length: at 1M tokens, every token must compute attention against 1M others, which is brutally expensive. MSA's approach is to pre-filter which token blocks are relevant and only compute full attention on those. Combined with GPU-level memory optimization — switching from per-query loading to per-block batch processing — I/O overhead drops dramatically.

The results are striking: at 1M context, M3's per-token compute drops to 1/20th of the previous generation, prefill speeds up 9x+, and decoding speeds up 15x+. These are production numbers, not lab experiments.

Benchmark Performance

M3 scores 59.0% on SWE-Bench Pro, edging past GPT-5.5's 58.6% (the margin is tiny, but the symbolism is huge — the first time an open-weight model beats a closed flagship on a hard coding benchmark). Terminal-Bench 2.1: 66.0%. BrowseComp: 83.5%, surpassing Opus 4.7's 79.3%.

The most impressive demo: MiniMax asked M3 to optimize an FP8 matrix multiplication kernel on NVIDIA Hopper GPUs. Given only a task description, a benchmark script, and a non-functional code skeleton with no reference solution, M3 pushed hardware utilization from 7.6% to 71.3% over 24 hours and 147 iterations. Most competing models gave up after a few dozen tries. M3 didn't find its best solution until attempt 145. That kind of persistence is exactly what matters in agentic scenarios.

The Price Advantage

This is where M3 changes the game.

Metric M3 Opus 4.8
Input price (per 1M tokens) $0.30 (promo) $5.00
Output price (per 1M tokens) $1.20 (promo) $25.00
Cache reads $0.12 $0.50
Open-weight Yes (within 10 days) No

Output cost difference: 20x. Input cost: 16x. If you're processing hundreds of millions of tokens daily, switching from Opus 4.8 to M3 saves millions of dollars a year.

Where It Falls Short

M3 isn't perfect. On BenchLM's aggregate score, Opus 4.8 scores 95 versus M3's 76. Multimodal is the weakest area — OfficeQA Pro shows 66.2% for Opus 4.8 versus 45.1% for M3. All scores are vendor-reported (third-party verification is pending), and M3's SWE-Bench numbers were run on MiniMax's own Agent scaffolding — results may vary with different frameworks.

Bottom Line

M3's significance isn't that it beats closed models across the board. It's that the open-source curve has converged with the closed frontier. When an open-weight model delivers 80-90% of closed-model performance at 1/20th the cost, the closed-source business model starts to crack.


GPT-5.6: The Ghost in the Machine

This is the most mysterious of the three. GPT-5.6 hasn't been officially announced, let alone released. But the traces it left in Codex backend logs sketch a picture that's hard to ignore.

What We Know

GPT-5.6's internal codename is iris-alpha, alongside ember-alpha and beacon-alpha (unclear variants). The headline leak: a 1.5-million-token context window — 43% larger than GPT-5.5's 1.05M.

What does that mean in practice? GPT-5.5 can already ingest the entire Three-Body Problem trilogy. GPT-5.6's 1.5M context can swallow large code repositories, ultra-long legal contracts, and extended multi-turn agent conversations. Developer tests confirm the model responds fluidly at 900K tokens and handles requests exceeding 1.05M without breaking.

Another leaked capability: front-end generation. Screenshots show GPT-5.6 generating a minimal note-taking app called Lumen Notes with almost no prompt — mature grid layout, restrained color palette, clear typographic hierarchy. AI is moving from "generating code snippets" to "generating commercially viable UIs."

Expected Positioning

While official benchmarks aren't out, early signals point to targeted improvements in advanced reasoning and agentic workflows, plus better token efficiency. The same token budget should accomplish more.

Polymarket puts GPT-5.6's probability of releasing before June 30 at 80-89%. If it ships this month, alongside Claude Sonnet 4.8, Google Gemini 3.5 Pro, and xAI Grok 5 (all rumored for the June window), June 2026 will be the most competitive month in AI history.

Uncertainty

GPT-5.6's biggest challenge is OpenAI's own cadence. GPT-5.5 only launched April 23 — shipping 5.6 within two months would be unprecedented. Behind it is capital markets pressure: Anthropic filed its IPO first, and OpenAI needs to show investors and the SEC that its iteration velocity hasn't slowed.

There's also the GPT-5.5 overhang. Despite 82.6% on SWE-Bench Verified, GPT-5.5 manages only 58.6% on the harder SWE-Bench Pro — 10 points behind Opus 4.8. GPT-5.6 needs real improvements in coding and reasoning, or "1.5M context" becomes a flashy number attached to a disappointing experience.

Bottom Line

GPT-5.6 is AI's Schrödinger's cat — simultaneously existing and not, destined for glory or disappointment. But if that 1.5M context window delivers, it will reset the standard for "long context."


Head-to-Head: Who Wins in Which Scenario?

Dimension Claude Opus 4.8 MiniMax M3 GPT-5.6 (leaked)
SWE-Bench Pro 69.2% 59.0% TBD (GPT-5.5: 58.6%)
Context window 1M tokens 1M tokens 1.5M tokens
Aggregate score (BenchLM) 95 76 TBD
Input price (per 1M tokens) $5.00 $0.30 TBD
Output price (per 1M tokens) $25.00 $1.20 TBD
Open-weight No Yes No
Multimodal Text+image Text+image+video Text+image
Agent orchestration Dynamic sub-agents Agent Team (Mavis) Agent SDK
Honesty 4x improvement Not disclosed Not disclosed
Release status Live Live Leaked, expected June

Recommendations by Use Case

Enterprise coding (finance, healthcare, compliance) → Claude Opus 4.8. Honesty and reliability are non-negotiable when bugs cost more than API calls.

Individual developers / startups → MiniMax M3. 85% of the coding capability at 5% of the price. Plus, open-weight means data privacy and self-hosting.

Ultra-long-context tasks → GPT-5.6 (if 1.5M materializes). Large codebase analysis, marathon contract review, extended agent loops — context length is productivity.

Cost-sensitive high-volume production → MiniMax M3. The price gap is too wide to argue with.


The Real Battlefield: Capital Markets

There's a thicker thread running underneath this model race.

On June 1, Anthropic confidentially filed its S-1 with the SEC, kicking off the IPO process at a $965 billion valuation — less than two weeks after closing a $65 billion funding round. Within a week, OpenAI announced its own confidential S-1 filing at $852 billion, targeting $1 trillion. Meanwhile, SpaceX-xAI is planning to price at a $1.75 trillion valuation. This fall, three AI companies will stage a combined market cap of over $3.8 trillion in the public markets.

Every model iteration from both companies serves a dual purpose: yes, improve the technology, but more importantly, prove to capital markets that their iteration velocity hasn't slowed and their moat is deepening.

Why leak GPT-5.6 info less than two months after GPT-5.5? Why emphasize "honesty" as a feature on Opus 4.8 while keeping pricing flat? Because investors and the SEC don't read benchmarks — they read strategic narratives. And narratives need fresh material every quarter.

Anthropic's finances look healthier: ~$0.23 in annualized recurring revenue per dollar raised, roughly double OpenAI's ratio. Anthropic projects positive cash flow by 2028; OpenAI by 2030. But OpenAI has scale on its side: 900 million weekly active users and $20 billion in annualized revenue, which public markets reward heavily.

MiniMax introduces a wild card. While Anthropic and OpenAI compete over who IPOs first and at what valuation, M3 proves at 1/20th the cost that the "closed-source premium" is shrinking. If open-source keeps closing the gap, the trillion-dollar narratives need rewriting.

The same day M3 launched, Tencent Cloud announced massive price cuts on DeepSeek-V4 (cache-hit prices down 97.5%). The price war isn't coming — it's here. When open-source pushes inference costs toward zero, the entire "sell tokens" business model faces an existential question.


Conclusion

So, back to the opening question: who's leading?

Short-term: Claude Opus 4.8. It's the only fully launched and battle-tested flagship among the three. Honesty and reliability are unmatched, and it has a clear lead in the coding agent category.

Medium-term: GPT-5.6. A 1.5M context window is a qualitative leap. If it also fixes GPT-5.5's SWE-Bench Pro weakness, it will redefine the flagship bar.

Long-term: MiniMax M3. Not because of M3 itself, but because of what it represents — when open-source delivers 80-90% of frontier performance at 20x lower cost, the entire industry's value chain gets重构. This isn't one model beating another. It's one paradigm beating another.

Honestly, though, it's too early for definitive answers. GPT-5.6 hasn't shipped. M3's third-party evaluations aren't out yet. On Polymarket, the "best AI model by end of June" bet has Anthropic at 83% — the market trusts Claude for now.

But if I had to pick a daily coding assistant today, I'd choose MiniMax M3. Not because it's the strongest of the three, but because it draws the best line between "good enough" and "affordable" that we've ever seen.

As for GPT-5.6 — the day it actually ships, I might change my answer.


Speaking of which, the real winner of this three-way race isn't Anthropic, OpenAI, or MiniMax. It's developers. Whether you're using Opus 4.8's honest coding, M3's ridiculous price-performance, or GPT-5.6's 1.5M context, we're entering a golden age where good models are cheap and great models keep getting better. Ten years from now, June 2026 might be remembered as the inflection point.


Data sources: Anthropic official system card, MiniMax official technical report, OpenAI Codex log leaks, SWE-Bench, Terminal-Bench, BenchLM, Polymarket prediction markets. Third-party verification of MiniMax M3 results is pending.

2026年6月的AI圈,像极了三国演义的开局。

Anthropic 在5月28日甩出 Claude Opus 4.8,定价不变但多了一个"诚实模式"。MiniMax 在6月1日把 M3 开源,号称编程能力超越 GPT-5.5。OpenAI 的 GPT-5.6 还在 Codex 日志里以代号 iris-alpha 潜伏,150万 token 的上下文窗口已经让不少人睡不着觉。

三款旗舰级模型在两周内密集登场,这不是巧合。AI 模型竞赛已经从"谁先发"变成了"谁发得对"。本文不说废话,直接从架构、跑分、定价和实际体验四个维度,拆清楚这三款模型到底谁在领跑。


▎Claude Opus 4.8:诚实是最大的武器

Anthropic 在 5月28日发布的 Opus 4.8,表面看是一次"温和升级"。Anthropic 自己的用词是"modest but tangible improvement"——谦虚得不像一家估值 9650 亿美元的公司在说话。

但仔细看数据,这次升级一点都不温和。

跑分说话

Opus 4.8 在 SWE-bench Verified 上拿到 88.6%,比 4.7 的 87.6% 高了 1 个百分点。这个增幅看似不大,但 SWE-bench Verified 已经接近饱和——越往后每提升 1% 需要的技术含量越高。真正拉开差距的是 SWE-bench Pro:Opus 4.8 达到 69.2%,而 GPT-5.5 只有 58.6%,Gemini 3.1 Pro 更是只有 54.2%。差距接近 11 个点,这已经不是"微幅领先"了。

在 Terminal-Bench 2.1 上,Opus 4.8 得分 74.6%。虽然 GPT-5.5 配合 Codex CLI 跑到了 83.4%,但那是换了脚手架的功劳——在统一的 Terminus-2 框架下,Opus 4.8 的 74.6% 紧贴 GPT-5.5 的成绩。

GPQA Diamond 得分 93.6%,知识类基准综合排名前三。GDPval-AA 拿到 1890 Elo,全面领先 GPT-5.5 的 1769。

真正有趣的变化

Opus 4.8 最大的亮点不在跑分,而在 Anthropic 内部评测体系中一个叫"诚实度"的指标——4.8 相比 4.7,对代码缺陷保持沉默的概率降低了 4 倍。

换句话说,以前的模型写错了还嘴硬,Opus 4.8 会主动说"这段我不确定"。这在工程场景中极其有价值。一个会承认自己不确定的模型,比一个自信满满但给你埋雷的模型,安全得多。

另一个实际变化是 Fast 模式:速度是标准模式的 2.5 倍,价格是标准模式的 2 倍($10/$50 per 1M tokens),但比上一代 Fast 模式便宜了 3 倍。日常开发中,大部分问题不需要全功率输出,Fast 模式覆盖了 80% 的场景。

在 Claude Code 里,Opus 4.8 还新增了一个"动态工作流"的预览功能——把一个复杂问题拆成上百个并行子任务,分发给多个子 Agent 独立求解,最后交叉验证再汇总。这个架构方向对了,Agent 落地的关键不是单个模型有多强,而是多 Agent 协作的能力。

一句话总结

Opus 4.8 不是参数堆砌的胜利,是工程打磨的胜利。它保持了 100 万 token 上下文、$5/$25 的定价不变,同时在每个维度上实实在在地往前推了一小步。但它最大的软肋是价格——对于个人开发者和中小团队,$25/百万输出 token 不算便宜。


▎MiniMax M3:开源阵营的奇袭

如果说 Opus 4.8 是稳扎稳打,那 MiniMax M3 就是一记奇袭。

6月1日发布的 M3,最大的新闻点是:它开源。而且它不是那种"开源但不如闭源"的开源——M3 在多个关键基准上正面击穿了 GPT-5.5 和 Gemini。

架构上的真正创新

M3 没有走主流的 Dense Transformer 或 MoE 路线,而是用了一套自研的稀疏注意力架构 MSA(MiniMax Sparse Attention)。这不是小修小补,是底层 attention 机制的重新设计。

传统 Full Attention 的计算量随输入长度呈平方增长——处理 100 万 token 时,每个 token 要和另外 100 万个 token 做 attention 计算,代价极高。MSA 的思路是:先粗筛哪些 token block 是相关的,只对选中的 block 做完整计算。配合 GPU 层面的内存访问优化——把逐 query 加载改为逐 block 批处理——显著降低了 I/O 开销。

成果是惊人的:在 100 万上下文下,M3 的 per-token 计算量只有上一代的 1/20,prefill 速度提升 9 倍以上,decoding 速度提升 15 倍以上。这不是实验室数据,是已经上线的结果。

跑分能打吗?

M3 在 SWE-Bench Pro 上拿到 59.0%,超过 GPT-5.5 的 58.6%(尽管差距只有 0.4 个百分点,但意义在于开源模型首次在硬基准上超过了闭源旗舰)。Terminal-Bench 2.1 得分 66.0%,BrowseComp 达到 83.5%(超过 Opus 4.7 的 79.3%)。

最让人印象深刻的是 MiniMax 展示的一个实验:让 M3 优化 NVIDIA Hopper GPU 上的 FP8 矩阵乘法 kernel。模型收到一个任务描述、一个基准脚本和一个不完整的代码骨架,没有参考实现。经过 24 小时、147 次迭代,M3 把硬件利用率从 7.6% 推到了 71.3%。大多数对比模型在几十次尝试后就放弃了,M3 在第 145 次尝试时才达到最优解。这种"死磕"的能力,在 Agent 场景中极其重要。

价格屠夫

这才是 M3 最狠的地方。

对比项 M3 Opus 4.8
输入价格(每百万 token) $0.30(推广期) $5.00
输出价格 $1.20(推广期) $25.00
缓存读取 $0.12 $0.50
开源权重 是(10天内发布)

输出价格差 20 倍。输入价格差 16 倍。如果你每天处理数亿 token,用 M3 代替 Opus 4.8,成本差距是百万美元级别的。

不足之处

M3 并非完美。在综合基准 BenchLM 上,Opus 4.8 综合得分 95,M3 只有 76。多模态能力尤其短板——OfficeQA Pro 上 Opus 4.8 得分 66.2%,M3 只有 45.1%。这些跑分来自 MiniMax 自家报告(第三方独立验证还在路上),而且 M3 的 SWE-Bench 分数是在 MiniMax 自己的 Agent 框架下跑出来的,换成其他脚手架分数可能会有变化。

一句话总结

M3 的意义不在于它全面超越了闭源模型,而在于它证明了开源的追赶曲线已经逼平了闭源。当开源模型以 1/20 的价格达到闭源 80-90% 的性能,闭源厂商的商业模式就开始承压了。


▎GPT-5.6:还没出道就已封神

这是三款中最神秘的一个。GPT-5.6 尚未官方发布,甚至没有被 OpenAI 正式确认。但它在 Codex 后端日志里留下的痕迹,已经足够勾勒出一幅令人窒息的草图。

已知信息

GPT-5.6 的内部代号是 iris-alpha,同时出现的还有 ember-alpha 和 beacon-alpha。最关键的泄露规格是上下文窗口——150 万 token,比 GPT-5.5 的 105 万提升了 43%。

这意味着什么?GPT-5.5 已经能读完整部《三体》三部曲。GPT-5.6 的 150 万 token,足够吃下大型代码仓库、超长合同审查、多轮复杂 Agent 对话。开发者实测显示,在输入 90 万 token 时模型依然流畅响应,甚至能完美处理超过 105 万 token 的请求。

另一个值得关注的是前端生成能力。泄露截图显示,GPT-5.6 在几乎没有详细提示的情况下,直接生成了一款叫 Lumen Notes 的极简记事应用——栅格布局成熟、配色克制、字体层级清晰。这意味着 AI 正从"生成代码片段"走向"生成可直接商用的前端产品"。

预测跑分和竞技格局

虽然官方数据未出,但推理引擎评测已透露一些信号:GPT-5.6 在 advanced reasoning 和 agentic workflows 上做了针对性优化,token 效率也有提升。这意味着它可能在同 token 预算下完成更复杂的任务。

Polymarket 上 GPT-5.6 在 6 月 30 日前发布的概率高达 80-89%。如果它真的在本月发布,加上 Anthropic 的 Claude Sonnet 4.8、Google 的 Gemini 3.5 Pro 和 xAI 的 Grok 5(都在 6 月窗口期),2026 年 6 月将成为 AI 史上竞争最激烈的一个月。

不确定性

GPT-5.6 面临的最大问题是:OpenAI 自己的节奏。GPT-5.5 在 4 月 23 日才发布,不到两个月就推 5.6,这个节奏史无前例。背后是 OpenAI 在资本市场上的压力——Anthropic 已经抢先提交 IPO 申请,OpenAI 需要向投资人和 SEC 展示"模型迭代速度没有放缓"。

另一个隐忧是 GPT-5.5 的表现。尽管在 SWE-Bench Verified 上达到 82.6%,但在更难的 SWE-Bench Pro 上只有 58.6%,被 Claude Opus 4.8 甩开 10 个点。GPT-5.6 需要在 coding 和 reasoning 上交出实质性的提升,否则"150 万 token"可能成为一个好看的数字但实际体验翻车的噱头。

一句话总结

GPT-5.6 是当前 AI 圈最大的"薛定谔的猫"——既存在又不存在,既可能封神也可能翻车。但 150 万 token 的上下文窗口如果兑现,它将直接改写"长上下文"的标准。


▎正面对比:谁在什么场景赢?

趁热打铁,我把三款模型放在六个关键维度上做了这个对比表:

维度 Claude Opus 4.8 MiniMax M3 GPT-5.6(泄露数据)
SWE-Bench Pro 69.2% 59.0% 待测(GPT-5.5: 58.6%)
上下文窗口 100 万 token 100 万 token 150 万 token
综合跑分(BenchLM) 95 76 待测
输入价格(每百万 token) $5.00 $0.30 待定(5.5: ~$10)
输出价格(每百万 token) $25.00 $1.20 待定
开源
多模态 文本+图像 文本+图像+视频 文本+图像
Agent 协作 动态子 Agent(预览) Agent Team(Mavis) Agent SDK
诚实度 4x 提升 未披露 未披露
发布状态 已上线 已上线 泄露中,预计 6 月

不同场景的推荐

企业级编码(金融、医疗、合规) → Claude Opus 4.8。诚实度和可靠性是刚需,出错的代价远超 API 调用费。

个人开发者 / 创业团队 → MiniMax M3。1/20 的价格换来 85% 的编码能力,经济账太划算。而且开源意味着数据隐私有保障,可以本地部署。

超长上下文任务 → GPT-5.6(如果 150 万 token 真的兑现)。大型代码库分析、超长合同审查、多轮复杂对话,上下文长度就是生产力。

成本敏感的大规模生产 → MiniMax M3 没有任何悬念。价格差距太大了。


▎真正的战场:资本市场

这场模型竞赛背后有一根更粗的线。

6 月 1 日,Anthropic 秘密提交了 S-1 表格,正式启动 IPO 流程,估值 9650 亿美元。不到两周前,它刚完成了一轮 650 亿美元的融资。不到一周后,OpenAI 也宣布已向 SEC 提交了机密 S-1,估值 8520 亿美元,目标万亿。与此同时,SpaceX-xAI 也在计划以 1.75 万亿美元估值定价。2026 年秋天,三家 AI 公司将在公开市场上演一场总市值超过 3.8 万亿美元的资本盛宴。

两家公司在模型上的每一轮迭代,都不只是为了跑分好看。它们在向资本市场证明一件事:我们的技术迭代速度没有放缓,我们的护城河在加深。

为什么 GPT-5.5 发布不到两个月,OpenAI 就急不可耐地泄露 5.6 的信息?为什么 Anthropic 宁愿在 Opus 4.8 上保持定价不变,也要强调"诚实度"这个软指标?因为投资人和 SEC 看的不是跑分,是战略叙事。叙事需要新料。

Anthropic 的财务状况更健康:每融资 1 美元对应约 0.23 美元的年化经常性收入,大约是 OpenAI 的 2 倍。Anthropic 预计 2028 年实现正现金流,OpenAI 则要到 2030 年。但 OpenAI 的用户规模更大——9 亿月活用户、200 亿美元年化收入——资本市场更吃这一套。

MiniMax 的存在给这个故事增加了一个变量。Anthropic 和 OpenAI 在比谁先上市、谁估值高的时候,M3 用 1/20 的价格证明了"闭源溢价"正在被压缩。如果开源的追赶速度继续维持,那 Anthropic 和 OpenAI 万亿估值的故事,就需要重新讲。

更值得注意的是,MiniMax M3 发布当天,腾讯云就宣布大幅下调 DeepSeek-V4 系列价格(缓存命中降幅高达 97.5%)。价格战已经不是暗流,是明牌。当开源阵营把推理成本打到接近零,整个"卖 token"的商业模式都会面临根本性质疑。


▎结论

那么,回到开头的问题:三强争霸,谁在领跑?

短期看 Claude Opus 4.8。 它是三款中唯一已经上线且经过验证的旗舰,诚实度和可靠性无人能及,在编码 Agent 赛道上有明确的领先优势。

中期看 GPT-5.6。 150 万 token 的上下文窗口是质的飞跃,如果同时能解决 GPT-5.5 在 SWE-Bench Pro 上的短板,它将重新定义旗舰的门槛。

长期看 MiniMax M3。 不是因为 M3 本身,而是因为它代表的趋势——当开源模型以 20 倍价差提供 80-90% 的性能,整个行业的价值链条会被重构。这不是一个模型对另一个模型的胜利,是一种模式的胜利。

但说实话,现在下结论还太早。GPT-5.6 毕竟还没发布,M3 的第三方独立评测还没出来。在 Polimarket 上,预测"哪家公司到 6 月底拥有最好的 AI 模型"的赌注中,Anthropic 以 83% 领先——市场暂时相信 Claude。

不过你要是让我选一个日常使用的编码助手,我现在会选择 MiniMax M3。不是因为它是三款中最强的,而是它在"够用"和"付得起"之间划出了一条前所未有的好线。

至于 GPT-5.6——等它真正发布那天,我可能会改变说法。


说到最后,这场三强争霸最大的受益者不是 Anthropic,不是 OpenAI,也不是 MiniMax——而是开发者。无论是用 Opus 4.8 的诚实编码、M3 的超低价格,还是 GPT-5.6 的 150 万上下文,我们正处在一个"好模型不贵、贵模型更好"的黄金时代。十年后回头看,2026 年 6 月可能就是这个时代的拐点。


本文数据来源:Anthropic 官方系统卡、MiniMax 官方技术报告、OpenAI Codex 日志泄露、SWE-Bench、Terminal-Bench、BenchLM、Polymarket 预测市场。第三方独立验证尚未完全覆盖 MiniMax M3 的全部测试结果。*

Share: 分享到: