How we contain Claude across products

The persistent, quiet problem with most AI sandboxing is not that it’s ineffective, but that it’s a black box. We, as users and observers, are asked to trust that robust boundaries exist without any clear understanding of what those boundaries are, what trade-offs were made, or how they’ve been tested. Anthropic has just published a detailed technical overview of how it contains Claude across its ecosystem, and in doing so, has set a new, necessary standard for transparency in an industry where

Hot

Quality

Impact

TL;DR

Analysis 深度分析

Anthropic just did something deceptively simple that feels revolutionary in the AI space: they published a clear, comprehensive map of their own security boundaries. In a field obsessed with scaling laws and benchmark scores, the act of meticulously documenting how and where you contain your own creation is a quiet act of profound responsibility—and a scathing indictment of the industry's default opacity.

The core of it is this: Anthropic has laid out, in plain language, how it sandboxes Claude across its consumer web product (Claude.ai), its developer-focused local tool (Claude Code), and its new collaborative agent environment (Claude Cowork). They use process sandboxes like gVisor, OS-level frameworks like macOS Seatbelt and Linux Bubblewrap, and full-blown virtual machines. The goal is a "hard boundary," a concept that is beautifully simple yet terrifyingly hard to implement. The example they give is perfect: if an API key never enters the sandbox, it can't be stolen, no matter how cleverly the model or an attacker might probe. This isn't just a feature; it's a philosophy of design, a form of radical transparency about risk.

This matters because, as the original author rightly notes, sandboxing is the silent, unglamorous guardian of the entire AI agent promise. We are hurtling toward a future where these models don't just chat; they write and execute code, browse the web, and interact with our files and systems. The sandbox is the container for that godlike power. Without robust, verified, and documented containment, you're essentially handing a live grenade to a toddler and hoping for the best. Most companies treat this layer as a trade secret or, worse, an afterthought. Anthropic, by contrast, is treating it as a foundational pillar and inviting scrutiny.

And let's be honest, the scrutiny is warranted. Anthropic openly admits to past misses, like the API file exfiltration vector that was discovered externally. This isn't a weakness in their argument; it's its greatest strength. It acknowledges the adversarial, iterative nature of security. A system is only as secure as its last unpatched flaw. By documenting their approach and past errors, they are creating a living textbook, not a static boast. It turns their security posture from a black box into a shared, evolving engineering challenge. It builds trust not through claims of perfection, but through evidence of rigor and humility.

Contrast this with the rest of the field. How many AI labs publish clear schematics of their containment strategies? We hear vague assurances of "safety" and "security" baked in, but we see little of the underlying architecture. This opacity is dangerous. It forces developers and enterprises to build on faith, not engineering. It makes it impossible for the broader security community to audit, challenge, and improve these critical systems. Anthropic’s move shatters this norm. It implicitly says: "This is hard, here's how we're tackling it, here's where we've failed, and we believe this transparency makes everyone safer." It's a direct play for the trust of developers and corporations who need to know exactly what walls the AI is bouncing off before they let it near their data.

The specific technical choices are also telling. Using a full VM for Cowork makes sense for a product designed to handle complex, potentially untrusted collaborative tasks—it's the heaviest-duty container. gVisor for the web interface is a smart, modern choice, a user-space kernel that provides strong isolation with less overhead than a full VM. Using OS-native tools like Seatbelt and Bubblewrap for the local CLI is pragmatic and appropriate for the trust model of code running on your own machine. This isn't a one-size-fits-all solution; it's a thoughtful portfolio of containment strategies matched to the risk profile of each product. It shows security isn't a checkbox, but a continuous, context-aware engineering discipline.

This publication also serves a clever strategic purpose. By open-sourcing their Anthropic Sandbox Runtime (srt) and documenting it so clearly, they are setting a de facto standard. They are making it easier for other developers to build securely on their platform, thereby growing their ecosystem. But more than that, they are raising the bar for everyone. The silent question to other AI labs now is: "Where is your map? What are your boundaries?" In an era of increasing regulatory scrutiny, being the company that can point to a detailed, publicly-discussed security architecture is a monumental advantage.

Ultimately, this isn't just about a blog post. It's about a fundamental shift in how we should evaluate AI companies. The race to AGI is not just about who has the smartest model, but who has the most trustworthy one. Trust is built on verifiable evidence, not marketing. By laying their security foundations bare, Anthropic is arguing that the latter is as critical as the former. It's a refreshingly mature and frankly adult approach in an industry still often behaving like a rowdy teenager. It doesn't guarantee safety—nothing ever does—but it creates the necessary condition for it: an open, evidence-based conversation about where the guardrails are, how strong they are, and how we can collectively make them stronger. This is how you build an industry that deserves to be trusted.

Anthropic上周发布的这篇安全白皮书，读起来像一份迟到的救赎。在AI公司普遍对安全细节讳莫如深、用模糊承诺搪塞用户的时代，他们直接把沙箱的“内脏”掏出来给人看。gVisor、Seatbelt、Bubblewrap、完整的虚拟机……这套组合拳打出来，至少说明了一件事：他们真的在认真考虑“如果Claude失控了怎么办”这个问题，而不只是在营销材料里塞一句“安全是我们的首要任务”。

但白皮书的可贵恰恰反衬出行业的常态。我们用着各种AI工具，把代码、数据、甚至企业核心逻辑交给它们处理，却几乎无人能说清自己的信任到底建立在什么之上。Anthropic的坦诚，反而成了一种稀缺品。这就像在一家所有餐厅都从不公开后厨的美食街上，突然有一家挂出了实时监控——你本该觉得理所当然，却忍不住多看两眼。

不过，别急着把这份文档当成安全保证书。它记录了他们如何构筑高墙，也坦承了墙曾被凿穿。那个通过api.anthropic.com/v1/files接口的数据泄露路径尤其值得玩味。漏洞本身（文件上传后的访问控制疏忽）并不新奇，但它的存在揭示了一个核心矛盾：AI系统的攻击面正在爆炸式增长。当一个AI代理被授权去操作文件系统、调用API、甚至控制浏览器时，每一个被授权的功能点都可能成为堤坝上的蚁穴。沙箱做得再严密，“边界”本身也成了新的需要严密保护的对象。传统的安全模型在面对能自主推理、尝试“创造性”绕过限制的AI代理时，显得有些力不从心。我们熟悉的“最小权限原则”在这里需要一场彻底的重新发明。

Anthropic提到，Claude.ai用gVisor，Claude Code用轻量级沙箱（macOS的Seatbelt，Linux的Bubblewrap），而Cowork则直接上了完整虚拟机。这种差异化策略很有意思——它承认了没有一劳永逸的银弹。本地运行的工具对性能更敏感，云端服务则可以承受虚拟机的开销。但这也引出了一个更根本的问题：安全的代价，最终由谁承担？是用户接受功能受限、延迟增加，还是厂商投入巨大成本维护多套复杂架构？从商业角度看，后者的可持续性本身就是一个挑战。

最耐人寻味的是他们对开源沙箱运行时（srt）的态度。将核心安全基础设施开源，无疑是一种自信的表现，也是争取行业信任的聪明举措。但这也是一把双刃剑：攻击者同样可以拿着你的防御地图，寻找盲区。不过，在AI安全这个黑箱遍布的领域，透明的代码审计总比不透明的承诺要可靠得多。至少，社区里的“白帽”们有了一个明确的靶子去攻击、去完善。

通篇读下来，我最大的感受不是“Anthropic的安全做得多好”，而是“AI安全这件事有多么复杂和紧迫”。这份文档就像一份来自前线的工程笔记，它告诉你战壕是怎么挖的，也告诉你敌人曾从哪里摸进来。它没有提供终极解决方案，而是展示了一个严肃的工程团队在面对一个快速演化的威胁时，所能采取的系统性思考和层层防御。

这对整个行业的启示或许在于：随着AI代理能力越来越强，安全不能再是产品发布后的一个补丁，而必须是架构设计时的基因。我们或许正在进入一个新的阶段，评估一个AI工具的首要标准，不再是它的基准测试分数有多高，而是它的“沙箱”有多可靠、它的失败模式有多可控。从这个角度看，Anthropic的这篇白皮书，与其说是技术展示，不如说是在设定新的行业门槛。接下来，就看其他厂商是跟进，还是继续假装后厨的脏乱不存在了。

Disclaimer: The above content is generated by AI and is for reference only.

Claude LLM Security Agent

Read Original →

Analysis 深度分析

Share to WeChat 分享到微信

Related Articles 相关文章