The Agent Infrastructure Bet

The Agent Infrastructure Bet

Five companies shipped production agent tooling in March. That's worth paying attention to.

Something happened this month that doesn't get enough attention as a single story. OpenAI released GPT-5.4 on March 5. Anthropic dropped Claude Sonnet 4.6 shortly after. ElevenLabs quietly shipped a batch of agent platform updates. The Model Context Protocol got donated to the Linux Foundation. LangChain rebranded its agent tooling as Fleet.

None of these companies coordinated. They all just looked at the same problem and shipped at the same time.

The problem: agents have been stuck in demo purgatory for two years. The tools for actually running them in production — at scale, reliably, on real business workflows — have been lagging behind the models. This month feels like that started to change.

What Actually Shipped

GPT-5.4 is OpenAI's clearest statement yet that agents are a first-class priority. The headline feature is native computer use — the model can look at a screenshot, figure out what's on screen, and return structured actions like clicks and keystrokes. OpenAI tested it against desktop navigation benchmarks and it beat average human performance. That's not a cherry-picked eval. It's a real capability shift.

The 1M-token context window also shipped, though it's not on by default — without explicit opt-in you get a fraction of that. Worth knowing before you architect around it. GPT-5.4 also ships Tool Search, which lets agents look up tool definitions on demand instead of loading everything into the prompt upfront. For anyone running agents with large tool libraries, the cost reduction is significant.

Claude Sonnet 4.6 covers similar ground from Anthropic's side — improved computer use, stronger agent planning, and a 1M-token context window in beta. Anthropic also shipped their own version of Tool Search independently, around the same time as OpenAI. The token savings they're reporting internally are in the same ballpark. Two companies solving the same problem without coordinating is usually a signal the pain is real.

ElevenLabs is less flashy but the releases are pointed. A unified Users page that gives you conversation history across all your voice agents. Custom SIP header support. Conversation filtering by metadata. Individually these are table-stakes features. Together they suggest ElevenLabs is going after contact center and telephony deployments seriously — these are the things enterprise buyers ask for, not indie developers.

MCP got donated to the Linux Foundation, which formed the Agentic AI Foundation alongside Block, OpenAI, Google, Microsoft, AWS, Cloudflare, and Bloomberg. There are now over 10,000 active public MCP servers. ChatGPT, Cursor, Gemini, Microsoft Copilot, and VS Code have all adopted it. The practical effect: MCP is no longer "Anthropic's standard." It's neutral infrastructure, which makes it a much easier sell internally at enterprises that were nervous about building on top of something one company controlled.

LangChain Fleet is Agent Builder with a new name and a clearer pitch. The rebrand signals who they're selling to — enterprise teams that need a shared control plane for agent development, not individual developers spinning up one-off workflows. Whether Fleet actually delivers on that is something we'll watch.

Why Everyone Is Building This Now

Agents have a few hard problems that have been blocking production deployments for a while. They forget things. They choke when given too many tools. They can't touch existing software without custom integrations. This month's releases are direct attacks on all three.

The 1M-token context windows from both OpenAI and Anthropic address the memory problem — though both are opt-in or in beta, not defaults. Tool Search addresses the token overhead problem. Computer use addresses the integration problem. Three blockers, all getting meaningful attention in the same month.

The MCP story is worth slowing down on because most coverage treated it as a footnote. A year ago, MCP was an open-source project that Anthropic released and controlled. Enterprises had legitimate questions about whether to build on a standard one company could change at will. The Agentic AI Foundation changes that. When Microsoft, Google, AWS, and OpenAI all co-sign a standard, it tends to become permanent. The incentives to defect are low when everyone is already building on top of it.

The most telling detail from this month isn't any individual release though. It's that OpenAI and Anthropic both shipped Tool Search within weeks of each other without coordinating. When one lab ships something it could be a bet that doesn't pan out. When two labs independently build the same thing after watching real production usage, it's pointing at a real constraint. The overhead from large tool surfaces was clearly showing up in production data at both companies. Same thing with the 1M-token windows — neither was responding to the other, both were responding to the same customer feedback about state management in longer-running workflows.

When both major labs move in the same direction at the same time, that direction is probably load-bearing for where agent development is heading.

What This Unlocks for Builders

A few things are meaningfully different now versus six months ago.

Persistent agents are actually viable. The context window improvements mean you're not constantly fighting memory constraints on longer tasks. Pair that with ElevenLabs' conversation history tracking and you can build agents that accumulate context across sessions without architecting a custom state management layer from scratch.

Tool-heavy agents got cheaper to run. Both OpenAI and Anthropic now let agents discover tools on demand rather than front-loading everything into context. If you've been hitting cost or accuracy walls with large tool surfaces, this is worth testing immediately.

Computer use is production-ready at the major labs. You don't need a separate pipeline for UI automation anymore. Both GPT-5.4 and Sonnet 4.6 handle it natively. That opens up automated testing, cross-platform automation, anything that lives in software without an API.

Team-scale agent ops have a clearer pattern. LangChain Fleet and Anthropic's Cowork Projects are different products solving overlapping problems — how do teams govern, manage, and run agents together. Neither is fully mature yet but the pattern is emerging.

The Catches

GPT-5.4's best agent features run through the Responses API, not Chat Completions. That's a real integration difference — not a minor naming distinction, worth knowing before you build. Anthropic's 1M context window is in beta and requires opt-in. ElevenLabs caps conversation retention at 365 days maximum with shorter defaults, which matters if you're building agents meant to maintain relationship context over time. MCP's new tool risk annotations require coordinated updates on both client and server side — ship one without the other and you introduce safety gaps during the transition.

The Linux Foundation governance model for MCP also means iteration will slow down compared to when Anthropic was running it directly. Probably the right long-term tradeoff for an open standard. Short-term, don't expect the same pace.

Bottom Line

Pick one layer and go deep rather than trying to integrate everything at once.

If computer use is relevant to what you're building, GPT-5.4 and Sonnet 4.6 are both worth proper testing now — find out where they break for your specific use case rather than assuming they will. If you're dealing with large tool libraries, Tool Search from either lab is worth implementing. The efficiency gains are significant enough to justify the migration cost. If you're building voice agents for enterprise, ElevenLabs' ops features are more mature than they look — the SIP integration alone opens up telephony workflows that required a lot of custom work before. And if you're evaluating infrastructure for a longer-term agent platform, MCP's neutral governance makes it a safer bet than it was six months ago.

This many infrastructure releases in a single month doesn't happen by accident. Figure out which layer matters most for what you're building and start there.