GPT-5.5 ships with the agent stack baked in

GPT-5.5 ships with the agent stack baked in

GPT-5.5 shipped Thursday with a Pro variant for problems the base model gives up on. The release is less interesting as a benchmark story than as a packaging one. Skills, MCP, computer use, a hosted shell, apply-patch, and tool search now sit inside the model surface rather than next to it.

What Shipped

There are two SKUs. The base GPT-5.5 is available through both Chat Completions and the Responses API. GPT-5.5 Pro is Responses-only. Both run a 1M-token context window and default to a medium reasoning effort that the model adjusts based on what the task seems to require. ChatGPT users on Plus, Pro, Business, and Enterprise plans get the model in the consumer product. API pricing is $5 per million input tokens and $30 per million output tokens for the base model. Pro lands at $30 input and $180 output.

The Bundled Stack

The bundled agent stack does more than the benchmarks. Function calling, prompt caching, Batch API, tool search, computer use, hosted shell, apply-patch, Skills, MCP, and web search are all part of what you call when you call GPT-5.5. For builders who have been wiring these together separately, running their own tool router, sandbox, and patch-application logic, that cuts the routing, sandboxing, and patch-application code most agents currently write themselves. The combination of hosted shell and apply-patch in particular collapses the read-edit-run loop most coding agents have to assemble themselves into a single model-side primitive. Tool search means the model picks from a registered tool set at runtime rather than every tool being crammed into the prompt up front, which is a practical win once tool counts climb past ten or so. OpenAI is positioning the model as an agent harness, not a chat endpoint, and the SKU now reflects that.

What The Benchmarks Say

The benchmark claims are stronger than the usual day-one numbers. OpenAI reports 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, 84.9% on GDPval, and 78.7% on OSWorld-Verified, and GPT-5.5 Pro reportedly hits 39.6% on FrontierMath Tier 4 against Claude Opus 4.7's 22.9% on the same test. These are OpenAI's numbers. Independent verification comes later, and the gap between announcement-day claims and reproduced benchmarks has been wide before. The test selection itself is informative — almost all of these are agentic or code-execution benchmarks, consistent with the bundled-tooling framing. The 1M context window in the same announcement is a quieter catch-up: Gemini has been there since early 2024, Claude has long-context variants, and this closes a gap rather than opening one. Useful for maintaining state across long tool sequences, but not the headline OpenAI's announcement implies.

What's Missing

What's missing matters. GPT-5.5 Pro is Responses API only, which splits OpenAI's developer surface. Production deployments on Chat Completions for two years now have to migrate to a different endpoint to access the harder-problem variant. Different request schema, different streaming behavior, different tooling assumptions. Not insurmountable, but not free either, and the announcement doesn't make that cost clear. This is OpenAI quietly nudging the ecosystem toward the Responses API as the default surface, consistent with where the agentic features have been landing but worth naming as migration pressure. The 1M context window needs the same calibration: announced number is one thing, usable context another. Every frontier model's long-range performance degrades before the stated ceiling, and until somebody runs needle-in-haystack and recall tests on GPT-5.5, practical context should be assumed shorter than marketing context.

The pricing is steep enough to make orchestrating cheaper models more attractive than calling GPT-5.5 directly. $30 per million output tokens on the base model puts GPT-5.5 in the same neighborhood as the prior generation's reasoning models, not the workhorse tier. Pro at $180 output is six times that. For high-volume agentic workloads that fan out into many tool calls, those numbers compound quickly. The default-to-medium reasoning effort assumes the model scales effort up only when needed. Whether it does that well or burns compute on tasks that didn't require it is the actual question, and production usage will answer it faster than benchmarks will.

Who This Is Actually For

The intended audience is the builder running agentic workflows that already needed Skills, MCP, computer use, and a hosted shell; that builder gets a single model surface instead of stitched-together infrastructure, and the pricing, while steep, is plausible against orchestrating multiple cheaper calls plus their own tool runtime. It's a worse fit for chat applications, where base pricing exceeds what most chat workloads justify and the bundled tooling goes unused — the cheaper, faster models in OpenAI's lineup are still the right call there. The Pro variant narrows further: research, complex reasoning, problems where one accurate answer beats five cheap attempts, and where $180 per million output tokens is justified by correctness dominating cost. Teams chaining cheaper models for hard problems will compare reliability and total cost, not per-token rates.

The real test is whether the bundled tool stack actually reduces builder friction, or whether teams keep their existing tool infrastructure and route a chat completion through the model anyway. The pricing makes the second outcome more likely than the agent-harness framing suggests.