Claude Opus 4.7 Ships Coding Gains and a Million-Token Window, Then Draws a Backlash.
Claude Opus 4.7 shipped on April 16 as Anthropic's new top-of-line model — a generation update to the Opus tier that had been sitting on 4.6 since late 2025. The release pushes the context window to one million tokens, adds the first high-resolution image support in the Claude lineup, ships a new tokenizer, and targets agentic coding as its primary design surface. It's generally available now across Anthropic's API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. No waitlist.
Three weeks later, the benchmarks tell one story. The developer community tells a different one.
What's in the Box
The spec changes from Opus 4.6 are concentrated in three areas. The context window goes from 200K tokens to 1M, with no long-context pricing surcharge — you pay the same $5 per million input tokens and $25 per million output tokens regardless of prompt length. Maximum output jumps to 128K tokens. And the model now includes adaptive thinking, where it can allocate more internal reasoning steps to harder problems within a single request.
Image handling gets the biggest percentage leap. Opus 4.7 accepts images up to 2,576 pixels on the long edge, roughly 3.75 megapixels. That's more than three times the resolution ceiling of prior Claude models. For workflows that process screenshots, diagrams, or scanned documents, the jump from blurry downsampled inputs to near-native resolution is a functional capability change, not an incremental one.
The model also ships with a new tokenizer. This matters more than it sounds — covered below.
The Coding Benchmarks
The headline numbers are strong and not in dispute. SWE-bench Verified jumped from 80.8% on Opus 4.6 to 87.6%, a seven-point gain on the industry's most-watched coding eval. SWE-bench Pro climbed from 53.4% to 64.3%, nearly eleven points. CursorBench went from 58% to 70%. On multi-file editing specifically, independent evaluations show a six-to-eight-point improvement — the model holds naming conventions and code style more consistently across files than 4.6 did.
On SWE-bench Pro, Opus 4.7's 64.3% sits ahead of GPT-5.5 at 58.6% and DeepSeek V4 Pro at 55.4%. For the specific task of diagnosing and fixing bugs in production codebases, this is the best generally available model right now. Finance Agent also climbed, from 60.7% to 64.4%.
The profile isn't uniformly up. BrowseComp, which measures web search and multi-page synthesis, dropped from 83.7% on Opus 4.6 to 79.3%. GPT-5.5 scores 89.3% on that benchmark; Gemini 3.1 Pro scores 85.9%. If your agents lean on browsing and web-sourced synthesis, that ten-point gap to GPT-5.5 is a routing decision, not a rounding error.
The Tokenizer and What It Costs
The rate card didn't change. The tokenizer did. Opus 4.7's new tokenizer produces more tokens from the same text — Anthropic's documentation acknowledges up to 35% inflation. Third-party analysis of production workloads found 32–34% more tokens on prompts over 10,000 tokens, with smaller prompts seeing even higher inflation at 42–45%.
The effect is a cost increase that doesn't appear on the pricing page. Your per-token rate is identical. Your per-task cost goes up because each task now consumes more tokens. An OpenRouter analysis broke down the math across prompt types and confirmed the gap is consistent enough that any team migrating from 4.6 should run a tokenizer comparison on their workloads before switching production traffic. Prompt caching (up to 90% savings) and batch processing (50% savings) still apply and can offset the increase for eligible workloads, but "same price" is not the same cost at scale.
The Developer Backlash
Within 48 hours of launch, a Reddit post titled "Opus 4.7 is not an upgrade but a serious regression" had 2,300 upvotes. A post on X calling it no improvement over 4.6 collected 14,000 likes. The complaints center on three things: degraded instruction-following on multi-file edits, increased refusal and hedging in coding contexts, and reduced agentic reliability in Claude Code compared to 4.6.
Some of this is a mismatch between what changed and what developers expected. Opus 4.7 follows instructions more literally than 4.6 — apparently a deliberate design choice toward stricter prompt adherence rather than inferring intent. For developers whose prompts relied on 4.6's interpretive flexibility, the behavior change feels like a regression. For developers whose prompts are already precise, it's an improvement. The refusal and hedging complaints are harder to separate from noise. Anthropic hasn't published a behavioral changelog between versions, which means developers are comparing subjective experience against a model card that only describes capabilities. An independent performance tracker now monitors Opus 4.7 on SWE-bench Pro daily, created specifically because the official documentation doesn't cover this ground.
Who Should Move and Who Should Wait
One clarification on the context window before the verdict. On single-needle retrieval at one million tokens (NIAH-2), Opus 4.7 hits 89%. GPT-5.5 scores 96%. Gemini 3 Deep Think hits 99%. The number that matters more for production is multi-needle retrieval, where the model synthesizes information scattered across a long input. That degrades for every frontier model between 200K and 1M tokens. Independent testing puts Opus 4.7's effective band at 200K to 400K tokens. Two to four times what Opus 4.6 handled reliably, but not the full million the spec sheet implies. Design for that band, not the headline number.
Three weeks out, Opus 4.7 leads coding benchmarks, trails on browsing and web-synthesis tasks, and costs more per task than 4.6 despite an identical rate card. No single model dominates the frontier right now: GPT-5.5 leads agentic terminal workflows and BrowseComp, DeepSeek V4 Pro matches both on competitive programming at a seventh of the output cost.
For teams already on Opus 4.6, the upgrade depends on two things: whether the coding gains justify the tokenizer-driven cost increase on your specific workloads, and whether your prompts are precise enough to benefit from stricter instruction-following rather than being broken by it. Run the tokenizer comparison first. The model that leads SWE-bench Pro and the model that dropped ten points on BrowseComp are the same model. Route accordingly.