
Claude Opus 4.7: what changed and what to watch
Anthropic shipped Opus 4.7 amid a regression controversy. Strong benchmarks, frank admissions, and the questions the field still has to answer.
Claude Opus 4.7: what changed and what to watch
The release lands during a regression controversy, with strong benchmarks and one rare admission: there's a more capable model Anthropic chose not to ship yet.
Context
Anthropic shipped Claude Opus 4.7 today, April 16, 2026. The release lands in an unusual moment. Two weeks ago, Stella Laurenzo, senior director of AMD's AI group, opened a GitHub issue with an analysis of 6,852 Claude Code sessions, 234,760 tool calls, and 17,871 thinking blocks. Her conclusion: Claude Code had started reading far less before editing (reads-per-edit dropped from 6.6 to 2.0), was stopping early, and getting stuck in reasoning loops. Summary: it "cannot be trusted" for complex engineering.
Boris Cherny, who leads Claude Code at Anthropic, thanked her for the rigor and rejected the central conclusion. The company denies touching the model.
That's the mood 4.7 ships into.
What arrives with Opus 4.7
Anthropic calls 4.7 its most capable generally available model, focused on long-horizon agentic work, coding, and vision. The self-reported numbers:
| Benchmark | Opus 4.6 | Opus 4.7 |
|---|---|---|
| CursorBench | 58% | 70% |
| XBOW Visual-Acuity | 54.5% | 98.5% |
| Rakuten-SWE-Bench | baseline | 3× more tasks solved |
| Terminal-Bench 2.0 | no baseline | 3 tasks no prior Claude could solve |
On vision, the max image resolution jumped from 1,568 px / 1.15 MP to 2,576 px / 3.75 MP. For screenshot, document, and "computer use" workflows, the difference is tangible.
Third-party evaluations that surfaced in the first hours:
- Box (Yashodha Bhavnani, Head of AI): 56% fewer model calls and 50% fewer tool calls vs 4.6. Efficiency gain, not just quality.
- UK AI Security Institute: tested Mythos Preview (Anthropic's unreleased internal model) on "The Last Ones", a 32-step network attack simulation. Mythos completed 3 of 10 attempts, averaging 22 steps. Opus 4.6 averaged 16.
Benchmarks are still benchmarks. A vendor reporting its own tests is always a signal to cross-check with independent evidence.
What's new inside Claude Code
The part that hits daily CLI use:
/ultrareview: a new command that opens a dedicated code review session. Three uses included for Pro and Max subscribers. Past that, normal quotas apply.- Default effort raised to
xhigh. 4.7 thinks more out of the box. That raises quality and token usage. For cost-sensitive workloads, calibrate explicitly. - Auto Mode extended to the Max plan. Before, only Team and Enterprise had full access to the long-running automatic mode.
- Pricing unchanged: $5 per million input tokens, $25 output. No long-context premium up to 1M tokens of context.
Cherny on Threads: "more agentic, more precise, and a lot better at long-running work. Carries context across sessions and handles ambiguity much better." In another post, he added that 4.7 "feels more intelligent" and that it took a few days for him personally to learn how to work with the new capabilities.
Breaking changes in the Messages API
If you call 4.7 directly through the API (not Claude Code or Managed Agents), watch out. Anthropic removed three things that worked in 4.6:
- Explicit thinking budgets are gone. Passing
thinking: {"type": "enabled", "budget_tokens": N}returns 400. The only supported thinking-on mode is nowadaptive, and it's off by default: you have to explicitly setthinking: {type: "adaptive"}to turn it on. - Sampling parameters removed. Setting
temperature,top_p, ortop_kto any non-default value returns 400. The recommendation is to omit them and use prompting to shape behavior. - Thinking content omitted by default. Thinking blocks still stream, but the
thinkingfield is empty unless you opt in withdisplay: "summarized". If your product shows reasoning to users, there'll be a long pause before output begins.
And the one almost everyone will feel in the wallet: new tokenizer. 4.7 uses up to 35% more tokens than 4.6 for the same text (varies by content). count_tokens returns different numbers. Updating max_tokens and compaction triggers is basically mandatory.
Another useful budgeting knob: task budgets (beta). Instead of a hard per-request cap (max_tokens), you hand the model a token target for the entire agentic loop (thinking + tool calls + results + output). The model sees a running countdown and scopes its work. Enable with the beta header task-budgets-2026-03-13. Minimum 20k tokens per budget.
Behavior changes (no breakage)
Not breaking changes, but they'll force a prompt review:
- More literal instruction following, especially at lower effort. The model won't silently generalize from one item to another, and won't infer requests you didn't make.
- Response length calibrates to perceived task complexity instead of defaulting to fixed verbosity.
- Fewer tool calls by default, using reasoning instead. Raising effort brings tool calls back.
- More direct, less validation-forward tone, with fewer emoji than 4.6.
- Fewer subagents spawned by default. Steerable via prompting.
- Real-time cyber safeguards. High-risk security requests may be refused. Legitimate security work goes through the Cyber Verification Program.
If you had scaffolding like "double-check the slide layout before returning" or forced interim status messages, try removing it. Anthropic's advice: re-baseline.
Mythos behind the curtain
The most interesting admission came from the official post itself: Opus 4.7 isn't the most capable model the company has. Claude Mythos exists, focused on cybersecurity, restricted to a hand-picked group of tech and security firms under an internal program called Project Glasswing.
Anthropic argues the 4.7 guardrails were calibrated to intentionally reduce cyber capabilities, while it learns in real production how to detect and block malicious requests. Mythos leaves the cocoon when that detection is solid enough for a broad release.
Translation: Anthropic is running a controlled drift. It ships a more "domesticated" version for the general public and watches the world try to break it, while the frontier model stays in a closed circle. That's an honest way to stage the capability-vs-safety trade-off, which competitors tend to sweep under the rug.
Community Response
On Hacker News, the official thread mixed measured enthusiasm with chronic skepticism. The most-upvoted points:
- "Progress doesn't seem to be plateauing like some predicted" (grandinquistor): common among those reading the CursorBench and Rakuten numbers.
- "Ah, here we go again" (hansmayer): fatigue with the incremental release cycle.
- "Is 11% more on SWE-bench Pro more problems solved or 11% fewer hallucinations?" (jameson): the question every benchmark skips.
On Threads, Cherny and the Claude Code team pushed editorial content hard: rapid-fire posts, an official "best practices for using Claude Opus 4.7" blog post, and an explicit plea to give it a few days before concluding "it didn't improve." Several users reported the same arc: first prompts felt less proactive; after tweaking scaffolding and raising effort, the perception flipped.
On Reddit and X, a good chunk of the conversation piggybacked on the 4.6 controversy: "4.7 is basically 4.6 without the artificial throttle." Anthropic denies throttling. The sentiment is structural distrust, not episodic anger, which is a harder problem to solve than a bug.
One data point worth flagging: 4.7 consumes more tokens than 4.6 for the same work (more thinking plus the new tokenizer). Per-token customers will feel it. Subscription users, less so.
In Practice
If you already have code calling 4.6 via the Messages API, the short migration path:
1. Swap the model.
In the request, change model to claude-opus-4-7.
2. Remove sampling parameters.
Drop temperature, top_p, top_k. If you relied on temperature = 0 for determinism, it never guaranteed identical output even on 4.6.
3. Migrate thinking.
Replace the old pattern:
# Before (Opus 4.6)
thinking = {"type": "enabled", "budget_tokens": 32000}
# After (Opus 4.7)
thinking = {"type": "adaptive"}
output_config = {"effort": "high"} # or "xhigh" for coding
If your product shows reasoning to the user, add display: "summarized" so the UI doesn't freeze waiting.
4. Give max_tokens headroom.
The new tokenizer can use up to 35% more tokens for the same text. Bump max_tokens and review compaction triggers.
5. Simplify legacy prompts.
Strip scaffolding like "check before returning" or forcing interim status. 4.7 does that by default. Re-baseline your evals.
6. (Optional) Adopt task budgets.
For agents running long loops, flip on the beta header task-budgets-2026-03-13 and pass a task_budget for the whole loop, instead of leaning only on max_tokens.
Tip: if you pay per token, run an A/B before migrating in bulk. 4.7 thinks more and the tokenizer is pricier per text. Quality gains can come with a real cost bump, especially for output-heavy workloads.
For Claude Code users, the update is automatic via claude update. Adjusting effort and using /ultrareview needs no extra config.
What to watch in the coming weeks
Benchmarks are a starting point. The real question: whether the jumps reported on CursorBench, Rakuten-SWE-Bench, and SWE-bench Pro show up in production work. Three signals worth tracking:
- Whether the 4.6 controversy reappears under another name. If low reads-per-edit, early stopping, and loops resurface, the root probably isn't the model: it's the harness.
- Whether companies running agents behind Claude Code (Box, AMD, Rakuten) publish post-4.7 telemetry. Those are the numbers that carry most weight in the debate.
- Whether Mythos leaves the cocoon. Anthropic tied the broad release of the frontier model to how fast its protections mature. How long that takes is a proxy for how aggressive the malicious-use landscape looks.
End Result
In a single release:
- Opus 4.7 live with CursorBench 70%, Rakuten-SWE-Bench 3× better, XBOW 98.5% on vision
/ultrareview,xhighas default, Auto Mode on the Max plan- Pricing unchanged at $5/$25 per million tokens, 1M context with no premium
- Serious Messages API breaking changes: thinking budgets, sampling params, thinking content by default
- New tokenizer using up to 35% more tokens per text
- Public admission: Mythos exists, it's more capable, and it ships when the guardrails are ready
Homework for the people shipping: migrate, measure, and report. That's the loop that pressures the next release to be honest about what actually improved.
References
- Introducing Claude Opus 4.7: official announcement with benchmarks and changes.
- What's new in Claude Opus 4.7 (docs): API technical details, breaking changes, task budgets.
- Claude Opus 4.7 is generally available (GitHub Changelog): rollout on Copilot Pro+, Business, and Enterprise.
- Anthropic releases Claude Opus 4.7, concedes it trails unreleased Mythos (Axios): editorial analysis of positioning and Mythos.
- Anthropic Preps Opus 4.7 and Full-Stack AI Studio (Decrypt): AI Security Institute context on Mythos and "The Last Ones".
- Claude Code has become dumber, lazier: AMD director (The Register): coverage of Stella Laurenzo's issue.
- Proving Claude Code's Quality Regression (lilting channel): original technical analysis with 17,871 thinking blocks.
- Claude Opus 4.7 (Hacker News thread): dev community reactions.
- Boris Cherny on Threads: Claude Code lead's statement.