Anthropic just dropped the latest SOTA model Opus 4.6, and a mere couple months after their last flagship model Opus 4.5 raised the bar for LLM-written code. We tried it out to see how it changes the day-to-day for engineering teams.
The biggest improvement: better task focus
Opus 4.6 can work longer and more continuously without dropping context. This is a huge improvement for Claude Code/coding agent performance, since agents can now work more independently without needing as much supervision/correction.
Much of this is can be attributed to 4.6’s 1 million token context window (only for API users, subscribers will still get the 200k token window). Longer tasks will no longer hit that limit, so Opus 4.6 can maintain higher quality output for more complex problems.
Token usage + efficiency
Opus models are notorious for their token consumption, and many users have to carefully budget what they use these models for. Users on Pro plans (the entry-level paid tier) might only get a few prompts before they hit their usage limits.
Claude already offers the ability to toggle “effort”, e.g. how hard the model will work to solve a given prompt. Opus 4.6 automatically decides when to opt into “extended thinking” based on the context of a prompt. It also has a smarter approach to context compaction, which summarizes the history of a convo so that it uses fewer tokens to retrieve it.
These improvements help mitigate the already-high token demand that a model like Opus 4.6 requires. However, it still is demanding, even more so than Opus 4.5 (likely due to longer task follow-through). To really take advantage of its coding capacity, you may need to upgrade to one of the higher tiers. Here’s how to keep track of your CC token usage.
Software engineering task performance
Since Opus 4.6 has added some of the optimizations above, it’s now better suited for long-winded engineering tasks, especially refactoring and test writing, given you have the token budget. Previously, Gemini 3 Pro was the model of choice for massive scale tasks in large codebases, thanks to its 1M token context window, but Opus might now overtake that (for API users, at least).
It benchmarks well (highest marks on Terminal-Bench, BrowserComp, and OSWorld), and engineering leads have spoken very highly about its beta. However, Anthropic still might need to balance the token spend/token allotment for its plans so subscribers can use these models to their capacity.
Environments for Claude Code
Claude Code works best when you give it access to on-demand preview environments so it can validate the code it writes. With Opus 4.6’s ability to get into a “flow state”, you can let it work longer and more continuously by pushing code to environments, viewing the changes live, pulling logs, and running tests.
Shipyard makes these workflows easy, for you and for Claude. Try it free today for 30 days, and watch the quality of your agent-written code improve.