Anthropic has released Claude 3.7 Sonnet, and its most significant new feature is extended thinking โ a mode where the model reasons through a problem internally before generating a response. This isn't just a UX tweak. It fundamentally changes how Claude approaches complex, multi-step problems, and the results are impressive.
What Extended Thinking Actually Does
In extended thinking mode, Claude 3.7 works through a problem using a scratchpad of intermediate reasoning before producing its final answer. Users can see this reasoning process, which makes the model's logic transparent and auditable. For tasks like complex code debugging, mathematical proofs, and nuanced analysis, this approach significantly reduces errors compared to single-pass generation.
Benchmark Performance
On the GPQA Diamond benchmark โ which tests graduate-level scientific reasoning โ Claude 3.7 with extended thinking scores above 70%, placing it among the top models globally. On competitive coding benchmarks like SWE-bench Verified, it achieves state-of-the-art results, correctly resolving a majority of real-world GitHub issues presented as tests. These aren't cherry-picked numbers; they represent genuine capability improvements.
Coding: The Real Differentiator
Where Claude 3.7 shines most brightly is in software development tasks. The combination of extended thinking and Claude's strong instruction-following makes it particularly effective at writing, reviewing, and debugging code. Developers using it via Cursor, Claude.ai, or the API report meaningfully fewer hallucinated functions and better handling of complex multi-file edits.
The Trade-off: Speed vs. Quality
Extended thinking comes with a latency cost. Enabling it can add several seconds to response time as the model works through its reasoning. For interactive use cases where speed matters, standard mode remains the better choice. But for asynchronous tasks โ batch processing, complex analysis, agent workflows โ the quality gains outweigh the wait. Anthropic has made this configurable, letting developers tune the thinking budget per request.