Claude 3.7 Sonnet: Anthropic's Biggest Leap in Extended Thinking

Anthropic has always positioned itself as the safety-focused alternative to OpenAI, but with Claude 3.7 Sonnet, the company is also staking a claim as the performance leader. Released in early 2025, Claude 3.7 introduces a feature Anthropic calls extended thinking — a configurable mode that allows the model to reason through problems before producing a final response. The results are striking across virtually every benchmark that matters.

What Is Extended Thinking?

Extended thinking is Anthropic's implementation of chain-of-thought reasoning at scale. When enabled, Claude 3.7 generates an internal scratchpad — a stream of reasoning that the model uses to work through complex problems before committing to an answer. Users can see this reasoning process if they choose, gaining insight into how the model arrived at its conclusion.

This is not just a cosmetic feature. The extended thinking mode gives the model additional computational time to explore multiple solution paths, catch its own errors, and produce more carefully considered outputs. For tasks like complex coding, mathematical proofs, and nuanced analysis, the improvement is substantial and measurable.

Benchmark Performance

On the SWE-bench Verified benchmark — which tests an AI's ability to solve real GitHub issues in software projects — Claude 3.7 Sonnet with extended thinking achieved 70.3%, compared to 49% for Claude 3.5 Sonnet. This represents a massive leap in software engineering capability. For context, the best human programmers typically score around 80-85% on this benchmark.

On AIME 2024 math competition problems, Claude 3.7 with extended thinking scored 80%, up from 16% for standard Claude 3.5 Sonnet. The math improvement is particularly dramatic and illustrates how extended thinking unlocks capabilities that seem inaccessible to standard language model inference.

Coding Performance: The Headline Story

For developers, Claude 3.7's coding capabilities are the most compelling feature. The model has become the default recommendation among professional software engineers for complex coding tasks. Its ability to understand large codebases, refactor across multiple files, implement complex algorithms, and debug subtle issues has made it the go-to model for many development workflows.

Tools like Cursor IDE have integrated Claude 3.7 as their primary model. Anecdotal reports from developers suggest that Claude 3.7 with extended thinking can handle tasks that previously required senior engineer intervention — like designing new system architectures or debugging race conditions in concurrent code.

The Extended Thinking API

Anthropic exposed extended thinking through their API with configurable budget tokens — developers can set how much thinking the model is allowed to do before responding. A budget of 10,000 tokens allows for quick reasoning on moderately complex problems, while a budget of 100,000+ tokens enables deep exploration of very complex challenges.

This creates an interesting cost-quality tradeoff that developers can tune for their specific use case. A customer service chatbot might use standard mode (fast, cheap). A code review assistant might use extended thinking with a moderate budget. A research analysis tool might use maximum thinking for thorough exploration of complex questions.

Safety and Constitutional AI

True to Anthropic's mission, Claude 3.7 continues to demonstrate strong safety characteristics. The extended thinking mode actually improves safety in some respects — by thinking through potential harms before responding, the model is better at refusing genuinely harmful requests while being less likely to falsely refuse legitimate ones.

Competitive Positioning

Claude 3.7 Sonnet arrives at a time when the AI landscape is intensely competitive. OpenAI's o3, Google's Gemini 2.0, and DeepSeek's R1 are all competing for developer mindshare. Claude 3.7's advantage lies in its combination of strong reasoning, excellent coding, nuanced instruction following, and Anthropic's reputation for safety. Whether that combination is enough to maintain a competitive lead remains to be seen — but for now, Claude 3.7 represents one of the most capable AI systems available anywhere.