OpenAI's o3 Model: When AI Started Solving PhD-Level Problems

In December 2024, OpenAI unveiled preliminary results for o3 — and the AI community went quiet for a moment before erupting. On ARC-AGI, the benchmark specifically designed to be resistant to AI pattern-matching and to require genuine reasoning, o3 scored 87.5%. Humans average around 85%. For the first time, an AI system had surpassed average human performance on a benchmark explicitly designed to resist that outcome.

What Is ARC-AGI?

The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) was created by François Chollet, creator of Keras and a researcher at Google. The benchmark presents visual pattern puzzles requiring understanding of abstract rules from just a few examples and applying them to new situations. It is designed to test genuine reasoning rather than memorization. Previous AI systems struggled dramatically: GPT-4 scored around 33%. The benchmark was considered a robust test of reasoning that no language model could crack through sheer scale.

How o3 Works

OpenAI has not fully disclosed o3's architecture, but it is clearly built on the foundations of the o1 series — models that use reinforcement learning to learn reasoning strategies rather than simply predicting the next token. o3 can be given different amounts of compute time at inference — essentially, more time to think. At low compute settings, it performs similarly to o1. At high compute settings, the performance improvement is dramatic.

PhD-Level Performance

Beyond ARC-AGI, o3's performance across scientific benchmarks is remarkable. On GPQA Diamond — which contains PhD-level questions in biology, chemistry, and physics — o3 scored 87.7%. Human PhD-level experts in the relevant domain typically score around 69.7%. The model is outperforming domain experts on their own specialist knowledge, a result that would have seemed impossible three years ago.

On the FrontierMath benchmark, which contains research-level mathematical problems that take professional mathematicians hours or days to solve, o3 solved 25% of problems. No previous AI had solved more than 2%. While 25% is not majority performance, it is an unprecedented demonstration of AI mathematical capability on genuinely hard problems.

Implications for AGI

These results inevitably prompt the question: is this AGI? The answer depends on definition. If AGI means AI that performs better than most humans on most cognitive tasks, then o3 is getting very close. If AGI means AI that can autonomously set goals and pursue them in the real world like a human, then we are still far away. François Chollet acknowledged that o3's performance requires re-evaluation of what the benchmark measures, noting that o3 likely uses program synthesis approaches rather than the generalized reasoning the benchmark was designed to test.

The Cost Question

o3's capabilities come at a price. At high compute settings, o3 can cost significantly more than standard GPT-4 per query. OpenAI released o3-mini as a more affordable alternative that captures much of the reasoning improvement at a fraction of the cost. For enterprise use cases where the value of correct answers justifies the cost — medical diagnosis, legal analysis, financial modeling — o3's pricing is easily justified. The emergence of efficient reasoning models like DeepSeek-R1 will likely drive pricing down over time.

What Comes After o3?

OpenAI has already teased o4 and beyond. If the scaling law for reasoning continues to hold — more thinking time yields proportionally better results — the trajectory points toward AI systems that can match the best human experts on virtually any intellectual task. What is undeniable is that o3 represents a genuine inflection point in AI capability, one that will be referenced as a landmark for years to come.