OpenAI o1 vs Claude 3.5 Sonnet vs Gemini Ultra: Reasoning Models Compared

TL;DR: OpenAI o1, Claude 3.5 Sonnet, and Gemini Ultra represent the current frontier of reasoning AI models. o1 leads on complex math and scientific reasoning; Claude 3.5 Sonnet excels at nuanced writing, coding, and instruction-following; Gemini Ultra shines on multimodal tasks and Google ecosystem integration. This deep comparison helps you choose the right reasoning model for your use case.

Why Reasoning Models Matter in 2025

The AI landscape shifted dramatically when OpenAI released o1 (then called “Strawberry”) in late 2024. Unlike traditional LLMs that respond immediately, reasoning models take additional computation time to “think through” problems — producing dramatically better results on complex tasks like mathematical proofs, multi-step logic puzzles, and advanced coding challenges.

Now we have three elite reasoning models to compare: OpenAI’s o1, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini Ultra. Each takes a different architectural approach, and each has distinct strengths.

Key Takeaways

OpenAI o1 uses chain-of-thought reasoning internally — best for math, science, and logical deduction
Claude 3.5 Sonnet balances reasoning with exceptional writing quality and instruction-following
Gemini Ultra leads on multimodal reasoning (images, video, audio) and Google integration
For coding tasks, all three perform at expert level — differences are in edge cases
Cost and latency differ significantly: o1 is slowest and most expensive; Claude 3.5 Sonnet is fastest

Model Overview

Model	Developer	Context Window	Input Price	Output Price
o1	OpenAI	128K tokens	$15/M tokens	$60/M tokens
Claude 3.5 Sonnet	Anthropic	200K tokens	$3/M tokens	$15/M tokens
Gemini Ultra	Google	1M tokens	$7/M tokens	$21/M tokens

1. Mathematical Reasoning

Math is where reasoning models most clearly separate themselves from standard LLMs. Here’s how the three models perform on established benchmarks.

AIME (American Invitational Mathematics Examination)

AIME problems require multi-step mathematical reasoning that stumps most AI models. Results from published benchmarks:

OpenAI o1: 83.3% on AIME 2024 — a significant leap over previous models
Claude 3.5 Sonnet: ~71% — strong but notably below o1 on competition math
Gemini Ultra: ~67% — solid performance, improved with extended thinking

MATH Benchmark (Hendrycks)

o1: 94.8% — near-perfect on undergraduate-level math problems
Claude 3.5 Sonnet: 88.7% — excellent across most categories
Gemini Ultra: 90.0% — particularly strong in calculus and statistics

Try OpenAI o1 →

2. Logical Reasoning and Deduction

Beyond pure math, logical reasoning covers puzzles, constraint satisfaction, syllogisms, and multi-step deduction problems.

ARC-Challenge

The AI2 Reasoning Challenge tests science questions requiring inference beyond pattern matching:

o1: 96.7%
Claude 3.5 Sonnet: 95.2%
Gemini Ultra: 94.4%

LogiQA

Logical question answering from standardized tests:

Claude 3.5 Sonnet: 89.1% — best in class for reading comprehension + logic
o1: 87.4%
Gemini Ultra: 85.9%

Interestingly, Claude 3.5 Sonnet edges out o1 on language-heavy logical reasoning — suggesting o1’s reasoning advantage is most pronounced in formal mathematical domains.

3. Coding Performance

All three models are exceptional coders. The question is which performs better on the hardest problems.

HumanEval and MBPP

Benchmark	o1	Claude 3.5 Sonnet	Gemini Ultra
HumanEval (pass@1)	92.4%	92.0%	87.8%
MBPP	88.9%	91.7%	85.9%
SWE-bench Verified	48.9%	49.0%	44.2%

For real-world software engineering tasks (SWE-bench), o1 and Claude 3.5 Sonnet are essentially tied. Claude 3.5 Sonnet often wins on code quality and instruction adherence, while o1 wins on algorithmic correctness for complex problems.

Try Claude 3.5 Sonnet →

4. Creative Tasks and Writing Quality

Reasoning ability doesn’t just matter for STEM — it also influences creative quality, narrative coherence, and writing polish.

Writing Quality Assessment

In blind human evaluations of long-form writing tasks:

Claude 3.5 Sonnet: Consistently rated highest for prose quality, nuance, and tone control
Gemini Ultra: Strong on structured writing (reports, summaries) and factual accuracy
OpenAI o1: Solid but sometimes overly formal; the reasoning focus can make creative outputs feel mechanical

For marketing copy, blog posts, storytelling, and anything requiring human voice, Claude 3.5 Sonnet is the clear winner among reasoning models.

5. Multimodal Reasoning

All three models can process images, but their capabilities and depth differ significantly.

Capability	o1	Claude 3.5 Sonnet	Gemini Ultra
Image understanding	Strong	Strong	Best-in-class
Video reasoning	Limited	Limited	Native support
Audio understanding	No	No	Yes
Chart/graph analysis	Good	Very Good	Excellent

If multimodal reasoning is your priority, Gemini Ultra is the clear choice — it natively handles video and audio in ways the other two models simply cannot.

Try Gemini Ultra →

6. Speed and Cost Analysis

Reasoning models trade speed for accuracy. Here’s what to expect in practice:

Model	Avg Response Time	Cost per 1M Output Tokens	Best For
o1	30–60 seconds	$60	Hard math, science, logic
Claude 3.5 Sonnet	5–15 seconds	$15	Coding, writing, general use
Gemini Ultra	10–25 seconds	$21	Multimodal, Google integration

Which Model Should You Choose?

Choose o1 if you’re solving hard mathematical, scientific, or logical problems where accuracy trumps speed and cost
Choose Claude 3.5 Sonnet if you need fast, high-quality responses for coding, writing, analysis, or instruction-following at reasonable cost
Choose Gemini Ultra if you work heavily with images, video, audio, or need deep Google Workspace integration

Frequently Asked Questions

Is OpenAI o1 better than Claude 3.5 Sonnet?

o1 outperforms Claude 3.5 Sonnet on formal mathematical and scientific reasoning tasks. Claude 3.5 Sonnet is faster, cheaper, and better at writing and instruction-following. The “better” model depends entirely on your use case.

What is a reasoning model?

Reasoning models use extended internal computation (often called “thinking” or “chain-of-thought”) before generating a response, allowing them to solve complex multi-step problems more accurately than standard language models.

Can I use these models via API?

Yes — all three are available via API. OpenAI API for o1, Anthropic API for Claude 3.5 Sonnet, and Google AI Studio/Vertex AI for Gemini Ultra.

How does Gemini Ultra compare to GPT-4?

Gemini Ultra 1.0 roughly matches GPT-4 Turbo on most benchmarks, with advantages in multimodal tasks. Gemini Ultra 1.5 with its 1M context window is substantially ahead on long-context tasks.

Which reasoning model is most cost-effective?

Claude 3.5 Sonnet offers the best performance-to-price ratio for most use cases. At $3/M input and $15/M output tokens, it delivers near-o1 performance at roughly 1/4 the cost.

Ready to get started?

Try Claude Free →

Find the Perfect AI Tool for Your Needs

Compare pricing, features, and reviews of 50+ AI tools

Browse All AI Tools →

Get Weekly AI Tool Updates

Join 1,000+ professionals. Free AI tools cheatsheet included.

🧭 What to Read Next

💵 Worth the $20? → $20 Plan Comparison
💻 For coding? → ChatGPT vs Claude for Coding
🏢 For business? → ChatGPT Business Guide
🆓 Want free? → Best Free AI Tools

🔥 AI Tool Deals This Week
Free credits, discounts, and invite codes updated daily

View Deals →

OpenAI o1 vs Claude 3.5 Sonnet vs Gemini Ultra: Reasoning Models Compared

Why Reasoning Models Matter in 2025

Key Takeaways

Model Overview

1. Mathematical Reasoning