OpenAI o1 vs Claude 3.5 Sonnet vs Gemini Ultra: Reasoning Models Compared
Why Reasoning Models Matter in 2025
The AI landscape shifted dramatically when OpenAI released o1 (then called “Strawberry”) in late 2024. Unlike traditional LLMs that respond immediately, reasoning models take additional computation time to “think through” problems — producing dramatically better results on complex tasks like mathematical proofs, multi-step logic puzzles, and advanced coding challenges.
Now we have three elite reasoning models to compare: OpenAI’s o1, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini Ultra. Each takes a different architectural approach, and each has distinct strengths.
Key Takeaways
- OpenAI o1 uses chain-of-thought reasoning internally — best for math, science, and logical deduction
- Claude 3.5 Sonnet balances reasoning with exceptional writing quality and instruction-following
- Gemini Ultra leads on multimodal reasoning (images, video, audio) and Google integration
- For coding tasks, all three perform at expert level — differences are in edge cases
- Cost and latency differ significantly: o1 is slowest and most expensive; Claude 3.5 Sonnet is fastest
Model Overview
| Model | Developer | Context Window | Input Price | Output Price |
|---|---|---|---|---|
| o1 | OpenAI | 128K tokens | $15/M tokens | $60/M tokens |
| Claude 3.5 Sonnet | Anthropic | 200K tokens | $3/M tokens | $15/M tokens |
| Gemini Ultra | 1M tokens | $7/M tokens | $21/M tokens |
1. Mathematical Reasoning
Math is where reasoning models most clearly separate themselves from standard LLMs. Here’s how the three models perform on established benchmarks.
AIME (American Invitational Mathematics Examination)
AIME problems require multi-step mathematical reasoning that stumps most AI models. Results from published benchmarks:
- OpenAI o1: 83.3% on AIME 2024 — a significant leap over previous models
- Claude 3.5 Sonnet: ~71% — strong but notably below o1 on competition math
- Gemini Ultra: ~67% — solid performance, improved with extended thinking
MATH Benchmark (Hendrycks)
- o1: 94.8% — near-perfect on undergraduate-level math problems
- Claude 3.5 Sonnet: 88.7% — excellent across most categories
- Gemini Ultra: 90.0% — particularly strong in calculus and statistics
2. Logical Reasoning and Deduction
Beyond pure math, logical reasoning covers puzzles, constraint satisfaction, syllogisms, and multi-step deduction problems.
ARC-Challenge
The AI2 Reasoning Challenge tests science questions requiring inference beyond pattern matching:
- o1: 96.7%
- Claude 3.5 Sonnet: 95.2%
- Gemini Ultra: 94.4%
LogiQA
Logical question answering from standardized tests:
- Claude 3.5 Sonnet: 89.1% — best in class for reading comprehension + logic
- o1: 87.4%
- Gemini Ultra: 85.9%
Interestingly, Claude 3.5 Sonnet edges out o1 on language-heavy logical reasoning — suggesting o1’s reasoning advantage is most pronounced in formal mathematical domains.
3. Coding Performance
All three models are exceptional coders. The question is which performs better on the hardest problems.
HumanEval and MBPP
| Benchmark | o1 | Claude 3.5 Sonnet | Gemini Ultra |
|---|---|---|---|
| HumanEval (pass@1) | 92.4% | 92.0% | 87.8% |
| MBPP | 88.9% | 91.7% | 85.9% |
| SWE-bench Verified | 48.9% | 49.0% | 44.2% |
For real-world software engineering tasks (SWE-bench), o1 and Claude 3.5 Sonnet are essentially tied. Claude 3.5 Sonnet often wins on code quality and instruction adherence, while o1 wins on algorithmic correctness for complex problems.
4. Creative Tasks and Writing Quality
Reasoning ability doesn’t just matter for STEM — it also influences creative quality, narrative coherence, and writing polish.
Writing Quality Assessment
In blind human evaluations of long-form writing tasks:
- Claude 3.5 Sonnet: Consistently rated highest for prose quality, nuance, and tone control
- Gemini Ultra: Strong on structured writing (reports, summaries) and factual accuracy
- OpenAI o1: Solid but sometimes overly formal; the reasoning focus can make creative outputs feel mechanical
For marketing copy, blog posts, storytelling, and anything requiring human voice, Claude 3.5 Sonnet is the clear winner among reasoning models.
5. Multimodal Reasoning
All three models can process images, but their capabilities and depth differ significantly.
| Capability | o1 | Claude 3.5 Sonnet | Gemini Ultra |
|---|---|---|---|
| Image understanding | Strong | Strong | Best-in-class |
| Video reasoning | Limited | Limited | Native support |
| Audio understanding | No | No | Yes |
| Chart/graph analysis | Good | Very Good | Excellent |
If multimodal reasoning is your priority, Gemini Ultra is the clear choice — it natively handles video and audio in ways the other two models simply cannot.
6. Speed and Cost Analysis
Reasoning models trade speed for accuracy. Here’s what to expect in practice:
| Model | Avg Response Time | Cost per 1M Output Tokens | Best For |
|---|---|---|---|
| o1 | 30–60 seconds | $60 | Hard math, science, logic |
| Claude 3.5 Sonnet | 5–15 seconds | $15 | Coding, writing, general use |
| Gemini Ultra | 10–25 seconds | $21 | Multimodal, Google integration |
Which Model Should You Choose?
- Choose o1 if you’re solving hard mathematical, scientific, or logical problems where accuracy trumps speed and cost
- Choose Claude 3.5 Sonnet if you need fast, high-quality responses for coding, writing, analysis, or instruction-following at reasonable cost
- Choose Gemini Ultra if you work heavily with images, video, audio, or need deep Google Workspace integration
Frequently Asked Questions
Is OpenAI o1 better than Claude 3.5 Sonnet?
o1 outperforms Claude 3.5 Sonnet on formal mathematical and scientific reasoning tasks. Claude 3.5 Sonnet is faster, cheaper, and better at writing and instruction-following. The “better” model depends entirely on your use case.
What is a reasoning model?
Reasoning models use extended internal computation (often called “thinking” or “chain-of-thought”) before generating a response, allowing them to solve complex multi-step problems more accurately than standard language models.
Can I use these models via API?
Yes — all three are available via API. OpenAI API for o1, Anthropic API for Claude 3.5 Sonnet, and Google AI Studio/Vertex AI for Gemini Ultra.
How does Gemini Ultra compare to GPT-4?
Gemini Ultra 1.0 roughly matches GPT-4 Turbo on most benchmarks, with advantages in multimodal tasks. Gemini Ultra 1.5 with its 1M context window is substantially ahead on long-context tasks.
Which reasoning model is most cost-effective?
Claude 3.5 Sonnet offers the best performance-to-price ratio for most use cases. At $3/M input and $15/M output tokens, it delivers near-o1 performance at roughly 1/4 the cost.
Ready to get started?
Try Claude Free →Find the Perfect AI Tool for Your Needs
Compare pricing, features, and reviews of 50+ AI tools
Browse All AI Tools →Get Weekly AI Tool Updates
Join 1,000+ professionals. Free AI tools cheatsheet included.
🧭 What to Read Next
- 💵 Worth the $20? → $20 Plan Comparison
- 💻 For coding? → ChatGPT vs Claude for Coding
- 🏢 For business? → ChatGPT Business Guide
- 🆓 Want free? → Best Free AI Tools
Free credits, discounts, and invite codes updated daily