OpenAI o1 vs Claude 3.5 Sonnet vs Gemini Ultra: Reasoning Models Compared

TL;DR: OpenAI o1, Claude 3.5 Sonnet, and Gemini Ultra represent the current frontier of reasoning AI models. o1 leads on complex math and scientific reasoning; Claude 3.5 Sonnet excels at nuanced writing, coding, and instruction-following; Gemini Ultra shines on multimodal tasks and Google ecosystem integration. This deep comparison helps you choose the right reasoning model for your use case.

Why Reasoning Models Matter in 2025

The AI landscape shifted dramatically when OpenAI released o1 (then called “Strawberry”) in late 2024. Unlike traditional LLMs that respond immediately, reasoning models take additional computation time to “think through” problems — producing dramatically better results on complex tasks like mathematical proofs, multi-step logic puzzles, and advanced coding challenges.

Now we have three elite reasoning models to compare: OpenAI’s o1, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini Ultra. Each takes a different architectural approach, and each has distinct strengths.

Key Takeaways

  • OpenAI o1 uses chain-of-thought reasoning internally — best for math, science, and logical deduction
  • Claude 3.5 Sonnet balances reasoning with exceptional writing quality and instruction-following
  • Gemini Ultra leads on multimodal reasoning (images, video, audio) and Google integration
  • For coding tasks, all three perform at expert level — differences are in edge cases
  • Cost and latency differ significantly: o1 is slowest and most expensive; Claude 3.5 Sonnet is fastest

Model Overview

Model Developer Context Window Input Price Output Price
o1 OpenAI 128K tokens $15/M tokens $60/M tokens
Claude 3.5 Sonnet Anthropic 200K tokens $3/M tokens $15/M tokens
Gemini Ultra Google 1M tokens $7/M tokens $21/M tokens

1. Mathematical Reasoning

Math is where reasoning models most clearly separate themselves from standard LLMs. Here’s how the three models perform on established benchmarks.

AIME (American Invitational Mathematics Examination)

AIME problems require multi-step mathematical reasoning that stumps most AI models. Results from published benchmarks:

  • OpenAI o1: 83.3% on AIME 2024 — a significant leap over previous models
  • Claude 3.5 Sonnet: ~71% — strong but notably below o1 on competition math
  • Gemini Ultra: ~67% — solid performance, improved with extended thinking

MATH Benchmark (Hendrycks)

  • o1: 94.8% — near-perfect on undergraduate-level math problems
  • Claude 3.5 Sonnet: 88.7% — excellent across most categories
  • Gemini Ultra: 90.0% — particularly strong in calculus and statistics

2. Logical Reasoning and Deduction

Beyond pure math, logical reasoning covers puzzles, constraint satisfaction, syllogisms, and multi-step deduction problems.

ARC-Challenge

The AI2 Reasoning Challenge tests science questions requiring inference beyond pattern matching:

  • o1: 96.7%
  • Claude 3.5 Sonnet: 95.2%
  • Gemini Ultra: 94.4%

LogiQA

Logical question answering from standardized tests:

  • Claude 3.5 Sonnet: 89.1% — best in class for reading comprehension + logic
  • o1: 87.4%
  • Gemini Ultra: 85.9%

Interestingly, Claude 3.5 Sonnet edges out o1 on language-heavy logical reasoning — suggesting o1’s reasoning advantage is most pronounced in formal mathematical domains.

3. Coding Performance

All three models are exceptional coders. The question is which performs better on the hardest problems.

HumanEval and MBPP

Benchmark o1 Claude 3.5 Sonnet Gemini Ultra
HumanEval (pass@1) 92.4% 92.0% 87.8%
MBPP 88.9% 91.7% 85.9%
SWE-bench Verified 48.9% 49.0% 44.2%

For real-world software engineering tasks (SWE-bench), o1 and Claude 3.5 Sonnet are essentially tied. Claude 3.5 Sonnet often wins on code quality and instruction adherence, while o1 wins on algorithmic correctness for complex problems.

4. Creative Tasks and Writing Quality

Reasoning ability doesn’t just matter for STEM — it also influences creative quality, narrative coherence, and writing polish.

Writing Quality Assessment

In blind human evaluations of long-form writing tasks:

  • Claude 3.5 Sonnet: Consistently rated highest for prose quality, nuance, and tone control
  • Gemini Ultra: Strong on structured writing (reports, summaries) and factual accuracy
  • OpenAI o1: Solid but sometimes overly formal; the reasoning focus can make creative outputs feel mechanical

For marketing copy, blog posts, storytelling, and anything requiring human voice, Claude 3.5 Sonnet is the clear winner among reasoning models.

5. Multimodal Reasoning

All three models can process images, but their capabilities and depth differ significantly.

Capability o1 Claude 3.5 Sonnet Gemini Ultra
Image understanding Strong Strong Best-in-class
Video reasoning Limited Limited Native support
Audio understanding No No Yes
Chart/graph analysis Good Very Good Excellent

If multimodal reasoning is your priority, Gemini Ultra is the clear choice — it natively handles video and audio in ways the other two models simply cannot.

6. Speed and Cost Analysis

Reasoning models trade speed for accuracy. Here’s what to expect in practice:

Model Avg Response Time Cost per 1M Output Tokens Best For
o1 30–60 seconds $60 Hard math, science, logic
Claude 3.5 Sonnet 5–15 seconds $15 Coding, writing, general use
Gemini Ultra 10–25 seconds $21 Multimodal, Google integration

Which Model Should You Choose?

  • Choose o1 if you’re solving hard mathematical, scientific, or logical problems where accuracy trumps speed and cost
  • Choose Claude 3.5 Sonnet if you need fast, high-quality responses for coding, writing, analysis, or instruction-following at reasonable cost
  • Choose Gemini Ultra if you work heavily with images, video, audio, or need deep Google Workspace integration

Frequently Asked Questions

Is OpenAI o1 better than Claude 3.5 Sonnet?

o1 outperforms Claude 3.5 Sonnet on formal mathematical and scientific reasoning tasks. Claude 3.5 Sonnet is faster, cheaper, and better at writing and instruction-following. The “better” model depends entirely on your use case.

What is a reasoning model?

Reasoning models use extended internal computation (often called “thinking” or “chain-of-thought”) before generating a response, allowing them to solve complex multi-step problems more accurately than standard language models.

Can I use these models via API?

Yes — all three are available via API. OpenAI API for o1, Anthropic API for Claude 3.5 Sonnet, and Google AI Studio/Vertex AI for Gemini Ultra.

How does Gemini Ultra compare to GPT-4?

Gemini Ultra 1.0 roughly matches GPT-4 Turbo on most benchmarks, with advantages in multimodal tasks. Gemini Ultra 1.5 with its 1M context window is substantially ahead on long-context tasks.

Which reasoning model is most cost-effective?

Claude 3.5 Sonnet offers the best performance-to-price ratio for most use cases. At $3/M input and $15/M output tokens, it delivers near-o1 performance at roughly 1/4 the cost.

Ready to get started?

Try Claude Free →

Find the Perfect AI Tool for Your Needs

Compare pricing, features, and reviews of 50+ AI tools

Browse All AI Tools →

Get Weekly AI Tool Updates

Join 1,000+ professionals. Free AI tools cheatsheet included.

🧭 What to Read Next

🔥 AI Tool Deals This Week
Free credits, discounts, and invite codes updated daily
View Deals →

Similar Posts