Skip to main content

OpenAI’s O3: Features, O1 Comparison, Release Date & More

Learn about OpenAI’s o3 and o3 mini, including their release dates, key features, ARC AGI breakthroughs, and safety innovations like deliberative alignment.
Dec 20, 2024  · 8 min read

OpenAI wrapped up its 12-day event by introducing o3, their latest AI model, alongside its cost-efficient sibling, o3 mini.

The decision to skip o2 wasn’t random. While OpenAI referenced Telefonica’s O2 brand as part of the reasoning, we suspect it was also a strategic move to signal a more substantial leap forward. Sam Altman joked during the announcement that naming isn’t their strong suit, but the choice seems calculated.

O3 focuses heavily on reasoning, with capabilities designed to handle complex tasks in coding, mathematics, and general intelligence. OpenAI is starting with public safety testing instead of a full launch, which we think reflects a cautious and transparent approach. If the early results hold, o3 could mark a notable step in the progression of AI models.

OpenAI Fundamentals

Get Started Using the OpenAI API and More!

Start Now

What Is OpenAI O3?

O3 is OpenAI’s latest frontier model, designed to advance reasoning capabilities across a range of complex tasks. Announced alongside its smaller counterpart, o3 mini, it focuses on addressing challenges in coding, mathematics, and general intelligence.

We find o3 notable for its emphasis on harder benchmarks that test reasoning in ways previous models haven’t fully tackled. OpenAI has highlighted its improvements over o1, positioning it as a more capable system for handling complex problem-solving.

O1 vs o3 on coding

O1 vs o3 on coding. Source: OpenAI

Currently, O3 isn’t available for general use. OpenAI is starting with public safety testing, inviting researchers to explore its strengths and limitations. We think this collaborative approach reflects a growing recognition of the need for careful evaluation as AI models become increasingly capable.

O1 vs. O3

O3 builds directly on the foundation set by o1, but the improvements are significant across key areas. OpenAI has positioned o3 as a model designed to handle more complex reasoning tasks, with performance gains reflected in its benchmarks.

Coding

We noticed some clear differences between the two models (see the graph above). On software-style coding tasks, O3 achieved 71.7% accuracy on Bench Verified, a substantial improvement over o1.

Similarly, in competitive programming, o3 reached an ELO score of 2727, far surpassing o1’s previous high of 1891. These numbers indicate a focus on advancing the model’s ability to tackle real-world coding challenges.

Math and science

The improvements aren’t limited to coding. o3 also excelled in mathematical reasoning, scoring 96.7% accuracy on the AIME 2024, compared to o1’s 83.3%. These gains suggest a model that can handle more nuanced and difficult problems, moving closer to benchmarks traditionally dominated by human experts.

O1 vs o3 on math and science

O1 vs o3 on math and science. Source: OpenAI

The leap is similarly apparent in science-related benchmarks. On GPQA Diamond, which measures performance on PhD-level science questions, o3 achieved an accuracy of 87.7%, up from o1’s 78%. These gains demonstrate a broad enhancement in the model’s ability to solve technically demanding problems across disciplines.

EpochAI Frontier Math

One area where o3’s progress is especially noteworthy is on the EpochAI Frontier Math benchmark.

This is considered one of the most challenging benchmarks in AI because it consists of novel, unpublished problems that are intentionally designed to be far more difficult than standard datasets. Many of these problems are at the level of mathematical research, often requiring professional mathematicians hours or even days to solve a single problem. Current AI systems typically score under 2% on this benchmark, highlighting its difficulty.

O3 on EpochAI Frontier Math

O3 on EpochAI Frontier Math. Source: OpenAI

Epic AI’s Frontier Math is important because it pushes models beyond rote memorization or optimization of familiar patterns. Instead, it tests their ability to generalize, reason abstractly, and tackle problems they haven’t encountered before—traits essential for advancing AI reasoning capabilities. o3’s score of 25.2% on this benchmark looks like a significant leap forward.

O3’s Breakthrough on ARC AGI

One of the most striking achievements of o3 is its performance on the ARC AGI benchmark, a test widely regarded as a gold standard for evaluating general intelligence in AI.

Developed in 2019 by François Chollet, ARC (Abstraction and Reasoning Corpus) focuses on assessing an AI’s ability to learn and generalize new skills from minimal examples. Unlike traditional benchmarks that often test for pre-trained knowledge or pattern recognition, ARC tasks are designed to challenge models to infer rules and transformations on the fly—tasks that humans can solve intuitively but AI has historically struggled with.

What makes ARC AGI particularly difficult is that every task requires distinct reasoning skills. Models cannot rely on memorized solutions or templates; instead, they must adapt to entirely new challenges in each test. For instance, one task might involve identifying patterns in geometric transformations, while another could require reasoning about numerical sequences. This diversity makes ARC AGI a powerful measure of how well an AI can truly think and learn like a human.

Example of a task from the ARC AGI test

Can you guess the logic by which the input is transformed into output? Source: OpenAI

o3’s performance on ARC AGI marks a significant milestone. On low-compute settings, o3 scored 76% on the semi-private holdout set—a figure far above any previous model.

When tested with high-compute settings, it achieved an even more impressive 88%, surpassing the 85% threshold often cited as human-level performance. This is the first time an AI has outperformed humans on this benchmark, setting a new standard for reasoning-based tasks.

O series preformance on ARC AGI

O series performance. Source: ArcPrize

We believe these results are particularly noteworthy because they demonstrate o3’s ability to handle tasks that demand adaptability and generalization rather than rote knowledge or brute-force computation. It’s a clear indication that o3 is pushing closer to true general intelligence, moving beyond domain-specific capabilities and into areas that were previously thought to be exclusively human territory.

What Is o3 Mini?

o3 mini was introduced alongside o3 as a cost-efficient alternative designed to bring advanced reasoning capabilities to more users while maintaining performance. OpenAI described it as redefining the “cost-performance frontier” in reasoning models, making it accessible for tasks that demand high accuracy but need to balance resource constraints.

One of the standout features of o3 mini is its adaptive thinking time, which allows users to adjust the model’s reasoning effort based on the complexity of the task. For simpler problems, users can select low-effort reasoning to maximize speed and efficiency.

For more challenging tasks, higher reasoning effort options enable the model to perform at levels comparable to o3 itself, but at a fraction of the cost. This flexibility is particularly compelling for developers and researchers working across diverse use cases.

O3 mini benchmarks

O3 mini benchmarks. Source: OpenAI

The live demo showcased how o3 mini delivers on its promise. For example, in a coding task, o3 mini was tasked with generating a Python script to create a local server with an interactive UI for testing. Despite the complexity of the task, the model performed well, demonstrating its ability to handle sophisticated programming challenges.

Interactive UI created with o3 mini during the live demo

Interactive UI created with o3 mini during the live demo. Source: OpenAI

We see o3 mini as a practical solution for scenarios where cost-effectiveness and performance must align.

Deliberative Alignment: Innovations in Safety Testing

OpenAI has adopted a proactive approach to safety testing for o3 and o3 mini by opening access to researchers for public safety evaluations before the models’ full release.

A central feature of OpenAI’s safety strategy for o3 is deliberative alignment, a method that goes beyond traditional safety approaches. The graph below highlights how deliberative alignment differs from other methods such as RLHF (Reinforcement Learning with Human Feedback), RLAIF (Reinforcement Learning with AI Feedback), and inference-time refinement techniques like Self-REFINE.

deliberative alighment vs rlhf vs rlaif vs inference-time refinement techniques

Source: OpenAI

In deliberative alignment, the model doesn’t simply rely on static rules or preference datasets to determine whether a prompt is safe or unsafe. Instead, it uses its reasoning capabilities to evaluate prompts in real-time. The graph above illustrates this process:

  1. Training data generation: Unlike RLHF, where human input directly informs the model, deliberative alignment uses a reasoning model to generate chain-of-thought (CoT) outputs for specific prompts. These CoT outputs provide nuanced reasoning patterns that guide the training process, helping the model understand context and intent more effectively.
  2. Inference time: During inference, the reasoning model evaluates prompts and provides a chain-of-thought explanation alongside its answers. This step allows the model to dynamically assess the intent and context of a prompt, identifying potential hidden risks or ambiguities that static rules might miss.

O3 Release Date

For now, o3 and o3 mini are not widely available, but OpenAI has opened access to researchers through its safety testing program.

As for public availability, OpenAI has shared a tentative timeline. o3 mini is expected to launch by the end of January, offering a cost-efficient option for reasoning tasks. The full o3 release will follow shortly after, though OpenAI has emphasized that its timeline depends on the feedback and insights gained during the safety testing phase.

We view this cautious approach as a positive step, prioritizing thorough evaluation and thoughtful alignment with user needs while maintaining transparency throughout the development process.

Conclusion

O3 and o3 mini highlight the growing complexity of AI systems and the challenges of releasing them responsibly. While the benchmarks are impressive, we find ourselves more interested in the questions these models raise: How well will they perform in real-world scenarios? Are the safety measures robust enough to address edge cases at scale?

OpenAI’s cautious rollout is one approach, but whether it strikes the right balance between capability and accountability will depend on how these models are ultimately used and evaluated.

Still, the promise o3 shows in reasoning and adaptability is hard to ignore, offering a glimpse of what the next generation of AI might achieve.

If you’re interested in exploring the biggest and latest releases this month, we recommend these blogs:

FAQs

What is OpenAI o3, and how is it different from o1?

o3 is the latest iteration of OpenAI's reasoning models. Compared to OpenAI o1, the o3 and o3-mini models demonstrate improved performance in reasoning tasks, including coding, scientific analysis, and breakthrough capabilities to novel tasks.

When will OpenAI o3 be released?

As of today, Friday, December 20th, OpenAI plans to launch o3-mini by the end of January, followed by o3 shortly thereafter. However, these timelines may change depending on the outcomes of safety testing.

Is OpenAI o3 multimodal?

Currently, there has been no announcement regarding multimodal capabilities for o3.

How can I get access to OpenAI o3?

OpenAI is currently offering early access to o3 for safety testing. You can apply for access through OpenAI's official website.

How does OpenAI o3 work?

Although no detailed description of how o3 works has been provided, it is reasonable to assume it follows a similar architecture to OpenAI's o1 model. This includes a combination of reinforcement learning, chain-of-thought reasoning, and a transformer-based framework.

How much will OpenAI o3 cost?

Although there has been no discussion of pricing for OpenAI o3, it is reasonable to assume that it will be priced similarly to or higher than the OpenAI o1 pro mode.

What is the difference between OpenAI o3 and o3-mini?

Similar to OpenAI O1 and O1-mini, OpenAI O3-mini is expected to be slightly less performant than O3 but more cost-effective to run and utilize. 


Alex Olteanu's photo
Author
Alex Olteanu
LinkedIn

Jack of all trades, master of Python, content, SEO, editing, writing. Technical guy—I wrote courses on Python, statistics, probability. But I also published an award-winning novel. Video editing & color grading in DaVinci.


Adel is a Data Science educator, speaker, and Evangelist at DataCamp where he has released various courses and live training on data analysis, machine learning, and data engineering. He is passionate about spreading data skills and data literacy throughout organizations and the intersection of technology and society. He has an MSc in Data Science and Business Analytics. In his free time, you can find him hanging out with his cat Louis.

Topics

Learn AI with these courses! 

track

ChatGPT Fundamentals

3hrs hr
Explore the essentials of ChatGPT and prompt engineering. Master crafting prompts to maximize ChatGPT's capabilities.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related
OpenAI o1 depiction as a human with a computer instead of his head

blog

OpenAI o1 Guide: How It Works, Use Cases, API & More

OpenAI o1 is a new series of models from OpenAI excelling in complex reasoning tasks, using chain-of-thought reasoning to outperform GPT-4o in areas like math, coding, and science.
Richie Cotton's photo

Richie Cotton

8 min

two brains representing the extra power of o1 pro mode

blog

What Is OpenAI's O1 Pro Mode? Features, ChatGPT Pro & More

Learn about OpenAI’s new ChatGPT Pro subscription plan and its most advanced model, o1 pro mode, featuring enhanced accuracy, reliability, and complex reasoning abilities.
Alex Olteanu's photo

Alex Olteanu

8 min

blog

OpenAI Announce GPT-4 Turbo With Vision: What We Know So Far

Discover the latest update from OpenAI, GPT-4 Turbo with vision, and its key features, including improved knowledge cutoff, an expanded context window, budget-friendly pricing, and more.
Richie Cotton's photo

Richie Cotton

7 min

blog

What Is OpenAI's Sora? How It Works, Examples, Features

Discover OpenAI’s Sora through example videos and explore its features, including Remix, Re-cut, Loop, Storyboard, Blend, and Style Preset.
Richie Cotton's photo

Richie Cotton

8 min

tutorial

OpenAI o1-preview Tutorial: Building a Machine Learning Project

Learn how to use OpenAI o1 to build an end-to-end machine learning project from scratch using just one prompt.
Abid Ali Awan's photo

Abid Ali Awan

15 min

tutorial

OpenAI O1 API Tutorial: How to Connect to OpenAI's API

Learn how to connect to the OpenAI O1 models through the API and manage API costs by understanding reasoning tokens and how to control them.
Alex Olteanu's photo

Alex Olteanu

8 min

See MoreSee More