GPT-4o Guide: How it Works, Use Cases, Pricing, Benchmarks

Learn about OpenAI’s GPT-4o, a multimodal AI model that processes text, audio, and visual data, and discover how it compares with GPT-4 Turbo for various use cases.

Updated Sep 29, 2024 · 8 min read

OpenAI announced its latest large language model, GPT-4o, the successor to GPT-4 Turbo. Read on to discover its capabilities, performance, and how you might want to use it.

What Is OpenAI’s GPT-4o?

GPT-4o is OpenAI’s latest LLM. The 'o' in GPT-4o stands for "omni"—Latin for "every"—referring to the fact that this new model can accept prompts that are a mixture of text, audio, images, and video. Previously, the ChatGPT interface used separate models for different content types.

For example, when speaking to ChatGPT via Voice Mode, your speech would be converted to text using Whisper, a text response would be generated using GPT-4 Turbo, and that text response would be converted to speech with TTS.

A comparison of how GPT-4 Turbo and GPT-4o process speech input

Similarly, working with images in ChatGPT involved a mix of GPT-4 Turbo and DALL-E 3.

Having a single model for different content media promises increased speed and quality of results, a simpler interface, and some new use cases.

What is GPT-4o mini?

GPT-4o Mini is a leaner, faster version of GPT-4o, designed to handle tasks with a greater focus on speed and efficiency. It's derived from the larger GPT-4o model through a process called distillation.

While it retains much of the original model’s ability to process multimodal inputs—text, audio, and images—GPT-4o mini is optimized for lightweight applications where faster response times are crucial.

It’s particularly useful for developers needing a cost-effective solution for coding, debugging, and real-time interactions that don’t require the full computational power of GPT-4o.

You can read more details in this article about GPT-4o mini.

What Makes GPT-4o Different to GPT-4 Turbo?

The all-in-one model approach means that GPT-4o overcomes several limitations of the previous voice interaction capabilities.

1. Tone of voice is now considered, facilitating emotional responses

With the previous OpenAI system of combining Whisper, GPT-4 Turbo, and TTS in a pipeline, the reasoning engine, GPT-4, only had access to the spoken words. This method meant that things like tone of voice, background noises, and knowledge of voices from multiple speakers were simply discarded. As such, GPT-4 Turbo couldn’t really express responses with different emotions or styles of speech.

By having a single model that can reason about text and audio, this rich audio information can be used to provide higher-quality responses with a greater variety of speaking styles.

In the following example provided by OpenAI, GPT-4o provides sarcastic output.

2. Lower latency enables real-time conversations

The existing three-model pipeline meant that there was a small delay ("latency") between speaking to ChatGPT and getting a response.

OpenAI shared that the average latency of Voice Mode is 2.8 seconds for GPT-3.5 and 5.4 seconds for GPT-4. By contrast, the average latency for GPT-4o is 0.32 seconds, nine times faster than GPT-3.5 and 17 times faster than GPT-4.

This decreased latency is close to the average human response times (0.21 seconds) and is important for conversational use cases, where there is a lot of back and forth between the human and AI, and the gaps between responses add up.

This feature feels reminiscent of Google launching Instant, its auto-complete for search queries, in 2010. While searching doesn't take a long time, being able to save a couple of seconds every time you use it makes the product experience better.

One use case that becomes more viable with GPT-4o’s decreased latency is real-time translation of speech. OpenAI presented a use case of two colleagues, one an English speaker and the other a Spanish speaker, communicating by having GPT-4o translate their conversation.

Beyond working with a screen, if you give GPT-4o access to a camera, perhaps your smartphone, it can describe what it sees.

A longer demo presented by OpenAI combines all these features. Two smartphones running GPT-4o hold a conversation. One GPT has access to the smartphone cameras and describes what it can see to another GPT that cannot see.

The result is a three-way conversation between a human and two AIs. The video also includes a section with the AIs singing, something that was not possible with previous models.

4. Better tokenization for non-Roman alphabets provides greater speed and value for money

One step in the LLM workflow is when the prompt text is converted into tokens. These are units of text that the model can understand.

In English, a token is typically one word or piece of punctuation, although some words can be broken down into multiple tokens. On average, three English words take up about four tokens.

If language can be represented in the model with fewer tokens, fewer calculations need to be made, and the speed of generating text is increased.

Further, since OpenAI charges for its API per token input or output, fewer tokens mean a lower price to the API users.

GPT-4o has an improved tokenization model that results in fewer tokens being needed per text. The improvement is mostly noticeable in languages that don't use the Roman alphabet.

For example, Indian languages, in particular, have benefitted, with Hindi, Marathi, Tamil, Telugu, and Gujarati all showing reductions in tokens by 2.9 to 4.4 times. Arabic showed a 2x token reduction, and East Asian languages like Chinese, Japanese, Korean, and Vietnamese showed token reductions between 1.4x and 1.7x.

5. Rollout to the free plan

With OpenAI's existing pricing strategy for ChatGPT, users have to pay to access the best model: GPT-4 Turbo has only been available on the Plus and Enterprise paid plans.

This is changing, with OpenAI promising to make GPT-4o available on the free plan as well. Plus users will get five times as many messages as users on the free plan.

The rollout will be gradual, with red team (testers who try to break the model to find problems) access beginning immediately and further users gaining access over time.

6. Launch of the ChatGPT desktop app

While this isn’t necessarily an update exclusive to GPT-4o, OpenAI also announced the release of the ChatGPT desktop app. The updates in latency and multimodality mentioned above, alongside the release of the app, mean that the way we work with ChatGPT is likely going to change. For example, OpenAI showed a demo of an augmented coding workflow using voice and the ChatGPT desktop app. Scroll down in the use-cases section to see that example in action!

How Does GPT-4o Work?

Many content types, one neural network

Details of how GPT-4o works are still scant. The only detail that OpenAI provided in its announcement is that GPT-4o is a single neural network that was trained on text, vision, and audio input.

This new approach differs from the previous technique of having separate models trained on different data types.

However, GPT-4o isn't the first model to take a multi-modal approach. In 2022, TenCent Lab created SkillNet, a model that combined LLM transformer features with computer vision techniques to improve the ability to recognize Chinese characters.

In 2023, a team from ETH Zurich, MIT, and Stanford University created WhisBERT, a variation on the BERT series of large language models. While not the first, GPT-4o is considerably more ambitious and powerful than either of these earlier attempts.

Is GPT-4o a radical change from GPT-4 Turbo?

How radical the changes are to GPT-4o's architecture compared to GPT-4 Turbo depends on whether you ask OpenAI's engineering or marketing teams. In April, a bot named "im-also-a-good-gpt2-chatbot" appeared on LMSYS's Chatbot Arena, a leaderboard for the best generative AIs. That mysterious AI has now been revealed to be GPT-4o.

The "gpt2" part of the name is important. Not to be confused with GPT-2, a predecessor of GPT-3.5 and GPT-4, the "2" suffix was widely regarded to mean a completely new architecture for the GPT series of models.

Evidently, someone in OpenAI's research or engineering team thinks that combining text, vision, and audio content types into a single model is a big enough change to warrant the first version number bump in six years.

On the other hand, the marketing team has opted for a relatively modest naming change, continuing the "GPT-4" convention.

GPT-4o Performance vs Other Models

OpenAI released benchmark figures of GPT-4o compared to several other high-end models.

GPT-4 Turbo
GPT-4 (initial release)
Claude 3 Opus
Gemini Pro 1.5
Gemini Ultra 1.0
Llama 3 400B

Of these, only three models really matter for comparison. GPT 4 Turbo, Claude 3 Opus, and Gemini Pro 1.5 have spent the last few months angling for the top spot on the LMSYS Chatbot Arena leaderboard.

Llama 3 400B may be a contender in the future, but it isn't finished yet. Thus here, we only present the results for these three models and GPT-4o.

The results of six benchmarks were used.

Massive Multitask Language Understanding (MMLU). Tasks on elementary mathematics, US history, computer science, law, and more. To attain high accuracy on this test, models must possess extensive world knowledge and problem-solving ability.
Graduate-Level Google-Proof Q&A (GPQA). Multiple-choice questions written by domain experts in biology, physics, and chemistry. The questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 74% accuracy.
MATH. Middle school and high school mathematics problems.
HumanEval. A test of the functional correctness of computer code, used for checking code generation.
Multilingual Grade School Math (MSGM). Grade school mathematics problems, translated into ten languages, including underrepresented languages like Bengali and Swahili.
Discrete Reasoning Over Paragraphs (DROP). Questions that require understanding complete paragraphs. For example, by adding, counting, or sorting values spread across multiple sentences.

Performance of GPT-4o, GPT-4 Turbo, Gemini Pro 1.5, and Claude 3 Opus against six LLM benchmarks. Scores for each benchmark range from 0 to 100. Recreated from data provided by OpenAI. No data was provided for Gemini Pro 1.5 for the GPQA benchmark.

GPT-4o gets the top score in four of the benchmarks, though it is beaten by Claude 3 Opus in the MSGM benchmark and by GPT-4 Turbo in the DROP benchmark. Overall, this performance is impressive, and it shows promise for the new approach of multimodal training.

If you look closely at the GPT-4o numbers compared to GPT-4 Turbo, you'll see that the performance increases are only a few percentage points.

It's an impressive boost for one year later, but it's far from the dramatic jumps in performance from GPT-1 to GPT-2 or GPT-2 to GPT-3.

Being 10% better at reasoning about text year-on-year is likely to be the new normal. The low-hanging fruit has been picked, and it's just difficult to continue with big leaps in text reasoning.

On the other hand, what these LLM benchmarks don't capture is AI's performance on multi-modal problems. The concept is so new that we don't have any good ways of measuring how good a model is across text, audio and vision.

Overall, GPT-4o's performance is impressive, and it shows promise for the new approach of multimodal training.

What Are GPT-4o Use-Cases?

1. GPT-4o for data analysis & coding tasks

Recent GPT models and their derivatives, like GitHub Copilot, are already capable of providing code assistance, including writing code and explaining and fixing errors. The multi-modal capabilities of GPT-4o allow for some interesting opportunities.

In a promotional video hosted by OpenAI CTO Mira Murati, two OpenAI researchers, Mark Chen and Barret Zoph, demonstrated using GPT-4o to work with some Python code.

The code is shared with GPT as text, and the voice interaction feature is used to get GPT to explain the code. Later, after running the code, GPT-4o's vision capability is used to explain the plot.

Overall, showing ChatGPT your screen and speaking a question is a potentially simpler workflow than saving a plot as an image file, uploading it to ChatGPT, then typing a question.

4. GPT-4o for assisting visually impaired users

GPT-4o's ability to understand video input from a camera and verbally describe the scene could be a must-have feature for visually impaired people. It's essentially the audio description feature that TVs have, but for real life.

Hands-On With GPT-4o

I’ve had access to some of GPT-4o’s new features since just after the announcement (sadly, no voice chat yet), and I’ve been impressed with many of its outputs. Responses seem faster and more consistent, and it seems to understand my requests better than it did previously. That’s not to say it’s been perfect, though.

Here are some examples of the interactions I had with ChatGPT-4o:

Data analysis task

First, using voice chat, I asked if it had any ideas about how to analyze the performance of the soccer team I support, the mighty Leeds United. As well as providing several options, it gave me some example Python code:

import pandas as pd

# Sample data for Leeds United's match results
data = {
    'Match': ['Match1', 'Match2', 'Match3', 'Match4', 'Match5'],
    'Result': ['Win', 'Draw', 'Loss', 'Win', 'Win'],
    'Goals Scored': [2, 1, 0, 3, 2]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Summarize match results
summary = df['Result'].value_counts()
total_goals = df['Goals Scored'].sum()
average_goals = df['Goals Scored'].mean()

# Display summary
print("Match Results Summary:")
print(summary)
print("\nTotal Goals Scored:", total_goals)
print("Average Goals Per Match:", round(average_goals, 2))

However, when I probed this train of thought in more detail, things got a little off. I first asked for some real data to use - it searched the web and found two good sources, but it reported the stats wrong. Leeds played 46 games in the regular season, scoring 81 with a goal difference of plus 38, as opposed to the 40 games played it outlined in its response.

I then asked ChatGPT to visualize the goals scored against each team:

Again, it’s half-completed the task here. It has created a visualization as requested, which looks fine on the surface. But in reality, a lot of the data is made up and inaccurate (teams appearing twice, goals not counted, and teams not in the same division as Leeds).

In fairness, I imagine performance would have been better had I provided a full dataset myself, but I wish it would say that rather than confidently make up answers.

Image analysis

Next, I asked GPT-4o to analyze a picture of one of my plants. I still don’t have access to the integrated vision feature, so I had to take a picture and ask ChatGPT what plant it was:

This isn’t a bad effort, although it’s not quite accurate. While it is a bonsai tree, it’s an Ilex crenata rather than a Carmona retusa. Still, the two look pretty similar so it’s an easy mistake to make and I appreciated the extra context on how to take care of the plant.

Image generation

Finally, I wanted to test out the image abilities of the new model. I first showed it a picture of my tortoise, Darwin, and asked it to tell me about my friend:

Again, this is close but not perfect. Darwin is actually a Horsefield tortoise rather than a Hermann’s, but they do look very similar. I then asked ChatGPT-4o to take the original image and recreate it in the style of Hokusai. Here’s the result:

A pretty good effort, although there isn’t much actual resemblance to the original image, but I guess that’s fair enough. It did take a little while to generate this one too.

Overall, though, I was impressed with the responsiveness of the new model and how well it understood my requests. It’s far from flawless, and it still confidently hallucinates at times, but I can’t wait to get hands-on with the improved speech and integrated vision.

GPT-4o Limitations & Risks

Regulation for generative AI is still in its early stages; the EU AI Act is the only notable legal framework in place so far. That means that companies creating AI need to make some of their own decisions about what constitutes safe AI.

OpenAI has a preparedness framework that it uses to determine whether or not a new model is fit to release to the public.

The framework tests four areas of concern.

Cybersecurity. Can AI increase the productivity of cybercriminals and help create exploits?
BCRN. Can the AI assist experts in creating biological, chemical, radiological, or nuclear threats?
Persuasion. Can the AI create (potentially interactive) content that persuades people to change their beliefs?
Model autonomy. Can the AI act as an agent, performing actions with other software?

Each area of concern is graded Low, Medium, High, or Critical, and the model's score is the highest of the grades across the four categories.

OpenAI promises not to release a model that is of critical concern, though this is a relatively low safety bar: under its definitions, a critical concern corresponds to something that would upend human civilization. GPT-4o comfortably avoids this, scoring Medium concern.

Imperfect output

As with all generative AIs, the model doesn't always behave as intended. Computer vision is not perfect, and so interpretations of an image or video are not guaranteed to work.

Likewise, transcriptions of speech are rarely 100% correct, particularly if the speaker has a strong accent or technical words are used.

OpenAI provided a video of some outtakes where GPT-4o did not work as intended.

Notably, translation between two non-English languages was one of the cases where it failed. Other problems included unsuitable tone of voice (being condescending) and speaking the wrong language.

Accelerated risk of audio deepfakes

The OpenAI announcement notes that "We recognize that GPT-4o’s audio modalities present a variety of novel risks." In a lot of ways, GPT-4o can accelerate the rise of deepfake scam calls, where AI impersonates celebrities, politicians, and people's friends and family. This is a problem that will only get worse before it is fixed, and GPT-4o has the power to make deepfake scam calls even more convincing.

To mitigate this risk, audio output is only available in a selection of preset voices.

Presumably, technically minded scammers can use GPT-4o to generate text output and then use their own text-to-speech model, though it's unclear if that would still gain the latency and tone-of-voice benefits that GPT-4o provides.

GPT-4o Release Date

As of July 19, 2024, many features of GPT-4o have been gradually rolled out. The text and image capabilities are added for many users on the Plus and free plans. This includes ChatGPT accessed on mobile browsers. Likewise, the text and vision features of GPT-4o are already available via the API.

These features of GPT-4o are broadly available on iOS and Android mobile apps. However, we're still awaiting the new Voice Mode, which will be updated to use GPT-4o, the API will add audio and video capabilities for GPT-4o, and the new model will be available on Mac Desktop. Access to the latter is also gradually being rolled out to Plus users, and a Windows desktop application is planned for later this year.

Below is a summary of the GPT-4o release dates:

Announcement of GPT-4o: May 13, 2024
GPT-4o text and image capabilities rollout: Starting May 13, 2024
GPT-4o availability in free tier and Plus users: Starting May 13, 2024
API access for GPT-4o (text and vision): Starting May 13, 2024
GPT-4o availability on Mac desktop for Plus users: Coming weeks (starting May 13, 2024)
New version of Voice Mode with GPT-4o in alpha: Coming weeks/months (after May 13, 2024)
API support for audio and video capabilities: Coming weeks/months (after May 13, 2024)
GPT-4o mini: July 18, 2024

However, after the controversy caused by the demo of the new voice capabilities, it seems OpenAI is being cautious about the release. According to their updated blog, 'Over the upcoming weeks and months, we’ll be working on the technical infrastructure, usability via post-training, and safety necessary to release the other modalities. For example, at launch, audio outputs will be limited to a selection of preset voices and will abide by our existing safety policies.'

How Much Does GPT-4o Cost?

Despite being faster than GPT-4 Turbo with better vision capabilities, GPT-4o will be around 50% cheaper than its predecessor. According to the OpenAI website, using the model will cost $5 per million tokens for input and $15 per million tokens for output.

How Can I Access GPT-4o in the Web Version of ChatGPT?

The user interface for ChatGPT has changed. All messages in ChatGPT default to using GPT-4o, and the model can be changed to GPT-3.5 using a toggle underneath the response.

What Does GPT-4o Mean for the Future?

There are two schools of thought about where AI should head towards. One is that AI should get ever more powerful and be able to accomplish a wider range of tasks. The other is that AI should get better at solving specific tasks as cheaply as possible.

OpenAI's mission to create artificial general intelligence (AGI), as well as its business model, put it firmly in the former camp. GPT-4o is another step towards that goal of ever more powerful AI.

This is the first generation of a completely new model architecture for OpenAI. That means that there is a lot for the company to learn and optimize over the coming months.

In the short term, expect new types of quirks and hallucinations, and in the long term, expect performance improvements, both in terms of speed and quality of output.

The timing of GPT-4o is interesting. Just as the tech giants have realized that Siri, Alexa, and Google Assistant aren't quite the money-making tools they once hoped for, OpenAI is hoping to make AI talkative again. In the best case, this will bring a raft of new use cases for generative AI. At the very least, you can now set a timer in whatever language you like.

Conclusion

GPT-4o represents further progress in generative AI, combining text, audio, and visual processing into one efficient model. This innovation promises faster responses, richer interactions, and a wider range of applications, from real-time translation to enhanced data analysis and improved accessibility for the visually impaired.

While there are initial limitations and risks, such as potential misuse in deepfake scams and the need for further optimization, GPT-4o is another step towards OpenAI's goal of artificial general intelligence. As it becomes more accessible, GPT-4o could change how we interact with AI, integrating into daily and professional tasks.

With its lower cost and enhanced capabilities, GPT-4o is poised to set a new standard in the AI industry, expanding the possibilities for users across various fields.

The future of AI is exciting, and now is as good a time as any to start learning how this technology works. If you’re new to the field, get started with our AI Fundamentals skill track, which covers actionable knowledge on topics like ChatGPT, large language models, generative AI, and more. You can also learn more about working with the OpenAI API in our hands-on course, or check out our full catalog of AI courses.

Author

Richie Cotton

Can GPT-4o handle multilingual conversations?

Does GPT-4o support all languages equally well?

How does GPT-4o handle background noise in audio input?

Is GPT-4o capable of generating video content?

Can GPT-4o mimic specific voices?

How secure is the data input into GPT-4o?

Can GPT-4o be integrated into existing applications?

When should I use GPT-4o Mini over GPT-4o?

What are differences between GPT-4o and the o1 model?

Topics

Artificial Intelligence

OpenAI

ChatGPT

Learn to Build AI Tools with OpenAI Today!

Course

Working with the OpenAI API

3 hr

95.9K

Start your journey developing AI-powered applications with the OpenAI API. Learn about the functionality that underpins popular AI applications like ChatGPT.

See Details

Start Course

Course

Introduction to Embeddings with the OpenAI API

3 hr

15.2K

Unlock more advanced AI applications, like semantic search and recommendation engines, using OpenAI's embedding model!

See Details

Start Course

Course

Developing AI Systems with the OpenAI API

3 hr

15.4K

Leverage the OpenAI API to get your AI applications ready for production.

See Details

Start Course

blog

What Is GPT-4o Mini? How It Works, Use Cases, API & More

GPT-4o mini is a smaller, more affordable version of OpenAI's GPT-4o model, offering a balance of performance and cost-efficiency for various AI applications.

Ryan Ong

8 min

blog

What is GPT-4 and Why Does it Matter?

OpenAI has announced the release of its latest large language model, GPT-4. This model is a large multimodal model that can accept both image and text inputs and generate text outputs.

Abid Ali Awan

9 min

OpenAI o1 depiction as a human with a computer instead of his head

blog

OpenAI o1 Guide: How It Works, Use Cases, API & More

OpenAI o1 is a new series of models from OpenAI excelling in complex reasoning tasks, using chain-of-thought reasoning to outperform GPT-4o in areas like math, coding, and science.

Richie Cotton

8 min

blog

OpenAI Announce GPT-4 Turbo With Vision: What We Know So Far

Discover the latest update from OpenAI, GPT-4 Turbo with vision, and its key features, including improved knowledge cutoff, an expanded context window, budget-friendly pricing, and more.

Richie Cotton

7 min

Tutorial

How to Use GPT-4o Audio Preview With LangChain and ChatOpenAI

Learn how to use OpenAI’s gpt-4o-audio-preview model with LangChain to build voice-enabled applications that include audio input, transcription, and generation.

Dr Ana Rojo-Echeburúa

Tutorial

GPT-4 Vision: A Comprehensive Guide for Beginners

This tutorial will introduce you to everything you need to know about GPT-4 Vision, from accessing it to, going hands-on into real-world examples, and the limitations of it.

Arunn Thevapalan

See More See More

What Is OpenAI’s GPT-4o?

What is GPT-4o mini?

What Makes GPT-4o Different to GPT-4 Turbo?

1. Tone of voice is now considered, facilitating emotional responses

2. Lower latency enables real-time conversations

3. Integrated vision enables descriptions of a camera feed

4. Better tokenization for non-Roman alphabets provides greater speed and value for money

5. Rollout to the free plan

6. Launch of the ChatGPT desktop app

How Does GPT-4o Work?

Many content types, one neural network

Is GPT-4o a radical change from GPT-4 Turbo?

GPT-4o Performance vs Other Models

What Are GPT-4o Use-Cases?

1. GPT-4o for data analysis & coding tasks

2. GPT-4o for real-time translation

3. Roleplay with GPT-4o

4. GPT-4o for assisting visually impaired users

Hands-On With GPT-4o

Data analysis task

Image analysis

Image generation

GPT-4o Limitations & Risks

Imperfect output

Accelerated risk of audio deepfakes

GPT-4o Release Date

How Much Does GPT-4o Cost?

How Can I Access GPT-4o in the Web Version of ChatGPT?

What Does GPT-4o Mean for the Future?

Conclusion

FAQs

How does GPT-4o handle background noise in audio input?

Is GPT-4o capable of generating video content?

Can GPT-4o mimic specific voices?

How secure is the data input into GPT-4o?

Can GPT-4o be integrated into existing applications?

When should I use GPT-4o Mini over GPT-4o?

What are differences between GPT-4o and the o1 model?

What Is GPT-4o Mini? How It Works, Use Cases, API & More

What is GPT-4 and Why Does it Matter?

OpenAI o1 Guide: How It Works, Use Cases, API & More

OpenAI Announce GPT-4 Turbo With Vision: What We Know So Far

How to Use GPT-4o Audio Preview With LangChain and ChatOpenAI

GPT-4 Vision: A Comprehensive Guide for Beginners

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Working with the OpenAI API

Introduction to Embeddings with the OpenAI API

Developing AI Systems with the OpenAI API

What Is GPT-4o Mini? How It Works, Use Cases, API & More

What is GPT-4 and Why Does it Matter?

OpenAI o1 Guide: How It Works, Use Cases, API & More

OpenAI Announce GPT-4 Turbo With Vision: What We Know So Far

How to Use GPT-4o Audio Preview With LangChain and ChatOpenAI

GPT-4 Vision: A Comprehensive Guide for Beginners

Working with the OpenAI API