OpenAI introduces ChatGPT-4o: its new flagship model that's fast, multimodal, and feels more human

May 14, 2024

Is this our decade's 'Hey Siri' moment?

OpenAI has unveiled GPT-4o, an "omnimodel" that's a significant step forward in human-computer interaction.

The AI can now listen to and express emotions. And it can analyze video in real time.

Suddenly, other AI chatbots will seem quaint.

With the ability to analyze video in real time, GPT-4o can understand and respond to visual cues. Whether it's identifying objects, reading facial expressions, or providing real-time feedback during video calls, this model will take visual intelligence to a whole new level.

Gone are OpenAI's days of clunky AI voice processing. Previously, a voice conversation with ChatGPT involved a multi-step behind-the-scenes process: audio recording and transcription, feeding text to a language model, generating a text response, and then converting text to speech. Latency was typically 2 - 5 seconds. GPT-4o is multimodal at its core, with the audio, text and visual information having being trained on the same neural network. This drastically reduces response times, down to a 200 - 300ms range which is very close to the speed of natural human conversations.

GPT-3.5 was fast. GPT-4 was smart. GPT-4o is both fast and smart.

And it feels more human than ever before.

The evaluation metrics for GPT-4o range from solid to impressive, showcasing its advanced capabilities across various benchmarks when compared to GPT-4 Turbo, Google's Gemini, Meta's Llama3, and Anthropic's Claude 3 models.

Vision Understanding Evals (source: OpenAI)

Text Evaluation Performance (source: OpenAI)

Did I mention that ChatGPT-4o will be free for all users? You don't need the monthly paid plan to get access. Custom GPTs will also now be accessible to everybody regardless of subscription status. The Plus plan does still have some benefits like higher usage volume caps, and sooner access to some features.

Let’s dive in to some of the fascinating demos which show off the new model's capabilities.

Talking with GPT-4o while it observes its surroundings:

AI assisting a blind man with navigation in London:

Harmonizing with two GPT-4o's. As a musican I'd like to have a word. This demo is cool but do we really want to call this harmony?

Showing off some vocal emotive variation:

Sarcasm with GPT-4o. Our ChatGPT conversations will never be the same:

So, we didn’t get a new search engine as had been rumored in recent days. And we didn't get GPT-5 (not yet). But we did get a sleek, powerful new flagship model that feels like a turning point in human-AI interaction. Devices that can see and understand emotional nuances will make older models seem outdated. We're stepping into a world where AI is more engaging, sees more clearly, and understands us better.

Hello 2001. Let's hope HAL doesn't turn on us.

Read OpenAI's full Spring Update here: https://openai.com/index/spring-update/