Gemini Omni: One Model for Text, Image, Audio, and Video

A first look at Google DeepMind's any-to-any model — what it does, what's new about it, and how to access it.

19 mai 2026 · 7 min. citire

The much-hyped Google I/O event did not disappoint. After a bit of preamble and alongside Gemini 3.5 Flash and Gemini Spark, the real highlight: Gemini Omni, a new family of native multimodal, "any-to-any" AI models, which means that they process you name it - text, image, audio, and video simultaneously. In other words, Nano Banana, but for video. (This is how Google itself talked about it.)

The initial model is Omni Flash. It introduces conversational video editing for creating cohesive media from individual assets. So, you can edit and remix videos through conversations and not by using editing software.

What Is Gemini Omni?

Let's understand what Omni is about in more detail.

Gemini Omni is Google DeepMind's first natively multimodal generative media model. The first variant in the family is Gemini Omni Flash, which is now live

Inside the Gemini app,
Google Flow, and
YouTube Shorts

Omni accepts any combination of text, images, audio, and video as input and produces a video. The key here is that there's no relay happening across different systems; this is all one model.

Until now, Google ran a split stack: Veo for video, Imagen for images, and separate systems for audio. Omni collapses that into one model (hence the name!) that can reason across modalities. In practice, that translates to more coherent edits and fewer pipeline artifacts.

Before I get carried away, let's frame this in terms of key features.

Key Features of Gemini Omni

What stands out to me is the process. Both conversational editing and using sketches as blueprints are features we know from image generation tools like Nano Banana 2 or OpenAI's ChatGPT Images 2.0, but Omni brings them to video generation as well.

Conversational video editing

The headline capability is conversational video editing. You give Omni a clip (one you generated or one you shot on your phone), and you change it by talking to it. "Make the lights dim." "Change the camera angle to over her shoulder." "Turn the violin invisible." Each instruction persists across turns. The scene evolves; it doesn't reset.

Real-world physics and world knowledge

Google made a point of this at launch. Omni has an intuitive grasp of gravity, kinetic energy, and fluid dynamics. In my initial test, the generated video didn't feel like real-world physics – but then again, this is the Flash version, and we expect an Omni Pro model to follow soon and be more competitive with tools like Seedance 2.0.

Sketches and drawings to video

You can turn doodles into realistic footage, using the sketch only as a movement guide rather than as a final visual reference. This will be useful in pre-production. Kids will love this, for their drawings, also.

SynthID watermarking and C2PA content credentials

Every Omni output ships with two layers of provenance.

SynthID is an invisible watermark embedded directly into the pixels at generation time. It's imperceptible to viewers. And it's designed to survive cropping, filters, and re-encoding.
C2PA content credentials sit alongside it as a signed cryptographic manifest attached to the file. Strip the metadata, and the pixel-level signal still holds.

Worth flagging: SynthID is Google-specific. Only Google's own models embed it, so "not watermarked" doesn't mean "human-made" — it just means "not from a Google model." Given how easily Omni can remix real footage, durable provenance on every output starts to look less like a nice-to-have and more like table stakes.

Testing Gemini Omni Flash

I tested Omni Flash's video generation capabilities in two different categories: real-world physics and style transfer.

Testing physics and world knowledge

Here was my prompt:

A medieval trebuchet launching a fired clay pot at a stone castle wall, shot in slow motion. The counterweight falls, the sling whips around, the pot arcs through the air and shatters against the stone, shards and embers scattering across the courtyard. Continuous handheld camera move, golden hour light, period-accurate construction and dress. Realistic sound design — wood creaking under tension, rope strain, the whoosh of the sling, the sharp crack of impact. No music.

The note told me it would take a few minutes, but it really only took about ten seconds.

Maybe I'm picky, but I was just a bit disappointed. The angle of the clay pot was wrong in the later frame, relative to the earlier one. It changed the angle, and it looked like an airplane with a different kind of propulsion behind it.

Testing motion and style transfer

Now, let me take the result of that last test and see if I can make it totally different. What I did: I took a screenshot of that last video (so, I uploaded an image) and then asked for a video in a new style.

Take the trebuchet clip and re-render the entire scene in the visual style of the Bayeux Tapestry from this reference image — flat embroidered figures, period-accurate threading colors (faded reds, ochres, blues, greens against undyed linen), narrow decorative borders top and bottom with stitched Latin captions, characteristic medieval proportions where soldiers and the trebuchet are roughly the same scale. Keep the original motion choreography intact: the counterweight falls, the sling whips around, the pot arcs through the air and shatters against the castle wall, all in the same timing and trajectory. The shatter moment should read as the embroidery itself unraveling — threads splaying outward on impact. Replace the original sound design with a single dulcimer or hurdy-gurdy underscore. No live-action foley.

This video took much longer to generate. But it was worth the wait. Suddenly, the physics glitch didn't matter at all because the tapestry-style allowed for imperfections, and the result was funny. The trebuchet did launch backwards this time.

Where Gemini Omni Stands on Benchmarks

Worth flagging upfront: Google didn't publish numeric benchmarks alongside the Omni launch. The DeepMind announcement led with capability claims, such as physics understanding, world knowledge, conversational editing, but provided no head-to-head scores on standardized evals.

Independent third-party benchmarks won't land for at least a few weeks.

What that means for the competitive picture: as of now, the public leader on the Artificial Analysis Video Arena (the closest thing video generation has to an industry-standard leaderboard) is ByteDance's Seedance 2.0, with an Elo of 1,269 on text-to-video and 1,351 on image-to-video.

All that said, here are the benchmarks worth watching once they run:

VBench 2.0: evaluates physics, commonsense reasoning, human fidelity, and controllability across 18 dimensions.
Artificial Analysis Video Arena: Elo-style head-to-head human preference rankings.
VABench: joint audio-video evaluation. Matters specifically because Omni generates synchronized audio natively.

How to Access Gemini Omni: Pricing and Plans

Access right now sits inside Google's consumer AI tiers in the US:

AI Plus at $7.99/month
AI Pro at $19.99/month, and
AI Ultra at $249.99/month

Credit allocations scale with the tier: Plus gets 200 monthly AI credits, Pro gets 1,000.

API access is "coming in the next few weeks." We will be writing about this in more detail once we get our hands on it.

Conclusion

Until now, the generative media stack has been a relay race. Text goes to one model, the output gets handed to an image model, which hands a still to a video model, which hands frames to an audio model. Every handoff is a place where coherence or quality could leak away. Omni's big claim is that it reasons across text, image, video, and audio in the same forward pass.

A full-capability Omni model, which we suspect it's aimed at enterprise, is certainly in the works.

Author

Josef Waples

What is Gemini Omni and how does it work?

What can you do with Gemini Omni?

How do you access Gemini Omni?

Can you edit Gemini Omni videos with follow-up prompts?

How can you tell if a video was made with Gemini Omni?

Subiecte

Artificial Intelligence

Generative AI

Înrudite

blog

Gemini 2.5 Pro: Features, Tests, Access, Benchmarks, and More

Explore Google's Gemini 2.5 Pro, and learn about its impressive 1 million token context window, multimodal capabilities, hands-on test results, and how to access it.

Alex Olteanu

8 min.

tutorial

Gemini 3 Deep Think: A Guide to AI Reasoning

Discover how Google's newest specialized reasoning model can accelerate your data science workflows, interpret complex datasets, and write robust code.

Tim Lu

tutorial

Building Multimodal AI Application with Gemini 2.0 Pro

Build a chat app that can understand text, images, audio, and documents, as well as execute Python code. Truly a multimodal application closer to AGI.

Abid Ali Awan

tutorial

What is Google Gemini? Everything You Need To Know About Google’s ChatGPT Rival

Gemini defines a family of multimodal LLMs capable of understanding texts, images, videos, and audio. It’s also said to be capable of performing complex tasks in math and physics, as well as being able to generate high-quality code in several programming languages.

Kurtis Pykes

tutorial

Gemini Deep Think: A Test On 5 Real-World Problems

I tested Google's new Gemini 2.5 Deep Think mode on five real-world problems, ranging from PhD-level questions to business challenges.

François Aubry

tutorial

Gemini 2.0 Flash: Step-by-Step Tutorial With Demo Project

Learn how to use Google's Gemini 2.0 Flash model to develop a visual assistant capable of reading on-screen content and answering questions about it using Python.

François Aubry

Vezi mai mult Vezi mai mult

What Is Gemini Omni?

Key Features of Gemini Omni

Conversational video editing

Real-world physics and world knowledge

Sketches and drawings to video

SynthID watermarking and C2PA content credentials

Testing Gemini Omni Flash

Testing physics and world knowledge

Testing motion and style transfer

Where Gemini Omni Stands on Benchmarks

How to Access Gemini Omni: Pricing and Plans

Conclusion

Gemini Omni FAQs

How do you access Gemini Omni?

Can you edit Gemini Omni videos with follow-up prompts?

How can you tell if a video was made with Gemini Omni?

Gemini 2.5 Pro: Features, Tests, Access, Benchmarks, and More

Gemini 3 Deep Think: A Guide to AI Reasoning

Building Multimodal AI Application with Gemini 2.0 Pro

What is Google Gemini? Everything You Need To Know About Google’s ChatGPT Rival

Gemini Deep Think: A Test On 5 Real-World Problems

Gemini 2.0 Flash: Step-by-Step Tutorial With Demo Project

Gemini 2.5 Pro: Features, Tests, Access, Benchmarks, and More

Gemini 3 Deep Think: A Guide to AI Reasoning

Building Multimodal AI Application with Gemini 2.0 Pro

What is Google Gemini? Everything You Need To Know About Google’s ChatGPT Rival

Gemini Deep Think: A Test On 5 Real-World Problems

Gemini 2.0 Flash: Step-by-Step Tutorial With Demo Project