Sari la conținutul principal

Gemini Omni: One Model for Text, Image, Audio, and Video

A first look at Google DeepMind's any-to-any model — what it does, what's new about it, and how to access it.
19 mai 2026  · 7 min. citire

The much-hyped Google I/O event did not disappoint. After a bit of preamble, the real highlight: Gemini Omni, a new family of native multimodal, "any-to-any" AI models, which means that they process you name it - text, image, audio, and video simultaneously. In other words, Nano Banana, but for video. (This is how Google itself talked about it.)

The initial model is Omni Flash. It introduces conversational video editing for creating cohesive media from individual assets. So, you can edit and remix videos through conversations and not by using editing software.

What Is Gemini Omni?

Let's understand what Omni is about in more detail.

Gemini Omni is Google DeepMind's first natively multimodal generative media model. The first variant in the family is Gemini Omni Flash, which is now live

  • inside the Gemini app,
  • Google Flow, and
  • YouTube Shorts

Omni accepts any combination of text, images, audio, and video as input and produces a video. Key here: There's no relay happening across different systems; this is all one model.

Until now, Google ran a split stack: Veo for video, Imagen for images, separate systems for audio. Omni collapses that into one model (hence the name!) that can reason across modalities. In practice, that translates to more coherent edits and fewer pipeline artifacts.

Before I get carried away, let's frame this as key features: 

Key Features of Gemini Omni

Here's what stands out in terms of features:

Conversational video editing

The headline capability is conversational video editing. You give Omni a clip — one you generated or one you shot on your phone — and you change it by talking to it. "Make the lights dim." "Change the camera angle to over her shoulder." "Turn the violin invisible." Each instruction persists across turns. The scene evolves; it doesn't reset.

Real-world physics and world knowledge

Google made a point of this at launch. Omni has an intuitive grasp of gravity, kinetic energy, and fluid dynamics.

Sketches and drawings to video

You can turn doodles into realistic footage, using the sketch only as a movement guide rather than as a final visual reference. This will be useful pre-production. Kids will love this, for their drawings, also.

SynthID watermarking and C2PA content credentials

Every Omni output ships with two layers of provenance.

  • SynthID is an invisible watermark embedded directly into the pixels at generation time. It's imperceptible to viewers. And it's designed to survive cropping, filters, and re-encoding.
  • C2PA content credentials sit alongside it as a signed cryptographic manifest attached to the file. Strip the metadata, and the pixel-level signal still holds.

Worth flagging: SynthID is Google-specific. Only Google's own models embed it, so "not watermarked" doesn't mean "human-made" — it just means "not from a Google model." Given how easily Omni can remix real footage, durable provenance on every output starts to look less like a nice-to-have and more like table stakes.

Testing Gemini Omni

Let's give it a test:

Testing physics and world knowledge

Here was my prompt: 

A medieval trebuchet launching a fired clay pot at a stone castle wall, shot in slow motion. The counterweight falls, the sling whips around, the pot arcs through the air and shatters against the stone, shards and embers scattering across the courtyard. Continuous handheld camera move, golden hour light, period-accurate construction and dress. Realistic sound design — wood creaking under tension, rope strain, the whoosh of the sling, the sharp crack of impact. No music.

The note told me it would take a few minutes, but it really only took about ten seconds. 

Maybe I'm picky, but I was just a bit disappointed. The angle of the clay pot was wrong in the later frame, relative to the earlier one. It changed the angle, and it looked like an airplane with a different kind of propulsion behind it. 

Testing motion and style transfer

Now, let me take the result of that last test and see if I can make it totally different. What I did: I took a screenshot of that last video (so, I uploaded an image) and then asked for a video in a new style. 

Take the trebuchet clip and re-render the entire scene in the visual style of the Bayeux Tapestry from this reference image — flat embroidered figures, period-accurate threading colors (faded reds, ochres, blues, greens against undyed linen), narrow decorative borders top and bottom with stitched Latin captions, characteristic medieval proportions where soldiers and the trebuchet are roughly the same scale. Keep the original motion choreography intact: the counterweight falls, the sling whips around, the pot arcs through the air and shatters against the castle wall, all in the same timing and trajectory. The shatter moment should read as the embroidery itself unraveling — threads splaying outward on impact. Replace the original sound design with a single dulcimer or hurdy-gurdy underscore. No live-action foley.

This video took much longer to generate. But it was worth the wait. Suddenly, the physics glitch didn't matter at all because the tapestry-style allowed for imperfections, and the result was funny. The trebuchet did launch backwards this time.

Where Gemini Omni Stands on Benchmarks

Worth flagging upfront: Google didn't publish numeric benchmarks alongside the Omni launch. The DeepMind announcement lead with capability claims — physics understanding, world knowledge, conversational editing — but no head-to-head scores on standardized evals.

Independent third-party benchmarks won't land for at least a few weeks.

What that means for the competitive picture: as of now, the public leader on the Artificial Analysis Video Arena — the closest thing video generation has to an industry-standard leaderboard — is ByteDance's Seedance 2.0, with an Elo of 1,269 on text-to-video and 1,351 on image-to-video. 

All that said, here are the benchmarks worth watching once they run:

  • VBench 2.0: evaluates physics, commonsense reasoning, human fidelity, and controllability across 18 dimensions.
  • Artificial Analysis Video Arena: Elo-style head-to-head human preference rankings. 
  • VABench: joint audio-video evaluation. Matters specifically because Omni generates synchronized audio natively.

How to Access Gemini Omni: Pricing and Plans

Access right now sits inside Google's consumer AI tiers in the US:

  • AI Plus at $7.99/month
  • AI Pro at $19.99/month, and
  • AI Ultra at $249.99/month

Credit allocations scale with the tier — Plus gets 200 monthly AI credits, Pro gets 1,000.

API access is "coming in the next few weeks." We will be writing about this.

Conclusion

Until now, the generative media stack has been a relay race. Text goes to one model, the output gets handed to an image model, which hands a still to a video model, which hands frames to an audio model. Every handoff is a place where coherence or quality could leak away. Omni's big claim is that reasons across text, image, video, and audio in the same forward pass. 

A full-capability Omni model - we suspect it's aimed at enterprise - is certainly in the works. 


Josef Waples's photo
Author
Josef Waples

I'm a data science writer and editor with contributions to research articles in scientific journals. I'm especially interested in linear algebra, statistics, R, and the like. I also play a fair amount of chess! 

Gemini Omni FAQs

What is Gemini Omni and how does it work?

Gemini Omni is Google DeepMind's AI video model that creates and edits video from text, image, audio, or video inputs through natural conversation.

What can you do with Gemini Omni?

Edit video styles, swap characters with reference images, change camera angles, sync text and sound to action, turn sketches into footage, and combine multiple inputs into one scene.

How do you access Gemini Omni?

It's available in the Gemini app, Google Flow, and YouTube Shorts. A Google AI subscription is required, with features varying by tier and region.

Can you edit Gemini Omni videos with follow-up prompts?

Yes. Omni supports multi-turn editing, so you can refine details, environments, and camera angles step by step while keeping the scene consistent.

How can you tell if a video was made with Gemini Omni?

Every output includes an invisible SynthID watermark and C2PA Content Credentials, verifiable in the Gemini app (with Chrome and Search support coming soon).

Subiecte
Înrudite
gemini 2.5 pro with a large context

blog

Gemini 2.5 Pro: Features, Tests, Access, Benchmarks, and More

Explore Google's Gemini 2.5 Pro, and learn about its impressive 1 million token context window, multimodal capabilities, hands-on test results, and how to access it.
Alex Olteanu's photo

Alex Olteanu

8 min.

tutorial

Gemini 3 Deep Think: A Guide to AI Reasoning

Discover how Google's newest specialized reasoning model can accelerate your data science workflows, interpret complex datasets, and write robust code.
Tim Lu's photo

Tim Lu

tutorial

Building Multimodal AI Application with Gemini 2.0 Pro

Build a chat app that can understand text, images, audio, and documents, as well as execute Python code. Truly a multimodal application closer to AGI.
Abid Ali Awan's photo

Abid Ali Awan

tutorial

What is Google Gemini? Everything You Need To Know About Google’s ChatGPT Rival

Gemini defines a family of multimodal LLMs capable of understanding texts, images, videos, and audio. It’s also said to be capable of performing complex tasks in math and physics, as well as being able to generate high-quality code in several programming languages.
Kurtis Pykes 's photo

Kurtis Pykes

tutorial

Gemini 2.0 Flash: Step-by-Step Tutorial With Demo Project

Learn how to use Google's Gemini 2.0 Flash model to develop a visual assistant capable of reading on-screen content and answering questions about it using Python.
François Aubry's photo

François Aubry

tutorial

Gemini Deep Think: A Test On 5 Real-World Problems

I tested Google's new Gemini 2.5 Deep Think mode on five real-world problems, ranging from PhD-level questions to business challenges.
François Aubry's photo

François Aubry

Vezi mai multVezi mai mult