Gemini Omni Flash Is Now Live: Google's Multimodal Video Model with "Nano-Banana" Style Editing
- David Borish

- May 19
- 6 min read

Gemini Omni Flash Multimodal Video Model
The headline distinction Google drew at I/O is straightforward: Veo 3, its previous generation video model, started with text. Gemini Omni starts with whatever you have. You can feed it a photo taken on a phone, a reference video, an audio clip, or a written description, and the model combines those inputs to generate output video. Google describes this as being "grounded in Gemini's real-world knowledge," meaning the model draws on Gemini's training in physics, biology, history, and visual context rather than treating generation as a purely pattern-matching exercise against a video corpus.
The practical result, based on the demos shown at I/O, is that users can do things like photograph a physical object and instruct the model to animate it with realistic motion, reference an existing clip and change specific characters or environmental details, or describe a scientific concept and receive a stylized explainer video. A demo on the DeepMind product page illustrates protein folding rendered as a claymation explainer, with stop-motion aesthetics and factually grounded content generated from a single descriptive prompt.
Gemini CEO Demis Hassabis described the long-term goal as generating any type of output from any kind of input. Omni Flash is the first step in that direction, focused specifically on video output.
Conversational Editing as the Core Differentiator
The part of the Omni announcement that drew the most attention at I/O is not the generation quality itself but the editing model. Unlike traditional AI video tools that force users into a cycle of generating, evaluating, scrapping, and re-prompting from scratch, Omni allows in-conversation modification. You can ask the model to change the camera angle on a generated clip, swap an object in a scene, shift the lighting, or relocate a character to a different environment, and each change builds on the previous one rather than starting over.
Google describes this as editing "over multiple turns, with consistency." The product page demonstrates a sequence where a violinist is first transported to an outdoor environment, then has the violin made invisible, then has the camera angle shifted to over-the-shoulder, with each successive edit maintaining the character and scene coherence established in prior steps.
This is the capability that most distinguishes Omni from the current field. Seedance 2.0 and Kling 3.0, the two models that have been leading public video generation benchmarks, are both strong at generation from scratch. Neither offers a comparable conversational editing layer. Google's bet is that iteration speed matters more to most creators than raw generation quality in isolation, a logic that mirrors how Nano Banana initially competed in the image generation space by leading on editing before catching up on generation scores.
Physics Modeling and World Knowledge
Google made specific claims about Omni's physical simulation accuracy, stating that the model has an intuitive understanding of forces like gravity, kinetic energy, and fluid dynamics. A demo on the product page shows a marble rolling through a chain-reaction track, with motion that tracks physically plausible momentum and surface interaction.
The physics grounding connects to a broader architectural point. Because Omni is built on Gemini's reasoning foundation rather than being a standalone video model, it has access to the world model Google has built into Gemini through training. A prompt for a stop-motion claymation explanation of protein folding works not just because the model knows what protein folding looks like visually, but because it understands the underlying biology well enough to sequence the explanation accurately.
This is where Omni's positioning diverges most from specialized competitors. ByteDance built Seedance 2.0 to generate photorealistic, high-fidelity video clips. Google built Omni to generate video that is also factually coherent. Whether users care about that distinction depends heavily on use case: commercial content creators may prioritize visual quality, while educators, researchers, or corporate content teams may place higher value on accuracy.
Access, Availability, and the Subscription Structure
Gemini Omni Flash is available today in three surfaces. In the Gemini app and Google Flow, it is restricted to AI Plus, Pro, and Ultra subscribers. In YouTube Shorts, it is available free as part of the Shorts Remix and Create app integration. The Pro tier of Omni, which Google has confirmed is in development, is not yet available.
The subscription gate reflects the compute requirements. Prior to the official launch, testers who accessed early builds reported that two video generation prompts consumed 86 percent of their daily quota on the AI Pro plan. That consumption rate indicates that Omni Flash, the lighter of the two planned variants, is substantially more resource-intensive than standard Gemini text or image requests. Google has built new usage limit infrastructure into the Gemini account settings to manage this.
Google is also taking a staged approach to some of Omni's more sensitive capabilities. The ability to create videos using a digital avatar of yourself with your own voice is available at launch. The ability to edit existing videos to change audio and speech is being held back while Google tests what it describes as responsible deployment. Each video generated with Omni includes a SynthID digital watermark that can be verified through the Gemini app, and Google is expanding SynthID verification to Search and Chrome in the coming months. C2PA Content Credentials support allows users to check whether content has been modified from its original camera source.
The concern Google is visibly navigating is that a model capable of seamlessly swapping characters, altering dialogue, and changing what happens in a video is a deepfake production tool at consumer scale. The staged rollout for audio-speech editing is an
acknowledgment of that risk.
The Competitive Context
The video generation market Google is entering with Omni looks meaningfully different than it did a year ago. OpenAI shut down the Sora 2 consumer app on April 29, 2026, with the model remaining available only through API access. That exit removed the most prominent branded competitor from consumer video generation at exactly the moment Google is leaning in.
The remaining competition is primarily specialized video generators from Asian AI labs. Seedance 2.0, built by ByteDance, has been leading most public benchmarks by commercial usability scores. Kling 3.0 from Kuaishou is the dominant player in the Chinese market. Alibaba's HappyHorse-1.0 briefly topped the Artificial Analysis leaderboard before being overtaken. What all these models have in common is that they are purpose-built for video generation and do not handle image creation or text reasoning within the same system.
Omni enters the market not as a claim to superior clip quality but as a different kind of product. If a user can generate a storyboard image, animate it into video, and refine both outputs through conversation in a single Gemini session, the competitive frame shifts from "which model produces the best 10-second clip" to "which platform handles the full creative workflow." That is a harder comparison to benchmark against.
What Omni Does Not Yet Do
The current release has real constraints worth noting. Generation length caps remain in place; Google has not published specific limits for Omni Flash, but early testing suggested outputs around 10 seconds per clip. The Pro tier, which may offer longer and higher-resolution generation, is described as in testing with no announced release date beyond "next month" for Gemini 3.5 Pro, which is a different product.
Raw generation quality at the Flash tier, based on early descriptions, is competitive but not at the ceiling of what Seedance 2.0 produces for photorealistic footage. Google's own framing of Omni emphasizes the editing and knowledge-grounding capabilities more than visual quality scores, which suggests the company is not yet claiming generation leadership in head-to-head clip comparisons.
The audio editing restriction, while likely temporary, means the model's conversational editing capabilities are currently more constrained for video content involving dialogue or narration. Users who need to change what a person says in a video, rather than what they're doing or wearing, will encounter limits that Google has not set a timeline for removing.
What This Release Signals About Google's Direction
Omni Flash is one of several announcements from I/O 2026 that collectively show Google consolidating its previously fragmented AI stack. Veo handled video, Nano Banana handled images, and Gemini handled text and reasoning. Omni begins collapsing that separation into a unified surface where generation modalities share a common reasoning foundation.
Gemini 3.5 Flash, also announced today, runs four times faster than the previous fastest frontier model by output tokens per second, and surpasses Gemini 3.1 Pro on coding, agentic, and multimodal benchmarks. Gemini Spark, a persistent personal agent integrating with Gmail, Docs, and third-party tools via MCP, extends the platform into task automation.
Together, the I/O announcements suggest Google is building toward a system where a user interacts with one Gemini surface that handles text reasoning, image generation, video generation, and multi-step task execution without switching tools.
Whether Omni Flash's first version lives up to the full ambition of that vision will become clearer as more users work with it at scale. The conversational editing capability, if it performs reliably across varied inputs, is the piece most likely to change how content creators approach AI video tools. The generation quality question will be answered over the next several model iterations, following roughly the same trajectory Nano Banana took with images.
Comments