NVIDIA's New 2.6B World Model Generates Minute-Scale 720p Video on a Single GPU

David Borish
May 19
4 min read

NVIDIA world model — NVIDIA's New 2.6B World Model Generates Minute-Scale 720p Video on a Single GPU

SANA-WM: One Image, One Minute, One GPU

The standard assumption in world-model research has been that minute-scale, high-resolution video generation requires large model ensembles, proprietary training data, and multi-GPU inference. NVIDIA's SANA-WM challenges all three at once.

Released in May 2026, SANA-WM synthesizes 720p video at 16 frames per second for a full 60 seconds, conditioned on a single input image, a text description, and a six-degree-of-freedom camera trajectory. The base inference pipeline fits within 51.1 GB of GPU memory. A distilled four-step variant runs on a single RTX 5090 with NVFP4 quantization, producing a 60-second clip in 34 seconds.

LingBot-World, the strongest visual-quality baseline on the benchmark, uses a 14B+14B parameter cascade and requires eight H100 GPUs per clip, achieving 0.6 videos per hour. SANA-WM with its second-stage refiner reaches 22 videos per hour while matching LingBot-World's VBench Overall score within 1.3 points on both trajectory splits. Without the refiner, throughput reaches 24.1 videos per hour, a 36x advantage over the field.

The Architecture: Replacing Softmax Attention at Scale

A 60-second 720p video at 16 fps produces 960 frames. After encoding through a high-compression tokenizer, the latent sequence is still far too long for full quadratic attention. SANA-WM uses a hybrid backbone that alternates between two attention mechanisms across 20 transformer blocks.

The primary mechanism is a frame-wise Gated DeltaNet (GDN), a recurrent layer that processes one entire latent frame per step. GDN maintains a fixed-size state matrix updated by a decay gate and a delta-rule correction, allowing the model to forget stale content without memory cost growing with sequence length. Standard cumulative linear attention accumulates key-value products without any decay, causing drift over long sequences. The GDN backbone combined with the LTX2 VAE tokenizer reduces peak memory from 8.9 GB to 5.7 GB and cuts inference latency from 1,267 ms to 433 ms per step compared to the SANA-Video baseline.

Every fourth block uses conventional softmax attention. Those five layers, placed at positions 3, 7, 11, 15, and 19, allow exact long-range recall at periodic intervals to anchor spatial consistency across the full minute. Switching to all-softmax causes the model to run out of memory entirely at 60-second sequence lengths on an H100.

The LTX2 VAE tokenizer encodes video into a representation 2x smaller than ST-DC-AE and 8x smaller than the Wan2.1 VAE. The tokenizer swap alone, without any other change to the backbone, cut peak memory by 3.5 GB and inference latency by 3.4x.

Camera Control: Two Branches at Two Rates

Precise camera control over a full minute requires solving a mismatch between the rate at which the model operates and the rate at which camera motion occurs. SANA-WM uses a dual-branch conditioning design to handle both.

The coarse branch encodes camera pose at the latent frame rate using Unified Camera Positional Encoding (UCPE), providing global 6-DoF trajectory structure across the rollout. The fine branch addresses sub-stride motion: each latent token summarizes eight raw frames with distinct camera poses, so the coarse encoding misses intra-stride variation. A Plücker ray mixing layer, computed at the raw-frame rate, fills that gap after each self-attention output.

The ablation is direct. No camera control produces a rotation error of 16.93 degrees. Plücker conditioning alone barely moves that number. UCPE alone reduces it to 7.73 degrees. The dual UCPE+Plücker combination achieves 6.21 degrees, the best result in the ablation.

On the full benchmark, SANA-WM with the refiner records rotation errors of 4.50 degrees on simple trajectories and 8.34 degrees on the hard split. Matrix-Game 3.0 records 12.96 and 18.79. LingBot-World's 14B cascade reaches 10.47 and 18.99.

The Second-Stage Refiner

Stage-1 outputs at minute scale often develop structural artifacts and temporal drift, where image quality degrades in the final seconds of a clip. SANA-WM uses a dedicated long-video refiner, initialized from the 17B LTX-2 model, to correct that.

The refiner uses truncated-sigma flow matching: a noisy version of the stage-1 latent serves as the starting point, and the model maps it toward a high-fidelity target rather than reconstructing from pure noise. LoRA adapters trained at rank 384 are merged into the distilled refiner at inference time, adding only three denoising steps to the pipeline.

The long-horizon quality drop statistic, delta-IQ, compares first and last 10-second windows of a clip. For stage-1 autoregressive outputs, delta-IQ is 3.79 on simple trajectories and 3.09 on hard. After the refiner: 1.17 and 0.31. HY-WorldPlay records 23.59 and 25.88, indicating severe quality collapse in the second half of its rollouts. Applying the original LTX-2.3 short-video refiner directly to stage-1 SANA-WM outputs, without the long-video adaptation, produces VBench Overall scores of 71.37 and 71.16, well below the adapted refiner's 80.62 and 81.89.

The Data Pipeline

Large-scale camera-controlled world models typically depend on proprietary pose-annotated datasets. SANA-WM uses seven open-source video sources and re-annotates them with metric-scale 6-DoF camera poses using a modified VIPE pose engine, with the depth backend replaced by Pi3X for multi-frame-consistent depth and MoGe-2 for per-frame metric scale. The resulting corpus contains 212,975 clips.

A secondary pass uses 3D Gaussian Splatting reconstructions of static DL3DV scenes to generate novel 60-second camera trajectories, producing 14,881 additional synthetic clips with known ground-truth poses. Captions deliberately omit camera motion language, preventing pose supervision from leaking through the text branch and forcing trajectory control through the pose conditioning.

Benchmark and Results

Because no prior benchmark targeted one-minute world modeling, the researchers built one: 80 initial scene images across four categories, each paired with a simple trajectory split (smooth planar paths) and a hard split (double loops, whip-pans, spiral descents, crane moves).

SANA-WM without the refiner achieves VBench Overall scores of 79.29 and 79.60 across the two splits. With the refiner, those scores rise to 80.62 and 81.89, against LingBot-World's 81.82 and 81.89, on one GPU per clip versus eight. The distilled variant's 34-second generation time on a single RTX 5090 brings that performance within reach of labs without datacenter access.

Limitations

SANA-WM does not maintain an explicit 3D scene representation and can drift in scenes with complex dynamic content, rare viewpoints, or rollouts beyond the training distribution. Revisit memory, reproducing a scene accurately when the camera returns to a previously visited viewpoint, is competitive but trails LingBot-World on the simple split by a narrow margin. Biases in publicly available video sources translate into uneven coverage across environments, cultures, and rare viewpoints.

Those constraints aside, the efficiency result is the takeaway. Minute-scale, 720p, camera-controlled world modeling no longer requires industrial-scale infrastructure to approach industrial-scale visual quality.

the distillation problem — Click image to read the previous article

DAVID BORISH