FLAT

Feedforward Latent Triangle Splatting for geometrically accurate scene generation.

Decode explicit surface-aligned triangle splats from video diffusion latents in a single forward pass.

Orest Kupyn^1,2, Goutam Bhat¹, Philipp Henzler¹, Fabian Manhardt¹, Christian Rupprecht^1,2, Federico Tombari^1,3

¹ Google Research

² University of Oxford, Visual Geometry Group

³ Technical University of Munich

Paper HuggingFace Video Interactive Demo

FLAT shows that compressed video diffusion latents can be mapped directly to explicit non-volumetric scene parameters. Instead of decoding 3D Gaussians, it predicts triangle splats in one pass, improving geometric accuracy while preserving competitive visual quality and enabling rasterization with simple triangle renderers and physics-based interaction after lightweight refinement.

Direct Triangle Decoding FLAT turns compressed video diffusion latents into explicit triangle splats directly, avoiding the usual generate-then-optimize path used by many feedforward scene pipelines.

Geometry-Specific Training Ray-centered triangle parameterization and a product window rendering function stabilize triangle regression, where small orientation errors would otherwise break gradient flow.

Refinement to Opaque Assets A lightweight test-time refinement step converts the predicted triangle soup into a fully opaque representation that fits standard rendering and game-engine-style interaction.

How FLAT turns video priors into scene geometry.

FLAT reuses the information already encoded in video diffusion latents, then predicts triangle-based surface primitives that are easier to export, refine, and physically use than volumetric feedforward outputs.

Architecture overview of FLAT from latent video features to triangle-splat scene geometry

1. Frozen video prior, geometry-aware decoder

A camera-conditioned video prior provides multi-view latent structure, while FLAT adds a feedforward decoder that regresses explicit triangle splats instead of volumetric blobs.

2. Triangle prediction needs special treatment

Triangles are more sensitive to orientation than Gaussian primitives, so the method centers rotations around viewing rays and uses the product window function to keep differentiable rendering gradients usable.

3. Raw prediction to usable asset

The direct output is a triangle soup optimized for geometric fidelity. A small refinement stage then makes it opaque and easier to deploy in standard graphics and physics pipelines.

Inspect generated scenes as explicit triangle geometry.

FLAT outputs scenes that can be explored immediately with a simple triangle renderer. This makes the viewer fast and portable across devices, without depending on a heavy rendering engine. On touch devices, drag inside the scene to look around and use the on-screen movement buttons to navigate.

Loading White Room...

Navigation W A S D move, drag to look, R to reset.

Tip Double-click anywhere in the viewport to snap back to the default view.

Touch Movement

Pipeline Flexibility

FLAT is trained to decode denoised Wan-2.1 latents directly, so at inference time it can replace the standard VAE image decoder with a scene decoder. Any Wan-2.1 variant that is finetuned from base model can generate explicit triangle-based geometry instead of RGB frames.

Flexible FLAT pipeline showing many Wan-2.1 generation modes feeding a shared FLAT latent scene decoder

1. Same latent space, different decoder

FLAT does not require a separate generator for each video model variant. It plugs into the latent space of the base video model and changes only the final decode target from pixels to scene geometry.

2. One FLAT decoder works across Wan-2.1 variants

Text-to-video, image-to-video, video-to-video, long-horizon, real-time, interactive, multi-conditioned, and world-consistent Wan-2.1 pipelines remain compatible as long as they produce the same denoised latent representation.

3. New upstream capabilities transfer automatically

As the video-model family gains new controls or better generation quality, the same latent scene decoder can inherit those improvements without training a different scene model for every pipeline mode.

Text-to-Video Output

FLAT + Wan-2.1 text-to-video.

Appearance and surface structure stay aligned.

We target geometric accuracy, not only image realism. These paired renders show that FLAT's novel views and surface normals stay consistent across viewpoints, making the geometry signal legible instead of hiding it behind appearance alone.

Novel View

Surface Normals

01 / 07

Refined FLAT scenes support direct physical interaction.

Geometric accuracy and representation choice matter in practice: after converting the predicted triangles into an opaque asset, the generated environment can be used directly in a simple rigid-body simulation rather than relying on a separately reconstructed collision proxy.

BibTeX

@misc{kupyn2026flat,
  title        = {FLAT: Feedforward Latent Triangle Splatting for Geometrically Accurate Scene Generation},
  author       = {Orest Kupyn and Goutam Bhat and Philipp Henzler and Fabian Manhardt and Christian Rupprecht and Federico Tombari},
  year         = {2026},
  note         = {Preprint}
}