Netflix VOID: Open Source Video Object Removal That Understands Physics


Most video inpainting tools treat object removal as a visual problem. Fill in the pixels, smooth out the edges, call it done. Netflix just released something that treats it as a physics problem. Their new open source model VOID (Video Object and Interaction Deletion) removes objects from videos and then reconstructs how the remaining scene would actually behave without them.

This matters because the gap between “technically removed” and “believably removed” has always been the expensive part of video post production. VOID attempts to close that gap by understanding causality, not just appearance.

What Makes VOID Different

The key insight behind VOID is that objects in videos do not exist in isolation. They interact with everything around them. Remove a person holding a guitar, and the guitar needs somewhere to go. Remove someone jumping into a pool, and that splash needs to disappear. Previous tools handled the visual artifacts like shadows and reflections. VOID handles the physical consequences.

According to Netflix’s research paper, the system achieves this through what they call “interaction aware mask conditioning.” Instead of a simple binary mask indicating what to remove, VOID uses a four value quadmask that encodes: the primary object to delete, overlap regions, areas that will be physically affected by the removal, and background to preserve.

AspectKey Point
What it isVideo object removal with physics simulation
Key benefitReconstructs physical interactions after deletion
Best forPost production editing, VFX cleanup
LimitationRequires 40GB+ GPU VRAM

Technical Architecture

VOID is built on Alibaba’s CogVideoX video diffusion model, specifically the CogVideoX-Fun-V1.5-5b-InP variant with 5 billion parameters. The Netflix team fine tuned this base using synthetic training data from Google’s Kubric and Adobe’s HUMOTO datasets, which provide ground truth for understanding how objects interact with each other.

The pipeline integrates multiple AI systems. Google’s Gemini 3 Pro identifies which areas of the scene will be affected after an object is removed. Meta’s SAM2 handles the actual segmentation for deletion. An optional second pass using optical flow corrects any remaining shape distortions for improved temporal consistency.

The model operates at 384x672 resolution and can process up to 197 frames. It uses BF16 precision with FP8 quantization and a DDIM scheduler for inference.

Performance Against Competitors

In a human evaluation study with 25 participants, VOID generated outputs were preferred 64.8 percent of the time. Runway came in second at 18.4 percent. The study compared VOID against ProPainter, DiffuEraser, Runway, MiniMax Remover, ROSE, and Gen Omnimatte across multiple video scenarios.

The performance advantage becomes most apparent in scenes with complex interactions. When removing a person who was physically supporting an object, VOID simulates the object falling naturally. When removing someone creating a splash in water, the water surface reconstructs smoothly. These are cases where traditional inpainting tools produce obvious artifacts.

Warning: The published benchmarks focus on relatively sparse scenes. It remains unclear how well VOID performs in densely populated environments like crowded streets or complex interiors. The examples Netflix shared feature open areas with limited visual clutter.

Hardware Requirements and Accessibility

Running VOID locally requires serious GPU resources. The minimum is 40GB of VRAM, which means an A100 or equivalent. Training required 8x A100 80GB GPUs with DeepSpeed ZeRO Stage 2.

For those without enterprise hardware, Netflix provides a demo on Hugging Face and a Google Colab notebook that requires A100 runtime. The model weights are available directly from Hugging Face under the Apache 2.0 license, which permits commercial use.

This hardware barrier is significant for independent developers. If you are exploring running AI models locally, VOID represents the upper end of what is currently practical without cloud resources.

Practical Applications

The immediate use case is film and television post production. Removing unwanted background elements, cleaning up continuity errors, or eliminating product placements after licensing changes are all tasks that currently require expensive manual frame by frame work. VOID could reduce these costs substantially.

For AI engineers building computer vision applications, VOID demonstrates an important architectural pattern. Rather than treating video editing as a single model problem, Netflix chains multiple specialized models together. Scene analysis, segmentation, and generation are handled by different components, each optimized for its specific task.

Independent filmmakers and content creators gain access to capabilities that were previously exclusive to major studio budgets. The Apache 2.0 license makes this viable for commercial projects. Combined with accessible AI infrastructure, this could shift what is possible for smaller production teams.

How to Get Started

VOID is available on Hugging Face at netflix/void-model. The repository includes:

Two checkpoint files: void_pass1.safetensors for base inpainting and void_pass2.safetensors for optional refinement.

A notebook.ipynb for Colab experimentation.

Full inference scripts with example inputs.

The input format requires your source video, a quadmask video encoding the four mask regions, and a prompt JSON describing the scene after removal. The quadmask uses specific pixel values: 0 for remove, 63 for overlap, 127 for affected areas, and 255 for keep.

Why This Release Matters

Netflix releasing their first open source AI model signals a shift in how large companies approach video AI development. Rather than keeping these capabilities proprietary, they are contributing to the broader ecosystem.

For AI engineers, VOID provides a production ready reference implementation of physics aware video manipulation. The architecture, training approach, and evaluation methodology are all documented in their arXiv paper (2604.02296).

The model also validates a specific technical approach: combining vision language models for scene understanding with diffusion models for generation, all coordinated through structured masking. This pattern is likely to influence how future video AI tools are built.

Limitations to Consider

VOID has clear constraints. The 40GB VRAM requirement limits who can run it locally. The benchmarks focus on sparse scenes, leaving performance on complex environments uncertain. The quadmask input format adds preprocessing complexity compared to simpler click to remove interfaces.

Physics simulation is also not perfect physics recreation. The model generates plausible looking outcomes based on training data, not actual physics calculations. For scenes requiring precise physical accuracy, the results may not satisfy.

Frequently Asked Questions

Can I use VOID for commercial projects?

Yes. VOID is released under the Apache 2.0 license, which permits commercial use, modification, and distribution. You can use it in commercial productions without licensing fees.

What GPU do I need to run VOID locally?

You need at least 40GB of VRAM. An NVIDIA A100 is the typical choice. For those without suitable hardware, Netflix provides a Hugging Face demo and Google Colab notebook with A100 runtime.

How does VOID compare to Runway’s object removal?

In Netflix’s evaluation study, VOID was preferred 64.8 percent of the time compared to Runway’s 18.4 percent. The primary advantage is handling physical interactions after object removal, not just visual cleanup.

Can VOID handle crowded scenes?

The published benchmarks focus on sparse environments. Netflix has not demonstrated performance on densely populated scenes, which may present challenges for the interaction simulation system.

Sources

To see how foundational AI engineering skills apply to video and computer vision projects, watch the full video tutorial on YouTube.

If you are building AI systems that process visual content, join the AI Engineering community where we discuss production implementation strategies for computer vision and multimodal AI.

Inside the community, you will find dedicated channels for discussing model deployment, hardware optimization, and real world implementation challenges.

Zen van Riel

Zen van Riel

Senior AI Engineer at GitHub | Ex-Microsoft

I went from a $500/month internship to Senior Engineer at GitHub. Now I teach 30,000+ engineers on YouTube and coach engineers toward $200K+ AI careers in the AI Engineering community.

Blog last updated