ICML 2026

Phys4D: Fine-Grained Physics-Consistent
4D Modeling from Video Diffusion

A three-stage training paradigm that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations.

Anonymous Authors

Abstract

Recent video diffusion models have achieved impressive capabilities as large-scale generative world models. However, these models often struggle with fine-grained physical consistency, exhibiting physically implausible dynamics over time. In this work, we present Phys4D, a pipeline for learning physics-consistent 4D world representations from video diffusion models. Phys4D adopts a three-stage training paradigm that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations. We first bootstrap robust geometry and motion representations through large-scale pseudo-supervised pretraining, providing the model with an initial capability for 4D scene generation. We then perform physics-grounded supervised fine-tuning using simulation-generated data, enforcing temporally consistent 4D dynamics. Finally, we apply simulation-grounded reinforcement learning to correct residual physical violations that are difficult to capture through explicit supervision. To evaluate fine-grained physical consistency beyond appearance-based metrics, we introduce a set of 4D world consistency evaluation that probe geometric coherence, motion stability, and long-horizon physical plausibility. Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance.

Method Overview

Phys4D converts a pretrained video diffusion model into a physics-consistent 4D world model via a three-stage training paradigm.

Phys4D Pipeline Diagram
Stage 1

Pseudo-Supervised Pretraining

We bootstrap geometry and motion representations using large-scale pseudo-supervision from off-the-shelf depth and optical flow estimators. The DiT backbone is frozen while auxiliary heads learn to predict depth maps and motion fields.

Stage 2

Physics-Grounded Fine-tuning

We selectively finetune using physics simulation data with ground-truth annotations. A warp consistency loss couples depth and motion predictions to enforce temporally coherent 3D structure and physically plausible dynamics.

Stage 3

Simulation-Grounded RL

We apply reinforcement learning with physics simulator rewards to correct residual physical violations. The reward is based on 4D Chamfer Distance between generated and physically valid object trajectories.

Demo Videos

Phys4D generates videos with physically consistent depth maps and motion fields, capturing coherent 4D scene evolution.

Physics Simulation Demo 1

Demonstrating temporally consistent depth and motion prediction with physically plausible object dynamics.

Physics Simulation Demo 2

Showcasing geometry-motion consistency with accurate depth warping across frames.

Physics Simulation Demo 3

Illustrating fine-grained physical consistency in complex multi-object interactions.

Results

Phys4D consistently improves physics understanding across multiple backbone architectures on the Physics-IQ benchmark.

Group Model Params MSE ↓ ST-IoU ↑ S-IoU ↑ WS-IoU ↑ Score (%) ↑
Ours CogVideoX + Phys4D 5B 0.009 0.169 0.252 0.157 30.2
WAN2.2 + Phys4D 5B 0.014 0.107 0.214 0.122 25.6
Open-Sora V1.2 + Phys4D 1.1B 0.016 0.098 0.195 0.112 22.4
Open-source CogVideoX 5B 0.013 0.116 0.222 0.142 18.8
WAN2.2 5B 0.016 0.088 0.150 0.105 16.8
Open-Sora V1.2 1.1B 0.021 0.072 0.135 0.092 14.5
Commercial VideoPoet / Pika 1.0 / Sora - - - - - 20.3 / 13.0 / 10.0

Key Contributions

1

Phys4D Framework: A physics-aware training framework for improving fine-grained physical consistency in video diffusion models, focusing on coherent geometry and motion over time.

2

Three-Stage Pipeline: A progressive training paradigm that injects physical structure into video diffusion models through pretraining, fine-tuning, and reinforcement learning.

3

Simulation Supervision: Leveraging physics-based simulation as a high-fidelity source of geometric, motion, and reward supervision for fine-grained physical alignment.

4

4D Consistency Metrics: A set of diagnostics that evaluate geometric coherence, motion stability, and long-horizon physical plausibility beyond appearance-based metrics.

Simulation Data

We present a physics-based simulation pipeline supporting diverse object types and physical phenomena.

1.25M Videos

Generated across 9 physical categories including rigid bodies, fluids, soft bodies, thermodynamics, and more.

20,800 Hours

Total video content at 1920x1080 resolution and 60 FPS with over 15TB of multimodal annotations.

20,000+ Configs

Unique physical configurations with systematic randomization across material, environmental, and geometric attributes.

Supported Physical Object Types

Rigid Bodies Articulated Structures Garments Fluids Thermodynamics Deformable Objects Inflatables Ropes Granular Materials