A three-stage training paradigm that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations.
Recent video diffusion models have achieved impressive capabilities as large-scale generative world models. However, these models often struggle with fine-grained physical consistency, exhibiting physically implausible dynamics over time. In this work, we present Phys4D, a pipeline for learning physics-consistent 4D world representations from video diffusion models. Phys4D adopts a three-stage training paradigm that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations. We first bootstrap robust geometry and motion representations through large-scale pseudo-supervised pretraining, providing the model with an initial capability for 4D scene generation. We then perform physics-grounded supervised fine-tuning using simulation-generated data, enforcing temporally consistent 4D dynamics. Finally, we apply simulation-grounded reinforcement learning to correct residual physical violations that are difficult to capture through explicit supervision. To evaluate fine-grained physical consistency beyond appearance-based metrics, we introduce a set of 4D world consistency evaluation that probe geometric coherence, motion stability, and long-horizon physical plausibility. Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance.
Phys4D converts a pretrained video diffusion model into a physics-consistent 4D world model via a three-stage training paradigm.
We bootstrap geometry and motion representations using large-scale pseudo-supervision from off-the-shelf depth and optical flow estimators. The DiT backbone is frozen while auxiliary heads learn to predict depth maps and motion fields.
We selectively finetune using physics simulation data with ground-truth annotations. A warp consistency loss couples depth and motion predictions to enforce temporally coherent 3D structure and physically plausible dynamics.
We apply reinforcement learning with physics simulator rewards to correct residual physical violations. The reward is based on 4D Chamfer Distance between generated and physically valid object trajectories.
Phys4D generates videos with physically consistent depth maps and motion fields, capturing coherent 4D scene evolution.
Demonstrating temporally consistent depth and motion prediction with physically plausible object dynamics.
Showcasing geometry-motion consistency with accurate depth warping across frames.
Illustrating fine-grained physical consistency in complex multi-object interactions.
Phys4D consistently improves physics understanding across multiple backbone architectures on the Physics-IQ benchmark.
| Group | Model | Params | MSE ↓ | ST-IoU ↑ | S-IoU ↑ | WS-IoU ↑ | Score (%) ↑ |
|---|---|---|---|---|---|---|---|
| Ours | CogVideoX + Phys4D | 5B | 0.009 | 0.169 | 0.252 | 0.157 | 30.2 |
| WAN2.2 + Phys4D | 5B | 0.014 | 0.107 | 0.214 | 0.122 | 25.6 | |
| Open-Sora V1.2 + Phys4D | 1.1B | 0.016 | 0.098 | 0.195 | 0.112 | 22.4 | |
| Open-source | CogVideoX | 5B | 0.013 | 0.116 | 0.222 | 0.142 | 18.8 |
| WAN2.2 | 5B | 0.016 | 0.088 | 0.150 | 0.105 | 16.8 | |
| Open-Sora V1.2 | 1.1B | 0.021 | 0.072 | 0.135 | 0.092 | 14.5 | |
| Commercial | VideoPoet / Pika 1.0 / Sora | - | - | - | - | - | 20.3 / 13.0 / 10.0 |
Phys4D Framework: A physics-aware training framework for improving fine-grained physical consistency in video diffusion models, focusing on coherent geometry and motion over time.
Three-Stage Pipeline: A progressive training paradigm that injects physical structure into video diffusion models through pretraining, fine-tuning, and reinforcement learning.
Simulation Supervision: Leveraging physics-based simulation as a high-fidelity source of geometric, motion, and reward supervision for fine-grained physical alignment.
4D Consistency Metrics: A set of diagnostics that evaluate geometric coherence, motion stability, and long-horizon physical plausibility beyond appearance-based metrics.
We present a physics-based simulation pipeline supporting diverse object types and physical phenomena.
Generated across 9 physical categories including rigid bodies, fluids, soft bodies, thermodynamics, and more.
Total video content at 1920x1080 resolution and 60 FPS with over 15TB of multimodal annotations.
Unique physical configurations with systematic randomization across material, environmental, and geometric attributes.