MORPHEUS: Benchmarking Physical Reasoning of Video Generative Models with Real Physical Experiments

1University of Trento, Italy
2University of Amsterdam, the Netherlands
3Archimedes, Athena Research Center, Greece
*Equal contribution of first authors
Equal contribution of second authors

Summary

As the development of Video Generation Models (VGMs) progresses, many are claimed to possess world modeling capabilities—the ability to generate arbitrary realistic videos. These models have significant potential for applications in fields such as robotics and autonomous driving as well as in scientific simulation and medical data augmentation. However, a critical question arises: to what extent do these models adhere to physical laws? To address this, we introduce MORPHEUS, a novel benchmark designed to evaluate the intrinsic physical reasoning capabilities of VGMs. MORPHEUS provides a curated dataset comprising 80 real-world videos of experiments that capture physical phenomena, guided by specific physical invariances such as energy or momentum conservation. These invariances reveal the specific physical phenomena missed by the models, enabling a fine-grained evaluation and forming the foundation of our interpretable Physical Score measure. We also propose a Statistical Score, based on Physics-Informed Neural Networks (PINNs), to provide a complementary evaluation across a broader range of physical scenarios. Our findings reveal that even with advanced prompting techniques, such as multi-frame prompting and enhanced textual descriptions, current VGMs demonstrate substantial limitations in their ability to model and understand physical phenomena.


Morpheus

Morpheus

The Morpheus benchmark. Physical Evaluation of VGMs for the holonomic pendulum experiment using Dynamical and Physical Invariance scores computed from the extracted trajectories.

Video examples

Falling Ball

Bouncing Ball

Projectile

Holonomic Pendulum

Non-holonomic Pendulum

Double Pendulum


MORPHEUS

real-world single frame multi frame keyframe interpolation
Model Name Experiment Type Conditioning Prompt Type Discard Rate Physical Invariance Score Dynamical score

Results

Discarded videos

An example of the disappearance of the object (the projectile ball). Model: PyramidalFlow, multi-frame conditioning, enhanced text prompt.

An example of the duplication of the object (non-holonomic pendulum). Model: COSMOS, multi-frame conditioning, plain text prompt.

An example of the stillness of the object (double pendulum). Model: LTX, single frame conditioning, plain text prompt.