MORPHEUS

Comparison of evaluation methods for video generative models. Human or VLM-based judgments provide only qualitative and subjective assessments of physical plausibility. Trajectory matching compares generated and ground-truth paths but may misclassify physically valid trajectories. For example, many projectile parabolas remain plausible when a VGM is conditioned on a single image with unknown velocity. Morpheus instead evaluates generations with physics-informed scores that test conservation of invariants and consistency with governing equations of motion.

Recent advances in image and video generation raise hopes that these models possess world modeling capabilities—the ability to generate realistic, physically plausible videos. This could revolutionize applications in robotics, autonomous driving, and scientific simulation. However, before treating these models as world models, we must ask: Do they adhere to physical laws? Current evaluation methods rely on subjective judgments or trajectory matching, limiting their usage for physical reasoning estimation, where many generations could be physically plausible.

We introduce Morpheus, a benchmark for evaluating video generation models on physical reasoning. It features 130 real-world videos capturing physical phenomena, guided by conservation laws. Using those as conditioning for video generation, we assess physical plausibility using physics-informed metrics evaluated with respect to infallible conservation laws known per physical setting, leveraging advances in physics-informed neural networks and vision-language foundation models.

Aggregated scores across all physical experiments (higher is better). We select the best scores between using the enhanced and plain versions of the textual prompt to represent the best model's ability. For comparison, real-world videos obtain Dynamical scores in the range of 0.98-0.99, and Physical Invariance score above 0.93.

Loading aggregated score breakdown...

Discarded videos

An example of the disappearance of the object (the projectile ball). Model: PyramidalFlow, multi-frame conditioning, enhanced text prompt.

An example of the duplication of the object (non-holonomic pendulum). Model: COSMOS, multi-frame conditioning, plain text prompt.

An example of the stillness of the object (double pendulum). Model: LTX, single frame conditioning, plain text prompt.

The analysis of the major reasons behind high discard rates reveals the absence of motion (i.e. stillness) and the presence of duplicate objects, as well as, to a lesser extent, the disappearance of the object from the video.

Loading discard breakdown...

MORPHEUS: Benchmarking Physical Reasoning of Video Generative Models with Real Physical Experiments

Summary

MORPHEUS

Video Examples

Experiments

Discarded videos

Augmentations

Experiments