Recent advances in image and video generation raise hopes that these models possess world modeling capabilities—the ability to generate realistic, physically plausible videos. This could revolutionize applications in robotics, autonomous driving, and scientific simulation. However, before treating these models as world models, we must ask: Do they adhere to physical laws? Current evaluation methods rely on subjective judgments or trajectory matching, limiting their usage for physical reasoning estimation, where many generations could be physically plausible.
We introduce Morpheus, a benchmark for evaluating video generation models on physical reasoning. It features 130 real-world videos capturing physical phenomena, guided by conservation laws. Using those as conditioning for video generation, we assess physical plausibility using physics-informed metrics evaluated with respect to infallible conservation laws known per physical setting, leveraging advances in physics-informed neural networks and vision-language foundation models.
No videos available for this configuration yet.
Aggregated scores across all physical experiments (higher is better). We select the best scores between using the enhanced and plain versions of the textual prompt to represent the best model's ability. For comparison, real-world videos obtain Dynamical scores in the range of 0.98-0.99, and Physical Invariance score above 0.93.
Loading aggregated score breakdown...
An example of the disappearance of the object (the projectile ball). Model: PyramidalFlow, multi-frame conditioning, enhanced text prompt.
An example of the duplication of the object (non-holonomic pendulum). Model: COSMOS, multi-frame conditioning, plain text prompt.
An example of the stillness of the object (double pendulum). Model: LTX, single frame conditioning, plain text prompt.
The analysis of the major reasons behind high discard rates reveals the absence of motion (i.e. stillness) and the presence of duplicate objects, as well as, to a lesser extent, the disappearance of the object from the video.
Loading discard breakdown...
No videos available yet for this experiment.
The relative difference between the score on the original, lab videos and augmented realistic settings. The realistic augmented settings appear to be much harder to predict.
Loading augmentation improvements...