Meta V-JEPA 2 world mannequin makes use of uncooked video to coach robots

Meta today introduced V-Jepa 2a 1.2-billion-parameter world model trained primarily on video to support understanding, prediction, and planning in robotic systems. Built on the Joint Embedding Predictive Architecture (JEPA), the model is designed to help robots and other “AI agents” navigate unfamiliar environments and tasks with limited domain-specific training.

V-JEPA 2 follows a two-stage training process all without additional human annotation. In the first, self-supervised stage, the model learns from over 1 million hours of video and 1 million images, capturing patterns of physical interaction. The second stage introduces action-conditioned learning using a small set of robot control data (about 62 hours), allowing the model to factor in agent actions when predicting outcomes. This makes the model usable for planning and closed-loop control tasks.

Meta said it has already tested this new model on robots in its labs. Meta reports that V-JEPA 2 performs well on common robotic tasks like and pick-and-place, using vision-based goal representations. For simpler tasks such as pick and place, the system generates candidate actions and evaluates them based on predicted outcomes. For tougher tasks, such as picking up an object and placing it in the right spot, V-JEPA2 uses a sequence of visual subgoals to guide behavior.

In internal tests, Meta said the model showed promising ability to generalize to new objects and settings, with success rates ranging from 65% to 80% on pick-and-place tasks in previously unseen environments.

“We believe world models will usher a new era for robotics, enabling real-world AI agents to help with chores and physical tasks without needing astronomical amounts of robotic training data,” said Meta’s chief AI scientist Yann LeCun.

Although V-JEPA 2 shows improvements over prior models, Meta AI said there remains a noticeable gap between model and human performance on these benchmarks. Meta suggests this points to the need for models that can operate across multiple timescales and modalities, such as incorporating audio or tactile information.

To assess progress in physical understanding from video, Meta is also releasing the following three benchmarks:

IntPhys 2: evaluates the model’s ability to distinguish between physically plausible and implausible scenarios.
MVPBench: tests whether models rely on genuine understanding rather than dataset shortcuts in video question-answering.
CausalVQA: examines reasoning about cause-and-effect, anticipation, and counterfactuals.

The V-JEPA 2 code and model checkpoints are available for commercial and research use, with Meta aiming to encourage broader exploration of world models in robotics and embodied AI.

Meta joins other tech leaders in developing their own world models. Google DeepMind has been developing its own version, Genie, which can simulate entire 3D environments. And World Labs, a startup founded by Fei-Fei Li, raised $230 million to build large world models.

Source link