NVIDIA Cosmos: a system for generating physically accurate simulations for AI

17:02
10.01.2025
TechDed
290

The article is based on the analysis of publicly available information about the NVIDIA Cosmos platform, including official announcements and technical blogs.

The article is based on an analysis of publicly available information about the NVIDIA Cosmos platform, including official announcements and technical blogs.

At CES 2025, NVIDIA introduced the revolutionary Cosmos platform, which promises to radically change the approach to developing artificial intelligence systems that interact with the physical world. The platform has already attracted the attention of key market players such as Uber, Waabi, and XPENG, indicating its serious potential. In this article, I tried to understand why Cosmos is generating such interest and what opportunities it opens up for developers.

What is NVIDIA Cosmos?

NVIDIA Cosmos is a fundamentally new approach to creating physical AI systems. Unlike large language models trained on textual data, Cosmos is trained to understand the physical world through video data analysis. The platform is based on a massive dataset of 20 million hours of video recordings (equivalent to 9,000 trillion tokens) containing various physical interactions: from simple human movements to complex object manipulations.

The core of the platform is the World Foundation Models (WFMs) — basic models of understanding the physical world. This is a kind of foundation on which more specialized artificial intelligence systems are built. Just as a person learns from childhood to understand how the physical world around them works, these models are trained to recognize and predict the interaction of various objects in the real world.

The platform includes two types of models:

Diffusion WFMs work like an artist who starts with a sketch and gradually adds details until a clear image is obtained. In the case of video, the model gradually creates an increasingly clear and realistic sequence of frames, being able to generate physically correct video from simple input data.
Autoregressive WFMs act like an experienced observer who, having watched the beginning of the video, can predict what will happen next. The model learns to understand the logic of what is happening and to continue the video sequence naturally, specializing in predicting future frames in the video sequence.

NVIDIA paid special attention to performance: new tokenizers provide 8-fold compression improvement and 12-fold processing acceleration compared to existing methods. In practice, this means that developers can iterate their solutions much faster and use computing resources more efficiently.

Practical Application

Robotics and Manufacturing

One of the most promising areas of application for Cosmos is robotics. In modern warehousing, where robots have to work with a variety of goods, traditionally it took months of real-world experiments to train a robot to grasp various objects. With Cosmos, this process is radically simplified: developers can create thousands of virtual scenarios where the robot learns to interact with objects of different shapes, sizes, and physical properties.

For example, companies like 1X, Agility Robotics, and XPENG are already using Cosmos to train manipulator robots. In a virtual environment, robots can "gain" experience equivalent to years of real practice in just a few days of simulations. At the same time, all errors and potential damages occur only in virtual space, which significantly reduces development costs.

Autonomous Transport

In the field of autonomous transport, Cosmos solves the critical problem of collecting data on rare and dangerous situations. Traditionally, unmanned vehicles required millions of kilometers of real tests to work out behavior in unusual conditions. Now developers can create and test such scenarios in a virtual environment.

Industrial Automation

In industry, Cosmos is used to optimize production processes by creating digital twins of production lines. This allows testing various automation scenarios without risk to real equipment. This is especially effective when combined with NVIDIA Omniverse, creating a complete environment for virtual testing and optimization.

Technical Features and Safety

An important aspect of Cosmos is its attention to safety. The platform includes the Cosmos Guardrails system, which works in two stages:

Pre-guard: scans incoming requests (prompts) for unsafe content using blocklist checks and specially tuned models Aegis AI Content Safety, and also filters out harmful requests.
Post-guard: evaluates the generated video frame by frame, rejecting unsafe videos. To protect privacy and reduce bias, the system automatically blurs people's faces. Additionally, videos generated through the NVIDIA catalog API contain invisible watermarks to identify AI-generated content.

Accelerated data processing is another key advantage of the platform. Using NVIDIA H100 GPUs (Hopper architecture) allows processing 20 million hours of data in just 40 days, and on the latest NVIDIA Blackwell GPUs, this figure improves to 14 days. In comparison, similar processing on a CPU would take more than three years. This dramatic acceleration means that companies can develop and test new solutions much faster, significantly reducing time to market.

Available Models

At launch (January 2025), the following models are available through Hugging Face and the NVIDIA NGC catalog:

Diffusion Models (Diffusion WFMs)

Cosmos-1.0-Diffusion-7B-Text2World: a base model with 7 billion parameters for generating video from text, suitable for quick idea testing;
Cosmos-1.0-Diffusion-14B-Text2World: an extended version for more accurate generation;
Cosmos-1.0-Diffusion-7B-Video2World: a 7B model for continuing video from the first frame;
Cosmos-1.0-Diffusion-14B-Video2World: an improved version for predicting scene development.

Autoregressive Models (Autoregressive WFMs)

Cosmos-1.0-Autoregressive-4B: base model for predicting the next frames;
Cosmos-1.0-Autoregressive-5B-Video2World: model with support for text conditions;
Cosmos-1.0-Autoregressive-12B: extended version for complex scenarios;
Cosmos-1.0-Autoregressive-13B-Video2World: advanced model for text-to-video generation.

It is important to note that models 4-7B are suitable for basic tasks and quick idea testing, while models 12-14B provide higher accuracy and are suitable for complex use cases.

Performance and Scale

The training process of Cosmos models is impressive in its scale:

10,000 NVIDIA H100 GPUs used;
Training time was three months;
20 million hours of video data processed (9,000 trillion tokens).

For example, Cosmos-1.0-Autoregressive-4B on eight NVIDIA H100 GPUs can process 9 input frames (0.9 seconds, 1280 tokens) and generate 24 future frames (2.4 seconds, 1920 tokens) at a speed of 806 tokens per second, completing the task in just 2.38 seconds.

Development Prospects

The integration of Cosmos with other NVIDIA technologies, especially the Omniverse platform, opens up unique opportunities for creating full-fledged virtual testing environments. This allows physical AI systems to be trained in the most realistic conditions without the risks and costs associated with real experiments.

Conclusion

NVIDIA Cosmos represents a significant breakthrough in the development of physical AI, making the process of developing robots and autonomous systems faster, safer, and more efficient. The combination of powerful pre-trained models, optimized data processing, and strict attention to safety creates a reliable foundation for a new generation of AI systems capable of effectively interacting with the physical world.

Currently, NVIDIA offers the opportunity to test Cosmos models online through an interactive interface.

NVIDIA Cosmos: a system for generating physically accurate simulations for AI, demonstrating complex physical interactions and realistic visualizations.

For example, you can try Cosmos-1.0-autoregressive-5b to generate future frames based on uploaded video or Cosmos-1.0-diffusion-7b to create videos from text descriptions. Each model can handle up to 20 requests, and generating one video takes about 60 seconds.

Interestingly, as noted by Reddit users, NVIDIA may have even more powerful models that the company is currently keeping for internal use. Considering that Cosmos is provided with an open licensing model, this may mean that we are only seeing the tip of the iceberg of this technology's capabilities.