Cosmos (original) (raw)

Cosmos World Foundation Model Platform for Physical AI by NVIDIA.

Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make our platform open-source and our models open-weight with permissive licenses available via https://github.com/NVIDIA/Cosmos.

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.

Loading original format checkpoints

Original format checkpoints that have not been converted to diffusers-expected format can be loaded using the from_single_file method.

import torch from diffusers import Cosmos2TextToImagePipeline, CosmosTransformer3DModel

model_id = "nvidia/Cosmos-Predict2-2B-Text2Image" transformer = CosmosTransformer3DModel.from_single_file( "https://huggingface.co/nvidia/Cosmos-Predict2-2B-Text2Image/blob/main/model.pt", torch_dtype=torch.bfloat16, ).to("cuda") pipe = Cosmos2TextToImagePipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.bfloat16) pipe.to("cuda")

prompt = "A close-up shot captures a vibrant yellow scrubber vigorously working on a grimy plate, its bristles moving in circular motions to lift stubborn grease and food residue. The dish, once covered in remnants of a hearty meal, gradually reveals its original glossy surface. Suds form and bubble around the scrubber, creating a satisfying visual of cleanliness in progress. The sound of scrubbing fills the air, accompanied by the gentle clinking of the dish against the sink. As the scrubber continues its task, the dish transforms, gleaming under the bright kitchen lights, symbolizing the triumph of cleanliness over mess." negative_prompt = "The video captures a series of frames showing ugly scenes, static with no motion, motion blur, over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky movements, low frame rate, artifacting, color banding, unnatural transitions, outdated special effects, fake elements, unconvincing visuals, poorly edited content, jump cuts, visual noise, and flickering. Overall, the video is of poor quality."

output = pipe( prompt=prompt, negative_prompt=negative_prompt, generator=torch.Generator().manual_seed(1) ).images[0] output.save("output.png")

CosmosTextToWorldPipeline

class diffusers.CosmosTextToWorldPipeline

< source >

( text_encoder: T5EncoderModel tokenizer: T5TokenizerFast transformer: CosmosTransformer3DModel vae: AutoencoderKLCosmos scheduler: EDMEulerScheduler safety_checker: CosmosSafetyChecker = None )

Parameters

Pipeline for text-to-world generation using Cosmos Predict1.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 704 width: int = 1280 num_frames: int = 121 num_inference_steps: int = 36 guidance_scale: float = 7.0 fps: int = 30 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) → ~CosmosPipelineOutput or tuple

Parameters

Returns

~CosmosPipelineOutput or tuple

If return_dict is True, CosmosPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images and the second element is a list of bools indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.

The call function to the pipeline for generation.

Examples:

import torch from diffusers import CosmosTextToWorldPipeline from diffusers.utils import export_to_video

model_id = "nvidia/Cosmos-1.0-Diffusion-7B-Text2World" pipe = CosmosTextToWorldPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16) pipe.to("cuda")

prompt = "A sleek, humanoid robot stands in a vast warehouse filled with neatly stacked cardboard boxes on industrial shelves. The robot's metallic body gleams under the bright, even lighting, highlighting its futuristic design and intricate joints. A glowing blue light emanates from its chest, adding a touch of advanced technology. The background is dominated by rows of boxes, suggesting a highly organized storage system. The floor is lined with wooden pallets, enhancing the industrial setting. The camera remains static, capturing the robot's poised stance amidst the orderly environment, with a shallow depth of field that keeps the focus on the robot while subtly blurring the background for a cinematic effect."

output = pipe(prompt=prompt).frames[0] export_to_video(output, "output.mp4", fps=30)

encode_prompt

< source >

( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None do_classifier_free_guidance: bool = True num_videos_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None max_sequence_length: int = 512 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None )

Parameters

Encodes the prompt into text encoder hidden states.

CosmosVideoToWorldPipeline

class diffusers.CosmosVideoToWorldPipeline

< source >

( text_encoder: T5EncoderModel tokenizer: T5TokenizerFast transformer: CosmosTransformer3DModel vae: AutoencoderKLCosmos scheduler: EDMEulerScheduler safety_checker: CosmosSafetyChecker = None )

Parameters

Pipeline for image-to-world and video-to-world generation using Cosmos Predict-1.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

__call__

< source >

( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None video: typing.List[typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]]] = None prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 704 width: int = 1280 num_frames: int = 121 num_inference_steps: int = 36 guidance_scale: float = 7.0 input_frames_guidance: bool = False augment_sigma: float = 0.001 fps: int = 30 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) → ~CosmosPipelineOutput or tuple

Parameters

Returns

~CosmosPipelineOutput or tuple

If return_dict is True, CosmosPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images and the second element is a list of bools indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.

The call function to the pipeline for generation.

Examples:

Image conditioning:

import torch from diffusers import CosmosVideoToWorldPipeline from diffusers.utils import export_to_video, load_image

model_id = "nvidia/Cosmos-1.0-Diffusion-7B-Video2World" pipe = CosmosVideoToWorldPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16) pipe.to("cuda")

prompt = "The video depicts a long, straight highway stretching into the distance, flanked by metal guardrails. The road is divided into multiple lanes, with a few vehicles visible in the far distance. The surrounding landscape features dry, grassy fields on one side and rolling hills on the other. The sky is mostly clear with a few scattered clouds, suggesting a bright, sunny day." image = load_image( ... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input.jpg" ... )

video = pipe(image=image, prompt=prompt).frames[0] export_to_video(video, "output.mp4", fps=30)

Video conditioning:

import torch from diffusers import CosmosVideoToWorldPipeline from diffusers.utils import export_to_video, load_video

model_id = "nvidia/Cosmos-1.0-Diffusion-7B-Video2World" pipe = CosmosVideoToWorldPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16) pipe.transformer = torch.compile(pipe.transformer) pipe.to("cuda")

prompt = "The video depicts a winding mountain road covered in snow, with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the solitude and beauty of a winter drive through a mountainous region." video = load_video( ... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input-vid.mp4" ... )[ ... :21 ... ]

video = pipe(video=video, prompt=prompt).frames[0] export_to_video(video, "output.mp4", fps=30)

encode_prompt

< source >

( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None do_classifier_free_guidance: bool = True num_videos_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None max_sequence_length: int = 512 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None )

Parameters

Encodes the prompt into text encoder hidden states.

Cosmos2TextToImagePipeline

class diffusers.Cosmos2TextToImagePipeline

< source >

( text_encoder: T5EncoderModel tokenizer: T5TokenizerFast transformer: CosmosTransformer3DModel vae: AutoencoderKLWan scheduler: FlowMatchEulerDiscreteScheduler safety_checker: CosmosSafetyChecker = None )

Parameters

Pipeline for text-to-image generation using Cosmos Predict2.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 768 width: int = 1360 num_inference_steps: int = 35 guidance_scale: float = 7.0 num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) → ~CosmosImagePipelineOutput or tuple

Parameters

Returns

~CosmosImagePipelineOutput or tuple

If return_dict is True, CosmosImagePipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images and the second element is a list of bools indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.

The call function to the pipeline for generation.

Examples:

import torch from diffusers import Cosmos2TextToImagePipeline

model_id = "nvidia/Cosmos-Predict2-2B-Text2Image" pipe = Cosmos2TextToImagePipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16) pipe.to("cuda")

prompt = "A close-up shot captures a vibrant yellow scrubber vigorously working on a grimy plate, its bristles moving in circular motions to lift stubborn grease and food residue. The dish, once covered in remnants of a hearty meal, gradually reveals its original glossy surface. Suds form and bubble around the scrubber, creating a satisfying visual of cleanliness in progress. The sound of scrubbing fills the air, accompanied by the gentle clinking of the dish against the sink. As the scrubber continues its task, the dish transforms, gleaming under the bright kitchen lights, symbolizing the triumph of cleanliness over mess." negative_prompt = "The video captures a series of frames showing ugly scenes, static with no motion, motion blur, over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky movements, low frame rate, artifacting, color banding, unnatural transitions, outdated special effects, fake elements, unconvincing visuals, poorly edited content, jump cuts, visual noise, and flickering. Overall, the video is of poor quality."

output = pipe( ... prompt=prompt, negative_prompt=negative_prompt, generator=torch.Generator().manual_seed(1) ... ).images[0] output.save("output.png")

encode_prompt

< source >

( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None do_classifier_free_guidance: bool = True num_images_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None max_sequence_length: int = 512 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None )

Parameters

Encodes the prompt into text encoder hidden states.

Cosmos2VideoToWorldPipeline

class diffusers.Cosmos2VideoToWorldPipeline

< source >

( text_encoder: T5EncoderModel tokenizer: T5TokenizerFast transformer: CosmosTransformer3DModel vae: AutoencoderKLWan scheduler: FlowMatchEulerDiscreteScheduler safety_checker: CosmosSafetyChecker = None )

Parameters

Pipeline for video-to-world generation using Cosmos Predict2.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

__call__

< source >

( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None video: typing.List[typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]]] = None prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 704 width: int = 1280 num_frames: int = 93 num_inference_steps: int = 35 guidance_scale: float = 7.0 fps: int = 16 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 sigma_conditioning: float = 0.0001 ) → ~CosmosPipelineOutput or tuple

Parameters

Returns

~CosmosPipelineOutput or tuple

If return_dict is True, CosmosPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images and the second element is a list of bools indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.

The call function to the pipeline for generation.

Examples:

import torch from diffusers import Cosmos2VideoToWorldPipeline from diffusers.utils import export_to_video, load_image

model_id = "nvidia/Cosmos-Predict2-2B-Video2World" pipe = Cosmos2VideoToWorldPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16) pipe.to("cuda")

prompt = "A close-up shot captures a vibrant yellow scrubber vigorously working on a grimy plate, its bristles moving in circular motions to lift stubborn grease and food residue. The dish, once covered in remnants of a hearty meal, gradually reveals its original glossy surface. Suds form and bubble around the scrubber, creating a satisfying visual of cleanliness in progress. The sound of scrubbing fills the air, accompanied by the gentle clinking of the dish against the sink. As the scrubber continues its task, the dish transforms, gleaming under the bright kitchen lights, symbolizing the triumph of cleanliness over mess." negative_prompt = "The video captures a series of frames showing ugly scenes, static with no motion, motion blur, over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky movements, low frame rate, artifacting, color banding, unnatural transitions, outdated special effects, fake elements, unconvincing visuals, poorly edited content, jump cuts, visual noise, and flickering. Overall, the video is of poor quality." image = load_image( ... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/yellow-scrubber.png" ... )

video = pipe( ... image=image, prompt=prompt, negative_prompt=negative_prompt, generator=torch.Generator().manual_seed(1) ... ).frames[0] export_to_video(video, "output.mp4", fps=16)

encode_prompt

< source >

( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None do_classifier_free_guidance: bool = True num_videos_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None max_sequence_length: int = 512 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None )

Parameters

Encodes the prompt into text encoder hidden states.

CosmosPipelineOutput

class diffusers.pipelines.cosmos.pipeline_output.CosmosPipelineOutput

< source >

( frames: Tensor )

Parameters

Output class for Cosmos any-to-world/video pipelines.

CosmosImagePipelineOutput

class diffusers.pipelines.cosmos.pipeline_output.CosmosImagePipelineOutput

< source >

( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray] )

Parameters

Output class for Cosmos any-to-image pipelines.

Update on GitHub