AudioLDM (original) (raw)

AudioLDM was proposed in AudioLDM: Text-to-Audio Generation with Latent Diffusion Models by Haohe Liu et al. Inspired by Stable Diffusion, AudioLDM is a text-to-audio latent diffusion model (LDM) that learns continuous audio representations from CLAPlatents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional sound effects, human speech and music.

The abstract from the paper is:

Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining (CLAP) latents. The pretrained CLAP models enable us to train LDMs with audio embedding while providing text embedding as a condition during sampling. By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency. Trained on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA performance measured by both objective and subjective metrics (e.g., frechet distance). Moreover, AudioLDM is the first TTA system that enables various text-guided audio manipulations (e.g., style transfer) in a zero-shot fashion. Our implementation and demos are available at this https URL.

The original codebase can be found at haoheliu/AudioLDM.

Tips

When constructing a prompt, keep in mind:

During inference:

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.

AudioLDMPipeline

class diffusers.AudioLDMPipeline

< source >

( vae: AutoencoderKL text_encoder: ClapTextModelWithProjection tokenizer: typing.Union[transformers.models.roberta.tokenization_roberta.RobertaTokenizer, transformers.models.roberta.tokenization_roberta_fast.RobertaTokenizerFast] unet: UNet2DConditionModel scheduler: KarrasDiffusionSchedulers vocoder: SpeechT5HifiGan )

Parameters

Pipeline for text-to-audio generation using AudioLDM.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

__call__

< source >

( prompt: typing.Union[str, typing.List[str]] = None audio_length_in_s: typing.Optional[float] = None num_inference_steps: int = 10 guidance_scale: float = 2.5 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_waveforms_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: typing.Optional[int] = 1 cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None output_type: typing.Optional[str] = 'np' ) → AudioPipelineOutput or tuple

Parameters

If return_dict is True, AudioPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated audio.

The call function to the pipeline for generation.

Examples:

from diffusers import AudioLDMPipeline import torch import scipy

repo_id = "cvssp/audioldm-s-full-v2" pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16) pipe = pipe.to("cuda")

prompt = "Techno music with a strong, upbeat tempo and high melodic riffs" audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]

scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)

AudioPipelineOutput

class diffusers.AudioPipelineOutput

< source >

( audios: ndarray )

Parameters

Output class for audio pipelines.

< > Update on GitHub