SpeechT5 (original) (raw)

Overview

The SpeechT5 model was proposed in SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.

The abstract from the paper is the following:

Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder. Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder. Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.

This model was contributed by Matthijs. The original code can be found here.

SpeechT5Config

class transformers.SpeechT5Config

< source >

( vocab_size = 81 hidden_size = 768 encoder_layers = 12 encoder_attention_heads = 12 encoder_ffn_dim = 3072 encoder_layerdrop = 0.1 decoder_layers = 6 decoder_ffn_dim = 3072 decoder_attention_heads = 12 decoder_layerdrop = 0.1 hidden_act = 'gelu' positional_dropout = 0.1 hidden_dropout = 0.1 attention_dropout = 0.1 activation_dropout = 0.1 initializer_range = 0.02 layer_norm_eps = 1e-05 scale_embedding = False feat_extract_norm = 'group' feat_proj_dropout = 0.0 feat_extract_activation = 'gelu' conv_dim = (512, 512, 512, 512, 512, 512, 512) conv_stride = (5, 2, 2, 2, 2, 2, 2) conv_kernel = (10, 3, 3, 3, 3, 2, 2) conv_bias = False num_conv_pos_embeddings = 128 num_conv_pos_embedding_groups = 16 apply_spec_augment = True mask_time_prob = 0.05 mask_time_length = 10 mask_time_min_masks = 2 mask_feature_prob = 0.0 mask_feature_length = 10 mask_feature_min_masks = 0 pad_token_id = 1 bos_token_id = 0 eos_token_id = 2 decoder_start_token_id = 2 num_mel_bins = 80 speech_decoder_prenet_layers = 2 speech_decoder_prenet_units = 256 speech_decoder_prenet_dropout = 0.5 speaker_embedding_dim = 512 speech_decoder_postnet_layers = 5 speech_decoder_postnet_units = 256 speech_decoder_postnet_kernel = 5 speech_decoder_postnet_dropout = 0.5 reduction_factor = 2 max_speech_positions = 4000 max_text_positions = 450 encoder_max_relative_position = 160 use_guided_attention_loss = True guided_attention_loss_num_heads = 2 guided_attention_loss_sigma = 0.4 guided_attention_loss_scale = 10.0 use_cache = True is_encoder_decoder = True **kwargs )

Parameters

This is the configuration class to store the configuration of a SpeechT5Model. It is used to instantiate a SpeechT5 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the SpeechT5microsoft/speecht5_asr architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

from transformers import SpeechT5Model, SpeechT5Config

configuration = SpeechT5Config()

model = SpeechT5Model(configuration)

configuration = model.config

SpeechT5HifiGanConfig

class transformers.SpeechT5HifiGanConfig

< source >

( model_in_dim = 80 sampling_rate = 16000 upsample_initial_channel = 512 upsample_rates = [4, 4, 4, 4] upsample_kernel_sizes = [8, 8, 8, 8] resblock_kernel_sizes = [3, 7, 11] resblock_dilation_sizes = [[1, 3, 5], [1, 3, 5], [1, 3, 5]] initializer_range = 0.01 leaky_relu_slope = 0.1 normalize_before = True **kwargs )

Parameters

This is the configuration class to store the configuration of a SpeechT5HifiGanModel. It is used to instantiate a SpeechT5 HiFi-GAN vocoder model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the SpeechT5microsoft/speecht5_hifigan architecture.

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

Example:

from transformers import SpeechT5HifiGan, SpeechT5HifiGanConfig

configuration = SpeechT5HifiGanConfig()

model = SpeechT5HifiGan(configuration)

configuration = model.config

SpeechT5Tokenizer

class transformers.SpeechT5Tokenizer

< source >

( vocab_file bos_token = '' eos_token = '' unk_token = '' pad_token = '' normalize = False sp_model_kwargs: Optional = None **kwargs )

Parameters

Construct a SpeechT5 tokenizer. Based on SentencePiece.

This tokenizer inherits from PreTrainedTokenizer which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

__call__

< source >

( text: Union = None text_pair: Union = None text_target: Union = None text_pair_target: Union = None add_special_tokens: bool = True padding: Union = False truncation: Union = None max_length: Optional = None stride: int = 0 is_split_into_words: bool = False pad_to_multiple_of: Optional = None padding_side: Optional = None return_tensors: Union = None return_token_type_ids: Optional = None return_attention_mask: Optional = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_offsets_mapping: bool = False return_length: bool = False verbose: bool = True **kwargs ) → BatchEncoding

Parameters

A BatchEncoding with the following fields:

Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences.

save_vocabulary

< source >

( save_directory: str filename_prefix: Optional = None )

decode

< source >

( token_ids: Union skip_special_tokens: bool = False clean_up_tokenization_spaces: bool = None **kwargs ) → str

Parameters

The decoded sentence.

Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces.

Similar to doing self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids)).

batch_decode

< source >

( sequences: Union skip_special_tokens: bool = False clean_up_tokenization_spaces: bool = None **kwargs ) → List[str]

Parameters

The list of decoded sentences.

Convert a list of lists of token ids into a list of strings by calling decode.

SpeechT5FeatureExtractor

( feature_size: int = 1 sampling_rate: int = 16000 padding_value: float = 0.0 do_normalize: bool = False num_mel_bins: int = 80 hop_length: int = 16 win_length: int = 64 win_function: str = 'hann_window' frame_signal_scale: float = 1.0 fmin: float = 80 fmax: float = 7600 mel_floor: float = 1e-10 reduction_factor: int = 2 return_attention_mask: bool = True **kwargs )

Parameters

Constructs a SpeechT5 feature extractor.

This class can pre-process a raw speech signal by (optionally) normalizing to zero-mean unit-variance, for use by the SpeechT5 speech encoder prenet.

This class can also extract log-mel filter bank features from raw speech, for use by the SpeechT5 speech decoder prenet.

This feature extractor inherits from SequenceFeatureExtractor which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

( audio: Union = None audio_target: Union = None padding: Union = False max_length: Optional = None truncation: bool = False pad_to_multiple_of: Optional = None return_attention_mask: Optional = None return_tensors: Union = None sampling_rate: Optional = None **kwargs )

Parameters

Main method to featurize and prepare for the model one or several sequence(s).

Pass in a value for audio to extract waveform features. Pass in a value for audio_target to extract log-mel spectrogram features.

SpeechT5Processor

class transformers.SpeechT5Processor

< source >

( feature_extractor tokenizer )

Parameters

Constructs a SpeechT5 processor which wraps a feature extractor and a tokenizer into a single processor.

SpeechT5Processor offers all the functionalities of SpeechT5FeatureExtractor and SpeechT5Tokenizer. See the docstring of call() and decode() for more information.

Processes audio and text input, as well as audio and text targets.

You can process audio by using the argument audio, or process audio targets by using the argumentaudio_target. This forwards the arguments to SpeechT5FeatureExtractor’scall().

You can process text by using the argument text, or process text labels by using the argument text_target. This forwards the arguments to SpeechT5Tokenizer’s call().

Valid input combinations are:

Please refer to the docstring of the above two methods for more information.

Collates the audio and text inputs, as well as their targets, into a padded batch.

Audio inputs are padded by SpeechT5FeatureExtractor’s pad(). Text inputs are padded by SpeechT5Tokenizer’s pad().

Valid input combinations are:

Please refer to the docstring of the above two methods for more information.

from_pretrained

< source >

( pretrained_model_name_or_path: Union cache_dir: Union = None force_download: bool = False local_files_only: bool = False token: Union = None revision: str = 'main' **kwargs )

Parameters

Instantiate a processor associated with a pretrained model.

This class method is simply calling the feature extractorfrom_pretrained(), image processorImageProcessingMixin and the tokenizer~tokenization_utils_base.PreTrainedTokenizer.from_pretrained methods. Please refer to the docstrings of the methods above for more information.

save_pretrained

< source >

( save_directory push_to_hub: bool = False **kwargs )

Parameters

Saves the attributes of this processor (feature extractor, tokenizer…) in the specified directory so that it can be reloaded using the from_pretrained() method.

This class method is simply calling save_pretrained() andsave_pretrained(). Please refer to the docstrings of the methods above for more information.

This method forwards all its arguments to SpeechT5Tokenizer’s batch_decode(). Please refer to the docstring of this method for more information.

This method forwards all its arguments to SpeechT5Tokenizer’s decode(). Please refer to the docstring of this method for more information.

SpeechT5Model

class transformers.SpeechT5Model

< source >

( config: SpeechT5Config encoder: Optional = None decoder: Optional = None )

Parameters

The bare SpeechT5 Encoder-Decoder Model outputting raw hidden-states without any specific pre- or post-nets. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_values: Optional = None attention_mask: Optional = None decoder_input_values: Optional = None decoder_attention_mask: Optional = None head_mask: Optional = None decoder_head_mask: Optional = None cross_attn_head_mask: Optional = None encoder_outputs: Optional = None past_key_values: Optional = None use_cache: Optional = None speaker_embeddings: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None ) → transformers.modeling_outputs.Seq2SeqModelOutput or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.Seq2SeqModelOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (SpeechT5Config) and inputs.

The SpeechT5Model forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

SpeechT5ForSpeechToText

class transformers.SpeechT5ForSpeechToText

< source >

( config: SpeechT5Config )

Parameters

SpeechT5 Model with a speech encoder and a text decoder. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_values: Optional = None attention_mask: Optional = None decoder_input_ids: Optional = None decoder_attention_mask: Optional = None head_mask: Optional = None decoder_head_mask: Optional = None cross_attn_head_mask: Optional = None encoder_outputs: Optional = None past_key_values: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None labels: Optional = None ) → transformers.modeling_outputs.Seq2SeqLMOutput or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.Seq2SeqLMOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (SpeechT5Config) and inputs.

The SpeechT5ForSpeechToText forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

from transformers import SpeechT5Processor, SpeechT5ForSpeechToText from datasets import load_dataset

dataset = load_dataset( ... "hf-internal-testing/librispeech_asr_demo", "clean", split="validation", trust_remote_code=True ... )
dataset = dataset.sort("id") sampling_rate = dataset.features["audio"].sampling_rate

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_asr") model = SpeechT5ForSpeechToText.from_pretrained("microsoft/speecht5_asr")

inputs = processor(audio=dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt") predicted_ids = model.generate(**inputs, max_length=100)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) transcription[0] 'mister quilter is the apostle of the middle classes and we are glad to welcome his gospel'

inputs["labels"] = processor(text_target=dataset[0]["text"], return_tensors="pt").input_ids

loss = model(**inputs).loss round(loss.item(), 2) 19.68

SpeechT5ForTextToSpeech

class transformers.SpeechT5ForTextToSpeech

< source >

( config: SpeechT5Config )

Parameters

SpeechT5 Model with a text encoder and a speech decoder. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_ids: Optional = None attention_mask: Optional = None decoder_input_values: Optional = None decoder_attention_mask: Optional = None head_mask: Optional = None decoder_head_mask: Optional = None cross_attn_head_mask: Optional = None encoder_outputs: Optional = None past_key_values: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None speaker_embeddings: Optional = None labels: Optional = None stop_labels: Optional = None ) → transformers.modeling_outputs.Seq2SeqSpectrogramOutput or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.Seq2SeqSpectrogramOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (SpeechT5Config) and inputs.

The SpeechT5ForTextToSpeech forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan, set_seed import torch

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts") model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts") vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

inputs = processor(text="Hello, my dog is cute", return_tensors="pt") speaker_embeddings = torch.zeros((1, 512))

set_seed(555)

speech = model.generate(inputs["input_ids"], speaker_embeddings=speaker_embeddings, vocoder=vocoder) speech.shape torch.Size([15872])

generate

< source >

( input_ids: LongTensor attention_mask: Optional = None speaker_embeddings: Optional = None threshold: float = 0.5 minlenratio: float = 0.0 maxlenratio: float = 20.0 vocoder: Optional = None output_cross_attentions: bool = False return_output_lengths: bool = False **kwargs ) → tuple(torch.FloatTensor) comprising various elements depending on the inputs

Parameters

Returns

tuple(torch.FloatTensor) comprising various elements depending on the inputs

Converts a sequence of input tokens into a sequence of mel spectrograms, which are subsequently turned into a speech waveform using a vocoder.

SpeechT5ForSpeechToSpeech

class transformers.SpeechT5ForSpeechToSpeech

< source >

( config: SpeechT5Config )

Parameters

SpeechT5 Model with a speech encoder and a speech decoder. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_values: Optional = None attention_mask: Optional = None decoder_input_values: Optional = None decoder_attention_mask: Optional = None head_mask: Optional = None decoder_head_mask: Optional = None cross_attn_head_mask: Optional = None encoder_outputs: Optional = None past_key_values: Optional = None use_cache: Optional = None output_attentions: Optional = None output_hidden_states: Optional = None return_dict: Optional = None speaker_embeddings: Optional = None labels: Optional = None stop_labels: Optional = None ) → transformers.modeling_outputs.Seq2SeqSpectrogramOutput or tuple(torch.FloatTensor)

Parameters

A transformers.modeling_outputs.Seq2SeqSpectrogramOutput or a tuple oftorch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various elements depending on the configuration (SpeechT5Config) and inputs.

The SpeechT5ForSpeechToSpeech forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

Example:

from transformers import SpeechT5Processor, SpeechT5ForSpeechToSpeech, SpeechT5HifiGan, set_seed from datasets import load_dataset import torch

dataset = load_dataset( ... "hf-internal-testing/librispeech_asr_demo", "clean", split="validation", trust_remote_code=True ... )
dataset = dataset.sort("id") sampling_rate = dataset.features["audio"].sampling_rate

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_vc") model = SpeechT5ForSpeechToSpeech.from_pretrained("microsoft/speecht5_vc") vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

inputs = processor(audio=dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")

speaker_embeddings = torch.zeros((1, 512))

set_seed(555)

speech = model.generate_speech(inputs["input_values"], speaker_embeddings, vocoder=vocoder) speech.shape torch.Size([77824])

generate_speech

< source >

( input_values: FloatTensor speaker_embeddings: Optional = None attention_mask: Optional = None threshold: float = 0.5 minlenratio: float = 0.0 maxlenratio: float = 20.0 vocoder: Optional = None output_cross_attentions: bool = False return_output_lengths: bool = False ) → tuple(torch.FloatTensor) comprising various elements depending on the inputs

Parameters

Returns

tuple(torch.FloatTensor) comprising various elements depending on the inputs

Converts a raw speech waveform into a sequence of mel spectrograms, which are subsequently turned back into a speech waveform using a vocoder.

SpeechT5HifiGan

class transformers.SpeechT5HifiGan

< source >

( config: SpeechT5HifiGanConfig )

Parameters

HiFi-GAN vocoder. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( spectrogram: FloatTensor ) → torch.FloatTensor

Parameters

Returns

torch.FloatTensor

Tensor containing the speech waveform. If the input spectrogram is batched, will be of shape (batch_size, num_frames,). If un-batched, will be of shape (num_frames,).

Converts a log-mel spectrogram into a speech waveform. Passing a batch of log-mel spectrograms returns a batch of speech waveforms. Passing a single, un-batched log-mel spectrogram returns a single, un-batched speech waveform.

< > Update on GitHub