NeMo ASR API — NVIDIA NeMo Framework User Guide (original) (raw)

Model Classes#

Modules#

class nemo.collections.asr.modules.ConvASREncoder(*args: Any, **kwargs: Any)#

Bases: NeuralModule, Exportable, AccessMixin

Convolutional encoder for ASR models. With this class you can implement JasperNet and QuartzNet models.

Based on these papers:

https://arxiv.org/pdf/1904.03288.pdf https://arxiv.org/pdf/1910.10261.pdf

input_example(max_batch=1, max_dim=8192)#

Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types#

Returns definitions of module input ports.

property output_types#

Returns definitions of module output ports.

update_max_sequence_length(seq_length: int, device)#

Find global max audio length across all nodes in distributed training and update the max_audio_length

class nemo.collections.asr.modules.ConvASRDecoder(*args: Any, **kwargs: Any)#

Bases: NeuralModule, Exportable, AdapterModuleMixin

Simple ASR Decoder for use with CTC-based models such as JasperNet and QuartzNet

Based on these papers:

https://arxiv.org/pdf/1904.03288.pdf https://arxiv.org/pdf/1910.10261.pdf https://arxiv.org/pdf/2005.04290.pdf

add_adapter(name: str, cfg: omegaconf.DictConfig)#

Add an Adapter module to this module.

Parameters:

input_example(max_batch=1, max_dim=256)#

Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types#

Define these to enable input neural type checks

property output_types#

Define these to enable output neural type checks

class nemo.collections.asr.modules.ConvASRDecoderClassification(*args: Any, **kwargs: Any)#

Bases: NeuralModule, Exportable

Simple ASR Decoder for use with classification models such as JasperNet and QuartzNet

Based on these papers:

https://arxiv.org/pdf/2005.04290.pdf

input_example(max_batch=1, max_dim=256)#

Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types#

Define these to enable input neural type checks

property output_types#

Define these to enable output neural type checks

class nemo.collections.asr.modules.SpeakerDecoder(*args: Any, **kwargs: Any)#

Bases: NeuralModule, Exportable

Speaker Decoder creates the final neural layers that maps from the outputs of Jasper Encoder to the embedding layer followed by speaker based softmax loss.

Parameters:

input_example(max_batch=1, max_dim=256)#

Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types#

Define these to enable input neural type checks

property output_types#

Define these to enable output neural type checks

class nemo.collections.asr.modules.ConformerEncoder(*args: Any, **kwargs: Any)#

Bases: NeuralModule, StreamingEncoder, Exportable, AccessMixin

The encoder for ASR model of Conformer. Based on this paper: ‘Conformer: Convolution-augmented Transformer for Speech Recognition’ by Anmol Gulati et al.https://arxiv.org/abs/2005.08100

Parameters:

change_attention_model(

self_attention_model: str | None = None,

att_context_size: List[int] | None = None,

update_config: bool = True,

device: torch.device | None = None,

)#

Update the self_attention_model which changes the positional encoding and attention layers.

Parameters:

change_subsampling_conv_chunking_factor(

subsampling_conv_chunking_factor: int,

)#

Update the conv_chunking_factor (int) Default is 1 (auto) Set it to -1 (disabled) or to a specific value (power of 2) if you OOM in the conv subsampling layers

Parameters:

subsampling_conv_chunking_factor (int)

property disabled_deployment_input_names#

Implement this method to return a set of input names disabled for export

property disabled_deployment_output_names#

Implement this method to return a set of output names disabled for export

enable_pad_mask(on=True)#

Enables or disables the pad mask and assign the boolean state on.

Returns:

The current state of the pad mask.

Return type:

mask (bool)

forward(

audio_signal,

length,

cache_last_channel=None,

cache_last_time=None,

cache_last_channel_len=None,

bypass_pre_encode=False,

)#

Forward function for the ConformerEncoder accepting an audio signal and its corresponding length. The audio_signal input supports two formats depending on the bypass_pre_encode boolean flag. This determines the required format of the input variable audio_signal: (1) bypass_pre_encode = False (default):

audio_signal must be a tensor containing audio features. Shape: (batch, self._feat_in, n_frames)

  1. bypass_pre_encode = True:audio_signal must be a tensor containing pre-encoded embeddings. Shape: (batch, n_frame, self.d_model)

forward_for_export(

audio_signal,

length,

cache_last_channel=None,

cache_last_time=None,

cache_last_channel_len=None,

)#

Forward function for model export. Please see forward() for more details.

forward_internal(

audio_signal,

length,

cache_last_channel=None,

cache_last_time=None,

cache_last_channel_len=None,

bypass_pre_encode=False,

)#

The audio_signal input supports two formats depending on the bypass_pre_encode boolean flag. This determines the required format of the input variable audio_signal: (1) bypass_pre_encode = False (default):

audio_signal must be a tensor containing audio features. Shape: (batch, self._feat_in, n_frames)

  1. bypass_pre_encode = True:audio_signal must be a tensor containing pre-encoded embeddings. Shape: (batch, n_frame, self.d_model)

bypass_pre_encode=True is used in cases where frame-level, context-independent embeddings are needed to be saved or reused (e.g., speaker cache in streaming speaker diarization).

input_example(max_batch=1, max_dim=256)#

Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types#

Returns definitions of module input ports.

property input_types_for_export#

Returns definitions of module input ports.

property output_types#

Returns definitions of module output ports.

property output_types_for_export#

Returns definitions of module output ports.

set_default_att_context_size(att_context_size)#

Sets the default attention context size from att_context_size argument.

Parameters:

att_context_size (list) – The attention context size to be set.

set_max_audio_length(max_audio_length)#

Sets maximum input length. Pre-calculates internal seq_range mask.

Parameters:

max_audio_length (int) – New maximum sequence length.

setup_streaming_params(

chunk_size: int | None = None,

shift_size: int | None = None,

left_chunks: int | None = None,

att_context_size: list | None = None,

max_context: int = 10000,

)#

This function sets the needed values and parameters to perform streaming. The configuration would be stored in self.streaming_cfg. The streaming configuration is needed to simulate streaming inference.

Parameters:

streaming_post_process(rets, keep_all_outputs=True)#

Post-process the output of the forward function for streaming.

Parameters:

update_max_seq_length(seq_length: int, device)#

Updates the maximum sequence length for the model.

Parameters:

class nemo.collections.asr.modules.SqueezeformerEncoder(*args: Any, **kwargs: Any)#

Bases: NeuralModule, Exportable, AccessMixin

The encoder for ASR model of Squeezeformer. Based on this paper: ‘Squeezeformer: An Efficient Transformer for Automatic Speech Recognition’ by Sehoon Kim et al.https://arxiv.org/abs/2206.00888

Parameters:

input_example(max_batch=1, max_dim=256)#

Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types#

Returns definitions of module input ports.

make_pad_mask(max_audio_length, seq_lens)#

Make masking for padding.

property output_types#

Returns definitions of module output ports.

set_max_audio_length(max_audio_length)#

Sets maximum input length. Pre-calculates internal seq_range mask.

class nemo.collections.asr.modules.RNNEncoder(*args: Any, **kwargs: Any)#

Bases: NeuralModule, Exportable

The RNN-based encoder for ASR models. Followed the architecture suggested in the following paper: ‘STREAMING END-TO-END SPEECH RECOGNITION FOR MOBILE DEVICES’ by Yanzhang He et al.https://arxiv.org/pdf/1811.06621.pdf

Parameters:

input_example()#

Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types#

Returns definitions of module input ports.

property output_types#

Returns definitions of module output ports.

class nemo.collections.asr.modules.RNNTDecoder(*args: Any, **kwargs: Any)#

Bases: AbstractRNNTDecoder, Exportable, AdapterModuleMixin

A Recurrent Neural Network Transducer Decoder / Prediction Network (RNN-T Prediction Network). An RNN-T Decoder/Prediction network, comprised of a stateful LSTM model.

Parameters:

add_adapter(name: str, cfg: omegaconf.DictConfig)#

Add an Adapter module to this module.

Parameters:

classmethod batch_aggregate_states_beam(

src_states: tuple[torch.Tensor, torch.Tensor],

batch_size: int,

beam_size: int,

indices: torch.Tensor,

dst_states: tuple[torch.Tensor, torch.Tensor] | None = None,

) → tuple[torch.Tensor, torch.Tensor]#

Aggregates decoder states based on the given indices. :param src_states: source states of

shape ([L x (batch_size * beam_size, H)], [L x (batch_size * beam_size, H)])

Parameters:

Return type:

Tuple[torch.Tensor, torch.Tensor]

Note

during the gathering operation.

batch_concat_states(

batch_states: List[List[torch.Tensor]],

) → List[torch.Tensor]#

Concatenate a batch of decoder state to a packed state.

Parameters:

batch_states (list) – batch of decoder states B x ([L x (H)], [L x (H)])

Returns:

decoder states

(L x B x H, L x B x H)

Return type:

(tuple)

batch_copy_states(

old_states: List[torch.Tensor],

new_states: List[torch.Tensor],

ids: List[int],

value: float | None = None,

) → List[torch.Tensor]#

Copy states from new state to old state at certain indices.

Parameters:

Returns:

batch of decoder states with partial copy at ids (or a specific value).

(L x B x H, L x B x H)

batch_initialize_states(

decoder_states: List[List[torch.Tensor]],

) → List[torch.Tensor]#

Creates a stacked decoder states to be passed to prediction network

Parameters:

decoder_states (list of list of list of torch.Tensor) –

list of decoder states [B, C, L, H]

Returns:

batch of decoder states

[C x torch.Tensor[L x B x H]

Return type:

batch_states (list of torch.Tensor)

classmethod batch_replace_states_all(

src_states: Tuple[torch.Tensor, torch.Tensor],

dst_states: Tuple[torch.Tensor, torch.Tensor],

batch_size: int | None = None,

)#

Replace states in dst_states with states from src_states

classmethod batch_replace_states_mask(

src_states: Tuple[torch.Tensor, torch.Tensor],

dst_states: Tuple[torch.Tensor, torch.Tensor],

mask: torch.Tensor,

other_src_states: Tuple[torch.Tensor, torch.Tensor] | None = None,

)#

Replaces states in dst_states with states from src_states based on the given mask.

Parameters:

Note

This operation is performed without CPU-GPU synchronization by using torch.where.

batch_score_hypothesis(

hypotheses: List[Hypothesis],

cache: Dict[Tuple[int], Any],

) → Tuple[List[torch.Tensor], List[List[torch.Tensor]]]#

Used for batched beam search algorithms. Similar to score_hypothesis method.

Parameters:

Returns:

batch_dec_out: a list of torch.Tensor [1, H] representing the prediction network outputs for the last tokens in the Hypotheses. batch_dec_states: a list of list of RNN states, each of shape [L, B, H]. Represented as B x List[states].

Return type:

Returns a tuple (batch_dec_out, batch_dec_states) such that

batch_select_state(

batch_states: List[torch.Tensor],

idx: int,

) → List[List[torch.Tensor]]#

Get decoder state from batch of states, for given id.

Parameters:

Returns:

decoder states for given id

([L x (1, H)], [L x (1, H)])

Return type:

(tuple)

classmethod batch_split_states(

batch_states: tuple[torch.Tensor, torch.Tensor],

) → list[tuple[torch.Tensor, torch.Tensor]]#

Split states into a list of states. Useful for splitting the final state for converting results of the decoding algorithm to Hypothesis class.

classmethod batch_unsplit_states(

batch_states: list[tuple[torch.Tensor, torch.Tensor]],

device=None,

dtype=None,

) → tuple[torch.Tensor, torch.Tensor]#

Concatenate a batch of decoder state to a packed state. Inverse of batch_split_states.

Parameters:

batch_states (list) – batch of decoder states B x ([L x (H)], [L x (H)])

Returns:

decoder states

(L x B x H, L x B x H)

Return type:

(tuple)

classmethod clone_state(

state: tuple[torch.Tensor, torch.Tensor],

) → tuple[torch.Tensor, torch.Tensor]#

Return copy of the states

initialize_state(

y: torch.Tensor,

) → Tuple[torch.Tensor, torch.Tensor]#

Initialize the state of the LSTM layers, with same dtype and device as input y. LSTM accepts a tuple of 2 tensors as a state.

Parameters:

y – A torch.Tensor whose device the generated states will be placed on.

Returns:

Tuple of 2 tensors, each of shape [L, B, H], where

L = Number of RNN layers

B = Batch size

H = Hidden size of RNN.

input_example(max_batch=1, max_dim=1)#

Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types#

Returns definitions of module input ports.

mask_select_states(

states: Tuple[torch.Tensor, torch.Tensor],

mask: torch.Tensor,

) → Tuple[torch.Tensor, torch.Tensor]#

Return states by mask selection :param states: states for the batch :param mask: boolean mask for selecting states; batch dimension should be the same as for states

Returns:

states filtered by mask

property output_types#

Returns definitions of module output ports.

predict(

y: torch.Tensor | None = None,

state: List[torch.Tensor] | None = None,

add_sos: bool = True,

batch_size: int | None = None,

) → Tuple[torch.Tensor, List[torch.Tensor]]#

Stateful prediction of scores and state for a (possibly null) tokenset. This method takes various cases into consideration : - No token, no state - used for priming the RNN - No token, state provided - used for blank token scoring - Given token, states - used for scores + new states

Here: B - batch size U - label length H - Hidden dimension size of RNN L - Number of RNN layers

Parameters:

Returns:

A tuple (g, hid) such that -

If add_sos is False:

g:

(B, U, H)

hid:

(h, c) where h is the final sequence hidden state and c is the final cell state:

h (tensor), shape (L, B, H)

c (tensor), shape (L, B, H)

If add_sos is True:

g:

(B, U + 1, H)

hid:

(h, c) where h is the final sequence hidden state and c is the final cell state:

h (tensor), shape (L, B, H)

c (tensor), shape (L, B, H)

score_hypothesis(

hypothesis: Hypothesis,

cache: Dict[Tuple[int], Any],

) → Tuple[torch.Tensor, List[torch.Tensor], torch.Tensor]#

Similar to the predict() method, instead this method scores a Hypothesis during beam search. Hypothesis is a dataclass representing one hypothesis in a Beam Search.

Parameters:

Returns:

y is a torch.Tensor of shape [1, 1, H] representing the score of the last token in the Hypothesis. state is a list of RNN states, each of shape [L, 1, H]. lm_token is the final integer token of the hypothesis.

Return type:

Returns a tuple (y, states, lm_token) such that

class nemo.collections.asr.modules.StatelessTransducerDecoder(*args: Any, **kwargs: Any)#

Bases: AbstractRNNTDecoder, Exportable

A Stateless Neural Network Transducer Decoder / Prediction Network. An RNN-T Decoder/Prediction stateless network that simply takes concatenation of embeddings of the history tokens as the output.

Parameters:

batch_concat_states(

batch_states: List[List[torch.Tensor]],

) → List[torch.Tensor]#

Concatenate a batch of decoder state to a packed state.

Parameters:

batch_states (list) – batch of decoder states B x ([(C)]

Returns:

decoder states

[(B x C)]

Return type:

(tuple)

batch_copy_states(

old_states: List[torch.Tensor],

new_states: List[torch.Tensor],

ids: List[int],

value: float | None = None,

) → List[torch.Tensor]#

Copy states from new state to old state at certain indices.

Parameters:

Returns:

batch of decoder states with partial copy at ids (or a specific value). (B x C)

batch_initialize_states(

decoder_states: List[List[torch.Tensor]],

)#

Creates a stacked decoder states to be passed to prediction network.

Parameters:

decoder_states (list of list of torch.Tensor) –

list of decoder states [B, 1, C]

Returns:

batch of decoder states [[B x C]]

Return type:

batch_states (list of torch.Tensor)

classmethod batch_replace_states_all(

src_states: list[torch.Tensor],

dst_states: list[torch.Tensor],

batch_size: int | None = None,

)#

Replace states in dst_states with states from src_states

classmethod batch_replace_states_mask(

src_states: tuple[torch.Tensor, torch.Tensor] | list[torch.Tensor],

dst_states: tuple[torch.Tensor, torch.Tensor] | list[torch.Tensor],

mask: torch.Tensor,

other_src_states: tuple[torch.Tensor, torch.Tensor] | list[torch.Tensor] | None = None,

)#

Replaces states in dst_states with states from src_states based on the given mask.

Parameters:

Note

This operation is performed without CPU-GPU synchronization by using torch.where.

batch_score_hypothesis(

hypotheses: List[Hypothesis],

cache: Dict[Tuple[int], Any],

) → Tuple[List[torch.Tensor], List[List[torch.Tensor]]]#

Used for batched beam search algorithms. Similar to score_hypothesis method.

Parameters:

Returns:

batch_dec_out: a list of torch.Tensor [1, H] representing the prediction network outputs for the last tokens in the Hypotheses. batch_dec_states: a list of list of RNN states, each of shape [L, B, H]. Represented as B x List[states].

Return type:

Returns a tuple (batch_dec_out, batch_dec_states) such that

batch_select_state(

batch_states: List[torch.Tensor],

idx: int,

) → List[List[torch.Tensor]]#

Get decoder state from batch of states, for given id.

Parameters:

Returns:

decoder states for given id

[(C)]

Return type:

(tuple)

classmethod batch_split_states(

batch_states: list[torch.Tensor],

) → list[list[torch.Tensor]]#

Split states into a list of states. Useful for splitting the final state for converting results of the decoding algorithm to Hypothesis class.

classmethod batch_unsplit_states(

batch_states: list[list[torch.Tensor]],

device=None,

dtype=None,

) → list[torch.Tensor]#

Concatenate a batch of decoder state to a packed state. Inverse of batch_split_states.

classmethod clone_state(

state: list[torch.Tensor],

) → list[torch.Tensor]#

Return copy of the states

initialize_state(

y: torch.Tensor,

) → List[torch.Tensor]#

Initialize the state of the RNN layers, with same dtype and device as input y.

Parameters:

y – A torch.Tensor whose device the generated states will be placed on.

Returns:

List of torch.Tensor, each of shape [L, B, H], where

L = Number of RNN layers B = Batch size H = Hidden size of RNN.

input_example(max_batch=1, max_dim=1)#

Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types#

Returns definitions of module input ports.

mask_select_states(

states: List[torch.Tensor] | None,

mask: torch.Tensor,

) → List[torch.Tensor] | None#

Return states by mask selection :param states: states for the batch :param mask: boolean mask for selecting states; batch dimension should be the same as for states

Returns:

states filtered by mask

property output_types#

Returns definitions of module output ports.

predict(

y: torch.Tensor | None = None,

state: torch.Tensor | None = None,

add_sos: bool = True,

batch_size: int | None = None,

) → Tuple[torch.Tensor, List[torch.Tensor]]#

Stateful prediction of scores and state for a tokenset.

Here: B - batch size U - label length C - context size for stateless decoder D - total embedding size

Parameters:

Returns:

A tuple (g, state) such that -

If add_sos is False:

g:

(B, U, D)

state:

[(B, C)] storing the history context including the new words in y.

If add_sos is True:

g:

(B, U + 1, D)

state:

[(B, C)] storing the history context including the new words in y.

score_hypothesis(

hypothesis: Hypothesis,

cache: Dict[Tuple[int], Any],

) → Tuple[torch.Tensor, List[torch.Tensor], torch.Tensor]#

Similar to the predict() method, instead this method scores a Hypothesis during beam search. Hypothesis is a dataclass representing one hypothesis in a Beam Search.

Parameters:

Returns:

y is a torch.Tensor of shape [1, 1, H] representing the score of the last token in the Hypothesis. state is a list of RNN states, each of shape [L, 1, H]. lm_token is the final integer token of the hypothesis.

Return type:

Returns a tuple (y, states, lm_token) such that

class nemo.collections.asr.modules.RNNTJoint(*args: Any, **kwargs: Any)#

Bases: AbstractRNNTJoint, Exportable, AdapterModuleMixin

A Recurrent Neural Network Transducer Joint Network (RNN-T Joint Network). An RNN-T Joint network, comprised of a feedforward model.

Parameters:

add_adapter(name: str, cfg: omegaconf.DictConfig)#

Add an Adapter module to this module.

Parameters:

property disabled_deployment_input_names#

Implement this method to return a set of input names disabled for export

input_example(max_batch=1, max_dim=8192)#

Generates input examples for tracing etc. :returns: A tuple of input examples.

property input_types#

Returns definitions of module input ports.

joint_after_projection(

f: torch.Tensor,

g: torch.Tensor,

) → torch.Tensor#

Compute the joint step of the network after projection.

Here, B = Batch size T = Acoustic model timesteps U = Target sequence length H1, H2 = Hidden dimensions of the Encoder / Decoder respectively H = Hidden dimension of the Joint hidden step. V = Vocabulary size of the Decoder (excluding the RNNT blank token).

Note

The implementation of this model is slightly modified from the original paper. The original paper proposes the following steps : (enc, dec) -> Expand + Concat + Sum [B, T, U, H1+H2] -> Forward through joint hidden [B, T, U, H] – *1 *1 -> Forward through joint final [B, T, U, V + 1].

We instead split the joint hidden into joint_hidden_enc and joint_hidden_dec and act as follows: enc -> Forward through joint_hidden_enc -> Expand [B, T, 1, H] – *1 dec -> Forward through joint_hidden_dec -> Expand [B, 1, U, H] – *2 (*1, *2) -> Sum [B, T, U, H] -> Forward through joint final [B, T, U, V + 1].

Parameters:

Returns:

Logits / log softmaxed tensor of shape (B, T, U, V + 1).

property output_types#

Returns definitions of module output ports.

project_encoder(

encoder_output: torch.Tensor,

) → torch.Tensor#

Project the encoder output to the joint hidden dimension.

Parameters:

encoder_output – A torch.Tensor of shape [B, T, D]

Returns:

A torch.Tensor of shape [B, T, H]

project_prednet(

prednet_output: torch.Tensor,

) → torch.Tensor#

Project the Prediction Network (Decoder) output to the joint hidden dimension.

Parameters:

prednet_output – A torch.Tensor of shape [B, U, D]

Returns:

A torch.Tensor of shape [B, U, H]

class nemo.collections.asr.modules.SampledRNNTJoint(*args: Any, **kwargs: Any)#

Bases: RNNTJoint

A Sampled Recurrent Neural Network Transducer Joint Network (RNN-T Joint Network). An RNN-T Joint network, comprised of a feedforward model, where the vocab size will be sampled instead of computing the full vocabulary joint.

Parameters:

sampled_joint(

f: torch.Tensor,

g: torch.Tensor,

transcript: torch.Tensor,

transcript_lengths: torch.Tensor,

) → torch.Tensor#

Compute the sampled joint step of the network.

Reference: Memory-Efficient Training of RNN-Transducer with Sampled Softmax.

Here, B = Batch size T = Acoustic model timesteps U = Target sequence length H1, H2 = Hidden dimensions of the Encoder / Decoder respectively H = Hidden dimension of the Joint hidden step. V = Vocabulary size of the Decoder (excluding the RNNT blank token). S = Sample size of vocabulary.

Note

The implementation of this joint model is slightly modified from the original paper. The original paper proposes the following steps : (enc, dec) -> Expand + Concat + Sum [B, T, U, H1+H2] -> Forward through joint hidden [B, T, U, H] – *1 *1 -> Forward through joint final [B, T, U, V + 1].

We instead split the joint hidden into joint_hidden_enc and joint_hidden_dec and act as follows: enc -> Forward through joint_hidden_enc -> Expand [B, T, 1, H] – *1 dec -> Forward through joint_hidden_dec -> Expand [B, 1, U, H] – *2 (*1, *2) -> Sum [B, T, U, H] -> Sample Vocab V_Pos (for target tokens) and V_Neg -> (V_Neg is sampled not uniformly by as a rand permutation of all vocab tokens, then eliminate all Intersection(V_Pos, V_Neg) common tokens to avoid duplication of loss) -> Concat new Vocab V_Sampled = Union(V_Pos, V_Neg) -> Forward partially through the joint final to create [B, T, U, V_Sampled]

Parameters:

Returns:

Logits / log softmaxed tensor of shape (B, T, U, V + 1).

Parts#

class nemo.collections.asr.parts.submodules.jasper.JasperBlock(*args: Any, **kwargs: Any)#

Bases: Module, AdapterModuleMixin, AccessMixin

Constructs a single “Jasper” block. With modified parameters, also constructs other blocks for models such as QuartzNet and Citrinet.

Note that above are general distinctions, each model has intricate differences that expand over multiple such blocks.

For further information about the differences between models which use JasperBlock, please review the configs for ASR models found in the ASR examples directory.

Parameters:

forward(

input_: Tuple[List[torch.Tensor], torch.Tensor | None],

) → Tuple[List[torch.Tensor], torch.Tensor | None]#

Forward pass of the module.

Parameters:

input – The input is a tuple of two values - the preprocessed audio signal as well as the lengths of the audio signal. The audio signal is padded to the shape [B, D, T] and the lengths are a torch vector of length B.

Returns:

The output of the block after processing the input through repeat number of sub-blocks, as well as the lengths of the encoded audio after padding/striding.

Mixins#

class nemo.collections.asr.parts.mixins.mixins.ASRBPEMixin#

Bases: ABC

ASR BPE Mixin class that sets up a Tokenizer via a config

This mixin class adds the method _setup_tokenizer(…), which can be used by ASR models which depend on subword tokenization.

The setup_tokenizer method adds the following parameters to the class -

In addition to these variables, the method will also instantiate and preserve a tokenizer (subclass of TokenizerSpec) if successful, and assign it to self.tokenizer.

The mixin also supports aggregate tokenizers, which consist of ordinary, monolingual tokenizers. If a conversion between a monolongual and an aggregate tokenizer (or vice versa) is detected, all registered artifacts will be cleaned up.

save_tokenizers(directory: str)#

Save the model tokenizer(s) to the specified directory.

Parameters:

directory – The directory to save the tokenizer(s) to.

class nemo.collections.asr.parts.mixins.mixins.ASRModuleMixin#

Bases: ASRAdapterModelMixin

ASRModuleMixin is a mixin class added to ASR models in order to add methods that are specific to a particular instantiation of a module inside of an ASRModel.

Each method should first check that the module is present within the subclass, and support additional functionality if the corresponding module is present.

change_attention_model(

self_attention_model: str | None = None,

att_context_size: List[int] | None = None,

update_config: bool = True,

)#

Update the self_attention_model if function is available in encoder.

Parameters:

change_conv_asr_se_context_window(

context_window: int,

update_config: bool = True,

)#

Update the context window of the SqueezeExcitation module if the provided model contains anencoder which is an instance of ConvASREncoder.

Parameters:

change_subsampling_conv_chunking_factor(

subsampling_conv_chunking_factor: int,

update_config: bool = True,

)#

Update the conv_chunking_factor (int) if function is available in encoder. Default is 1 (auto) Set it to -1 (disabled) or to a specific value (power of 2) if you OOM in the conv subsampling layers

Parameters:

conv_chunking_factor (int)

conformer_stream_step(

processed_signal: torch.Tensor,

processed_signal_length: torch.Tensor | None = None,

cache_last_channel: torch.Tensor | None = None,

cache_last_time: torch.Tensor | None = None,

cache_last_channel_len: torch.Tensor | None = None,

keep_all_outputs: bool = True,

previous_hypotheses: List[Hypothesis] | None = None,

previous_pred_out: torch.Tensor | None = None,

drop_extra_pre_encoded: int | None = None,

return_transcription: bool = True,

return_log_probs: bool = False,

)#

It simulates a forward step with caching for streaming purposes. It supports the ASR models where their encoder supports streaming like Conformer. :param processed_signal: the input audio signals :param processed_signal_length: the length of the audios :param cache_last_channel: the cache tensor for last channel layers like MHA :param cache_last_channel_len: lengths for cache_last_channel :param cache_last_time: the cache tensor for last time layers like convolutions :param keep_all_outputs: if set to True, would not drop the extra outputs specified by encoder.streaming_cfg.valid_out_len :param previous_hypotheses: the hypotheses from the previous step for RNNT models :param previous_pred_out: the predicted outputs from the previous step for CTC models :param drop_extra_pre_encoded: number of steps to drop from the beginning of the outputs after the downsampling module. This can be used if extra paddings are added on the left side of the input. :param return_transcription: whether to decode and return the transcriptions. It can not get disabled for Transducer models. :param return_log_probs: whether to return the log probs, only valid for ctc model

Returns:

the greedy predictions from the decoder all_hyp_or_transcribed_texts: the decoder hypotheses for Transducer models and the transcriptions for CTC models cache_last_channel_next: the updated tensor cache for last channel layers to be used for next streaming step cache_last_time_next: the updated tensor cache for last time layers to be used for next streaming step cache_last_channel_next_len: the updated lengths for cache_last_channel best_hyp: the best hypotheses for the Transducer models log_probs: the logits tensor of current streaming chunk, only returned when return_log_probs=True encoded_len: the length of the output log_probs + history chunk log_probs, only returned when return_log_probs=True

Return type:

greedy_predictions

transcribe_simulate_cache_aware_streaming(

paths2audio_files: List[str],

batch_size: int = 4,

logprobs: bool = False,

return_hypotheses: bool = False,

online_normalization: bool = False,

)#

Parameters:

Returns:

A list of transcriptions (or raw log probabilities if logprobs is True) in the same order as paths2audio_files

class nemo.collections.asr.parts.mixins.transcription.TranscriptionMixin#

Bases: ABC

An abstract class for transcribe-able models.

Creates a template function transcribe() that provides an interface to perform transcription of audio tensors or filepaths.

The following abstract classes must be implemented by the subclass:

transcribe(

audio: str | List[str] | numpy.ndarray | torch.utils.data.DataLoader,

batch_size: int = 4,

return_hypotheses: bool = False,

num_workers: int = 0,

channel_selector: int | Iterable[int] | str | None = None,

augmentor: omegaconf.DictConfig | None = None,

verbose: bool = True,

timestamps: bool | None = None,

override_config: TranscribeConfig | None = None,

**config_kwargs,

) → List[Any] | List[List[Any]] | Tuple[Any] | Tuple[List[Any]] | Dict[str, List[Any]]#

Template function that defines the execution strategy for transcribing audio.

Parameters:

Returns:

Output is defined by the subclass implementation of TranscriptionMixin._transcribe_output_processing(). It can be:

transcribe_generator(

audio,

override_config: TranscribeConfig | None,

)#

A generator version of transcribe function.

class nemo.collections.asr.parts.mixins.transcription.TranscribeConfig(

batch_size: int = 4,

return_hypotheses: bool = False,

num_workers: int | None = None,

channel_selector: int | Iterable[int] | str = None,

augmentor: omegaconf.DictConfig | None = None,

timestamps: bool | None = None,

verbose: bool = True,

partial_hypothesis: List[Any] | None = None,

_internal: nemo.collections.asr.parts.mixins.transcription.InternalTranscribeConfig | None = None,

)#

Bases: object

class nemo.collections.asr.parts.mixins.interctc_mixin.InterCTCMixin#

Bases: object

Adds utilities for computing interCTC loss from https://arxiv.org/abs/2102.03216.

To use, make sure encoder accesses interctc['capture_layers']property in the AccessMixin and registers interctc/layer_output_X andinterctc/layer_length_X for all layers that we want to get loss from. Additionally, specify the following config parameters to set up loss:

interctc: # can use different values loss_weights: [0.3] apply_at_layers: [8]

Then call

add_interctc_losses(

loss_value: torch.Tensor,

transcript: torch.Tensor,

transcript_len: torch.Tensor,

compute_wer: bool,

compute_loss: bool = True,

log_wer_num_denom: bool = False,

log_prefix: str = '',

) → Tuple[torch.Tensor | None, Dict]#

Adding interCTC losses if required.

Will also register loss/wer metrics in the returned dictionary.

Parameters:

Returns:

tuple of new loss tensor and dictionary with logged metrics.

Return type:

tuple[Optional[torch.Tensor], Dict]

finalize_interctc_metrics(

metrics: Dict,

outputs: List[Dict],

prefix: str,

)#

Finalizes InterCTC WER and loss metrics for logging purposes.

Should be called inside multi_validation_epoch_end (with prefix="val_") ormulti_test_epoch_end (with prefix="test_").

Note that metrics dictionary is going to be updated in-place.

get_captured_interctc_tensors() → List[Tuple[torch.Tensor, torch.Tensor]]#

Returns a list of captured tensors from encoder: tuples of (output, length).

Will additionally apply ctc_decoder to the outputs.

get_interctc_param(param_name)#

Either directly get parameter from self._interctc_params or call getattr with the corresponding name.

is_interctc_enabled() → bool#

Returns whether interCTC loss is enabled.

set_interctc_enabled(enabled: bool)#

Can be used to enable/disable InterCTC manually.

set_interctc_param(param_name, param_value)#

Setting the parameter to the self._interctc_params dictionary.

Raises an error if trying to set decoder, loss or wer as those should always come from the main class.

setup_interctc(decoder_name, loss_name, wer_name)#

Sets up all interctc-specific parameters and checks config consistency.

Caller has to specify names of attributes to perform CTC-specific WER, decoder and loss computation. They will be looked up in the class state with getattr.

The reason we get the names and look up object later is because those objects might change without re-calling the setup of this class. So we always want to look up the most up-to-date object instead of “caching” it here.

Datasets#

Character Encoding Datasets#

class nemo.collections.asr.data.audio_to_text.AudioToCharDataset(*args: Any, **kwargs: Any)#

Bases: _AudioTextDataset

Dataset that loads tensors via a json file containing paths to audio files, transcripts, and durations (in seconds). Each new line is a different sample. Example below: {“audio_filepath”: “/path/to/audio.wav”, “text_filepath”: “/path/to/audio.txt”, “duration”: 23.147} … {“audio_filepath”: “/path/to/audio.wav”, “text”: “the transcription”, “offset”: 301.75, “duration”: 0.82, “utt”: “utterance_id”, “ctm_utt”: “en_4156”, “side”: “A”}

Parameters:

property output_types_: Dict[str, NeuralType] | None_#

Returns definitions of module output ports.

class nemo.collections.asr.data.audio_to_text.TarredAudioToCharDataset(*args: Any, **kwargs: Any)#

Bases: decorator

A similar Dataset to the AudioToCharDataset, but which loads tarred audio files.

Accepts a single comma-separated JSON manifest file (in the same style as for the AudioToCharDataset), as well as the path(s) to the tarball(s) containing the wav files. Each line of the manifest should contain the information for one audio file, including at least the transcript and name of the audio file within the tarball.

Valid formats for the audio_tar_filepaths argument include: (1) a single string that can be brace-expanded, e.g. ‘path/to/audio.tar’ or ‘path/to/audio_{1..100}.tar.gz’, or (2) a list of file paths that will not be brace-expanded, e.g. [‘audio_1.tar’, ‘audio_2.tar’, …].

See the WebDataset documentation for more information about accepted data and input formats.

If using multiple workers the number of shards should be divisible by world_size to ensure an even split among workers. If it is not divisible, logging will give a warning but training will proceed. In addition, if using mutiprocessing, each shard MUST HAVE THE SAME NUMBER OF ENTRIES after filtering is applied. We currently do not check for this, but your program may hang if the shards are uneven!

Notice that a few arguments are different from the AudioToCharDataset; for example, shuffle (bool) has been replaced by shuffle_n (int).

Additionally, please note that the len() of this DataLayer is assumed to be the length of the manifest after filtering. An incorrect manifest length may lead to some DataLoader issues down the line.

Parameters:

Text-to-Text Datasets for Hybrid ASR-TTS models#

class nemo.collections.asr.data.text_to_text.TextToTextDataset(*args: Any, **kwargs: Any)#

Bases: TextToTextDatasetBase, Dataset

Text-to-Text Map-style Dataset for hybrid ASR-TTS models

collate_fn(

batch: List[TextToTextItem | tuple],

) → TextToTextBatch | TextOrAudioToTextBatch | tuple#

Collate function for dataloader Can accept mixed batch of text-to-text items and audio-text items (typical for ASR)

class nemo.collections.asr.data.text_to_text.TextToTextIterableDataset(*args: Any, **kwargs: Any)#

Bases: TextToTextDatasetBase, IterableDataset

Text-to-Text Iterable Dataset for hybrid ASR-TTS models Only part necessary for current process should be loaded and stored

collate_fn(

batch: List[TextToTextItem | tuple],

) → TextToTextBatch | TextOrAudioToTextBatch | tuple#

Collate function for dataloader Can accept mixed batch of text-to-text items and audio-text items (typical for ASR)

Subword Encoding Datasets#

class nemo.collections.asr.data.audio_to_text.AudioToBPEDataset(*args: Any, **kwargs: Any)#

Bases: _AudioTextDataset

Dataset that loads tensors via a json file containing paths to audio files, transcripts, and durations (in seconds). Each new line is a different sample. Example below: {“audio_filepath”: “/path/to/audio.wav”, “text_filepath”: “/path/to/audio.txt”, “duration”: 23.147} … {“audio_filepath”: “/path/to/audio.wav”, “text”: “the transcription”, “offset”: 301.75, “duration”: 0.82, “utt”: “utterance_id”, “ctm_utt”: “en_4156”, “side”: “A”}

In practice, the dataset and manifest used for character encoding and byte pair encoding are exactly the same. The only difference lies in how the dataset tokenizes the text in the manifest.

Parameters:

property output_types_: Dict[str, NeuralType] | None_#

Returns definitions of module output ports.

class nemo.collections.asr.data.audio_to_text.TarredAudioToBPEDataset(*args: Any, **kwargs: Any)#

Bases: decorator

A similar Dataset to the AudioToBPEDataset, but which loads tarred audio files.

Accepts a single comma-separated JSON manifest file (in the same style as for the AudioToBPEDataset), as well as the path(s) to the tarball(s) containing the wav files. Each line of the manifest should contain the information for one audio file, including at least the transcript and name of the audio file within the tarball.

Valid formats for the audio_tar_filepaths argument include: (1) a single string that can be brace-expanded, e.g. ‘path/to/audio.tar’ or ‘path/to/audio_{1..100}.tar.gz’, or (2) a list of file paths that will not be brace-expanded, e.g. [‘audio_1.tar’, ‘audio_2.tar’, …].

See the WebDataset documentation for more information about accepted data and input formats.

If using multiple workers the number of shards should be divisible by world_size to ensure an even split among workers. If it is not divisible, logging will give a warning but training will proceed. In addition, if using mutiprocessing, each shard MUST HAVE THE SAME NUMBER OF ENTRIES after filtering is applied. We currently do not check for this, but your program may hang if the shards are uneven!

Notice that a few arguments are different from the AudioToBPEDataset; for example, shuffle (bool) has been replaced by shuffle_n (int).

Additionally, please note that the len() of this DataLayer is assumed to be the length of the manifest after filtering. An incorrect manifest length may lead to some DataLoader issues down the line.

Parameters:

Audio Preprocessors#

class nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor(*args: Any, **kwargs: Any)#

Bases: AudioPreprocessor, Exportable

Featurizer module that converts wavs to mel spectrograms.

Parameters:

input_example(

max_batch: int = 8,

max_dim: int = 32000,

min_length: int = 200,

)#

Override this method if random inputs won’t work :returns: A tuple sample of valid input data.

property input_types#

Returns definitions of module input ports.

property output_types#

Returns definitions of module output ports.

processed_signal:

0: AxisType(BatchTag) 1: AxisType(MelSpectrogramSignalTag) 2: AxisType(ProcessedTimeTag)

processed_length:

0: AxisType(BatchTag)

classmethod restore_from(restore_path: str)#

Restores model instance (weights and configuration) from a .nemo file

Parameters:

save_to(save_path: str)#

Standardized method to save a tarfile containing the checkpoint, config, and any additional artifacts. Implemented via nemo.core.connectors.save_restore_connector.SaveRestoreConnector.save_to().

Parameters:

save_path – str, path to where the file should be saved.

class nemo.collections.asr.modules.AudioToMFCCPreprocessor(*args: Any, **kwargs: Any)#

Bases: AudioPreprocessor

Preprocessor that converts wavs to MFCCs. Uses torchaudio.transforms.MFCC.

Parameters:

property input_types#

Returns definitions of module input ports.

property output_types#

Returns definitions of module output ports.

classmethod restore_from(restore_path: str)#

Restores model instance (weights and configuration) from a .nemo file

Parameters:

save_to(save_path: str)#

Standardized method to save a tarfile containing the checkpoint, config, and any additional artifacts. Implemented via nemo.core.connectors.save_restore_connector.SaveRestoreConnector.save_to().

Parameters:

save_path – str, path to where the file should be saved.

Audio Augmentors#

class nemo.collections.asr.modules.SpectrogramAugmentation(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Performs time and freq cuts in one of two ways. SpecAugment zeroes out vertical and horizontal sections as described in SpecAugment (https://arxiv.org/abs/1904.08779). Arguments for use with SpecAugment are freq_masks, time_masks, freq_width, and time_width. SpecCutout zeroes out rectangulars as described in Cutout (https://arxiv.org/abs/1708.04552). Arguments for use with Cutout arerect_masks, rect_freq, and rect_time.

Parameters:

property input_types#

Returns definitions of module input types

property output_types#

Returns definitions of module output types

class nemo.collections.asr.modules.CropOrPadSpectrogramAugmentation(*args: Any, **kwargs: Any)#

Bases: NeuralModule

Pad or Crop the incoming Spectrogram to a certain shape.

Parameters:

audio_length (int) – the final number of timesteps that is required. The signal will be either padded or cropped temporally to this size.

property input_types#

Returns definitions of module output ports.

property output_types#

Returns definitions of module output ports.

classmethod restore_from(restore_path: str)#

Restores model instance (weights and configuration) from a .nemo file

Parameters:

save_to(save_path: str)#

Standardized method to save a tarfile containing the checkpoint, config, and any additional artifacts. Implemented via nemo.core.connectors.save_restore_connector.SaveRestoreConnector.save_to().

Parameters:

save_path – str, path to where the file should be saved.

class nemo.collections.asr.parts.preprocessing.perturb.SpeedPerturbation(

sr,

resample_type,

min_speed_rate=0.9,

max_speed_rate=1.1,

num_rates=5,

rng=None,

)#

Bases: Perturbation

Performs Speed Augmentation by re-sampling the data to a different sampling rate, which does not preserve pitch.

Note: This is a very slow operation for online augmentation. If space allows, it is preferable to pre-compute and save the files to augment the dataset.

Parameters:

class nemo.collections.asr.parts.preprocessing.perturb.TimeStretchPerturbation(

min_speed_rate=0.9,

max_speed_rate=1.1,

num_rates=5,

n_fft=512,

rng=None,

)#

Bases: Perturbation

Time-stretch an audio series by a fixed rate while preserving pitch, based on [1], [2].

Note: This is a simplified implementation, intended primarily for reference and pedagogical purposes. It makes no attempt to handle transients, and is likely to produce audible artifacts.

References

Parameters:

class nemo.collections.asr.parts.preprocessing.perturb.GainPerturbation(min_gain_dbfs=-10, max_gain_dbfs=10, rng=None)#

Bases: Perturbation

Applies random gain to the audio.

Parameters:

class nemo.collections.asr.parts.preprocessing.perturb.ImpulsePerturbation(

manifest_path=None,

audio_tar_filepaths=None,

shuffle_n=128,

normalize_impulse=False,

shift_impulse=False,

rng=None,

)#

Bases: Perturbation

Convolves audio with a Room Impulse Response.

Parameters:

class nemo.collections.asr.parts.preprocessing.perturb.ShiftPerturbation(min_shift_ms=-5.0, max_shift_ms=5.0, rng=None)#

Bases: Perturbation

Perturbs audio by shifting the audio in time by a random amount between min_shift_ms and max_shift_ms. The final length of the audio is kept unaltered by padding the audio with zeros.

Parameters:

class nemo.collections.asr.parts.preprocessing.perturb.NoisePerturbation(

manifest_path=None,

min_snr_db=10,

max_snr_db=50,

max_gain_db=300.0,

rng=None,

audio_tar_filepaths=None,

shuffle_n=100,

orig_sr=16000,

)#

Bases: Perturbation

Perturbation that adds noise to input audio.

Parameters:

perturb(data, ref_mic=0)#

Parameters:

perturb_with_foreground_noise(

data,

noise,

data_rms=None,

max_noise_dur=2,

max_additions=1,

ref_mic=0,

)#

Parameters:

perturb_with_input_noise(

data,

noise,

data_rms=None,

ref_mic=0,

)#

Parameters:

class nemo.collections.asr.parts.preprocessing.perturb.WhiteNoisePerturbation(min_level=-90, max_level=-46, rng=None)#

Bases: Perturbation

Perturbation that adds white noise to an audio file in the training dataset.

Parameters:

class nemo.collections.asr.parts.preprocessing.perturb.RirAndNoisePerturbation(

rir_manifest_path=None,

rir_prob=0.5,

noise_manifest_paths=None,

noise_prob=1.0,

min_snr_db=0,

max_snr_db=50,

rir_tar_filepaths=None,

rir_shuffle_n=100,

noise_tar_filepaths=None,

apply_noise_rir=False,

orig_sample_rate=None,

max_additions=5,

max_duration=2.0,

bg_noise_manifest_paths=None,

bg_noise_prob=1.0,

bg_min_snr_db=10,

bg_max_snr_db=50,

bg_noise_tar_filepaths=None,

bg_orig_sample_rate=None,

rng=None,

)#

Bases: Perturbation

RIR augmentation with additive foreground and background noise. In this implementation audio data is augmented by first convolving the audio with a Room Impulse Response and then adding foreground noise and background noise at various SNRs. RIR, foreground and background noises should either be supplied with a manifest file or as tarred audio files (faster).

Different sets of noise audio files based on the original sampling rate of the noise. This is useful while training a mixed sample rate model. For example, when training a mixed model with 8 kHz and 16 kHz audio with a target sampling rate of 16 kHz, one would want to augment 8 kHz data with 8 kHz noise rather than 16 kHz noise.

Parameters:

class nemo.collections.asr.parts.preprocessing.perturb.TranscodePerturbation(codecs=None, rng=None)#

Bases: Perturbation

Audio codec augmentation. This implementation uses sox to transcode audio with low rate audio codecs, so users need to make sure that the installed sox version supports the codecs used here (G711 and amr-nb).

Parameters:

Miscellaneous Classes#

CTC Decoding#

class nemo.collections.asr.parts.submodules.ctc_decoding.CTCDecoding(decoding_cfg, vocabulary)#

Bases: AbstractCTCDecoding

Used for performing CTC auto-regressive / non-auto-regressive decoding of the logprobs for character based models.

Parameters:

decode_ids_to_tokens(

tokens: List[int],

) → List[str]#

Implemented by subclass in order to decode a token id list into a token list. A token list is the string representation of each token id.

Parameters:

tokens – List of int representing the token ids.

Returns:

A list of decoded tokens.

decode_tokens_to_str(tokens: List[int]) → str#

Implemented by subclass in order to decoder a token list into a string.

Parameters:

tokens – List of int representing the token ids.

Returns:

A decoded string.

static get_words_offsets(

char_offsets: List[Dict[str, str | float]],

encoded_char_offsets: List[Dict[str, str | float]],

word_delimiter_char: str = ' ',

supported_punctuation: Set | None = None,

) → List[Dict[str, str | float]]#

Utility method which constructs word time stamps out of character time stamps.

References

This code is a port of the Hugging Face code for word time stamp construction.

Parameters:

Returns:

A list of dictionaries containing the word offsets. Each item contains “word”, “start_offset” and “end_offset”.

class nemo.collections.asr.parts.submodules.ctc_decoding.CTCBPEDecoding(

decoding_cfg,

tokenizer: TokenizerSpec,

)#

Bases: AbstractCTCDecoding

Used for performing CTC auto-regressive / non-auto-regressive decoding of the logprobs for subword based models.

Parameters:

decode_ids_to_tokens(

tokens: List[int],

) → List[str]#

Implemented by subclass in order to decode a token id list into a token list. A token list is the string representation of each token id.

Parameters:

tokens – List of int representing the token ids.

Returns:

A list of decoded tokens.

decode_tokens_to_str(tokens: List[int]) → str#

Implemented by subclass in order to decoder a token list into a string.

Parameters:

tokens – List of int representing the token ids.

Returns:

A decoded string.

static define_tokenizer_type(

vocabulary: List[str],

) → str#

Define the tokenizer type based on the vocabulary.

static define_word_start_condition(

tokenizer_type: str,

word_delimiter_char: str,

) → Callable[[str, str], bool]#

Define the word start condition based on the tokenizer type and word delimiter character.

get_words_offsets(

char_offsets: List[Dict[str, str | float]],

encoded_char_offsets: List[Dict[str, str | float]],

word_delimiter_char: str = ' ',

supported_punctuation: Set | None = None,

) → List[Dict[str, str | float]]#

Utility method which constructs word time stamps out of sub-word time stamps.

Note: Only supports Sentencepiece based tokenizers !

Parameters:

Returns:

A list of dictionaries containing the word offsets. Each item contains “word”, “start_offset” and “end_offset”.

class nemo.collections.asr.parts.submodules.ctc_greedy_decoding.GreedyCTCInfer(

blank_id: int,

preserve_alignments: bool = False,

compute_timestamps: bool = False,

preserve_frame_confidence: bool = False,

confidence_method_cfg: omegaconf.DictConfig | None = None,

)#

Bases: Typing, ConfidenceMethodMixin

A greedy CTC decoder.

Provides a common abstraction for sample level and batch level greedy decoding.

Parameters:

forward(

decoder_output: torch.Tensor,

decoder_lengths: torch.Tensor | None,

)#

Returns a list of hypotheses given an input batch of the encoder hidden embedding. Output token is generated auto-repressively.

Parameters:

Returns:

packed list containing batch number of sentences (Hypotheses).

property input_types#

Returns definitions of module input ports.

property output_types#

Returns definitions of module output ports.

class nemo.collections.asr.parts.submodules.ctc_beam_decoding.BeamCTCInfer(

blank_id: int,

beam_size: int,

search_type: str = 'default',

return_best_hypothesis: bool = True,

preserve_alignments: bool = False,

compute_timestamps: bool = False,

ngram_lm_alpha: float = 0.3,

beam_beta: float = 0.0,

ngram_lm_model: str | None = None,

flashlight_cfg: FlashlightConfig | None = None,

pyctcdecode_cfg: PyCTCDecodeConfig | None = None,

)#

Bases: AbstractBeamCTCInfer

A beam CTC decoder.

Provides a common abstraction for sample level and batch level greedy decoding.

Parameters:

default_beam_search(

x: torch.Tensor,

out_len: torch.Tensor,

) → List[Hypothesis | NBestHypotheses]#

Open Seq2Seq Beam Search Algorithm (DeepSpeed)

Parameters:

Returns:

A list of NBestHypotheses objects, one for each sequence in the batch.

flashlight_beam_search(

x: torch.Tensor,

out_len: torch.Tensor,

) → List[Hypothesis | NBestHypotheses]#

Flashlight Beam Search Algorithm. Should support Char and Subword models.

Parameters:

Returns:

A list of NBestHypotheses objects, one for each sequence in the batch.

forward(

decoder_output: torch.Tensor,

decoder_lengths: torch.Tensor,

) → Tuple[List[Hypothesis | NBestHypotheses]]#

Returns a list of hypotheses given an input batch of the encoder hidden embedding. Output token is generated auto-repressively.

Parameters:

Returns:

packed list containing batch number of sentences (Hypotheses).

set_decoding_type(decoding_type: str)#

Sets the decoding type of the framework. Can support either char or subword models.

Parameters:

decoding_type – Str corresponding to decoding type. Only supports “char” and “subword”.

RNNT Decoding#

class nemo.collections.asr.parts.submodules.rnnt_decoding.RNNTDecoding(decoding_cfg, decoder, joint, vocabulary)#

Bases: AbstractRNNTDecoding

Used for performing RNN-T auto-regressive decoding of the Decoder+Joint network given the encoder state.

Parameters:

decode_ids_to_langs(

tokens: List[int],

) → List[str]#

Decode a token id list into language ID (LID) list.

Parameters:

tokens – List of int representing the token ids.

Returns:

A list of decoded LIDS.

decode_ids_to_tokens(

tokens: List[int],

) → List[str]#

Implemented by subclass in order to decode a token id list into a token list. A token list is the string representation of each token id.

Parameters:

tokens – List of int representing the token ids.

Returns:

A list of decoded tokens.

decode_tokens_to_lang(tokens: List[int]) → str#

Compute the most likely language ID (LID) string given the tokens.

Parameters:

tokens – List of int representing the token ids.

Returns:

A decoded LID string.

decode_tokens_to_str(tokens: List[int]) → str#

Implemented by subclass in order to decoder a token list into a string.

Parameters:

tokens – List of int representing the token ids.

Returns:

A decoded string.

static get_words_offsets(

char_offsets: List[Dict[str, str | float]],

encoded_char_offsets: List[Dict[str, str | float]],

word_delimiter_char: str = ' ',

supported_punctuation: Set | None = None,

) → List[Dict[str, str | float]]#

Utility method which constructs word time stamps out of character time stamps.

References

This code is a port of the Hugging Face code for word time stamp construction.

Parameters:

Returns:

A list of dictionaries containing the word offsets. Each item contains “word”, “start_offset” and “end_offset”.

class nemo.collections.asr.parts.submodules.rnnt_decoding.RNNTBPEDecoding(

decoding_cfg,

decoder,

joint,

tokenizer: TokenizerSpec,

)#

Bases: AbstractRNNTDecoding

Used for performing RNN-T auto-regressive decoding of the Decoder+Joint network given the encoder state.

Parameters:

decode_hypothesis(

hypotheses_list: List[Hypothesis],

) → List[Hypothesis | NBestHypotheses]#

Decode a list of hypotheses into a list of strings. Overrides the super() method optionally adding lang information

Parameters:

hypotheses_list – List of Hypothesis.

Returns:

A list of strings.

decode_ids_to_langs(

tokens: List[int],

) → List[str]#

Decode a token id list into language ID (LID) list.

Parameters:

tokens – List of int representing the token ids.

Returns:

A list of decoded LIDS.

decode_ids_to_tokens(

tokens: List[int],

) → List[str]#

Implemented by subclass in order to decode a token id list into a token list. A token list is the string representation of each token id.

Parameters:

tokens – List of int representing the token ids.

Returns:

A list of decoded tokens.

decode_tokens_to_lang(

tokens: List[int],

) → str#

Compute the most likely language ID (LID) string given the tokens.

Parameters:

tokens – List of int representing the token ids.

Returns:

A decoded LID string.

decode_tokens_to_str(tokens: List[int]) → str#

Implemented by subclass in order to decoder a token list into a string.

Parameters:

tokens – List of int representing the token ids.

Returns:

A decoded string.

static define_tokenizer_type(

vocabulary: List[str],

) → str#

Define the tokenizer type based on the vocabulary.

static define_word_start_condition(

tokenizer_type: str,

word_delimiter_char: str,

) → Callable[[str, str], bool]#

Define the word start condition based on the tokenizer type and word delimiter character.

get_words_offsets(

char_offsets: List[Dict[str, str | float]],

encoded_char_offsets: List[Dict[str, str | float]],

word_delimiter_char: str = ' ',

supported_punctuation: Set | None = None,

) → List[Dict[str, str | float]]#

Utility method which constructs word time stamps out of sub-word time stamps.

Note: Only supports Sentencepiece based tokenizers !

Parameters:

Returns:

A list of dictionaries containing the word offsets. Each item contains “word”, “start_offset” and “end_offset”.

class nemo.collections.asr.parts.submodules.rnnt_greedy_decoding.GreedyRNNTInfer(

decoder_model: AbstractRNNTDecoder,

joint_model: AbstractRNNTJoint,

blank_index: int,

max_symbols_per_step: int | None = None,

preserve_alignments: bool = False,

preserve_frame_confidence: bool = False,

confidence_method_cfg: omegaconf.DictConfig | None = None,

)#

Bases: _GreedyRNNTInfer

A greedy transducer decoder.

Sequence level greedy decoding, performed auto-regressively.

Parameters:

forward(

encoder_output: torch.Tensor,

encoded_lengths: torch.Tensor,

partial_hypotheses: List[Hypothesis] | None = None,

)#

Returns a list of hypotheses given an input batch of the encoder hidden embedding. Output token is generated auto-regressively.

Parameters:

Returns:

packed list containing batch number of sentences (Hypotheses).

class nemo.collections.asr.parts.submodules.rnnt_greedy_decoding.GreedyBatchedRNNTInfer(

decoder_model: AbstractRNNTDecoder,

joint_model: AbstractRNNTJoint,

blank_index: int,

max_symbols_per_step: int | None = None,

preserve_alignments: bool = False,

preserve_frame_confidence: bool = False,

confidence_method_cfg: omegaconf.DictConfig | None = None,

loop_labels: bool = True,

use_cuda_graph_decoder: bool = True,

ngram_lm_model: str | Path | None = None,

ngram_lm_alpha: float = 0.0,

)#

Bases: _GreedyRNNTInfer, WithOptionalCudaGraphs

A batch level greedy transducer decoder.

Batch level greedy decoding, performed auto-regressively.

Parameters:

disable_cuda_graphs()#

Disable CUDA graphs (e.g., for decoding in training)

forward(

encoder_output: torch.Tensor,

encoded_lengths: torch.Tensor,

partial_hypotheses: List[Hypothesis] | None = None,

)#

Returns a list of hypotheses given an input batch of the encoder hidden embedding. Output token is generated auto-regressively.

Parameters:

Returns:

packed list containing batch number of sentences (Hypotheses).

maybe_enable_cuda_graphs()#

Enable CUDA graphs (if allowed)

class nemo.collections.asr.parts.submodules.rnnt_beam_decoding.BeamRNNTInfer(

decoder_model: AbstractRNNTDecoder,

joint_model: AbstractRNNTJoint,

beam_size: int,

search_type: str = 'default',

score_norm: bool = True,

return_best_hypothesis: bool = True,

tsd_max_sym_exp_per_step: int | None = 50,

alsd_max_target_len: int | float = 1.0,

nsc_max_timesteps_expansion: int = 1,

nsc_prefix_alpha: int = 1,

maes_num_steps: int = 2,

maes_prefix_alpha: int = 1,

maes_expansion_gamma: float = 2.3,

maes_expansion_beta: int = 2,

language_model: Dict[str, Any] | None = None,

softmax_temperature: float = 1.0,

preserve_alignments: bool = False,

ngram_lm_model: str | None = None,

ngram_lm_alpha: float = 0.0,

hat_subtract_ilm: bool = False,

hat_ilm_weight: float = 0.0,

max_symbols_per_step: int | None = None,

blank_lm_score_mode: str | None = 'no_score',

pruning_mode: str | None = 'early',

allow_cuda_graphs: bool = False,

)#

Bases: Typing

Beam Search implementation ported from ESPNet implementation -espnet/espnet

Sequence level beam decoding or batched-beam decoding, performed auto-repressively depending on the search type chosen.

Parameters:

align_length_sync_decoding(

h: torch.Tensor,

encoded_lengths: torch.Tensor,

partial_hypotheses: Hypothesis | None = None,

) → List[Hypothesis]#

Alignment-length synchronous beam search implementation. Based on https://ieeexplore.ieee.org/document/9053040

Parameters:

h – Encoded speech features (1, T_max, D_enc)

Returns:

N-best decoding results

Return type:

nbest_hyps

compute_ngram_score(

current_lm_state: kenlm.State,

label: int,

) → Tuple[float, kenlm.State]#

Score computation for kenlm ngram language model.

default_beam_search(

h: torch.Tensor,

encoded_lengths: torch.Tensor,

partial_hypotheses: Hypothesis | None = None,

) → List[Hypothesis]#

Beam search implementation.

Parameters:

x – Encoded speech features (1, T_max, D_enc)

Returns:

N-best decoding results

Return type:

nbest_hyps

greedy_search(

h: torch.Tensor,

encoded_lengths: torch.Tensor,

partial_hypotheses: Hypothesis | None = None,

) → List[Hypothesis]#

Greedy search implementation for transducer. Generic case when beam size = 1. Results might differ slightly due to implementation details as compared to GreedyRNNTInfer and GreedyBatchRNNTInfer.

Parameters:

h – Encoded speech features (1, T_max, D_enc)

Returns:

1-best decoding results

Return type:

hyp

property input_types#

Returns definitions of module input ports.

modified_adaptive_expansion_search(

h: torch.Tensor,

encoded_lengths: torch.Tensor,

partial_hypotheses: Hypothesis | None = None,

) → List[Hypothesis]#

Based on/modified from https://ieeexplore.ieee.org/document/9250505

Parameters:

h – Encoded speech features (1, T_max, D_enc)

Returns:

N-best decoding results

Return type:

nbest_hyps

property output_types#

Returns definitions of module output ports.

prefix_search(

hypotheses: List[Hypothesis],

enc_out: torch.Tensor,

prefix_alpha: int,

) → List[Hypothesis]#

Prefix search for NSC and mAES strategies. Based on https://arxiv.org/pdf/1211.3711.pdf

recombine_hypotheses(

hypotheses: List[Hypothesis],

) → List[Hypothesis]#

Recombine hypotheses with equivalent output sequence.

Parameters:

hypotheses (list) – list of hypotheses

Returns:

list of recombined hypotheses

Return type:

final (list)

resolve_joint_output(

enc_out: torch.Tensor,

dec_out: torch.Tensor,

) → Tuple[torch.Tensor, torch.Tensor]#

Resolve output types for RNNT and HAT joint models

set_decoding_type(decoding_type: str)#

Sets decoding type. Please check train_kenlm.py in scripts/asr_language_modeling/ to find out why we need :param decoding_type: decoding type

sort_nbest(

hyps: List[Hypothesis],

) → List[Hypothesis]#

Sort hypotheses by score or score given sequence length.

Parameters:

hyps – list of hypotheses

Returns:

sorted list of hypotheses

Return type:

hyps

time_sync_decoding(

h: torch.Tensor,

encoded_lengths: torch.Tensor,

partial_hypotheses: Hypothesis | None = None,

) → List[Hypothesis]#

Time synchronous beam search implementation. Based on https://ieeexplore.ieee.org/document/9053040

Parameters:

h – Encoded speech features (1, T_max, D_enc)

Returns:

N-best decoding results

Return type:

nbest_hyps

class nemo.collections.asr.parts.submodules.rnnt_beam_decoding.BeamBatchedRNNTInfer(

decoder_model: AbstractRNNTDecoder,

joint_model: AbstractRNNTJoint,

blank_index: int,

beam_size: int,

search_type: str = 'malsd_batch',

score_norm: bool = True,

maes_num_steps: int | None = 2,

maes_expansion_gamma: float | None = 2.3,

maes_expansion_beta: int | None = 2,

max_symbols_per_step: int | None = 10,

preserve_alignments: bool = False,

ngram_lm_model: str | Path | None = None,

ngram_lm_alpha: float = 0.0,

blank_lm_score_mode: str | BlankLMScoreMode | None = BlankLMScoreMode.LM_WEIGHTED_FULL,

pruning_mode: str | PruningMode | None = PruningMode.LATE,

allow_cuda_graphs: bool | None = True,

return_best_hypothesis: str | None = True,

)#

Bases: Typing, ConfidenceMethodMixin

forward(

encoder_output: torch.Tensor,

encoded_lengths: torch.Tensor,

partial_hypotheses: list[Hypothesis] | None = None,

) → Tuple[list[Hypothesis] | List[NBestHypotheses]]#

Returns a list of hypotheses given an input batch of the encoder hidden embedding. Output token is generated auto-regressively. :param encoder_output: A tensor of size (batch, features, timesteps). :param encoded_lengths: list of int representing the length of each sequence

output sequence.

Returns:

Tuple of a list of hypotheses for each batch. Each hypothesis contains

the decoded sequence, timestamps and associated scores. The format of the returned hypotheses depends on the return_best_hypothesis attribute:

Return type:

Tuple[list[Hypothesis] | List[NBestHypotheses]]

property input_types#

Returns definitions of module input ports.

property output_types#

Returns definitions of module output ports.

TDT Decoding#

class nemo.collections.asr.parts.submodules.rnnt_greedy_decoding.GreedyTDTInfer(

decoder_model: AbstractRNNTDecoder,

joint_model: AbstractRNNTJoint,

blank_index: int,

durations: list,

max_symbols_per_step: int | None = None,

preserve_alignments: bool = False,

preserve_frame_confidence: bool = False,

include_duration: bool = False,

include_duration_confidence: bool = False,

confidence_method_cfg: omegaconf.DictConfig | None = None,

)#

Bases: _GreedyRNNTInfer

A greedy TDT decoder.

Sequence level greedy decoding, performed auto-regressively.

Parameters:

forward(

encoder_output: torch.Tensor,

encoded_lengths: torch.Tensor,

partial_hypotheses: List[Hypothesis] | None = None,

)#

Returns a list of hypotheses given an input batch of the encoder hidden embedding. Output token is generated auto-regressively. :param encoder_output: A tensor of size (batch, features, timesteps). :param encoded_lengths: list of int representing the length of each sequence

output sequence.

Returns:

packed list containing batch number of sentences (Hypotheses).

class nemo.collections.asr.parts.submodules.rnnt_greedy_decoding.GreedyBatchedTDTInfer(

decoder_model: AbstractRNNTDecoder,

joint_model: AbstractRNNTJoint,

blank_index: int,

durations: List[int],

max_symbols_per_step: int | None = None,

preserve_alignments: bool = False,

preserve_frame_confidence: bool = False,

include_duration: bool = False,

include_duration_confidence: bool = False,

confidence_method_cfg: omegaconf.DictConfig | None = None,

use_cuda_graph_decoder: bool = True,

ngram_lm_model: str | Path | None = None,

ngram_lm_alpha: float = 0.0,

)#

Bases: _GreedyRNNTInfer, WithOptionalCudaGraphs

A batch level greedy TDT decoder. Batch level greedy decoding, performed auto-regressively. :param decoder_model: rnnt_utils.AbstractRNNTDecoder implementation. :param joint_model: rnnt_utils.AbstractRNNTJoint implementation. :param blank_index: int index of the blank token. Must be len(vocabulary) for TDT models. :param durations: a list containing durations. :param max_symbols_per_step: Optional int. The maximum number of symbols that can be added

to a sequence in a single time step; if set to None then there is no limit.

Parameters:

disable_cuda_graphs()#

Disable CUDA graphs (e.g., for decoding in training)

forward(

encoder_output: torch.Tensor,

encoded_lengths: torch.Tensor,

partial_hypotheses: List[Hypothesis] | None = None,

)#

Returns a list of hypotheses given an input batch of the encoder hidden embedding. Output token is generated auto-regressively. :param encoder_output: A tensor of size (batch, features, timesteps). :param encoded_lengths: list of int representing the length of each sequence

output sequence.

Returns:

packed list containing batch number of sentences (Hypotheses).

maybe_enable_cuda_graphs()#

Enable CUDA graphs (if allowed)

class nemo.collections.asr.parts.submodules.tdt_beam_decoding.BeamTDTInfer(

decoder_model: AbstractRNNTDecoder,

joint_model: AbstractRNNTJoint,

durations: list,

beam_size: int,

search_type: str = 'default',

score_norm: bool = True,

return_best_hypothesis: bool = True,

maes_num_steps: int = 2,

maes_prefix_alpha: int = 1,

maes_expansion_gamma: float = 2.3,

maes_expansion_beta: int = 2,

softmax_temperature: float = 1.0,

preserve_alignments: bool = False,

ngram_lm_model: str | None = None,

ngram_lm_alpha: float = 0.3,

max_symbols_per_step: int | None = None,

blank_lm_score_mode: str | None = 'no_score',

pruning_mode: str | None = 'early',

allow_cuda_graphs: bool = False,

)#

Bases: Typing

Beam search implementation for Token-andDuration Transducer (TDT) models.

Sequence level beam decoding or batched-beam decoding, performed auto-repressively depending on the search type chosen.

Parameters:

compute_ngram_score(

current_lm_state: kenlm.State,

label: int,

) → Tuple[float, kenlm.State]#

Computes the score for KenLM Ngram language model.

Parameters:

Returns:

score for label.

Return type:

lm_score

default_beam_search(

encoder_outputs: torch.Tensor,

encoded_lengths: torch.Tensor,

partial_hypotheses: Hypothesis | None = None,

) → List[Hypothesis]#

Default Beam search implementation for TDT models.

Parameters:

Returns:

N-best decoding results

Return type:

nbest_hyps

property input_types#

Returns definitions of module input ports.

merge_duplicate_hypotheses(hypotheses)#

Merges hypotheses with identical token sequences and lengths. The combined hypothesis’s probability is the sum of the probabilities of all duplicates. Duplicate hypotheses occur when two consecutive blank tokens are predicted and their duration values sum up to the same number.

Parameters:

hypotheses – list of hypotheses.

Returns:

list if hypotheses without duplicates.

Return type:

hypotheses

modified_adaptive_expansion_search(

encoder_outputs: torch.Tensor,

encoded_lengths: torch.Tensor,

partial_hypotheses: Hypothesis | None = None,

) → List[Hypothesis]#

Modified Adaptive Exoansion Search algorithm for TDT models. Based on/modified from https://ieeexplore.ieee.org/document/9250505. Supports N-gram language model shallow fusion.

Parameters:

Returns:

N-best decoding results

Return type:

nbest_hyps

property output_types#

Returns definitions of module output ports.

prefix_search(

hypotheses: List[Hypothesis],

encoder_output: torch.Tensor,

prefix_alpha: int,

) → List[Hypothesis]#

Performs a prefix search and updates the scores of the hypotheses in place. Based on https://arxiv.org/pdf/1211.3711.pdf.

Parameters:

Returns:

list of hypotheses with updated scores.

Return type:

hypotheses

set_decoding_type(decoding_type: str)#

Sets decoding type. Please check train_kenlm.py in scripts/asr_language_modeling/ to find out why we need :param decoding_type: decoding type

sort_nbest(

hyps: List[Hypothesis],

) → List[Hypothesis]#

Sort hypotheses by score or score given sequence length.

Parameters:

hyps – list of hypotheses

Returns:

sorted list of hypotheses

Return type:

hyps

class nemo.collections.asr.parts.submodules.tdt_beam_decoding.BeamBatchedTDTInfer(

decoder_model: AbstractRNNTDecoder,

joint_model: AbstractRNNTJoint,

durations: list,

blank_index: int,

beam_size: int,

search_type: str = 'malsd_batch',

score_norm: bool = True,

max_symbols_per_step: int | None = None,

preserve_alignments: bool = False,

ngram_lm_model: str | Path | None = None,

ngram_lm_alpha: float = 0.0,

blank_lm_score_mode: str | BlankLMScoreMode | None = BlankLMScoreMode.NO_SCORE,

pruning_mode: str | PruningMode | None = PruningMode.EARLY,

allow_cuda_graphs: bool | None = True,

return_best_hypothesis: str | None = True,

)#

Bases: Typing, ConfidenceMethodMixin

forward(

encoder_output: torch.Tensor,

encoded_lengths: torch.Tensor,

partial_hypotheses: list[Hypothesis] | None = None,

) → Tuple[list[Hypothesis] | List[NBestHypotheses]]#

Returns a list of hypotheses given an input batch of the encoder hidden embedding. Output token is generated auto-regressively. :param encoder_output: A tensor of size (batch, features, timesteps). :param encoded_lengths: list of int representing the length of each sequence

output sequence.

Returns:

Tuple of a list of hypotheses for each batch. Each hypothesis contains

the decoded sequence, timestamps and associated scores. The format of the returned hypotheses depends on the return_best_hypothesis attribute:

Return type:

Tuple[list[Hypothesis] | List[NBestHypotheses]]

property input_types#

Returns definitions of module input ports.

property output_types#

Returns definitions of module output ports.

Hypotheses#

class nemo.collections.asr.parts.utils.rnnt_utils.Hypothesis(

score: float,

y_sequence: ~typing.List[int] | torch.Tensor,

text: str | None = None,

dec_out: ~typing.List[torch.Tensor] | None = None,

dec_state: ~typing.List[~typing.List[torch.Tensor]] | ~typing.List[torch.Tensor] | None = None,

timestamp: ~typing.List[int] | torch.Tensor = ,

alignments: ~typing.List[int] | ~typing.List[~typing.List[int]] | None = None,

frame_confidence: ~typing.List[float] | ~typing.List[~typing.List[float]] | None = None,

token_confidence: ~typing.List[float] | None = None,

word_confidence: ~typing.List[float] | None = None,

length: int | torch.Tensor = 0,

y: ~typing.List[torch.tensor] | None = None,

lm_state: ~typing.Dict[str,

~typing.Any] | ~typing.List[~typing.Any] | None = None,

lm_scores: torch.Tensor | None = None,

ngram_lm_state: ~typing.Dict[str,

~typing.Any] | ~typing.List[~typing.Any] | None = None,

tokens: ~typing.List[int] | torch.Tensor | None = None,

last_token: torch.Tensor | None = None,

token_duration: torch.Tensor | None = None,

last_frame: int | None = None,

)#

Bases: object

Hypothesis class for beam search algorithms.

score: A float score obtained from an AbstractRNNTDecoder module’s score_hypothesis method.

y_sequence: Either a sequence of integer ids pointing to some vocabulary, or a packed torch.Tensor

behaving in the same manner. dtype must be torch.Long in the latter case.

dec_state: A list (or list of list) of LSTM-RNN decoder states. Can be None.

text: (Optional) A decoded string after processing via CTC / RNN-T decoding (removing the CTC/RNNT

blank tokens, and optionally merging word-pieces). Should be used as decoded string for Word Error Rate calculation.

timestamp: (Optional) A list of integer indices representing at which index in the decoding

process did the token appear. Should be of same length as the number of non-blank tokens.

alignments: (Optional) Represents the CTC / RNNT token alignments as integer tokens along an axis of

time T (for CTC) or Time x Target (TxU). For CTC, represented as a single list of integer indices. For RNNT, represented as a dangling list of list of integer indices. Outer list represents Time dimension (T), inner list represents Target dimension (U). The set of valid indices includes the CTC / RNNT blank token in order to represent alignments.

frame_confidence: (Optional) Represents the CTC / RNNT per-frame confidence scores as token probabilities

along an axis of time T (for CTC) or Time x Target (TxU). For CTC, represented as a single list of float indices. For RNNT, represented as a dangling list of list of float indices. Outer list represents Time dimension (T), inner list represents Target dimension (U).

token_confidence: (Optional) Represents the CTC / RNNT per-token confidence scores as token probabilities

along an axis of Target U. Represented as a single list of float indices.

word_confidence: (Optional) Represents the CTC / RNNT per-word confidence scores as token probabilities

along an axis of Target U. Represented as a single list of float indices.

length: Represents the length of the sequence (the original length without padding), otherwise

defaults to 0.

y: (Unused) A list of torch.Tensors representing the list of hypotheses.

lm_state: (Unused) A dictionary state cache used by an external Language Model.

lm_scores: (Unused) Score of the external Language Model.

ngram_lm_state: (Optional) State of the external n-gram Language Model.

tokens: (Optional) A list of decoded tokens (can be characters or word-pieces.

last_token (Optional): A token or batch of tokens which was predicted in the last step.

last_frame (Optional): Index of the last decoding step hypothesis was updated including blank token prediction.

class nemo.collections.asr.parts.utils.rnnt_utils.NBestHypotheses(

n_best_hypotheses: List[Hypothesis] | None,

)#

Bases: object

List of N best hypotheses

Adapter Networks#

class nemo.collections.asr.parts.submodules.adapters.multi_head_attention_adapter_module.MultiHeadAttentionAdapter(*args: Any, **kwargs: Any)#

Bases: MultiHeadAttention, AdapterModuleUtil

Multi-Head Attention layer of Transformer.

Parameters:

forward(

query,

key,

value,

mask,

pos_emb=None,

cache=None,

)#

Compute ‘Scaled Dot Product Attention’. :param query: (batch, time1, size) :type query: torch.Tensor :param key: (batch, time2, size) :type key: torch.Tensor :param value: (batch, time2, size) :type value: torch.Tensor :param mask: (batch, time1, time2) :type mask: torch.Tensor :param cache: (batch, time_cache, size) :type cache: torch.Tensor

Returns:

transformed value (batch, time1, d_model) weighted by the query dot key attention cache (torch.Tensor) : (batch, time_cache_next, size)

Return type:

output (torch.Tensor)

get_default_strategy_config() → dataclass#

Returns a default adapter module strategy.

class nemo.collections.asr.parts.submodules.adapters.multi_head_attention_adapter_module.RelPositionMultiHeadAttentionAdapter(*args: Any, **kwargs: Any)#

Bases: RelPositionMultiHeadAttention, AdapterModuleUtil

Multi-Head Attention layer of Transformer-XL with support of relative positional encoding. Paper: https://arxiv.org/abs/1901.02860

Parameters:

forward(

query,

key,

value,

mask,

pos_emb,

cache=None,

)#

Compute ‘Scaled Dot Product Attention’ with rel. positional encoding. :param query: (batch, time1, size) :type query: torch.Tensor :param key: (batch, time2, size) :type key: torch.Tensor :param value: (batch, time2, size) :type value: torch.Tensor :param mask: (batch, time1, time2) :type mask: torch.Tensor :param pos_emb: (batch, time1, size) :type pos_emb: torch.Tensor :param cache: (batch, time_cache, size) :type cache: torch.Tensor

Returns:

transformed value (batch, time1, d_model) weighted by the query dot key attention cache_next (torch.Tensor) : (batch, time_cache_next, size)

Return type:

output (torch.Tensor)

get_default_strategy_config() → dataclass#

Returns a default adapter module strategy.

class nemo.collections.asr.parts.submodules.adapters.multi_head_attention_adapter_module.PositionalEncodingAdapter(*args: Any, **kwargs: Any)#

Bases: PositionalEncoding, AdapterModuleUtil

Absolute positional embedding adapter.

Note

Absolute positional embedding value is added to the input tensor without residual connection ! Therefore, the input is changed, if you only require the positional embedding, drop the returned x !

Parameters:

get_default_strategy_config() → dataclass#

Returns a default adapter module strategy.

class nemo.collections.asr.parts.submodules.adapters.multi_head_attention_adapter_module.RelPositionalEncodingAdapter(*args: Any, **kwargs: Any)#

Bases: RelPositionalEncoding, AdapterModuleUtil

Relative positional encoding for TransformerXL’s layers See : Appendix B in https://arxiv.org/abs/1901.02860

Note

Relative positional embedding value is not added to the input tensor ! Therefore, the input should be updated changed, if you only require the positional embedding, drop the returned x !

Parameters:

get_default_strategy_config() → dataclass#

Returns a default adapter module strategy.

Adapter Strategies#

class nemo.collections.asr.parts.submodules.adapters.multi_head_attention_adapter_module.MHAResidualAddAdapterStrategy(

stochastic_depth: float = 0.0,

l2_lambda: float = 0.0,

)#

Bases: ResidualAddAdapterStrategy

An implementation of residual addition of an adapter module with its input for the MHA Adapters.

forward(

input: dict,

adapter: torch.nn.Module,

*,

module: AdapterModuleMixin,

)#

A basic strategy, comprising of a residual connection over the input, after forward pass by the underlying adapter. Additional work is done to pack and unpack the dictionary of inputs and outputs.

Note: The value tensor is added to the output of the attention adapter as the residual connection.

Parameters:

Returns:

The result tensor, after one of the active adapters has finished its forward passes.

compute_output(

input: torch.Tensor,

adapter: torch.nn.Module,

*,

module: AdapterModuleMixin,

) → torch.Tensor#

Compute the output of a single adapter to some input.

Parameters:

Returns:

The result tensor, after one of the active adapters has finished its forward passes.