Audio Seq2seq Model using Transformers (original) (raw)

Last Updated : 23 Jul, 2025

The article explores the various applications of the Seq2Seq model in various fields, delving into its complexities. We'll also look at how audio transformation can be used in practice.

What is Seq2Seq model?

Seq2Seq are encoder and decoder models allowing for different lengths of inputs and outputs as the input is processed by the encoder and the output is processed by the decoder. These are typically used for Automatic Speech Recognizations (ASR), Speech to Speech Translation, and Speech Synthesis. Let us see how input is processed through a Seq2Seq model.

**Encoder: The encoder takes a log-mel spectrogram as input and transforms it into a sequence of hidden states. These encoder hidden states capture essential features from the spoken speech, representing the overall meaning of the input speech.
**Decoder: Subsequently, the encoder's output is inputted into the transformer decoder using a mechanism known as cross-attention. This mechanism, akin to self-attention but focused on the encoder output, allows the decoder to predict a sequence of text tokens in an autoregressive manner. Starting with an initial sequence containing only a "start" token (SOT in the case of Whisper), the decoder generates one token at a time. At each step, the previously generated sequence becomes the new input, progressively extending the output sequence. This process continues until the decoder predicts an "end" token or reaches a predefined maximum number of timesteps. In this architectural design, the decoder acts as a language model, leveraging the hidden-state representations from the encoder to produce corresponding text transcriptions.

For a more detailed understanding of how audio transformers work kindly look into the article Audio Transformer

In this article, we will focus on the implementation of the seq-to-seq model using a transformer. We will use the pre-trained whisper model from Hugginface and fine-tune it.

Audio Seq2seq Model Implementation using Transformers

**Install the Necessary Libraries

Install the below libraries if not available in your environment. These are required to run the subsequent code.

The 'Datasets' library is commonly used for working with machine learning datasets. The specific functions used from this library are load_dataset, Audio, and DatasetDict. These functions are used for loading datasets and dealing with audio data.
'Torch' is the PyTorch library, a popular deep-learning framework. This library provides tools for building and training neural networks
'Transformers' is a popular library for working with pre-trained language models from hugging face
'evaluate' contains functions contains related to evaluating models

Import the necessary libraries

!pip install datasets !pip install transformers !pip install torch !pip install evaluate !pip install jiwer !pip install transformers[torch] !pip install numpy

Step 1: Import the Necessary Libraries

And then import the libraries into your notebook

Python3 `

##Imports required import numpy as np from datasets import load_dataset, Audio, DatasetDict import torch import evaluate from dataclasses import dataclass, field from typing import Any, Dict, List, Optional, Union from transformers import Seq2SeqTrainingArguments,Seq2SeqTrainer device = 'cuda' if torch.cuda.is_available() else 'cpu' print(device)

**Step 2: Loading Dataset

About the PloyAi/minds14 dataset- MINDS-14 is a training and evaluation resource for intent detection tasks with spoken data. It covers 14 intents extracted from a commercial system in the e-banking domain, associated with spoken examples in 14 diverse language varieties

Using the load_dataset function of the datasets library we load the minds14 dataset in 'English' from hugging face. We load only 80 samples for our demo purpose.
- Data Fields
- path (str): Path to the audio file
- audio (dict): Audio object including loaded audio array, sampling rate, and path to audio
- transcription (str): Transcription of the audio file
- english_transcription (str): English transcription of the audio file
- intent_class (int): Class id of intent
- lang_id (int): Id of language
We remove the unnecessary columns using the remove_columns method on the dataset
We then split the dataset in the ratio of 80: 20. We specify shuffle= false so that we get some examples while running the code Python3 `

dataset = DatasetDict()

Load the PolyAI dataset.

dataset = load_dataset("PolyAI/minds14", name="en-US", split="train[:80]")

Remove unnecessary columns

dataset.remove_columns( ['path', 'english_transcription', 'intent_class', 'lang_id'])

Split the datasedataset into train and test

dataset = dataset.train_test_split(test_size=0.2, shuffle=False)

**Step3: Data Pre-Processing and Tokenization

We first resample our audio data to 16khz from 8kz using an audio library as the whimper seq to seq model is trained on 16 kHz

Python3 `

dataset['train'] = dataset['train'].cast_column("audio", Audio(sampling_rate=16000)) dataset['test'] = dataset['test'].cast_column("audio", Audio(sampling_rate=16000))

Let us import the whisper model and processor from Hugging Face using the Transformers library

We need to crate two variables with name 'input_featrues'(input array of sound wave in raw form) and 'labels'(transcription)
We use the processor library from the transformer. It contains a feature extractor and tokenizer.
- The feature extractor is used to convert our raw audio data into the input form expected by the model.
- The tokenizer converts our target sentence into labels expected by the model Python3 `

from transformers import WhisperProcessor, WhisperForConditionalGeneration processor = WhisperProcessor.from_pretrained( "openai/whisper-tiny.en", task="transcribe", model_max_length=225) model = WhisperForConditionalGeneration.from_pretrained( 'openai/whisper-tiny.en') model.to(device)

Preparing a function to process the entire dataset

def prepare_dataset(batch): audio = batch["audio"]

batch["input_ids"] = processor.feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"],

return_tensor = "pt").input_features[0]

batch["input_features"] = processor.feature_extractor(
    audio["array"], sampling_rate=audio["sampling_rate"], return_tensor="pt").input_features[0]

batch["labels"] = processor.tokenizer(batch["sentence"]).input_ids
return batch

np.object = object encoded_dataset = dataset.map( prepare_dataset, remove_columns=data.column_names["train"], num_proc=4)

**Step 4: Preparing data collator class

The data collator class is used for the dynamic padding of our data during the training time.
For audio, we need to pad differently for our input and output sentences as input is raw audio form and output is text.
We pad the input and output and return them as tensors. Python3 `

Creating a DataCollatorClass

from dataclasses import dataclass, field from typing import Any, Dict, List, Optional, Union

@dataclass class DataCollatorSpeechSeq2SeqWithPadding: processor: processor padding: Union[bool, str] = "longest"

def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
    # split inputs and labels since they have to be of different lengths and need
    # different padding methods
    # print(features)
    input_features = [{"input_features": feature["input_features"]}
                      for feature in features]
    label_features = [{"input_ids": feature["labels"]}
                      for feature in features]

    batch = self.processor.feature_extractor.pad(
        input_features, return_tensors="pt")

    labels_batch = self.processor.tokenizer.pad(
        label_features, return_tensors="pt")

    # replace padding with -100 to ignore loss correctly
    labels = labels_batch["input_ids"].masked_fill(
        labels_batch.attention_mask.ne(1), -100)

    batch["labels"] = labels
    # print(batch)
    return batch  # batch

data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

Step 5: Model Evaluation

We will be evaluating our model on word error rate

Python3 `

Evalution metric-

import evaluate

metric = evaluate.load("wer")

def compute_metrics(pred):

wer = evaluate.load("wer")

pred_ids = pred.predictions
label_ids = pred.label_ids

# replace -100 with the pad_token_id
label_ids[label_ids == -100] = processor.tokenizer.pad_token_id

# we do not want to group tokens when computing the metrics
pred_str = processor.tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
label_str = processor.tokenizer.batch_decode(label_ids, skip_special_tokens=True)

wer =  metric.compute(predictions=pred_str, references=label_str)

return {"wer": wer}

**Step 6: Define our trainer

training_args: This section initializes an object of the Seq2SeqTrainingArguments class, which contains various training-related settings. Some notable parameters include:
- output_dir: Directory where the trained model will be saved.
- gradient_checkpointing: Enables gradient checkpointing, a memory-saving technique during training.
- per_device_train_batch_size: Number of training samples processed in each forward/backward pass on each device.
- learning_rate: Initial learning rate for the optimizer.
- fp16: Enables mixed-precision training using a 16-bit floating-point format for improved training speed and reduced memory usage.
- optim: The optimizer used for training; in this case, it's set to 'adafactor'.
- predict_with_generate: Indicates that the model should use generation during prediction.
- evaluation_strategy: Defines when to evaluate the model during training, in this case, after a certain number of steps.
- per_device_eval_batch_size: Batch size for evaluation.
- eval_steps: Number of steps between evaluations.
- load_best_model_at_end: Indicates whether to load the best model checkpoint at the end of training.
- metric_for_best_model: The metric used to determine the best model checkpoint.
- report_to: List of integrations to which training results should be reported, in this case, TensorBoard.
trainer: This section initializes an object of the Seq2SeqTrainer class, which is responsible for handling the training process. It takes in the model, training arguments, datasets, tokenizers, data collator, and a function for computing metrics. Python3 `

model.config.forced_decoder_ids = None model.config.suppress_tokens = [] training_args = Seq2SeqTrainingArguments( output_dir="seqtoseq-trained", gradient_checkpointing=True, per_device_train_batch_size=2, learning_rate=1e-5, warmup_steps=2, max_steps=2000, fp16=True ,#False ,#True, optim='adafactor',

group_by_length=True,

predict_with_generate=True,
evaluation_strategy="steps",
per_device_eval_batch_size=2,
eval_steps=100,
load_best_model_at_end=True,
metric_for_best_model="wer",
report_to = ["tensorboard"],
#data_parallel=False

)

trainer = Seq2SeqTrainer( model=model, args=training_args, train_dataset=encoded_dataset["train"], eval_dataset=encoded_dataset["test"], tokenizer=processor, data_collator=data_collator, compute_metrics=compute_metrics, #data_parallel=False

sampler = None

)

**Step 6: Define Computation metric

Load Word Error Rate (WER) Metric: The function loads a Word Error Rate metric. WER is commonly used to evaluate the performance of automatic speech recognition or sequence-to-sequence models by measuring the difference between predicted and reference sequences.
Extract Predicted and Label IDs: The predicted and label IDs are extracted from the pred object, which is the output of the model during evaluation, and passed to this object
Replace -100 with pad_token_id: Pad token IDs are replaced where the prediction is -100
Decode Sequences: The predicted and label sequences are decoded using the tokenizer's batch_decode method, skipping special tokens. This step converts the token IDs back into human-readable text.
Compute Word Error Rate (WER): The WER is computed using the external WER metric, comparing the predicted and reference sequences. The result is multiplied by 100 to obtain a percentage.
Return WER in a Dictionary: The function returns a dictionary containing the computed WER under the key "wer". This format is suitable for reporting multiple metrics during training and evaluation. Python3 `

import evaluate

metric = evaluate.load("wer")

def compute_metrics(pred):

wer = evaluate.load("wer")

pred_ids = pred.predictions
label_ids = pred.label_ids

# replace -100 with the pad_token_id
label_ids[label_ids == -100] = processor.tokenizer.pad_token_id

# we do not want to group tokens when computing the metrics
pred_str = processor.tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
label_str = processor.tokenizer.batch_decode(label_ids, skip_special_tokens=True)

wer = 100 * metric.compute(predictions=pred_str, references=label_str)

return {"wer": wer}

To start training run the below command

Python3 `

Requires GPU for training

trainer.train()

**Output:

Logs of training: Step Training Loss Validation Loss Wer 100 No log 0.525059 4.193548 200 No log 0.532363 1.846774 300 No log 0.553872 1.161290 400 No log 0.568876 1.161290 500 0.000000 0.590014 1.169355

**Step 7: Drawing inferences

Let us check the output of our model after training

Python3 `

getting test data

inputs = processor(dataset['test'][8]["audio"]["array"], sampling_rate=16000, return_tensors="pt").to(device).input_features print(f"The input test audio is: {dataset['test'][8]['transcription']}")

generated_ids = model.generate(inputs=inputs)

transcription = processor.batch_decode( generated_ids, skip_special_tokens=True)[0] print(f'The output prediction is : {transcription}')

**Output:

The input test audio is: how much do I have in my account The output prediction is : 'm much do I have in my account

Conclusion:

In this article, we saw how to fine-tune an audio seq2seq model using the transformers library.