Speech emotion Recognition using Transfer Learning (original) (raw)

Last Updated : 23 Jul, 2025

**This article provides a comprehensive guide to implementing Speech Emotion Recognition (SER) using Transfer Learning, leveraging tools like Librosa for audio feature extraction and VGG16 for robust classification.

**Prerequisites: VGG-16

Need for Speech Emotion Recognition

**Speech emotion recognition (SER) focuses on analyzing the pitch, tone, loudness, and frequency of sound to identify emotions in speech. This technique plays a crucial role in industries like entertainment, customer service, robotics, and security by providing insights into customer sentiment and human interactions.

**Transfer Learning is a powerful technique where a pre-trained model is fine-tuned and reused for new datasets. It eliminates the need to train a model from scratch, significantly reducing training time and improving efficiency.

Why Use CNN Based Model for Speech Emotion Recognition?

Techniques and Tools

In this project, we use Python due to its robust library ecosystem. Speech data contains features such as pitch, loudness, and frequency that need to be accurately captured for analysis.

For this task, we will utilize the **Toronto Emotional Speech Set (TESS), which includes 2,800 samples of seven emotions recorded by a 64-year-old woman and a young woman in her 20s.

The emotions are:

You can download the dataset from here.

Step 1: Import Required Libraries

Import the necessary libraries for data preprocessing, model creation, and training. Key libraries include:

import os import librosa import torch from torch.utils.data import Dataset import numpy as np from torch.utils.data import DataLoader, random_split import torch.nn as nn import torchvision.models as models import matplotlib.pyplot as plt

`

**Step 2: Define the Custom Dataset Class

The **EmotionDataset**class loads audio files, preprocesses them into Mel-Spectrograms, and prepares data for model training.

Python `

class EmotionDataset(Dataset): def init(self, data_path, emotions, transform=None): self.data_path = data_path self.emotions = emotions self.file_list = [] self.labels = [] self.transform = transform

    for idx, emotion in enumerate(emotions):
        emotion_folders = [f'YAF_{emotion}', f'OAF_{emotion}']
        for folder in emotion_folders:
            folder_path = os.path.join(data_path, folder)
            if os.path.exists(folder_path):
                for file_name in os.listdir(folder_path):
                    file_path = os.path.join(folder_path, file_name)
                    self.file_list.append(file_path)
                    self.labels.append(idx)

def __len__(self):
    return len(self.file_list)

def __getitem__(self, idx):
    file_path = self.file_list[idx]
    label = self.labels[idx]
    y, sr = librosa.load(file_path, sr=16000)
    mel_spectrogram = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128)
    mel_spectrogram_db = librosa.power_to_db(mel_spectrogram, ref=np.max)
    max_length = 128
    pad_width = max_length - mel_spectrogram_db.shape[1]
    if pad_width > 0:
        mel_spectrogram_db = np.pad(mel_spectrogram_db, pad_width=((0, 0), (0, pad_width)), mode='constant')
    else:
        mel_spectrogram_db = mel_spectrogram_db[:, :max_length]
    mel_spectrogram_3ch = np.repeat(mel_spectrogram_db[np.newaxis, :, :], 3, axis=0)
    return torch.tensor(mel_spectrogram_3ch, dtype=torch.float32), torch.tensor(label)

`

**Step 3: Define the Emotion Recognition Model

Use a pre-trained VGG16 model for transfer learning. Freeze the existing layers and replace the final layer with a custom classification layer for emotion recognition.

Python `

class EmotionRecognitionModel(nn.Module): def init(self, num_classes): super(EmotionRecognitionModel, self).init() self.vgg = models.vgg16(pretrained=True) for param in self.vgg.parameters(): param.requires_grad = False self.vgg.classifier[6] = nn.Linear(self.vgg.classifier[6].in_features, num_classes)

def forward(self, x):
    return self.vgg(x)

`

**Step 4: Initialize Dataset and DataLoader

emotions = ['anger', 'disgust', 'fear', 'happiness', 'pleasant_surprise', 'sadness', 'neutral'] data_path = 'TESS Toronto emotional speech set data' dataset = EmotionDataset(data_path, emotions) train_size = int(0.7 * len(dataset)) val_size = int(0.15 * len(dataset)) test_size = len(dataset) - train_size - val_size train_dataset, val_dataset, test_dataset = random_split(dataset, [train_size, val_size, test_size]) train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True) val_loader = DataLoader(val_dataset, batch_size=32) test_loader = DataLoader(test_dataset, batch_size=32)

`

Step 5: Training the Model

model = EmotionRecognitionModel(num_classes=len(emotions)) criterion = nn.CrossEntropyLoss() optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

num_epochs = 10 for epoch in range(num_epochs): model.train() train_loss = 0.0 total_train_correct = 0 total_train_samples = 0

for inputs, labels in train_loader:
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()

    train_loss += loss.item()
    total_train_correct += (outputs.argmax(dim=1) == labels).sum().item()
    total_train_samples += labels.size(0)

train_accuracy = total_train_correct / total_train_samples
print(f"Epoch [{epoch+1}/{num_epochs}], Training Accuracy: {train_accuracy:.4f}")

`

**Output:

Epoch [1/10], Training Loss: 3.5698, Training Accuracy: 0.3829
Epoch [1/10], Validation Loss: 0.6287, Validation Accuracy: 0.7867
Epoch [2/10], Training Loss: 1.6390, Training Accuracy: 0.4850
Epoch [2/10], Validation Loss: 0.2506, Validation Accuracy: 0.8433
.
.
.
Epoch [10/10], Training Loss: 0.3281, Training Accuracy: 0.7450
Epoch [10/10], Validation Loss: 0.0285, Validation Accuracy: 0.9493
Final Training Accuracy: 0.7450
Final Validation Accuracy: 0.9493

**Step 7: Predict an Emotion

Use the trained model to predict the emotion of a new audio file.

Python `

def predict_emotion(audio_path): y, sr = librosa.load(audio_path, sr=16000) mel_spectrogram = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128) mel_spectrogram_db = librosa.power_to_db(mel_spectrogram, ref=np.max)

max_length = 128
pad_width = max_length - mel_spectrogram_db.shape[1]
if pad_width > 0:
    mel_spectrogram_db = np.pad(mel_spectrogram_db, pad_width=((0, 0), (0, pad_width)), mode='constant')
else:
    mel_spectrogram_db = mel_spectrogram_db[:, :max_length]

mel_spectrogram_3ch = np.repeat(mel_spectrogram_db[np.newaxis, :, :], 3, axis=0)
input_tensor = torch.tensor(mel_spectrogram_3ch, dtype=torch.float32).unsqueeze(0)

model.eval()
with torch.no_grad():
    output = model(input_tensor)
    predicted_class = output.argmax(dim=1).item()
return emotions[predicted_class]

audio_file_path = '/path/to/audio.wav' # Replace with your audio file path predicted_emotion = predict_emotion(audio_file_path) print(f"Predicted Emotion: {predicted_emotion}")

`

**Output:

Predicted Emotion: fear

Complete Code

Python `

import os import librosa import torch from torch.utils.data import Dataset import numpy as np from torch.utils.data import DataLoader, random_split import torch.nn as nn import torchvision.models as models import matplotlib.pyplot as plt import librosa import torch

class EmotionDataset(Dataset): def init(self, data_path, emotions, transform=None): self.data_path = data_path self.emotions = emotions self.file_list = [] self.labels = [] self.transform = transform

    for idx, emotion in enumerate(emotions):
        emotion_folders = [f'YAF_{emotion}', f'OAF_{emotion}']
        for folder in emotion_folders:
            folder_path = os.path.join(data_path, folder)
            if os.path.exists(folder_path):
                for file_name in os.listdir(folder_path):
                    file_path = os.path.join(folder_path, file_name)
                    self.file_list.append(file_path)
                    self.labels.append(idx)

def __len__(self):
    return len(self.file_list)

def __getitem__(self, idx):
    file_path = self.file_list[idx]
    label = self.labels[idx]
    y, sr = librosa.load(file_path, sr=16000)
    mel_spectrogram = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128)
    mel_spectrogram_db = librosa.power_to_db(mel_spectrogram, ref=np.max)
    max_length = 128
    pad_width = max_length - mel_spectrogram_db.shape[1]
    if pad_width > 0:
        mel_spectrogram_db = np.pad(mel_spectrogram_db, pad_width=((0, 0), (0, pad_width)), mode='constant')
    else:
        mel_spectrogram_db = mel_spectrogram_db[:, :max_length]
    mel_spectrogram_3ch = np.repeat(mel_spectrogram_db[np.newaxis, :, :], 3, axis=0)
    return torch.tensor(mel_spectrogram_3ch, dtype=torch.float32), torch.tensor(label)

class EmotionRecognitionModel(nn.Module): def init(self, num_classes): super(EmotionRecognitionModel, self).init() self.vgg = models.vgg16(pretrained=True) for param in self.vgg.parameters(): param.requires_grad = False self.vgg.classifier[6] = nn.Linear(self.vgg.classifier[6].in_features, num_classes)

def forward(self, x):
    return self.vgg(x)

emotions = ['anger', 'disgust', 'fear', 'happiness', 'pleasant_surprise', 'sadness', 'neutral'] data_path = '/content/drive/MyDrive/extract_speech/TESS Toronto emotional speech set data' dataset = EmotionDataset(data_path, emotions) train_size = int(0.7 * len(dataset)) val_size = int(0.15 * len(dataset)) test_size = len(dataset) - train_size - val_size train_dataset, val_dataset, test_dataset = random_split(dataset, [train_size, val_size, test_size]) train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True) val_loader = DataLoader(val_dataset, batch_size=32)

model = EmotionRecognitionModel(num_classes=len(emotions)) criterion = nn.CrossEntropyLoss() optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

total_train_correct = 0 total_train_samples = 0 total_val_correct = 0 total_val_samples = 0

num_epochs = 10 for epoch in range(num_epochs): model.train() train_loss = 0.0

for inputs, labels in train_loader:
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()

    train_loss += loss.item()
    total_train_correct += (outputs.argmax(dim=1) == labels).sum().item()
    total_train_samples += labels.size(0)

avg_train_loss = train_loss / len(train_loader)
train_accuracy = total_train_correct / total_train_samples
print(f"Epoch [{epoch+1}/{num_epochs}], Training Loss: {avg_train_loss:.4f}, Training Accuracy: {train_accuracy:.4f}")


model.eval()
val_loss = 0.0

with torch.no_grad():
    for inputs, labels in val_loader:
        outputs = model(inputs)
        loss = criterion(outputs, labels)

        val_loss += loss.item()
        total_val_correct += (outputs.argmax(dim=1) == labels).sum().item()
        total_val_samples += labels.size(0)

avg_val_loss = val_loss / len(val_loader)
val_accuracy = total_val_correct / total_val_samples
print(f"Epoch [{epoch+1}/{num_epochs}], Validation Loss: {avg_val_loss:.4f}, Validation Accuracy: {val_accuracy:.4f}")

final_train_accuracy = total_train_correct / total_train_samples final_val_accuracy = total_val_correct / total_val_samples print(f"Final Training Accuracy: {final_train_accuracy:.4f}") print(f"Final Validation Accuracy: {final_val_accuracy:.4f}")

torch.save(model.state_dict(), 'emotion_recognition_model.pth')

test_loader = DataLoader(test_dataset, batch_size=32)

model.load_state_dict(torch.load('emotion_recognition_model.pth'))

model.eval() test_loss = 0.0 total_test_correct = 0 total_test_samples = 0

with torch.no_grad(): for inputs, labels in test_loader: outputs = model(inputs) loss = criterion(outputs, labels)

    test_loss += loss.item()
    total_test_correct += (outputs.argmax(dim=1) == labels).sum().item()
    total_test_samples += labels.size(0)

avg_test_loss = test_loss / len(test_loader) test_accuracy = total_test_correct / total_test_samples print(f"Test Loss: {avg_test_loss:.4f}, Test Accuracy: {test_accuracy:.4f}")

def predict_emotion(audio_path):

y, sr = librosa.load(audio_path, sr=16000)
mel_spectrogram = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128)
mel_spectrogram_db = librosa.power_to_db(mel_spectrogram, ref=np.max)  
max_length = 128
pad_width = max_length - mel_spectrogram_db.shape[1]
if pad_width > 0:
    mel_spectrogram_db = np.pad(mel_spectrogram_db, pad_width=((0, 0), (0, pad_width)), mode='constant')
else:
    mel_spectrogram_db = mel_spectrogram_db[:, :max_length]
mel_spectrogram_3ch = np.repeat(mel_spectrogram_db[np.newaxis, :, :], 3, axis=0)
input_tensor = torch.tensor(mel_spectrogram_3ch, dtype=torch.float32).unsqueeze(0)
model.eval()
with torch.no_grad():
    output = model(input_tensor)
    predicted_class = output.argmax(dim=1).item()
return emotions[predicted_class]

audio_file_path = '/content/drive/MyDrive/extract_speech/TESS Toronto emotional speech set data/OAF_Fear/OAF_bar_fear.wav' # Replace with your audio file path predicted_emotion = predict_emotion(audio_file_path) print(f'Predicted Emotion: {predicted_emotion}')

`

Speech Emotion Analysis is a useful technique as it helps to analyze the emotions of a person via speech. Combining the extraction power of Librosa and VGG 16 will be definitely useful in many industries as it will leverage the sentiment analysis.