Melfrequency Cepstral Coefficients (MFCC) for Speech Recognition (original) (raw)

Last Updated : 23 Jul, 2025

Have you ever wondered how your smartphone comprehends voice instructions? Or how voice assistants such as Alexa and Siri process your commands? The mechanism behind this remarkable capability is largely attributed to a method known as Mel-Frequency Cepstral Coefficients (MFCCs).

**While the concept may initially appear daunting, this article is designed to demystify MFCCs, presenting them in a manner that even those new to the topic can understand.

Table of Content

**Speech Recognition Technology

Speech recognition technology allows machines to interpret human speech, transforming spoken words into a format that computers can manipulate. This technology is pivotal in developing interactive and responsive AI, such as voice-activated assistants, automated customer service systems, and real-time translation services.

What are MFCCs?

MFCC stands for Mel-frequency Cepstral Coefficients. It’s a feature used in automatic speech and speaker recognition. Essentially, it’s a way to represent the short-term power spectrum of a sound which helps machines understand and process human speech more effectively. Imagine your voice as a unique fingerprint. MFCCs, function similarly to a unique code capturing the salient features of your speech and enabling computers to discern between distinct words, and sounds. In speech recognition applications where computers must translate spoken words into text this code is especially helpful.

**Role of Mel-Frequency Cepstral Coefficients (MFCCs)

MFCCs are mathematical representations of the vocal tract produced by humans as they speak. The process involves several steps to capture the essential characteristics of human speech which are most discernible to the human ear.

Here’s how MFCCs contribute to understanding speech:

  1. **Signal Analysis: Speech is a complex signal characterized by varying frequency and amplitude. MFCCs help break down these signals into simpler components that represent the rate and characteristics of sound-wave changes over time.
  2. **Frequency Transformation: Humans do not perceive frequencies on a linear scale. Therefore, the MFCCs use a mel-scale that closely approximates the human auditory system's response, which is more sensitive to changes in lower frequencies than higher ones.
  3. **Cepstral Representation: After transforming to the mel scale, the signal is converted back to a time-domain representation called the cepstrum. The cepstrum separates the signal's periodic variation (pitch) from the slow variation (timbre), focusing on the latter which carries most of the information relevant to recognizing speech.

Basics of Fourier Transform

The Fourier Transform is based on the premise that any periodic signal can be represented as a sum of simple oscillating functions, namely sines and cosines. These functions are characterized by their frequencies, and the Fourier Transform identifies the component frequencies in a signal and measures their amplitude and phase.

The Fourier Transform of a continuous-time signal _f(t) is given by:

F(w) = \int_{-\infty}^{\infty} f(t) e^{-i\omega t} dt

where:

Mel-Scale for Audio Analysis

The Mel-scale is specifically designed to mimic the way humans perceive sound, particularly how we discern differences in pitch. Human hearing is more sensitive to changes in lower frequencies than to equivalent changes in higher frequencies.

The Mel-scale addresses this by applying a linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz. This scaling allows for a more perceptually relevant representation of audio signals, aligning the scale with the non-linear human auditory system:

This dual approach helps in various applications like speech processing and music analysis, where capturing the nuances of how humans actually hear can significantly enhance the effectiveness and accuracy of the technology.

Pre-emphasis in Audio Signal Processing

Pre-emphasis is a preprocessing technique used in audio signal processing, especially in speech recognition, to artificially enhance high-frequency components of a speech signal. This is necessary because speech naturally loses energy at higher frequencies due to the physiological characteristics of the human vocal tract and the properties of sound transmission. By amplifying these frequencies:

Pre-emphasis facilitates more effective subsequent processing stages, including feature extraction, by ensuring that key speech characteristics are preserved and highlighted.

Framing the Signals

In speech processing, the continuous speech stream is divided into shorter segments called frames, typically lasting between 20 to 40 milliseconds. This segmentation is necessary because speech characteristics, like pitch and tone, change over time. By analyzing these short, stable segments, we can more effectively capture and examine the speech's dynamic properties.

Additionally, frames often overlap by about 50%, ensuring that no important information is missed and smoothing the transitions between segments. This overlap helps prevent discontinuities and ensures comprehensive analysis of the speech stream.

Windowing

To prevent unwanted artifacts such as spectral leakage caused by the abrupt starts and ends of each frame, windowing is applied. This involves:

Fast Fourier Transform (FFT)

Fast Fourier Transform (FFT) is a method to efficiently compute the Fourier Transform, which converts the time domain signal of each framed signal into the frequency domain:

Mel-filterbank

Once the signal is in the frequency domain, a Mel-filterbank is applied:

Log Mel-spectrum

Our perception of loudness is logarithmic rather than linear:

Discrete Cosine Transform (DCT)

Finally, a DCT is applied to the log Mel-spectrum:

How to compute MFCC?

Finally, by taking the first few coefficients from the DCT output, we obtain the MFCCs, which represent a compact and informative description of the speech signal in each frame. To calculate MFCCs, we follow these steps:

  1. **Pre-emphasize the signal: Amplify higher frequencies to balance the spectrum.
  2. **Framing: Break the signal into small, overlapping frames.
  3. **Windowing: To soften the edges of each frame, apply a Hamming window.
  4. **FFT: Convert each frame from the time domain to the frequency domain.
  5. **Mel-filterbank: Apply overlapping triangular filters spaced according to the Mel-scale.
  6. **Logarithm: To replicate the way a human ear reacts to sound strength take the logarithm of the filterbank outputs.
  7. **DCT: Apply the DCT to the log Mel-spectrum to obtain the Mel-frequency Cepstral Coefficients.

Calculating MFCCs from Speech Signal in Python

In this example we'll go over how to use Python to calculate the MFCCs from a speech signal. Common libraries like librosa for audio processing and numpy, scipy, and matplotlib will be used. Lastly, we'll utilize ipywidgets to build a basic GUI that will allow users to test the model in real time.

Original Signal -> Pre-emphasis -> Framing -> Windowing -> FFT -> Mel-filterbank -> Logarithm -> DCT -> MFCCs

Step 1: Install Required Libraries

We must install the required libraries first. In your Google Colab/System environment, you can use the following commands:

!pip install numpy scipy matplotlib librosa ipywidgets

Step 2: Load and Visualize the Audio Signal

We'll start by loading an audio file and visualizing its waveform.

Python `

import numpy as np import librosa import matplotlib.pyplot as plt

Load the audio file

audio_path = librosa.example('trumpet') y, sr = librosa.load(audio_path)

Plot the waveform

plt.figure(figsize=(14, 5)) plt.plot(y) plt.title('Waveform of the Audio Signal') plt.xlabel('Time') plt.ylabel('Amplitude') plt.show()

`

**Output:

Downloading file 'sorohanro_-solo-trumpet-06.ogg' from 'https://librosa.org/data/audio/sorohanro-_solo-trumpet-06.ogg' to '/root/.cache/librosa'.

download-(2)

Step 3: Pre-emphasis

Pre-emphasizing the audio signal helps to balance the spectrum by amplifying higher frequencies.

Python `

Apply pre-emphasis filter

pre_emphasis = 0.97 y_preemphasized = np.append(y[0], y[1:] - pre_emphasis * y[:-1])

Plot the pre-emphasized signal

plt.figure(figsize=(14, 5)) plt.plot(y_preemphasized) plt.title('Pre-emphasized Signal') plt.xlabel('Time') plt.ylabel('Amplitude') plt.show()

`

**Output:

download-(3)

Step 4: Framing

We'll break the audio signal into small frames.

Python `

frame_size = 0.025 # 25 ms frame_stride = 0.01 # 10 ms frame_length, frame_step = frame_size * sr, frame_stride * sr # Convert from seconds to samples signal_length = len(y_preemphasized) frame_length = int(round(frame_length)) frame_step = int(round(frame_step)) num_frames = int(np.ceil(float(np.abs(signal_length - frame_length)) / frame_step))

Pad signal to ensure all frames have equal number of samples

pad_signal_length = num_frames * frame_step + frame_length z = np.zeros((pad_signal_length - signal_length)) pad_signal = np.append(y_preemphasized, z)

Slice the signal into frames

indices = np.tile(np.arange(0, frame_length), (num_frames, 1)) + np.tile(np.arange(0, num_frames * frame_step, frame_step), (frame_length, 1)).T frames = pad_signal[indices.astype(np.int32, copy=False)]

Plot a few frames

plt.figure(figsize=(14, 5)) plt.plot(frames[0]) plt.title('First Frame of the Signal') plt.xlabel('Samples') plt.ylabel('Amplitude') plt.show()

`

**Output:

download-(4)

Step 5: Windowing

Apply a window function to each frame to minimize discontinuities at the edges.

Python `

Apply Hamming window

frames *= np.hamming(frame_length)

Plot the first frame after windowing

plt.figure(figsize=(14, 5)) plt.plot(frames[0]) plt.title('First Frame after Windowing') plt.xlabel('Samples') plt.ylabel('Amplitude') plt.show()

`

**Output:

download-(5)

Step 6: Fast Fourier Transform (FFT)

Convert each frame from the time domain to the frequency domain.

Python `

NFFT = 512 mag_frames = np.absolute(np.fft.rfft(frames, NFFT)) # Magnitude of the FFT pow_frames = ((1.0 / NFFT) * ((mag_frames) ** 2)) # Power Spectrum

Plot the magnitude spectrum of the first frame

plt.figure(figsize=(14, 5)) plt.plot(mag_frames[0]) plt.title('Magnitude Spectrum of the First Frame') plt.xlabel('Frequency Bin') plt.ylabel('Amplitude') plt.show()

`

**Output:

download-(6)

Step 7: Apply Mel-filterbank

Apply a filterbank to the power spectra to get the energy in each Mel-frequency bin.

Python `

nfilt = 40 low_freq_mel = 0 high_freq_mel = 2595 * np.log10(1 + (sr / 2) / 700) # Convert Hz to Mel mel_points = np.linspace(low_freq_mel, high_freq_mel, nfilt + 2) # Equally spaced in Mel scale hz_points = 700 * (10 ** (mel_points / 2595) - 1) # Convert Mel to Hz bin = np.floor((NFFT + 1) * hz_points / sr)

fbank = np.zeros((nfilt, int(np.floor(NFFT / 2 + 1)))) for m in range(1, nfilt + 1): f_m_minus = int(bin[m - 1]) # left f_m = int(bin[m]) # center f_m_plus = int(bin[m + 1]) # right

for k in range(f_m_minus, f_m):
    fbank[m - 1, k] = (k - bin[m - 1]) / (bin[m] - bin[m - 1])
for k in range(f_m, f_m_plus):
    fbank[m - 1, k] = (bin[m + 1] - k) / (bin[m + 1] - bin[m])

filter_banks = np.dot(pow_frames, fbank.T) filter_banks = np.where(filter_banks == 0, np.finfo(float).eps, filter_banks) # Numerical stability filter_banks = 20 * np.log10(filter_banks) # dB

Plot the filter bank energies

plt.figure(figsize=(14, 5)) plt.imshow(filter_banks.T, cmap='hot', aspect='auto') plt.title('Filter Bank Energies') plt.xlabel('Frame Index') plt.ylabel('Filter Index') plt.show()

`

**Output:

download-(7)(2)

Step 8: Discrete Cosine Transform (DCT)

Apply DCT to the filter bank energies to get the MFCCs.

Python `

from scipy.fftpack import dct

num_ceps = 12 mfcc = dct(filter_banks, type=2, axis=1, norm='ortho')[:, :num_ceps]

Plot the MFCCs

plt.figure(figsize=(14, 5)) plt.imshow(mfcc.T, cmap='hot', aspect='auto') plt.title('MFCC') plt.xlabel('Frame Index') plt.ylabel('Cepstral Coefficient Index') plt.show()

`

**Output:

download-(8)

Step 9: Interactive GUI with ipywidgets

Let's create an interactive GUI where users can upload their own audio files or use a sample audio file from the web to compute MFCCs.

Load Sample Audio File from Web

We'll first demonstrate how to download a sample audio file from the web and use it to compute MFCCs.

Python `

import requests

Download a sample audio file from the web

url = 'https://www2.cs.uic.edu/~i101/SoundFiles/preamble10.wav' response = requests.get(url) sample_audio_path = 'sample_audio.wav'

with open(sample_audio_path, 'wb') as f: f.write(response.content)

`

Interactive GUI for

Audio File Upload and MFCC Computation

Python `

import librosa.display import ipywidgets as widgets from IPython.display import display from IPython.display import Audio from scipy.fftpack import dct

File uploader widget

uploader = widgets.FileUpload(accept='.wav', multiple=False)

Load and compute MFCCs for a given audio file

def compute_mfcc(file): y, sr = librosa.load(file, sr=None) y_preemphasized = np.append(y[0], y[1:] - pre_emphasis * y[:-1]) signal_length = len(y_preemphasized) num_frames = int(np.ceil(float(np.abs(signal_length - frame_length)) / frame_step)) pad_signal_length = num_frames * frame_step + frame_length z = np.zeros((pad_signal_length - signal_length)) pad_signal = np.append(y_preemphasized, z) indices = np.tile(np.arange(0, frame_length), (num_frames, 1)) + np.tile(np.arange(0, num_frames * frame_step, frame_step), (frame_length, 1)).T frames = pad_signal[indices.astype(np.int32, copy=False)] frames *= np.hamming(frame_length) mag_frames = np.absolute(np.fft.rfft(frames, NFFT)) pow_frames = ((1.0 / NFFT) * ((mag_frames) ** 2)) filter_banks = np.dot(pow_frames, fbank.T) filter_banks = np.where(filter_banks == 0, np.finfo(float).eps, filter_banks) filter_banks = 20 * np.log10(filter_banks) mfcc = dct(filter_banks, type=2, axis=1, norm='ortho')[:, :num_ceps]

plt.figure(figsize=(14, 5))
plt.subplot(2, 1, 1)
librosa.display.waveshow(y, sr=sr)
plt.title('Waveform')
plt.subplot(2, 1, 2)
plt.imshow(mfcc.T, cmap='hot', aspect='auto')
plt.title('MFCC')
plt.xlabel('Frame Index')
plt.ylabel('Cepstral Coefficient Index')
plt.tight_layout()
plt.show()

Handle file upload and compute MFCCs

def on_upload_change(change): file = list(uploader.value.values())[0] compute_mfcc(file['content'])

uploader.observe(on_upload_change, names='value')

Button to compute MFCCs for the sample audio file

sample_button = widgets.Button(description='Use Sample Audio') def on_sample_button_click(b): compute_mfcc(sample_audio_path)

sample_button.on_click(on_sample_button_click)

Display widgets

display(uploader, sample_button)

`

**Output:

mfcc

Explanation

This interactive GUI lets users either upload their own audio files or use a sample file to visualize and understand MFCC computation.

Conclusion

MFCCs are a cornerstone of speech recognition technology, providing a robust way to represent speech signals. Exciting developments in speech recognition and other speech-based technologies are made possible by MFCCs which imitate human hearing and extract important aspects of sound waves. Through comprehension and utilization of MFCCs we can improve the precision and effectiveness of diverse audio processing applications. MFCCs are essential for improving the ability of machines to comprehend human speech whether it is for text recognition or speech-to-text conversion.