Melfrequency Cepstral Coefficients (MFCC) for Speech Recognition (original) (raw)
Last Updated : 23 Jul, 2025
Have you ever wondered how your smartphone comprehends voice instructions? Or how voice assistants such as Alexa and Siri process your commands? The mechanism behind this remarkable capability is largely attributed to a method known as Mel-Frequency Cepstral Coefficients (MFCCs).
**While the concept may initially appear daunting, this article is designed to demystify MFCCs, presenting them in a manner that even those new to the topic can understand.
Table of Content
- Speech Recognition Technology
- What are MFCCs?
- Role of Mel-Frequency Cepstral Coefficients (MFCCs)
- Basics of Fourier Transform
- Mel-Scale for Audio Analysis
- Pre-emphasis in Audio Signal Processing
- Framing the Signals
- Windowing
- Fast Fourier Transform (FFT)
- Mel-filterbank
- Log Mel-spectrum
- Discrete Cosine Transform (DCT)
- How to compute MFCC?
- Calculating MFCCs from Speech Signal in Python
- Applications of MFCC
- Comparison with Other Features
- Conclusion
- MFCC for Speech Recognition - FAQs
**Speech Recognition Technology
Speech recognition technology allows machines to interpret human speech, transforming spoken words into a format that computers can manipulate. This technology is pivotal in developing interactive and responsive AI, such as voice-activated assistants, automated customer service systems, and real-time translation services.
What are MFCCs?
MFCC stands for Mel-frequency Cepstral Coefficients. It’s a feature used in automatic speech and speaker recognition. Essentially, it’s a way to represent the short-term power spectrum of a sound which helps machines understand and process human speech more effectively. Imagine your voice as a unique fingerprint. MFCCs, function similarly to a unique code capturing the salient features of your speech and enabling computers to discern between distinct words, and sounds. In speech recognition applications where computers must translate spoken words into text this code is especially helpful.
**Role of Mel-Frequency Cepstral Coefficients (MFCCs)
MFCCs are mathematical representations of the vocal tract produced by humans as they speak. The process involves several steps to capture the essential characteristics of human speech which are most discernible to the human ear.
Here’s how MFCCs contribute to understanding speech:
- **Signal Analysis: Speech is a complex signal characterized by varying frequency and amplitude. MFCCs help break down these signals into simpler components that represent the rate and characteristics of sound-wave changes over time.
- **Frequency Transformation: Humans do not perceive frequencies on a linear scale. Therefore, the MFCCs use a mel-scale that closely approximates the human auditory system's response, which is more sensitive to changes in lower frequencies than higher ones.
- **Cepstral Representation: After transforming to the mel scale, the signal is converted back to a time-domain representation called the cepstrum. The cepstrum separates the signal's periodic variation (pitch) from the slow variation (timbre), focusing on the latter which carries most of the information relevant to recognizing speech.
Basics of Fourier Transform
The Fourier Transform is based on the premise that any periodic signal can be represented as a sum of simple oscillating functions, namely sines and cosines. These functions are characterized by their frequencies, and the Fourier Transform identifies the component frequencies in a signal and measures their amplitude and phase.
The Fourier Transform of a continuous-time signal _f(t) is given by:
F(w) = \int_{-\infty}^{\infty} f(t) e^{-i\omega t} dt
where:
- F(\omega) is the Fourier Transform of _f(t),
- \omega is the angular frequency in radians per second,
- _t represents time,
- _e is the base of the natural logarithm,
- _i is the imaginary unit.
Mel-Scale for Audio Analysis
The Mel-scale is specifically designed to mimic the way humans perceive sound, particularly how we discern differences in pitch. Human hearing is more sensitive to changes in lower frequencies than to equivalent changes in higher frequencies.
The Mel-scale addresses this by applying a linear frequency spacing below 1000 Hz and a logarithmic spacing above 1000 Hz. This scaling allows for a more perceptually relevant representation of audio signals, aligning the scale with the non-linear human auditory system:
- **Linear Region: In the lower frequencies (below 1000 Hz), our ears can detect small differences in pitch. The Mel-scale mirrors this sensitivity by spacing these frequencies linearly, meaning that a change in frequency corresponds directly to a proportional change on the scale.
- **Logarithmic Region: Above 1000 Hz, our ears are less sensitive to changes in frequency. Here, the Mel-scale becomes logarithmic, grouping closer together frequencies that would perceptually sound similar. This logarithmic nature means that as frequencies increase, larger changes are required to achieve a similar perceptual difference in pitch.
This dual approach helps in various applications like speech processing and music analysis, where capturing the nuances of how humans actually hear can significantly enhance the effectiveness and accuracy of the technology.
Pre-emphasis in Audio Signal Processing
Pre-emphasis is a preprocessing technique used in audio signal processing, especially in speech recognition, to artificially enhance high-frequency components of a speech signal. This is necessary because speech naturally loses energy at higher frequencies due to the physiological characteristics of the human vocal tract and the properties of sound transmission. By amplifying these frequencies:
- **Speech clarity is improved: Enhancing high frequencies makes important speech details like formants and consonants more discernible, which are essential for distinguishing different sounds and words.
- **Signal quality is enhanced: It helps in increasing the signal-to-noise ratio, making the important features of speech stand out more prominently against background noise.
Pre-emphasis facilitates more effective subsequent processing stages, including feature extraction, by ensuring that key speech characteristics are preserved and highlighted.
Framing the Signals
In speech processing, the continuous speech stream is divided into shorter segments called frames, typically lasting between 20 to 40 milliseconds. This segmentation is necessary because speech characteristics, like pitch and tone, change over time. By analyzing these short, stable segments, we can more effectively capture and examine the speech's dynamic properties.
Additionally, frames often overlap by about 50%, ensuring that no important information is missed and smoothing the transitions between segments. This overlap helps prevent discontinuities and ensures comprehensive analysis of the speech stream.
Windowing
To prevent unwanted artifacts such as spectral leakage caused by the abrupt starts and ends of each frame, windowing is applied. This involves:
- **Smoothing the Edges: A window function, typically a Hamming window, is multiplied by each frame. This smooths the edges of the frames, reducing sudden jumps in signal amplitude and minimizing the discontinuities at the frame borders.
Fast Fourier Transform (FFT)
Fast Fourier Transform (FFT) is a method to efficiently compute the Fourier Transform, which converts the time domain signal of each framed signal into the frequency domain:
- **Frequency Content Analysis: The Fourier Transform helps identify different frequency components within a frame, and FFT allows this to be done quickly and efficiently.
Mel-filterbank
Once the signal is in the frequency domain, a Mel-filterbank is applied:
- **Frequency Band Separation: This involves a set of filters, each tuned to a specific range of frequencies according to the Mel scale. The Mel-filterbank divides the FFT output into these bands, capturing the energy level of each band.
- **Emphasis on Important Frequencies: The Mel-filterbank highlights frequencies that are perceptually important to human hearing, reducing the complexity of data by focusing on relevant frequencies.
Log Mel-spectrum
Our perception of loudness is logarithmic rather than linear:
- **Logarithmic Compression: By taking the logarithm of the output from the Mel-filterbank, the dynamic range of the signal is compressed. This stage creates a representation that more closely matches how humans perceive sound intensity.
Discrete Cosine Transform (DCT)
Finally, a DCT is applied to the log Mel-spectrum:
- **Decorrelation of Filterbank Coefficients: DCT helps in reducing redundancy among the filterbank coefficients, highlighting the most significant features of the sound in each frame.
- **Efficient Feature Representation: The result is a set of coefficients known as the Mel-frequency cepstrum, which effectively captures the essential characteristics of the sound, aiding in tasks such as speech recognition and speaker identification.
How to compute MFCC?
Finally, by taking the first few coefficients from the DCT output, we obtain the MFCCs, which represent a compact and informative description of the speech signal in each frame. To calculate MFCCs, we follow these steps:
- **Pre-emphasize the signal: Amplify higher frequencies to balance the spectrum.
- **Framing: Break the signal into small, overlapping frames.
- **Windowing: To soften the edges of each frame, apply a Hamming window.
- **FFT: Convert each frame from the time domain to the frequency domain.
- **Mel-filterbank: Apply overlapping triangular filters spaced according to the Mel-scale.
- **Logarithm: To replicate the way a human ear reacts to sound strength take the logarithm of the filterbank outputs.
- **DCT: Apply the DCT to the log Mel-spectrum to obtain the Mel-frequency Cepstral Coefficients.
Calculating MFCCs from Speech Signal in Python
In this example we'll go over how to use Python to calculate the MFCCs from a speech signal. Common libraries like librosa for audio processing and numpy, scipy, and matplotlib will be used. Lastly, we'll utilize ipywidgets to build a basic GUI that will allow users to test the model in real time.
Original Signal -> Pre-emphasis -> Framing -> Windowing -> FFT -> Mel-filterbank -> Logarithm -> DCT -> MFCCs
Step 1: Install Required Libraries
We must install the required libraries first. In your Google Colab/System environment, you can use the following commands:
!pip install numpy scipy matplotlib librosa ipywidgets
Step 2: Load and Visualize the Audio Signal
We'll start by loading an audio file and visualizing its waveform.
Python `
import numpy as np import librosa import matplotlib.pyplot as plt
Load the audio file
audio_path = librosa.example('trumpet') y, sr = librosa.load(audio_path)
Plot the waveform
plt.figure(figsize=(14, 5)) plt.plot(y) plt.title('Waveform of the Audio Signal') plt.xlabel('Time') plt.ylabel('Amplitude') plt.show()
`
**Output:
Downloading file 'sorohanro_-solo-trumpet-06.ogg' from 'https://librosa.org/data/audio/sorohanro-_solo-trumpet-06.ogg' to '/root/.cache/librosa'.
.jpg)
Step 3: Pre-emphasis
Pre-emphasizing the audio signal helps to balance the spectrum by amplifying higher frequencies.
Python `
Apply pre-emphasis filter
pre_emphasis = 0.97 y_preemphasized = np.append(y[0], y[1:] - pre_emphasis * y[:-1])
Plot the pre-emphasized signal
plt.figure(figsize=(14, 5)) plt.plot(y_preemphasized) plt.title('Pre-emphasized Signal') plt.xlabel('Time') plt.ylabel('Amplitude') plt.show()
`
**Output:
.jpg)
Step 4: Framing
We'll break the audio signal into small frames.
Python `
frame_size = 0.025 # 25 ms frame_stride = 0.01 # 10 ms frame_length, frame_step = frame_size * sr, frame_stride * sr # Convert from seconds to samples signal_length = len(y_preemphasized) frame_length = int(round(frame_length)) frame_step = int(round(frame_step)) num_frames = int(np.ceil(float(np.abs(signal_length - frame_length)) / frame_step))
Pad signal to ensure all frames have equal number of samples
pad_signal_length = num_frames * frame_step + frame_length z = np.zeros((pad_signal_length - signal_length)) pad_signal = np.append(y_preemphasized, z)
Slice the signal into frames
indices = np.tile(np.arange(0, frame_length), (num_frames, 1)) + np.tile(np.arange(0, num_frames * frame_step, frame_step), (frame_length, 1)).T frames = pad_signal[indices.astype(np.int32, copy=False)]
Plot a few frames
plt.figure(figsize=(14, 5)) plt.plot(frames[0]) plt.title('First Frame of the Signal') plt.xlabel('Samples') plt.ylabel('Amplitude') plt.show()
`
**Output:
.jpg)
Step 5: Windowing
Apply a window function to each frame to minimize discontinuities at the edges.
Python `
Apply Hamming window
frames *= np.hamming(frame_length)
Plot the first frame after windowing
plt.figure(figsize=(14, 5)) plt.plot(frames[0]) plt.title('First Frame after Windowing') plt.xlabel('Samples') plt.ylabel('Amplitude') plt.show()
`
**Output:
.jpg)
Step 6: Fast Fourier Transform (FFT)
Convert each frame from the time domain to the frequency domain.
Python `
NFFT = 512 mag_frames = np.absolute(np.fft.rfft(frames, NFFT)) # Magnitude of the FFT pow_frames = ((1.0 / NFFT) * ((mag_frames) ** 2)) # Power Spectrum
Plot the magnitude spectrum of the first frame
plt.figure(figsize=(14, 5)) plt.plot(mag_frames[0]) plt.title('Magnitude Spectrum of the First Frame') plt.xlabel('Frequency Bin') plt.ylabel('Amplitude') plt.show()
`
**Output:
.jpg)
Step 7: Apply Mel-filterbank
Apply a filterbank to the power spectra to get the energy in each Mel-frequency bin.
Python `
nfilt = 40 low_freq_mel = 0 high_freq_mel = 2595 * np.log10(1 + (sr / 2) / 700) # Convert Hz to Mel mel_points = np.linspace(low_freq_mel, high_freq_mel, nfilt + 2) # Equally spaced in Mel scale hz_points = 700 * (10 ** (mel_points / 2595) - 1) # Convert Mel to Hz bin = np.floor((NFFT + 1) * hz_points / sr)
fbank = np.zeros((nfilt, int(np.floor(NFFT / 2 + 1)))) for m in range(1, nfilt + 1): f_m_minus = int(bin[m - 1]) # left f_m = int(bin[m]) # center f_m_plus = int(bin[m + 1]) # right
for k in range(f_m_minus, f_m):
fbank[m - 1, k] = (k - bin[m - 1]) / (bin[m] - bin[m - 1])
for k in range(f_m, f_m_plus):
fbank[m - 1, k] = (bin[m + 1] - k) / (bin[m + 1] - bin[m])filter_banks = np.dot(pow_frames, fbank.T) filter_banks = np.where(filter_banks == 0, np.finfo(float).eps, filter_banks) # Numerical stability filter_banks = 20 * np.log10(filter_banks) # dB
Plot the filter bank energies
plt.figure(figsize=(14, 5)) plt.imshow(filter_banks.T, cmap='hot', aspect='auto') plt.title('Filter Bank Energies') plt.xlabel('Frame Index') plt.ylabel('Filter Index') plt.show()
`
**Output:
(2).jpg)
Step 8: Discrete Cosine Transform (DCT)
Apply DCT to the filter bank energies to get the MFCCs.
Python `
from scipy.fftpack import dct
num_ceps = 12 mfcc = dct(filter_banks, type=2, axis=1, norm='ortho')[:, :num_ceps]
Plot the MFCCs
plt.figure(figsize=(14, 5)) plt.imshow(mfcc.T, cmap='hot', aspect='auto') plt.title('MFCC') plt.xlabel('Frame Index') plt.ylabel('Cepstral Coefficient Index') plt.show()
`
**Output:
.jpg)
Step 9: Interactive GUI with ipywidgets
Let's create an interactive GUI where users can upload their own audio files or use a sample audio file from the web to compute MFCCs.
Load Sample Audio File from Web
We'll first demonstrate how to download a sample audio file from the web and use it to compute MFCCs.
Python `
import requests
Download a sample audio file from the web
url = 'https://www2.cs.uic.edu/~i101/SoundFiles/preamble10.wav' response = requests.get(url) sample_audio_path = 'sample_audio.wav'
with open(sample_audio_path, 'wb') as f: f.write(response.content)
`
Interactive GUI for
Audio File Upload and MFCC Computation
Python `
import librosa.display import ipywidgets as widgets from IPython.display import display from IPython.display import Audio from scipy.fftpack import dct
File uploader widget
uploader = widgets.FileUpload(accept='.wav', multiple=False)
Load and compute MFCCs for a given audio file
def compute_mfcc(file): y, sr = librosa.load(file, sr=None) y_preemphasized = np.append(y[0], y[1:] - pre_emphasis * y[:-1]) signal_length = len(y_preemphasized) num_frames = int(np.ceil(float(np.abs(signal_length - frame_length)) / frame_step)) pad_signal_length = num_frames * frame_step + frame_length z = np.zeros((pad_signal_length - signal_length)) pad_signal = np.append(y_preemphasized, z) indices = np.tile(np.arange(0, frame_length), (num_frames, 1)) + np.tile(np.arange(0, num_frames * frame_step, frame_step), (frame_length, 1)).T frames = pad_signal[indices.astype(np.int32, copy=False)] frames *= np.hamming(frame_length) mag_frames = np.absolute(np.fft.rfft(frames, NFFT)) pow_frames = ((1.0 / NFFT) * ((mag_frames) ** 2)) filter_banks = np.dot(pow_frames, fbank.T) filter_banks = np.where(filter_banks == 0, np.finfo(float).eps, filter_banks) filter_banks = 20 * np.log10(filter_banks) mfcc = dct(filter_banks, type=2, axis=1, norm='ortho')[:, :num_ceps]
plt.figure(figsize=(14, 5))
plt.subplot(2, 1, 1)
librosa.display.waveshow(y, sr=sr)
plt.title('Waveform')
plt.subplot(2, 1, 2)
plt.imshow(mfcc.T, cmap='hot', aspect='auto')
plt.title('MFCC')
plt.xlabel('Frame Index')
plt.ylabel('Cepstral Coefficient Index')
plt.tight_layout()
plt.show()Handle file upload and compute MFCCs
def on_upload_change(change): file = list(uploader.value.values())[0] compute_mfcc(file['content'])
uploader.observe(on_upload_change, names='value')
Button to compute MFCCs for the sample audio file
sample_button = widgets.Button(description='Use Sample Audio') def on_sample_button_click(b): compute_mfcc(sample_audio_path)
sample_button.on_click(on_sample_button_click)
Display widgets
display(uploader, sample_button)
`
**Output:

Explanation
- **File Uploader Widget: Allows users to upload their own .wav files.
- **Sample Button: Computes MFCCs using a sample audio file downloaded from the web.
- **Compute MFCC Function: Evaluates the audio file in order to calculate and show MFCCs.
- **Visualization: Displays the waveform and MFCCs using matplotlib.
This interactive GUI lets users either upload their own audio files or use a sample file to visualize and understand MFCC computation.
Conclusion
MFCCs are a cornerstone of speech recognition technology, providing a robust way to represent speech signals. Exciting developments in speech recognition and other speech-based technologies are made possible by MFCCs which imitate human hearing and extract important aspects of sound waves. Through comprehension and utilization of MFCCs we can improve the precision and effectiveness of diverse audio processing applications. MFCCs are essential for improving the ability of machines to comprehend human speech whether it is for text recognition or speech-to-text conversion.