Automatic Speech Recognition using Whisper (original) (raw)

Last Updated : 16 Mar, 2026

Automatic Speech Recognition (ASR) is a technology that converts spoken audio into written text. In simple terms, it enables machines to understand human speech and transform it into readable format just like automatically converting a voice message into text. Some of its applications are:

automatic-speech-reconition

Automatic speech recognition

Why use Whisper

Whisper is a speech recognition model developed by OpenAI and is widely accessible through Hugging Face. It stands out due to its performance, flexibility and ease of integration. It has advantages like:

Implementing Automatic Speech Recognition

Step 1: Set Up the Environment

First, install the required libraries. Run the following command one by one in your command prompt.

pip install transformers --upgrade
pip install torch torchaudio

Step 2: Import Required Modules

This step sets up the foundational components required to build the speech to text pipeline.

from transformers import WhisperProcessor, WhisperForConditionalGeneration import torch import torchaudio

`

Step 3: Load Model and Processor

We load the pre trained Whisper Small model developed by OpenAI from Hugging Face.

model_name = "openai/whisper-small"

processor = WhisperProcessor.from_pretrained(model_name) model = WhisperForConditionalGeneration.from_pretrained(model_name)

`

**Output:

output

Loading pretrained model

Step 4: Download and Load Audio

It downloads a sample audio file from Hugging Face and saves it locally. Then, torchaudio.load() reads the file and returns the audio waveform and its sampling rate. This prepares the speech input for the Whisper model.

You an also download the audio file from here

Python `

import requests

url = "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac"

r = requests.get(url)

with open("sample.flac", "wb") as f: f.write(r.content)

audio, sampling_rate = torchaudio.load("Your audio file path")

`

Step 5: Resampling Audio to 16kHz

Whisper requires audio sampled at 16kHz. If the loaded audio has a different sampling rate, we resample it.

Python `

if sampling_rate != 16000: resampler = torchaudio.transforms.Resample(sampling_rate, 16000) audio = resampler(audio)

`

Step 6: Preprocess Audio

The processor converts the raw audio waveform into numerical features that the Whisper model can understand.

inputs = processor( audio.squeeze().numpy(), sampling_rate=16000, return_tensors="pt" )

`

Step 7: Generate Transcription

with torch.no_grad(): predicted_ids = model.generate(inputs["input_features"])

`

Step 8: Decode Output

transcription = processor.batch_decode( predicted_ids, skip_special_tokens=True )[0]

print("Transcription:", transcription)

`

**Output:

Transcription: He hoped there would be stew for dinner, turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick, peppered, flour-fattened sauce.

You can download the full code from here