OpenAI Whisper (original) (raw)

Last Updated : 14 Apr, 2026

OpenAI Whisper is a speech recognition model that converts audio into text. It supports multiple tasks such as transcription, translation and language detection, making it highly useful for working with audio data.

Working of OpenAI Whisper

Whisper processes audio through multiple stages to convert speech into accurate text.

  1. **Audio Preprocessing: The input audio is split into smaller segments and converted into spectrograms, which represent sound frequencies visually
  2. **Feature Extraction: The model extracts important linguistic and acoustic patterns from these spectrograms
  3. **Language Identification: If the language is unknown, the model detects it automatically
  4. **Speech Recognition: The model predicts the most likely sequence of words based on the extracted features
  5. **Translation (Optional): The recognized text can be translated into another language if required
  6. **Post-processing: The output is refined using language rules to improve accuracy and readability

Implementation Using Open AI

Step 1****:** Install Openai library

!pip install -q openai

Step 2:Import Library

Import the OpenAI library and assign your generated API KEY by replacing "YOUR_API_KEY" with your API key in the code below

To know how to get Open AI API Key refer to: OpenAI API Key

Python `

import openai

openai.api_key = "YOUR_API_KEY"

`

Step 3: Transcribe Audio

Converts speech into text in the same language.

Python `

audio_file = open("Path to an audio file", "rb")

transcript = client.audio.transcriptions.create( model="whisper-1", file=audio_file )

print(transcript.text)

`

Step 4: Translate Audio to English

Translates audio into English.

Python `

audio_file = open("audio.mp3", "rb")

translation = client.audio.transcriptions.create( model="whisper-1", file=audio_file, translate=True )

print(translation.text)

`

Implementation Using Hugging Face

Step 1: Set Up the Environment

First, install the required libraries. Run the following command one by one in your command prompt.

pip install transformers --upgrade
pip install torch torchaudio

Step 2: Import Required Modules

This step sets up the foundational components required to build the speech to text pipeline.

from transformers import WhisperProcessor, WhisperForConditionalGeneration import torch import torchaudio

`

Step 3: Load Model and Processor

We load the pre trained Whisper Small model developed by OpenAI from Hugging Face.

model_name = "openai/whisper-small"

processor = WhisperProcessor.from_pretrained(model_name) model = WhisperForConditionalGeneration.from_pretrained(model_name)

`

**Output:

output-

Output

Step 4: Download and Load Audio

It downloads a sample audio file from Hugging Face and saves it locally. Then, torchaudio.load() reads the file and returns the audio waveform and its sampling rate. This prepares the speech input for the Whisper model.

Python `

import requests

url = "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/1.flac"

r = requests.get(url)

with open("sample.flac", "wb") as f: f.write(r.content)

audio, sampling_rate = torchaudio.load("Your audio file path")

`

Step 5: Resampling Audio to 16kHz

Whisper requires audio sampled at 16kHz. If the loaded audio has a different sampling rate, we resample it.

Python `

if sampling_rate != 16000: resampler = torchaudio.transforms.Resample(sampling_rate, 16000) audio = resampler(audio)

`

Step 6: Preprocess Audio

The processor converts the raw audio waveform into numerical features that the Whisper model can understand.

inputs = processor( audio.squeeze().numpy(), sampling_rate=16000, return_tensors="pt" )

`

Step 7: Generate Transcription

with torch.no_grad(): predicted_ids = model.generate(inputs["input_features"])

`

Step 8: Decode Output

transcription = processor.batch_decode( predicted_ids, skip_special_tokens=True )[0]

print("Transcription:", transcription)

`

**Advantages

Applications

Limitations