Speech Recognition Using Whisper | OpenVINO GenAI (original) (raw)

Convert and Optimize Model

Download and convert model (e.g. openai/whisper-base) to OpenVINO format from Hugging Face:

optimum-cli export openvino --model openai/whisper-base --trust-remote-code whisper_ov

See all supported Speech Recognition Models.

info

Refer to the Model Preparation guide for detailed instructions on how to download, convert and optimize models for OpenVINO GenAI.

Run Model Using OpenVINO GenAI

OpenVINO GenAI introduces the WhisperPipeline pipeline for inference of speech recognition Whisper models. You can construct it straight away from the folder with the converted model. It will automatically load the model, tokenizer, detokenizer and default generation configuration.

info

WhisperPipeline expects normalized audio files in WAV format at sampling rate of 16 kHz as input.

import openvino_genai as ov_genai
import librosa

def read_wav(filepath):
  raw_speech, samplerate = librosa.load(filepath, sr=16000)
  return raw_speech.tolist()

raw_speech = read_wav('sample.wav')

pipe = ov_genai.WhisperPipeline(model_path, "CPU")
result = pipe.generate(raw_speech, max_new_tokens=100)
print(result)

tip

Use CPU or GPU as devices without any other code change.

Additional Usage Options

tip

Check out Python and C++ Whisper speech recognition samples.

Use Different Generation Parameters

Generation Configuration Workflow

  1. Get the model default config with get_generation_config()
  2. Modify parameters
  3. Apply the updated config using one of the following methods:
    • Use set_generation_config(config)
    • Pass config directly to generate() (e.g. generate(prompt, config))
    • Specify options as inputs in the generate() method (e.g. generate(prompt, max_new_tokens=100))

Basic Generation Configuration

import openvino_genai as ov_genai

pipe = ov_genai.WhisperPipeline(model_path, "CPU")

# Get default configuration
config = pipe.get_generation_config()

# Modify parameters
config.max_new_tokens = 100
config.temperature = 0.7
config.top_k = 50
config.top_p = 0.9
config.repetition_penalty = 1.2

# Generate text with custom configuration
result = pipe.generate(raw_speech, config)

Understanding Basic Generation Parameters

For the full list of generation parameters, refer to the Generation Config API.

Beam search helps explore multiple possible text completions simultaneously, often leading to higher quality outputs.

import openvino_genai as ov_genai

pipe = ov_genai.WhisperPipeline(model_path, "CPU")

# Get default generation config
config = pipe.get_generation_config()

# Modify parameters
config.max_new_tokens = 256
config.num_beams = 15
config.num_beam_groups = 3
config.diversity_penalty = 1.0

# Generate text with custom configuration
result = pipe.generate(raw_speech, config)

Understanding Beam Search Generation Parameters

For the full list of generation parameters, refer to the Generation Config API.

Transcription

Whisper models can automatically detect the language of the input audio, or you can specify the language to improve accuracy:

pipe = ov_genai.WhisperPipeline(model_path, "CPU")

# Automatic language detection
raw_speech = read_wav("speech_sample.wav")
result = pipe.generate(raw_speech)

# Explicitly specify language (English)
result = pipe.generate(raw_speech, language="<|en|>")

# French speech sample
raw_speech = read_wav("french_sample.wav")
result = pipe.generate(raw_speech, language="<|fr|>")

Translation

By default, Whisper performs transcription, keeping the output in the same language as the input. To translate non-English speech to English, use the translate task:

pipe = ov_genai.WhisperPipeline(model_path, "CPU")

# Translate French audio to English
raw_speech = read_wav("french_sample.wav")
result = pipe.generate(raw_speech, task="translate")

Timestamps Prediction

Whisper can predict timestamps for each segment of speech, which is useful for synchronization or creating subtitles:

pipe = ov_genai.WhisperPipeline(model_path, "CPU")

# Enable timestamp prediction
result = pipe.generate(raw_speech, return_timestamps=True)

# Print timestamps and text segments
for chunk in result.chunks:
    print(f"timestamps: [{chunk.start_ts:.2f}, {chunk.end_ts:.2f}] text: {chunk.text}")

Long-Form Audio Processing

Whisper models are designed for audio segments up to 30 seconds in length. For longer audio, the OpenVINO GenAI Whisper pipeline automatically handles the processing using a sequential chunking algorithm ("sliding window"):

  1. The audio is divided into 30-second segments
  2. Each segment is processed sequentially
  3. Results are combined to produce the complete transcription

This happens automatically when you input longer audio files.

Using Initial Prompts and Hotwords

You can improve transcription quality and guide the model's output style by providing initial prompts or hotwords using the following parameters:

Whisper models can use that context to better understand the speech and maintain a consistent writing style. However, prompts do not need to be genuine transcripts from prior audio segments. Such prompts can be used to steer the model to use particular spellings or styles:

pipe = ov_genai.WhisperPipeline(model_path, "CPU")

result = pipe.generate(raw_speech)
# He has gone and gone for good answered Paul Icrom who...

result = pipe.generate(raw_speech, initial_prompt="Polychrome")
# He has gone and gone for good answered Polychrome who...

Streaming the Output

Refer to the Streaming guide for more information on streaming the output with OpenVINO GenAI.