Wav2Vec2 Model (original) (raw)

Last Updated : 14 Apr, 2026

Wav2Vec2 is a self-supervised learning model designed for speech recognition. It learns meaningful representations directly from raw audio using large amounts of unlabeled data, and can later be fine-tuned for tasks such as transcription with minimal labeled data.

Architecture of Wav2Vec2 Model

architecture_of_wav2vec2

Architecture

1. Feature encoder

The feature encoder is the first component of Wav2Vec2 that processes raw audio input. It takes the audio waveform and converts it into a sequence of meaningful features.

Feature-encoder

Feature Encoder of Wav2Vec2

2. Transformer Encoder (Context Network)

The Transformer encoder builds a deeper understanding of the extracted audio features by analyzing their relationships over time.

3. Quantization module

The quantization module converts continuous audio features into discrete representations that act like speech units.

wav2vec2_quantization_process

Quantization

Implementation

**Step1: Install Libraries

Installs all required libraries for audio processing and model usage

!pip install transformers datasets torch -q

Step2: Import Libraries

import torch from datasets import load_dataset, Audio from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC

`

**Step3: Loading Dataset and Preprocessing

Loading Minds 14 dataset and split the dataset in 80:20 ratio.

Python `

dataset = load_dataset( "hf-internal-testing/librispeech_asr_dummy", "clean", split="validation" )

dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

`

**Step4: Load Lightweight Wav2Vec2 Model

processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base") model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base")

`

**Output:

output

Output

Step5: Select an Audio Sample

Extracts raw audio from dataset

Python `

sample = dataset[0]

audio_input = sample["audio"]["array"]

`

**Step 6: Convert Audio to Model Input

inputs = processor( audio_input, sampling_rate=16000, return_tensors="pt" )

`

**Step 7: Run the Model

with torch.no_grad(): logits = model(**inputs).logits

`

Step 8: Decode Output to Text

predicted_ids = torch.argmax(logits, dim=-1)

transcription = processor.batch_decode(predicted_ids)

print("Predicted Text:", transcription[0]) print("Actual Text:", sample["text"])

`

**Output:

output2

Output

Download full code from here

Applications

Limitations