Audio Classification using Hugging face (original) (raw)

Last Updated : 14 Apr, 2026

Audio classification using models from Hugging Face enables developers to automatically categorize audio data into predefined classes such as speech, music, emotions or environmental sounds. With pre-trained transformer and audio models, it becomes easy to build audio‑based AI applications without training from scratch.

audio_classification

Audio Classification

Working

  1. **Waveform Input: The raw audio signal is captured as a waveform representing sound intensity over time.
  2. **Feature Extraction: The waveform is converted into numerical features (like spectrograms) so the model can process it mathematically.
  3. **Pattern Learning: The model analyzes patterns across time and frequency to understand the characteristics of the sound.
  4. **Prediction: It outputs the most probable sound category based on learned patterns.

Implementation

Step 1: Set Up the Environment

First, install the required libraries. Run the following command in your command prompt.

pip install transformers torch torchaudio requests

Step 2: Import Required Libraries

Python `

from transformers import AutoFeatureExtractor, AutoModelForAudioClassification import torch import torchaudio import requests

`

Step 3: Download and Save the Audio file

You can also download the audio file from here

Python `

url = "https://github.com/karolpiczak/ESC-50/raw/master/audio/1-100032-A-0.wav" r = requests.get(url)

with open("sample.wav", "wb") as f: f.write(r.content)

print("Audio downloaded.")

`

Step 4: Load the Model

In this step, we use MIT/ast-finetuned-audioset-10-10-0.4593, a transformer based Audio Spectrogram Transformer (AST) model developed at Massachusetts Institute of Technology and fine tuned on AudioSet. It is specifically designed for environmental sound classification, covering hundreds of real world audio categories.

model_name = "MIT/ast-finetuned-audioset-10-10-0.4593"

feature_extractor = AutoFeatureExtractor.from_pretrained(model_name) model = AutoModelForAudioClassification.from_pretrained(model_name)

`

**Output:

output

Loading pretrained model

Step 5: Load Audio File

audio, sampling_rate = torchaudio.load("sample.wav")

`

Step 6: Resample the Audio

The model expects 16kHz audio input. If the audio has a different frequency, it is converted to 16kHz.

Python `

if sampling_rate != 16000: resampler = torchaudio.transforms.Resample(sampling_rate, 16000) audio = resampler(audio)

`

Step 7: Preprocess Audio

inputs = feature_extractor( audio.squeeze().numpy(), sampling_rate=16000, return_tensors="pt" )

`

Step 8: Run Inference

with torch.no_grad(): outputs = model(**inputs)

`

Step 9: Get Predictions

logits = outputs.logits probs = torch.nn.functional.softmax(logits, dim=1)

top_k = torch.topk(probs, k=5)

for score, idx in zip(top_k.values[0], top_k.indices[0]): label = model.config.id2label[idx.item()] print(f"{label}: {score.item():.4f}")

`

**Output:

output2

Dog Bark Detected

You can download the full code from here