GitHub - wheevu/nemo-vietnamese-asr: Pipeline for extracting, validating, and training Vietnamese ASR models with NVIDIA NeMo. (original) (raw)
Pipeline for harvesting, validating, and training Vietnamese ASR models with NVIDIA NeMo.
This repo focuses on data preparation, validation, and training. Serving an ASR model in production is a separate step. This separation follows the architecture defined in NVIDIA’s NIM Microservices coursework.
Deep dive
This project has a Local-to-Cloud workflow. This allows development on a standard laptop (Mac/Windows) while only paying for cloud GPUs when actual training is required.
- Local Data Prep (CPU): Doing the "Heavy Lifting" of downloading and processing audio locally.
- Cloud Training (GPU): Uploading the clean data to Google Colab to run the actual NeMo training.
- Strict Validation: Prioritizing manual transcripts over auto-generated ones to make sure the model learns from high-quality data.
Scope & Non-Goals
- Goal: Build a clean dataset and a training pipeline.
- Non-Goal: No live API endpoint or a web app here.
- Constraint: Designed to prevent "Out of Memory" (OOM) errors on T4 GPUs by chopping audio into 30-second chunks.
Project Structure
nemo-vietnamese-asr/ ├── src/ │ └── yt_harvester/ # The Tool: Downloads & Cleans Data │ ├── main.py # Entry point │ ├── downloader.py # Logic to fetch YouTube video/audio │ └── processor.py # Logic to analyze text (sentiment) ├── audio/ # Output: Clean 16kHz WAV files ├── transcripts/ # Output: Clean text files ├── prepare_data.py # Script: Generates NeMo manifest files ├── benchmark.py # Script: Tests model speed (FPS/WER) ├── NVIDIA_NeMo_ASR.ipynb # Notebook: Run this in Google Colab └── tests/ # Quality Assurance (QA)
Step 1: Local Data Engineering
The tool inside src/yt_harvester (reused legacy code) turns messy YouTube links into a clean dataset.
What it does
- Best Transcript First: It looks for a human-written transcript. If none exists, it falls back to auto-generated captions.
- Audio Formatting: It automatically converts audio to 16kHz Mono WAV, which is the standard required by NeMo models.
- Smart Skipping: If ran twice, it skips files already downloaded to save time.
How to use it
1. Process a single video
python -m src.yt_harvester "https://www.youtube.com/watch?v=VIDEO_ID"
2. Process a list of videos
python -m src.yt_harvester --bulk links.txt --workers 4
3. Create the training files (Manifests)
python prepare_data.py --seed 42
The Validation Layer (prepare_data.py)
Before sending data to the GPU, this script checks the work. It removes empty files, fixes text formatting (lowercase), and splits the data into Train/Test sets (80/10/10).
Step 2: Cloud Workflow (Google Colab)
Open NVIDIA_NeMo_ASR.ipynb in Google Colab to handle the GPU work.
Workflow Logic
- Load Data: Unzips the dataset directly to the Colab disk for speed.
- Load Model: Downloads the
stt_en_conformer_ctc_largemodel from NVIDIA. - Evaluate: Runs the model on the Vietnamese data.
- Note: Since I am using an English model on Vietnamese audio without fine-tuning, the accuracy will be low (high WER). This proves the pipeline works before spending hours fine-tuning.
Results & Analysis
I performed a "Zero-shot" test (running the English model on Vietnamese audio).
- Result: The model attempts to map Vietnamese sounds to English words phonetically.
| Original Vietnamese Audio | Model Transcription (English Phonetics) | Analysis |
|---|---|---|
| "Giang Ơi Radio" | "the radio" | Recognized the English loanword |
| "Chào bạn" | "ta bak" | Acoustic approximation (sounds similar) |
Conclusion: The pipeline successfully feeds audio to the model. The next logical step is Transfer Learning: freezing the model's "ear" (Encoder) and retraining its "brain" (Decoder) to understand Vietnamese text.
Optimizing for Speed (Quantization)
To make the model run faster on smaller GPUs (like the free Colab T4), I use Quantization (following “Quantization Fundamentals with Hugging Face”). This reduces the precision of the math inside the model (from 32-bit floating point to 16-bit) to save memory and increase speed.
Benchmark Results (Colab T4 GPU)
I wrote benchmark.py to measure exactly how much faster the optimized model is.
| Precision | Speed (Latency) | VRAM Usage | Notes |
|---|---|---|---|
| float32 (Standard) | 151 ms/file | ~731 MB | Baseline speed. |
| float16 (Optimized) | 89 ms/file | ~888 MB | ~40% Faster. Recommended for T4. |
| int8 | N/A | ~166 MB | Currently incompatible with this model type. |
To run this benchmark yourself in Colab:
python benchmark.py --model stt_en_conformer_ctc_large --manifest val_manifest.json
Testing & Quality Assurance
Bad data ruins training runs. I included a professional test suite to catch errors before they crash the training script. (following "Testing Machine Learning Systems: Code, Data and Models" by Made With ML)
What I test
- Text Processing: Does the code handle YouTube URLs correctly? Does it preserve Vietnamese diacritics (accents)?
- Data Integrity: Are the audio files actually 16kHz mono? Do the JSON manifests point to real files?
How to run tests
Run all tests
pytest tests/ -v
Example Output
tests/test_data_integrity.py::TestAudioFormatCompliance::test_audio_sample_rate_is_16khz PASSED
tests/test_text_processing.py::TestCleanCaptionLines::test_vietnamese_diacritics_preserved PASSED
Continuous Integration (CI)
I use GitHub Actions to automatically run these tests every time code is pushed. This ensures that a code change doesn't accidentally break the data processing pipeline.
Pipeline: Checkout Code -> Install Audio Libs -> Run pytest.
Dependencies
- Local:
yt-dlp(Downloading),ffmpeg(Audio conversion),textblob(Analysis),soundfile. - Cloud:
nemo_toolkit[all],pytorch-lightning,jiwer(Error rate calculation). - Testing:
pytest,pytest-cov.
Future Work: Fine-Tuning Strategy
Now that the pipeline is validated, the next steps for high-accuracy Vietnamese ASR are:
- Select Model: Switch to
stt_en_conformer_ctc_smallfor faster training.- Fine-Tune: Freeze the Encoder, retrain the Decoder on the Vietnamese corpus.
- Tokenizer: Replace the English tokenizer with a Vietnamese character-based tokenizer.
