How to download with a script the full CommonVoice21 dataset? (original) (raw)

Hello,

I would like to download the entire CommonVoice 21 dataset for all languages, in order to train and test different models for a speaker verification task, but I have to manually download them via the link for each dataset. However, I would need to be able to run a script to download all the datasets at once. Is there a way to access all the data at once? I saw on HuggingFace that it was possible to download all the data for CommonVoice 17, but I would like CommonVoice 21.

Have a nice day, thank you in advance.

Emrah_Sarisoy (Emrah Sarısoy) April 27, 2025, 12:22pm 2

same challenge here. I even tried using the download link on the button(in CV dataset download website) via GCS bucket, transfer-in function but no success. this is quite annoying especially when working with cloud gpu instances.

Hi @MangoH and @Emrah_Sarisoy,

Unfortunately, we do not currently have a mechanism for downloading all the language datasets at once, as the most common use case is for researchers or ML engineers to work with a specific language, or small group of languages. Alternative download mechanisms, such as via API, is a feature we are considering for the future, as it is a frequent request, however as a very small team it may be some time before we’re able to deliver this functionality.

The copies of Common Voice data hosted on Hugging Face are not managed or controlled by the Common Voice team, unfortunately, and are managed and controlled by Hugging Face. It might be worth asking on the Hugging Face forum about whether Hugging Face has any plans to make CV 18, 19, 20 and 21 available on its platform - https://discuss.huggingface.co/

If you are working with speaker verification models, then there are some characteristics of the Common Voice dataset that you may need to be aware of - if you are not already. Firstly, logging in with a particular profile is not mandatory, and the same speaker may have contributions ascribed to multiple client_id values - for example, if they contributed while logged in, then did contributions if they were logged out. Another edge case here to consider is that in some cases contributions by multiple speakers may have been made under one client_id value, for example if there was a contribution festival. These may need to be flagged as limitations for any speaker verification study.

Thanks for your answer. in fact I meant for a specific language.

No, the datasets are available from the “Download” link in the platform only, currently.

I had the same issue, so I hope this repo may help some people: GitHub - cjweaver/common-voice-utils: Utilities for working with the Common Voice dataset

The script common_voice_downloader.py uses Selenium to automates the download process for Common Voice datasets by:

  1. Scraping dataset URLs from the Common Voice website
  2. Creating language-specific directories
  3. Downloading datasets with resume capability (allows downloading on another machine)
  4. Optionally extracting archives