What is the purpose of train-all.csv (original) (raw)
I am following this doc to train my own English model using CommonVoice data
https://deepspeech.readthedocs.io/en/r0.9/TRAINING.html
After running this command:
bin/import_cv2.py --filter_alphabet path/to/some/alphabet.txt /path/to/extracted/language/archive
there are files generated
clips/dev.csvclips/test.csvclips/train.csvclips/train-all.csv
Then the next step is to train the model using clips/dev.csv, clips/test.csv and clips/train.csv.
Why don’t we use clips/train-all.csv as training data? This file have a lot more data than clips/train.csv and also from validated dataset so I think it should output a better model. But in the doc I do not see any mention about this file.
Also, was DeepSpeech pre-trained model trained from clips/train.csv or clips/train-all.csv?
lissyx ((slow to reply) [NOT PROVIDING SUPPORT]) May 31, 2021, 5:16pm 2
No, if you train with validation dataset, you just overfit and learn nothing.
chibt (Chibt) June 1, 2021, 3:06am 3
Hi I do not train with validation dataset.
What I mean by “validated dataset” is this file en/validated.tsv which is already validated its quality by up votes and down votes. It is different and not for validation while training.
Anyway I just want to know If I should use en/clips/train-all.csv instead of en/clips/train.csv for training. I am sure that they do not include dev and test dataset
ftyers (Francis Tyers) June 1, 2021, 6:32pm 4
train-all.csv contains the training data that doesn’t have the one recording per transcript restriction enabled. See: https://github.com/mozilla/CorporaCreator/issues/113
Ugur_Turkdamar (Uğur Türkdamar) February 14, 2023, 9:14am 5
hey, my folder doesn’t have clips/train-all.csv file, any solution? cannot use import_cv2 script…