Using Deep Speech (original) (raw)

November 1, 2017, 2:28pm 1

Covers topics concerned with the use of Deep Speech

maboa (Mark Boas) November 14, 2017, 11:31am 2

Just wanted to say this is great to see, since I’m working in the area of STT I’m very much looking forward to the discussion on this topic - hoping to contribute :slight_smile:

psukys (Paulius Šukys) November 17, 2017, 4:48pm 3

Is there a user guide for using pre-trained models?

kdavis (kdavis) November 21, 2017, 7:21am 4

In the coming days we will release an American English model an info on its use.

elpimous_robot (Vincent Foucault) November 21, 2017, 3:44pm 5

Thanks a lot for Deepspeech.
It really improves the STT accuracy (quite better than cmusphinx !!)

Nik (Nik)

November 29, 2017, 9:02pm 6

I’ve been testing the model which was released a few days ago. I recorded myself saying a few lines which are found in the readme.

The expected result:

Alternatively, quicker inference can be performed using a supported NVIDIA GPU on Linux. (See below to find which GPU’s are supported.) This is done by instead installing the GPU specific package with the command:
pip install deepspeech-gpu

actual result can be seen in the image below

10

I can unfortunately not upload the .wav file here, if it’s necessary I can upload it somewhere else.

Is this the expected performance of deep speech? I’m hypothesising that the language model used is not trained on the vocabulary I’m using. Is there anything to gain by looking at another language model?

reuben November 29, 2017, 9:11pm 7

To test if the language model is negatively influencing the results, simply omit the last two parameters (lm.binary and trie) and see if it improves.

Nik (Nik) November 29, 2017, 9:16pm 8

output without the language model:

olteritof me quicker in fraens can be perforemed asing i supported and veny a gpi on lenices se belo fight which ipis are spoied tet is tom by instar and soim igi butsiv package heth ti comand pep install deep speech hian gi pu

reuben November 29, 2017, 9:25pm 9

Yeah, looks like it’s not the language model, but rather the acoustic model is struggling with the audio 😕

Could be due to noise in the recording, or maybe your accent. We definitely want to make our models more robust to things like that, by training with more varied data for example.

Nik (Nik) November 30, 2017, 6:15pm 10

Do you think there is a lot to gain by using the 250 hours of Common Voice and trying to do the whole training process myself? Or might it be better to wait until there is about 5000 hours of data, which was used in the paper by baidu?

readwrite November 30, 2017, 9:15pm 11

How can one do transfer learning using the pretrained DeepSpeech model?

yesterdays (Yesterdays)

December 1, 2017, 12:08am 12

The line from deepspeech.model import Model provides the following error:

05

sawantilak (Sawantilak) December 19, 2017, 11:20am 15

Hey did you find the solution to this issue? I am facing ther same issue.

mark2 (Matti Meikäläinen) December 22, 2017, 9:35am 16

Hi!

I am testing the basic use of DeepSpeech with pre-trained model downloaded from https://github.com/mozilla/DeepSpeech/releases and some test wav-files downloaded from https://www.dropbox.com/s/xecprghgwbbuk3m/vctk-pc225.tar.gz?dl=1. The correct transcriptions for three below cases are “It is linked to the row over proposed changes at Scottish Ballet”, “Please call Stella” and “Ask her to bring these things with her from the store” respectively. The results suggested by the default model are something totally different:

AMAC02TX3KKHTD8:DeepSpeech mark$ deepspeech …/models/output_graph.pb p225_366.wav …/models/alphabet.txt …/models/lm.binary …/models/trie
Loading model from file …/models/output_graph.pb
Loaded model in 1.071s.
Loading language model from files …/models/lm.binary …/models/trie
Loaded language model in 3.408s.
Running inference.
i do
Inference took 8.283s for 15.900s audio file.
AMAC02TX3KKHTD8:DeepSpeech mark$ deepspeech …/models/output_graph.pb p225_001.wav …/models/alphabet.txt …/models/lm.binary …/models/trie
Loading model from file …/models/output_graph.pb
Loaded model in 0.920s.
Loading language model from files …/models/lm.binary …/models/trie
Loaded language model in 3.111s.
Running inference.
huh
Inference took 4.822s for 6.155s audio file.
AMAC02TX3KKHTD8:DeepSpeech mark$ deepspeech …/models/output_graph.pb p225_002.wav …/models/alphabet.txt …/models/lm.binary …/models/trie
Loading model from file …/models/output_graph.pb
Loaded model in 1.026s.
Loading language model from files …/models/lm.binary …/models/trie
Loaded language model in 3.217s.
Running inference.
a cage
Inference took 7.021s for 12.176s audio file.

Any ideas for the such behaviour?

BR,
Mark

kdavis (kdavis) December 22, 2017, 9:46am 17

Are the wav audio files 16-bit, 16 kHz, and mono? If not, deepspeech can’t create transcripts for them.

lissyx ((slow to reply) [NOT PROVIDING SUPPORT]) December 22, 2017, 9:47am 18

@mark2 I just had a look at your files, and like mentionned, it’s 48kHz instead of 16kHz as expected, that explains the completely unexpected output.

lissyx ((slow to reply) [NOT PROVIDING SUPPORT]) December 22, 2017, 9:49am 19

FTR:

alex@portable-alex:~/tmp/deepspeech/cpu$ LC_ALL=C ./deepspeech ../models/output_graph.pb ../test-data/vctk-p225/wav48/p225/p225_366.wav ../models/alphabet.txt -t 2>&1 
2017-12-22 10:48:34.494758: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
u to o
cpu_time_overall=23.77523 cpu_time_mfcc=0.00953 cpu_time_infer=23.76570
alex@portable-alex:~/tmp/deepspeech/cpu$ LC_ALL=C ./deepspeech ../models/output_graph.pb ../test-data/vctk-p225/wav48/p225/p225_366.16k.wav ../models/alphabet.txt -t 2>&1 
2017-12-22 10:48:54.894628: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
it is lind to the row everyprprose changes at scosish balle
cpu_time_overall=22.01665 cpu_time_mfcc=0.00750 cpu_time_infer=22.00915

And one can do conversion like that:

alex@portable-alex:~/tmp/deepspeech/cpu$ ffmpeg -i ../test-data/vctk-p225/wav48/p225/p225_366.wav -acodec pcm_s16le -ac 1 -ar 16000 ../test-data/vctk-p225/wav48/p225/p225_366.16k.wav
alex@portable-alex:~/tmp/deepspeech/cpu$ 

mark2 (Matti Meikäläinen) December 22, 2017, 10:12am 20

Thanks! Now it gives more reasonable answers.

b.r (Buvana R) February 23, 2018, 5:48pm 21

Hello, what are the training data sets that went into the model that is available at https://github.com/mozilla/DeepSpeech/releases?

kdavis (kdavis) February 23, 2018, 5:55pm 22

LibriSpeech[1], Fisher[2,3,4,5], and Switchboard[6]