Unsupervised training of a speech recognizer using TV broadcasts (original) (raw)

Current speech recognition systems require large amounts of transcribed data for parameter estimation. The transcription, however, is tedious and expensive. In this work we describe our experiments which are aimed at training a speech recognizer without transcriptions. The experiments were carried out with TV newscasts, that were recorded using a satellite receiver and a simple MPEG coding hardware. The newscasts were automatically segmented into segments of similar acoustic background condition. This material is inexpensive and can be made available in large quantities, but there are no transcriptions available. We develop a training scheme, where a recognizer is bootstrapped using very little transcribed data and is improved using new, untranscribed speech. We show that it is necessary to use a con dence measure to judge the initial transcriptions of the recognizer before using them. Higher improvements can be achieved if the number of parameters in the system is increased when more data becomes available. We show, that the bene cial e ect of unsupervised training is not compensated by MLLR adaptation on the hypothesis. In a nal experiment, the e ect of untranscribed data is compared with the e ect of transcribed speech. Using the described methods, we found that the untranscribed data gives roughly one third of the improvement of the transcribed material.