Avoiding speaker variability in pronunciation verification of children's disordered speech (original) (raw)
Related papers
A study of pronunciation verification in a speech therapy application
2009
Techniques are presented for detecting phoneme level mispronunciations in utterances obtained from a population of impaired children speakers. The intended application of these approaches is to use the resulting confidence measures to provide feedback to patients concerning the quality of pronunciations in utterances arising within interactive speech therapy sessions. The pronunciation verification scenario involves presenting utterances of known words to a phonetic decoder and generating confusion networks from the resulting phone lattices. Confidence measures are derived from the posterior probabilities obtained from the confusion networks. Phoneme level mispronunciation detection performance was significantly improved with respect to a baseline system by optimizing acoustic models and pronunciation models in the phonetic decoder and applying a nonlinear mapping to the confusion network posteriors.
Automatic Assessment of Pronunciation Quality of Children within Assisted Speech Therapy
Electronics and Electrical Engineering, 2012
In this paper we present our results in automatic evaluation of pronunciation quality of children with dyslalia (mispronunciation of specific phonemes). Our aim is to offer real-time, quality feedback so that to reduce the gap between human assisted and artificial speech therapy. We present both theoretical and practical related issues such as: acquisition of data, human scoring, Hidden Markov Models training and classification, and the performances of our system. The obtained results encourage us to continue the development of Logomon-the first computer based speech therapy system for Romanian language. Ill. 1, bibl. 17, tabl. 2 (in English; abstracts in English and Lithuanian). O. A. Schipor, S. G. Pentiuc, M. D. Schipor. Automatinis vaikų tarsenos kokybės vertinimas pagalbinio kalbėjimo terapijoje // Elektronika ir elektrotechnika.-Kaunas: Technologija, 2012.-Nr. 6(122).-P. 15-18. Pateikiami vaikų, turinčių dislaliją (klaidingas specifinių fonemų tarimas), automatinio tarties kokybės įvertinimo rezultatai. Tikslas-pasiūlyti kokybišką realaus laiko grįžtamąjį ryšį mažinant trukmę tarp pagalbinės (žmogaus) ir dirbtinės kalbos terapijos. Pateikiami teoriniai ir praktiniai duomenų rinkimo, vertinimo, paslėptų Markovo modelių mokymo ir klasifikavimo bei sistemos našumo aspektai. Gauti rezultatai skatina tęsti darbus kuriant Logomoną-pirmąją kompiuterinę kalbos terapijos sistemą rumunų kalbai. Il. 1, bibl. 17, lent. 2 (anglų kalba; santraukos anglų ir lietuvių k.).
Evaluating acoustic speaker normalization algorithms: Evidence from longitudinal child data
The Journal of the Acoustical Society of America, 2012
Speaker vowel formant normalization, a technique that controls for variation introduced by physical differences between speakers, is necessary in variationist studies to compare speakers of different ages, genders, and physiological makeup in order to understand non-physiological variation patterns within populations. Many algorithms have been established to reduce variation introduced into vocalic data from physiological sources. The lack of real-time studies tracking the effectiveness of these normalization algorithms from childhood through adolescence inhibits exploration of child participation in vowel shifts. This analysis compares normalization techniques applied to data collected from ten African American children across five time points. Linear regressions compare the reduction in variation attributable to age and gender for each speaker for the vowels BEET, BAT, BOT, BUT, and BOAR. A normalization technique is successful if it maintains variation attributable to a reference sociolinguistic variable, while reducing variation attributable to age. Results indicate that normalization techniques which rely on both a measure of central tendency and range of the vowel space perform best at reducing variation attributable to age, although some variation attributable to age persists after normalization for some sections of the vowel space.
Acoustic normalization of children's speech
2003
Young speakers are not represented adequately in current speech recognizers. In this paper we focus on the problem to adapt the acoustic frontend of a speech recognizer which has been trained on adults' speech to achieve a better performance on speech from children. We introduce and evaluate a method to perform non-linear VTLN by an unconstrained data-driven optimization of the filterbank. A second approach normalizes the speaking rate of the young speakers with the PSOLA algorithm. Significant reductions in word error rate have been achieved.
Capturing Local Variability for Speaker Normalization in Speech Recognition
IEEE Transactions on Audio, Speech, and Language Processing, 2000
The new model reduces the impact of local spectral and temporal variability by estimating a finite set of spectral and temporal warping factors which are applied to speech at the frame level. Optimum warping factors are obtained while decoding in a locally constrained search. The model involves augmenting the states of a standard hidden Markov model (HMM), providing an additional degree of freedom. It is argued in this paper that this represents an efficient and effective method for compensating local variability in speech which may have potential application to a broader array of speech transformations. The technique is presented in the context of existing methods for frequency warpingbased speaker normalization for ASR. The new model is evaluated in clean and noisy task domains using subsets of the Aurora 2, the Spanish Speech-Dat-Car, and the TIDIGITS corpora. In addition, some experiments are performed on a Spanish language corpus collected from a population of speakers with a range of speech disorders. It has been found that, under clean or not severely degraded conditions, the new model provides improvements over the standard HMM baseline. It is argued that the framework of local warping is an effective general approach to providing more flexible models of speaker variability.
The Goodness of Pronunciation algorithm applied to disordered speech
In this paper, we report on a study with the aim of automatically detecting phoneme-level mispronunciations in 32 French speakers suffering from unilateral facial palsy at four different clinical severity grades. We sought to determine if the Goodness of Pronunciation (GOP) algorithm, which is commonly used in Computer-Assisted Language Learning systems to detect learners' individual errors, could also detect segmental deviances in disordered speech. For this purpose, speech read by the 32 speakers was aligned and GOP scores were computed for each phone realization. The highest scores, which indicate large dissimilarities with standard phone realizations, were obtained for the most severely impaired speakers. The corresponding speech subset was manually transcribed at phone-level. 8.3% of the phones differed from standard pronunciations extracted from our lexicon. The GOP technique allowed to detect 70.2% of mispronunciations with an equal rate of about 30% of false rejections and false acceptances. The phone substitutions detected by the algorithm confirmed that some of the speakers have difficulties to produce bilabial plosives, and showed that other sounds such as sibilants are prone to mispronunciation. Another interesting finding was the fact that speakers diagnosed with a same pathology grade do not necessarily share the same pronunciation issues.
A simple approach to non-uniform vowel normalization
2002
In this paper, we present results of non-uniform vowel normalization and show that the frequency-warping necessary to do nonuniform vowel normalization is similar to the mel-scale. We compare our methods to Fant's non-uniform vowel normalization method and show that with proposed frequency warping approach we can achieve similar performance without any knowledge of the spoken vowel and the formant number. The proposed approach is motivated by a desire to perform non-uniform speaker normalization in automatic speech recognition systems. We also present results of a more comprehensive study of our earlier work on nonuniform scaling which again shows that mel-scale is the appropriate warping function. All the results in this paper are based on data from Peterson & Barney and Hillenbrand et al. vowel databases.
Assessment of non-native children���s pronunciation: Human marking and automatic scoring
2005
The paper investigates automatic rating of non-native children's pronunciation. We have designed a set of 28 pronunciationfeatures; when classification is performed in high-dimensional feature space best recognition-results can be achieved. Different measures to evaluate inter-rater agreement and the machine score are proposed. In the European project Pf-Star data of native and non-native children has been recorded; the German children reading English texts have been graded by 13-14 raters. When classifying 5 sentence-level marks the result can be interpreted as 73 % correct. Looking at a longer context, recognition becomes more robust. On the speaker level error and correlation is comparable with some of the human raters.
SPEAKER NORMALIZATION FOR AUTOMATIC SPEECH RECOGNITION -AN ON-LINE APPROACH
We propose a method to transform the on line speech signal so as to comply with the specica-tions of an HMM-based automatic speech recog-nizer. The spectrum of the input signal undergoes a v ocal tract length (VTL) normalization based on dierences of the average third formant F 3 . The high frequency gap which is generated after scaling is estimated by means of an extrapolation scheme. Mel scale cepstral coecients (MFCC) are used along with delta and delta 2 -cepstra as well as delta and delta 2 energy. The method has been tested on the TI digits database which contains adult and kids speech providing substantial gains with respect to non normalized speech.
In this paper we discuss the need for compiling non-native speech corpora for the development of Computer Assisted Pronunciation Training (CAPT) applications that address specific language pairs. Learner’s L1-oriented CAPT tools enhanced with Automatic Speech Recognition (ASR) technology can perform better in automatically identifying common errors of speakers of a specific L1 when learning an L2. Nevertheless, the adaptation of an ASR system to non-native speech is a complex and time-consuming task which demands large quantities of speech data annotated at different transcription levels, and this data is generally not easily available for developers. Background research on CAPT development is presented by reporting on various projects that make use of non-native corpora for developing CAPT applications. We then present two different studies which share the objective of compiling an L1-L2 specific non-native corpus with the purpose of developing a CAPT system for the addressed language pair. The results of the two studies show that specific language combinations trigger specific pronunciations errors and these errors should be conveniently described in order to incorporate this information into the CAPT system and provide students with meaningful and effective feedback on their specific mispronunciations. Procedures for the compilation, annotation and transcription of non-native speech corpora that will improve the reusability and exchange of the databases will be also addressed. We discuss some possibilities for facilitating this task by means of using materials already provided with the orthographic transcription, which can be employed to automatically generate the phonemic transcription, or obtaining transcriptions by other methods such as crowdsourcing. These procedures are less time- consuming and should make it easier to develop effective applications. Enhancing awareness of the need for learner's speech databases is another priority that should be addressed, not only for research purposes, but also for the development of applications.