Multidialectal Acoustic Modeling: a Comparative Study (original) (raw)

Multidialectal Spanish acoustic modeling for speech recognition

Speech Communication, 2009

During the last years, language resources for speech recognition have been collected for many languages and specifically, for global languages. One of the characteristics of global languages is their wide geographical dispersion, and consequently, their wide phonetic, lexical, and semantic dialectal variability. Even if the collected data is huge, it is difficult to represent dialectal variants accurately.

Clustering of triphones using phoneme similarity estimation for the definition of a multilingual set of triphones

Speech Communication, 2003

This paper addresses the problem of multilingual acoustic modelling for the design of multilingual speech recognisers. An agglomerative clustering algorithm for the definition of multilingual set of triphones is proposed. This clustering algorithm is based on the definition of an indirect distance measure for triphones defined as a weighted sum of the explicit estimates of the context similarity on a monophone level. The monophone similarity estimation method is based on the algorithm of Houtgast. The new clustering algorithm was tested in a multilingual speech recognition experiment for three languages. The algorithm was applied on monolingual triphone sets of language specific recognisers for all languages. In order to evaluate the clustering algorithm, the performance of the multilingual set of triphones was compared to the performance of the reference system composed of all three language specific recognisers operating in parallel, and to the performance of the multilingual set of triphones produced by the tree-based clustering algorithm. All experiments were based on the 1000 FDB SpeechDat(II) databases (Slovenian, Spanish and German). Experiments have shown that the use of the clustering algorithm results in a significant reduction of the number of triphones with minor degradation of recognition rate.

Minimum risk acoustic clustering for multilingual acoustic model combination

2000

In this paper we describe procedures for combining multiple acoustic models, obtained using training corpora from different languages, in order to improve ASR performance in languages for which large amounts of training data are not available. We treat these models as multiple sources of information whose scores are combined in a log-linear model to compute the hypothesis likelihood. The model combination can either be performed in a static way, with constant combination weights, or in a dynamic way, with parameters that can vary for different segments of a hypothesis. The aim is to optimize the parameters so as to achieve minimum word error rate. In order to achieve robust parameter estimation in the dynamic combination case, the parameters are defined to be piecewise constant on different phonetic classes that form a partition of the space of hypothesis segments. The partition is defined, using phonological knowledge, on segments that correspond to hypothesized phones. We examine different ways to define such a partition, including an automatic approach that gives a binary tree structured partition which tries to achieve the minimum WER with the minimum number of classes.

DECISION TREE-BASED CONTEXT DEPENDENT SUBLEXICAL UNITS FOR SPANISH CONTINUOUS SPEECH RECOGNITION TASKS

This paper presents a new methodology, based on the classical decision tree classification scheme proposed by Bahl [1], to get a suitable set of context dependent sublexical units in Spanish continuous speech recognition tasks. The original method was applied as a first baseline approach. Then two new features were added: a discriminative function to evaluate the quality of the splits and the use of discrete HMMs to compute the likelihoods. A second approach was explored, based on the fast and efficient Growing and Pruning algorithm fitting both the size and the acoustic modelling capability of the decision trees. In addition, the use of these units to build word models was addressed, considering only intraword contexts. The baseline approach gave recognition rates clearly outperforming those of context independent phone-like units. Then the two new features and the alternative methodology outlined above were evaluated. Recognition rates were similar to those of the baseline approach, being the discriminative function the most promising feature. Finally, modelling explicitly the between-word contexts appearing in the test database made a prospective attempt. This approach gave the best results, suggesting further work in pronunciation modelling using context dependent phone-like units 1 .

Computational Comparison and Classification of Dialects

Dialectologia et Geolinguistica, 2000

In this paper a range of methods for measuring the phonetic distance between dialectal variants are described. It concerns variants of the frequency method, the frequency per word method and Levenshtein distance, both simple (based on atomic characters) and complex (based on feature bundles). The measurements between feature bundles used Manhattan distance, Euclidean distance or (a measure using) Pearson's correlation coefficient. Variants of these using feature weighting by entropy reduction were systematically compared, as was the representation of diphthongs (as one symbol or two). The dialects were compared with each other directly and indirectly via a standard dialect. The results of comparison were classified by clustering and by training of a Kohonen map. The results were compared to wellestablished scholarship in dialectology, yielding a calibration of the methods. These results indicate that the frequency per word method and the Levenshtein distance outperform the frequency method, that feature representations are more sensitive, that Manhattan distance and Euclidean distance are good measures of phonetic overlap of feature bundles, that weighting is not useful, that two-phone representations of diphthongs mostly outperform one-phone representations, and that dialects should be directly compared to each other. The results of clustering give the sharper classification, but the Kohonen map is a nice supplement.

Analyzing phonetic variation in the traditional English dialects: Simultaneously clustering dialects and phonetic features

Literary and Linguistic Computing, 2013

This study explores the linguistic application of bipartite spectral graph partitioning, a graphtheoretic technique that simultaneously identifies clusters of similar localities as well as clusters of features characteristic of those localities. We compare the results using this approach to previously published results on the same dataset using cluster and principal component analysis (Shackleton, 2007). While the results of the spectral partitioning method and Shackleton's approach overlap to a broad extent, the analyses offer complementary insights into the data. The traditional cluster analysis detects some clusters which are not identified by the spectral partitioning analysis, while the reverse also occurs. Similarly, the principal component analysis and the spectral partitioning analysis detect many overlapping, but also some different linguistic variants. The main benefit of the bipartite spectral graph partitioning method over the alternative approaches remains its ability to simultaneously identify sensible geographical clusters of localities with their corresponding linguistic features.

Taking Advantage of Spanish Speech Resources to Improve Catalan Acoustic HMMs

At TALP, we are working on speech recognition of official languages in Catalonia, i.e. Spanish and Catalan. These two languages share approximately 80 % of their allophones. The speech databases that we have available to train HMMs in Catalan have a smaller size than the Spanish databases. This difference of size of training databases results in poorer phonetic unit models for Catalan than for Spanish. The Catalan database size is not enough to allow correct training of more complex models like triphones. The aim of this work is to find segments in Spanish databases that, used in conjunction to the Catalan utterances to train the HMM models, get an improvement of the speech recognition rate for the Catalan language. To make this selection, the following information is used: the distance between the HMM which are trained separately in Spanish and Catalan, and the phonetic attributes of every allophone. A contextual acoustic unit, the demiphone, and a state tying approach are used. This tying is done by tree clustering, using the phonetic attributes of the units and the distances between the HMM states. Different tests have been carried out by using different percentage of tied states in training simultaneously in Catalan and Spanish. In this way, Catalan models are obtained that give generally better results than the models trained only with the Catalan utterances. However, we observe from one of the tests that, when the number of gaussians is increased, that improvement becomes a loss of performance. Currently, we are working on the inclusion of additional labels to avoid that tree clustering puts in the same pool phoneme realizations that are too much different.

A comparison of data-derived and knowledge-based modeling of pronunciation variation

2000

This paper focuses on modeling pronunciation variation in two different ways: data-derived and knowledge-based. The knowledge-based approach consists of using phonological rules to generate variants. The data-derived approach consists of performing phone recognition, followed by various pruning and smoothing methods to alleviate some of the errors in the phone recognition. Using phonological rules led to a small improvement in WER; whereas, using a data-derived approach in which the phone recognition was smoothed using simple decision trees (d-trees) prior to lexicon generation led to a significant improvement compared to the baseline. Furthermore, we found that 10% of variants generated by the phonological rules were also found using phone recognition, and this increased to 23% when the phone recognition output was smoothed by using d-trees. In addition, we propose a metric to measure confusability in the lexicon and we found that employing this confusion metric to prune variants results in roughly the same improvement as using the d-tree method.