Nearest neighbor approach in speaker adaptation for HMM-based speech synthesis (original) (raw)
Related papers
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014
Statistical speech synthesis (SSS) systems have the ability to adapt to a target speaker with a couple of minutes of adaptation data. Developing adaptation algorithms to further reduce the number of adaptation utterances to a few seconds of data can have substantial effect on the deployment of the technology in real-life applications such as consumer electronics devices. The traditional way to achieve such rapid adaptation is the eigenvoice technique which works well in speech recognition but known to generate perceptual artifacts in statistical speech synthesis. Here, we propose three methods to alleviate the quality problems of the baseline eigenvoice adaptation algorithm while allowing speaker adaptation with minimal data. Our first method is based on using a Bayesian eigenvoice approach for constraining the adaptation algorithm to move in realistic directions in the speaker space to reduce artifacts. Our second method is based on finding pre-trained reference speakers that are close to the target speaker and utilizing only those reference speaker models in a second eigenvoice adaptation iteration. Both techniques performed significantly better than the baseline eigenvoice method in the objective tests. Similarly, they both improved the speech quality in subjective tests compared to the baseline eigenvoice method. In the third method, tandem use of the proposed eigenvoice method with a state-of-the-art linear regression based adaptation technique is found to improve adaptation of excitation features.
Finding Relevant Features for Statistical Speech Synthesis Adaptation
Statistical speech synthesis (SSS) models typically lie in a very highdimensional space. They can be used to allow speech synthesis on digital devices, using only few sentences of input by the user. However, the adaptation algorithms of such weakly trained models suer from the high dimensionality of the feature space. Because creating new voices is easy with the SSS approach, thousands of voices can be trained and a Nearest-Neighbor (NN) algorithm can be used to obtain better speaker similarity in those limited-data cases. NN methods require good distance measures that correlate well with human perception. This paper investigates the problem of nding good low-cost metrics, i.e. simple functions of feature values that map with objective signal quality metrics. We show this is a ill-posed problem, and study its conversion to a tractable form. Tentative solutions are found using statistical analyzes. With a performance index improved by 36% w.r.t. a naive solution, while using only 0.77% of the respective amount of features, our results are promising. Deeper insights in our results are then unveiled using visual methods, namely highdimensional data visualization and dimensionality reduction techniques. Perspectives on new adaptation algorithms, and tighter integration of data mining and visualization principles are eventually given.
2010
The EMIME project aims to build a personalized speech-to-speech translator, such that spoken input of a user in one language is used to produce spoken output that still sounds like the user's voice however in another language. This distinctiveness makes unsupervised cross-lingual speaker adaptation one key to the project's success. So far, research has been conducted into unsupervised and crosslingual cases separately by means of decision tree marginalization and HMM state mapping respectively. In this paper we combine the two techniques to perform unsupervised cross-lingual speaker adaptation. The performance of eight speaker adaptation systems (supervised vs. unsupervised, intra-lingual vs. cross-lingual) is compared using objective and subjective evaluations. Experimental results show the performance of unsupervised cross-lingual speaker adaptation is comparable to that of the supervised case in terms of spectrum adaptation in the EMIME scenario, even though automatically obtained transcriptions have a very high phoneme error rate.
Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis
2010
This paper demonstrates how unsupervised cross-lingual adaptation of HMM-based speech synthesis models may be performed without explicit knowledge of the adaptation data language. A two-pass decision tree construction technique is deployed for this purpose. Using parallel translated datasets, cross-lingual and intralingual adaptation are compared in a controlled manner. Listener evaluations reveal that the proposed method delivers performance approaching that of unsupervised intralingual adaptation.
VTLN adaptation for statistical speech synthesis
2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010
The advent of statistical speech synthesis has enabled the unification of the basic techniques used in speech synthesis and recognition. Adaptation techniques that have been successfully used in recognition systems can now be applied to synthesis systems to improve the quality of the synthesized speech. The application of vocal tract length normalization (VTLN) for synthesis is explored in this paper. VTLN based adaptation requires estimation of a single warping factor, which can be accurately estimated from very little adaptation data and gives additive improvements over CMLLR adaptation. The challenge of estimating accurate warping factors using higher order features is solved by initializing warping factor estimation with the values calculated from lower order features.
Computer Speech & Language
In this paper we present results of unsupervised cross-lingual speaker adaptation applied to text-to-speech synthesis. The application of our research is the personalisation of speech-to-speech translation in which we employ a HMM statistical framework for both speech recognition and synthesis. This framework provides a logical mechanism to adapt synthesised speech output to the voice of the user by way of speech recognition. In this work we present results of several different unsupervised and cross-lingual adaptation approaches as well as an end-to-end speaker adaptive speech-to-speech translation system. Our experiments show that we can successfully apply speaker adaptation in both unsupervised and cross-lingual scenarios and our proposed algorithms seem to generalise well for several language pairs. We also discuss important future directions including the need for better evaluation metrics.► We examine different approaches for personalising HMM-based speech-to-speech translation systems using speaker adaptation. ► Two adaptation approaches based on ‘pipeline’ and a novel ‘unified’ framework are presented. ► Using both frameworks, unsupervised adaptation of HMM-based TTS is able to preserve speaker similarity in the presence of recognition errors. ► There is a need to better understand perception of speaker similarity and to develop better evaluation metrics for this task.
Analysis of speaker similarity in the statistical speech synthesis systems using a hybrid approach
Statistical speech synthesis (SSS) approach has become one of the most popular and successful methods in the speech synthesis field. Smooth speech transitions, without the spurious errors that are observed in unit selection systems, can be generated with the SSS approach. However, a well-known issue with SSS is the lack of voice similarity to the target speaker. The issue arises both in speakerdependent models and models that are adapted from average voices. Moreover, in speaker adaptation, similarity to the target speaker does not increase significantly after around one minute of adaptation data which potentially indicates inherent bottleneck(s) in the system. Here, we propose using the hybrid speech synthesis approach to understand the key factors behind the speaker similarity problem. To that end, we try to answer the following question: which segments and parameters of speech, if generated/synthesized better, would have a substantial improvement on speaker similarity? In this work, our hybrid methods are described and listening test results are presented and discussed.
Rapid Unsupervised Speaker Adaptation Using Single Utterance Based on MLLR and Speaker Selection
2007
In this paper, we employ the concept of HMM-Sufficient Statistics (HMM-Suff Stat) and N-best speakers selection to realize a rapid implementation of Baum-Welch and MLLR. Only a single arbitrary utterance is required which is used to select the N-best speakers HMM-Suff Stat from the training database as adaptation data. Since HMM-Suff Stat are pre-computed offline, computation load is minimized. Moreover, adaptation data from the target speaker is not needed. An absolute improvement of 1.8 % WA is achieved when using the rapid Baum-Welch as opposed to using SI model and an improvement of 1.1 % WA is achieved when the rapid MLLR is used compared to rapid Baum-Welch adaptation using HMM-Suff Stat. Adaptation time is as fast as 6 sec and 7 sec respectively. Evaluation is done in noisy environment conditions where the adaptation algorithm is integrated in a speech dialogue system. Additional experiments with VTLN, MAP, and the conventional MLLR are performed.
Analysis of speaker clustering strategies for HMM-based speech synthesis
This paper describes a method for speaker clustering, with the application of building average voice models for speakeradaptive HMM-based speech synthesis that are a good basis for adapting to specific target speakers. Our main hypothesis is that using perceptually similar speakers to build the average voice model will be better than use unselected speakers, even if the amount of data available from perceptually similar speakers is smaller. We measure the perceived similarities among a group of 30 female speakers in a listening test and then apply multiple linear regression to automatically predict these listener judgements of speaker similarity and thus to identify similar speakers automatically. We then compare a variety of average voice models trained on either speakers who were perceptually judged to be similar to the target speaker, or speakers selected by the multiple linear regression, or a large global set of unselected speakers. We find that the average voice model trained on perceptually similar speakers provides better performance than the global model, even though the latter is trained on more data, confirming our main hypothesis. However, the average voice model using speakers selected automatically by the multiple linear regression does not reach the same level of performance.