Application of neural networks to duration modeling in a Spanish text-to-speech system (original) (raw)
Related papers
Automatic modeling of duration in a Spanish text-to-speech system using neural networks
1999
Accurate prediction of segmental duration from text in a textto-speech system is difficult for several reasons. One specially relevant is the great quantity of contextual factors that affect timing and how to model them. There are many parameters that affect duration, but not all of them are always relevant. We present a complete environment in which to decide which parameters are more relevant in different situations and the best way to code them. The system is based in a neural network absolutely configurable, and the main effort is made in the parameters to be used, including the contextual effects using windows of variable length.
Duration modeling in a restricted-domain female-voice synthesis in Spanish using neural networks
2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), 2001
The objective of this paper is the accurate prediction of segmental duration in a Spanish text-to-speech system. There are many parameters that affect duration, but not all of them are always relevant. We present a complete environment in which to decide which parameters are more relevant and the best way to code them. This work is the continuation of [1], where all efforts were dedicated to an unrestricted-domain database for a male voice. In this case, we are considering a female voice in a restricted-domain environment. This restricted-domain offers several advantages to the modeling: the variation in the different patterns is reduced, and so most of the decisions we have made about the parameters are now based in more significant results. So, the conclusions that we present now show clearly which parameters are best. The system is based in a neural network absolutely configurable.
Neural Network-Based Modeling of Phonetic Durations
Interspeech 2019
A deep neural network (DNN)-based model has been developed to predict non-parametric distributions of durations of phonemes in specified phonetic contexts and used to explore which factors influence durations most. Major factors in US English are pre-pausal lengthening, lexical stress, and speaking rate. The model can be used to check that text-to-speech (TTS) training speech follows the script and words are pronounced as expected. Duration prediction is poorer with training speech for automatic speech recognition (ASR) because the training corpus typically consists of single utterances from many speakers and is often noisy or casually spoken. Low probability durations in ASR training material nevertheless mostly correspond to non-standard speech, with some having disfluencies. Children's speech is disproportionately present in these utterances, since children show much more variation in timing.
Use of phoneme dedicated artificial neural networks to predict segmental durations
2005
The results of two alternative models to predict segmental durations in speech synthesis, both based on Artificial Neural Networks (ANNs) are discussed. The ANN model consists in just one ANN trained to predict the segmental durations for all phonemes. The phoneme dedicated ANN model consists in a set of ANNs, each one dedicated to predict the segmental duration of a specific phoneme. Both models are compared with the same input information extracted from one European Portuguese database. Objective and subjective measurements of performance of both approaches are compared. A slight preference was denoted for the phoneme dedicated ANN model.
Argentine Spanish segmental duration prediction
2012
In this paper we model the segmental duration of Spanish spoken in Buenos Aires, considering its application in a text-to-speech system. The work was performed on two hand labeled databases. We use articial neural networks as predictor, and all the input features can be extracted automatically from the speech text. We experimented with a neural network for all phonemes and one neural network for phoneme. In both cases the results are very promising for the two databases used. The order of importance of input features revealed to be dierent for each of the methods tested and dierent according to the speaker style.
Generating Segment Durations in a Text-To-Speech System: A Hybrid Rule-Based/Neural Network Approach
Eprint Arxiv Cs 9811030, 1998
A combination of a neural network with rule firing information from a rule-based system is used to generate segment durations for a text-to-speech system. The system shows a slight improvement in performance over a neural network system without the rule firing information. Synthesized speech using segment durations was accepted by listeners as having about the same quality as speech generated using segment durations extracted from natural speech.
Evaluation of a Segmental Durations Model for TTS
Lecture Notes in Computer Science, 2003
In this paper we present a condensed description of a European Portuguese segmental duration's model for TTS purposes and concentrate on its evaluation. This model is based on artificial neural networks. The evaluation of the model quality was made by comparison with read speech. The standard deviation reached in test set is 19.5 ms and the linear correlation coefficient is 0.84. The model is perceptually evaluated with 4.12 against 4.30 for natural human read speech in a scale of 5.
Evaluation of a Neural Network Segmental Durations Model for Portuguese
This paper presents a description of, as far as the authors know, the first published segmental durations model for European Portuguese for TTS purpose and its evaluation. This model is based in artificial neural networks trained with resilient back propagation algorithm. Using a substantial amount of training data, and a selected set of input factors, the standard deviation reaches in several paragraphs 19 ms and linear correlation superior to 0.9. This paper will present the methodology, the topology of the neural network, the training algorithm, discuss the importance of the used factors, the evaluation of the model and comparison with other models.
Phoneme dedicated ANN improves segmental duration model
2008
The Phoneme Dedicated Artificial Neural Network (PDANN) segmental duration model consists of a set of ANNs trained specifically for each phoneme segment in order to avoid miscellaneous influence of different types of phoneme segments. Therefore, each ANN is dedicated to predict the duration of a specific phoneme segment. Objective and subjective measurements of the performance of the PDANN model were compared with those of a typical ANN model using the same input features and database. The results indicate a slight, but clear, perceptually perceived preference towards the PDANN.