Evaluation of a Neural Network Segmental Durations Model for Portuguese (original) (raw)

Evaluation of a Segmental Durations Model for TTS

Lecture Notes in Computer Science, 2003

In this paper we present a condensed description of a European Portuguese segmental duration's model for TTS purposes and concentrate on its evaluation. This model is based on artificial neural networks. The evaluation of the model quality was made by comparison with read speech. The standard deviation reached in test set is 19.5 ms and the linear correlation coefficient is 0.84. The model is perceptually evaluated with 4.12 against 4.30 for natural human read speech in a scale of 5.

Application of neural networks to duration modeling in a Spanish text-to-speech system

We are going to show the application of neural networks in one of the critical modules of a text -to-speech system: duration modeling. Our objective is the accurate prediction of segmental duration. We present a complete environment in which to decide which are the most relevant parameters and how to code them. We will compare two systems: an unrestricted-domain database for a male voice and a restricted-domain environment for a female voice. The restricted-domain offers several advantages to the modeling: the variation in the different patterns is reduced, and so most of the decisions about the parameters are based in more significant results. The result is a very low prediction error, specially in the restricted-domain case.

Automatic modeling of duration in a Spanish text-to-speech system using neural networks

1999

Accurate prediction of segmental duration from text in a textto-speech system is difficult for several reasons. One specially relevant is the great quantity of contextual factors that affect timing and how to model them. There are many parameters that affect duration, but not all of them are always relevant. We present a complete environment in which to decide which parameters are more relevant in different situations and the best way to code them. The system is based in a neural network absolutely configurable, and the main effort is made in the parameters to be used, including the contextual effects using windows of variable length.

Use of phoneme dedicated artificial neural networks to predict segmental durations

2005

The results of two alternative models to predict segmental durations in speech synthesis, both based on Artificial Neural Networks (ANNs) are discussed. The ANN model consists in just one ANN trained to predict the segmental durations for all phonemes. The phoneme dedicated ANN model consists in a set of ANNs, each one dedicated to predict the segmental duration of a specific phoneme. Both models are compared with the same input information extracted from one European Portuguese database. Objective and subjective measurements of performance of both approaches are compared. A slight preference was denoted for the phoneme dedicated ANN model.

Argentine Spanish segmental duration prediction

2012

In this paper we model the segmental duration of Spanish spoken in Buenos Aires, considering its application in a text-to-speech system. The work was performed on two hand labeled databases. We use articial neural networks as predictor, and all the input features can be extracted automatically from the speech text. We experimented with a neural network for all phonemes and one neural network for phoneme. In both cases the results are very promising for the two databases used. The order of importance of input features revealed to be dierent for each of the methods tested and dierent according to the speaker style.

Phoneme dedicated ANN improves segmental duration model

2008

The Phoneme Dedicated Artificial Neural Network (PDANN) segmental duration model consists of a set of ANNs trained specifically for each phoneme segment in order to avoid miscellaneous influence of different types of phoneme segments. Therefore, each ANN is dedicated to predict the duration of a specific phoneme segment. Objective and subjective measurements of the performance of the PDANN model were compared with those of a typical ANN model using the same input features and database. The results indicate a slight, but clear, perceptually perceived preference towards the PDANN.

Neural Network-Based Modeling of Phonetic Durations

Interspeech 2019

A deep neural network (DNN)-based model has been developed to predict non-parametric distributions of durations of phonemes in specified phonetic contexts and used to explore which factors influence durations most. Major factors in US English are pre-pausal lengthening, lexical stress, and speaking rate. The model can be used to check that text-to-speech (TTS) training speech follows the script and words are pronounced as expected. Duration prediction is poorer with training speech for automatic speech recognition (ASR) because the training corpus typically consists of single utterances from many speakers and is often noisy or casually spoken. Low probability durations in ASR training material nevertheless mostly correspond to non-standard speech, with some having disfluencies. Children's speech is disproportionately present in these utterances, since children show much more variation in timing.

Duration modeling in a restricted-domain female-voice synthesis in Spanish using neural networks

2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), 2001

The objective of this paper is the accurate prediction of segmental duration in a Spanish text-to-speech system. There are many parameters that affect duration, but not all of them are always relevant. We present a complete environment in which to decide which parameters are more relevant and the best way to code them. This work is the continuation of [1], where all efforts were dedicated to an unrestricted-domain database for a male voice. In this case, we are considering a female voice in a restricted-domain environment. This restricted-domain offers several advantages to the modeling: the variation in the different patterns is reduced, and so most of the decisions we have made about the parameters are now based in more significant results. So, the conclusions that we present now show clearly which parameters are best. The system is based in a neural network absolutely configurable.

Segmental Duration Modelling in Turkish

Lecture Notes in Computer Science, 2006

Naturalness of synthetic speech highly depends on appropriate modeling of prosodic aspects. Mostly, three prosody components are modeled: segmental duration, pitch contour and intensity. In this study, we present our work on modeling segmental duration in Turkish using machinelearning algorithms, especially Classification and Regression Trees (CART). The models predict phone durations based on attributes such as phone identity, neighboring phone identities, lexical stress, position of syllable in word, part-ofspeech (POS) information, word length in number of syllables and position of word in utterance extracted from a speech corpus of approximately 700 sentences. Obtained models predict segment durations better than mean duration approximations (~0.77 Correlation Coefficient, CC, and 20.4 ms Root-Mean Squared Error, RMSE). Attributes phone identity, neighboring phone identities, lexical stress, syllable type, POS, phrase break information, and location of word in the phrase constitute best predictor set for phoneme duration modeling.