HMM Training Strategy for Incremental Speech Synthesis (original) (raw)
Related papers
Evaluating prosodic processing for incremental speech synthesis
2012
Incremental speech synthesis (iSS) accepts input and produces output in consecutive chunks that only together result in a full utterance. Systems that use iSS thus have the ability to adapt their utterances while they are ongoing. However, starting to process with less than the full utterance available prohibits global optimization, leading to potentially suboptimal solutions. In this paper, we present a method for incrementalizing the symbolic pre-processing component of speech synthesis and assess the influence of varying "lookahead", i. e. knowledge about the rest of the utterance, on prosodic quality. We found that high quality incremental output can be achieved even with a lookahead of less than one phrase, allowing for timely system reaction.
Towards Improved HMM-based Speech Synthesis Using High-Level Syntactical Features.
2010
A major drawback of current Hidden Markov Model (HMM)-based speech synthesis is the monotony of the generated speech which is closely related to the monotony of the generated prosody. Complementary to model-oriented approaches that aim to increase the prosodic variability by reducing the "over-smoothing" effect, this paper presents a linguistic-oriented approach in which high-level linguistic features are extracted from text in order to improve prosody modeling. A linguistic processing chain based on linguistic preprocessing, morpho-syntactical labeling, and syntactical parsing is used to extract high-level syntactical features from an input text. Such linguistic features are then introduced into a HMM-based speech synthesis system to model prosodic variations (f0, duration, and spectral variations). Subjective evaluation reveals that the proposed approach significantly improve speech synthesis compared to a baseline model, event if such improvement depends on the observed linguistic phenomenon.
Partial representations improve the prosody of incremental speech synthesis
2014
When humans speak, they do not plan their full utterance in all detail before beginning to speak, nor do they speak piece-by-piece and ignoring their full message – instead humans use partial representations in which they fill in the missing parts as the utterance unfolds. Incremental speech synthesizers, in contrast, have not yet made use of partial representations and the information contained there-in. We analyze the quality of prosodic parameter assignments (pitch and duration) generated from partial utterance specifications (substituting defaults for missing features) in order to determine the requirements that symbolic incremental prosody modelling should meet. We find that broader, higher-level information helps to improve prosody even if lower-level information about the near future is yet unavailable. Furthermore, we find that symbolic phrase-level or utterance-level information is most helpful towards the end of the phrase or utterance, respectively, that is, when this inf...
2014
This paper describes a approach to text-to-speech synthesis (TTS) based on HMM. In the proposing approach, speech spectral parameter sequences are generated from HMMs directly based on maximum likelihood criterion. By considering relationship between static and dynamic features during parameter generation, smooth spectral sequences are generated according to the statistics of static and dynamic parameters modelled by HMMs, resulting in natural sounding speech. In this paper, first, the algorithm for parameter generation is derived, and then the basic structure of an HMM based TTS system is described. Results of subjective experiments show the effectiveness of dynamic feature.
Dialogue context sensitive HMM-based speech synthesis
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014
The focus of this work is speech synthesis tailored to the needs of spoken dialogue systems. More specifically, the framework of HMM-based speech synthesis is utilized to train an emphatic synthetic voice that also considers dialogue context for decision tree state clustering. To achieve this, we designed and recorded a speech corpus comprising of system turns from human-computer interaction, as well as additional prompts for slot-level emphasis. This corpus, combined with a general purpose text-to-speech one, was used to train HMM-based synthetic voices using a) baseline context features, b) additional slot-level emphasis features, and c) additional dialogue context features extracted from the dialogue act semantic representation. The voices were evaluated in pairs for dialogue appropriateness using a preference listening test. The results show that the emphatic voice is more preferable than the baseline when emphasis markup is present, while the dialogue context-sensitive voice is more preferable than the plain emphatic one when no emphasis markup is present and more preferable than the baseline in both cases.
Reactive and continuous control of HMM-based speech synthesis
2012 IEEE Spoken Language Technology Workshop (SLT), 2012
In this paper, we present a modified version of HTS, called performative HTS or pHTS. The objective of pHTS is to enhance the control ability and reactivity of HTS. pHTS reduces the phonetic context used for training the models and generates the speech parameters within a 2-label window. Speech waveforms are generated on-the-fly and the models can be reactively modified, impacting the synthesized speech with a delay of only one phoneme. It is shown that HTS and pHTS have comparable output quality. We use this new system to achieve reactive model interpolation and conduct a new test where articulation degree is modified within the sentence.
Prerequisites for Building an Intelligible HMM-based Speech Synthesis
The purpose of the current study was to identify some key factors that have the most significant impact on the final intelligibility of an HMM-based speech synthesizer, one of the most promising new technologies in recent speech synthesis. Based on a speech corpora designed for the Polish BOSS synthesizer, two synthetic voices were built; a cluster unit selection voice for Festival and an HMM-based voice. Speech samples synthesized using both approaches were comparatively analysed. Erroneous pitch extraction, insufficient database coverage and big ratio of distorted samples, as well as poorly designed contextual question list and data over pruning proved to influence the final intelligibility of an HMM-based speech synthesis the most.
Decision tree usage for incremental parametric speech synthesis
2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014
Human speakers plan and deliver their utterances incrementally, piece-by-piece, and it is obvious that their choice regarding phonetic details (and the details' peculiarities) is rarely determined by globally optimal solutions. In contrast, parametric speech synthesizers use a full-utterance context when optimizing vocoding parameters and when determing HMM states. Apart from being cognitively implausible, this impedes incremental use-cases, where the future context is often at least partially unavailable. This paper investigates the 'locality' of features in parametric speech synthesis voices and takes some missing steps towards better HMM state selection and prosody modelling for incremental speech synthesis.
Minimum Generation Error Training for HMM-Based Speech Synthesis
2006 IEEE International Conference on Acoustics Speed and Signal Processing Proceedings
A minimum generation error (MGE) criterion had been proposed to solve the issues related to maximum likelihood (ML) based HMM training in HMM-based speech synthesis. In this paper, we improve the MGE criterion by imposing a log spectral distortion (LSD) instead of the Euclidean distance to define the generation error between the original and generated line spectral pair (LSP) coefficients. Moreover, we investigate the effect of different sampling strategies to calculate the integration of the LSD function. From the experimental results, using the LSDs calculated by sampling at LSPs achieved the best performance, and the quality of synthesized speech after the MGE-LSD training was improved over the original MGE training.