Investigating Spectral Amplitude Modulation Phase Hierarchy Features in Speech Synthesis (original) (raw)

Probabilistic Amplitude Demodulation Features in Speech Synthesis for Improving Prosody

Interspeech 2016, 2016

Amplitude demodulation (AM) is a signal decomposition technique by which a signal can be decomposed to a product of two signals, i.e, a quickly varying carrier and a slowly varying modulator. In this work, the probabilistic amplitude demodulation (PAD) features are used to improve prosody in speech synthesis. The PAD is applied iteratively for generating syllable and stress amplitude modulations in a cascade manner. The PAD features are used as a secondary input scheme along with the standard text-based input features in statistical parametric speech synthesis. Specifically, deep neural network (DNN)-based speech synthesis is used to evaluate the importance of these features. Objective evaluation has shown that the proposed system using the PAD features has improved mainly prosody modelling; it outperforms the baseline system by approximately 5% in terms of relative reduction in root mean square error (RMSE) of the fundamental frequency (F0). The significance of this improvement is validated by subjective evaluation of the overall speech quality, achieving 38.6% over 19.5% preference score in respect to the baseline system, in an ABX test.

Multiple-prosody speech databases and their effectiveness in high-quality speech synthesis at arbitrary rates

Electronics and Communications in Japan (Part II: Electronics), 2005

This paper discusses a method of high-quality speech synthesis in which the speech rate can be controlled in various ways. When the prosody is adjusted by the PSOLA method or by the synthesis-by-analysis method in the waveform segment connection process, the quality declines as the extent of modification increases. To deal with this problem, this paper proposes a method in which modification of the segment duration is reduced and quality degradation is alleviated by using a speech database for each speech rate. The proposed method has the following features. (1) Synthesized speech with the target speech rate is produced for each utterance, and is recorded. (2) Speech databases of the same text at different speech rates are constructed. In this study, speech databases at three different speech rates, fast, medium, and slow, were acquired. Speech at two different speech rates (fast and slow) was synthesized by using the acquired speech databases and by the conventional method (using a speech database at the standard speech rate). Listening experiments showed that the proposed method can synthesize higher-quality speech than the conventional method. When speech databases with different speech rates are combined, there is a danger that the speech quality may be degraded due to differences in voice quality among the databases. The effect of voice quality was investigated in a listening experiment, and was found to be within the tolerable range. © 2005 Wiley Periodicals, Inc. Electron Comm Jpn Pt 2, 88(9): 38-47, 2005; Published online in Wiley InterScience (www.interscience.wiley.com).

Speech Synthesis: A Review

International journal of engineering research and technology, 2013

Attempts to control the quality of voice of synthesized speech have existed for more than a decade now. Several prototypes and fully operating systems have been built based on different synthesis technique. This article reviews recent research advances in R&D of speech synthesis with focus on one of the key approaches i.e. statistical parametric approach to speech synthesis based on HMM, so as to provide a technological perspective. In this approach, spectrum, excitation, and duration of speech are simultaneously modeled by context –dependent HMMs, and speech waveforms are generated from the HMMs themselves. This paper aims to give an overview of what has been done in this field, summarize and compare the characteristics of various synthesis techniques used. It is expected that this study shall be a contribution in the field of speech synthesis and enable identification of research topic and applications which are at the forefront of this exciting and challenging field.

ProSynth: an integrated prosodic approach to device-independent, natural-sounding speech synthesis

Computer Speech & Language, 2000

This paper outlines ProSynth, an approach to speech synthesis which takes a rich linguistic structure as central to the generation of natural-sounding speech. We start from the assumption that the acoustic richness of the speech signal reflects linguistic structural richness and underlies the percept of naturalness. Naturalness achieved by paying attention to systematic phonetic detail in the spectral, temporal and intonational domains produces a perceptually robust signal that is intelligible in adverse listening conditions. ProSynth uses syntactic and phonological parses to model the fine acoustic-phonetic detail of real speech. We present examples of our approach to modelling systematic segmental, temporal and intonational detail and show how all are integrated in the prosodic structure. Preliminary tests to evaluate the effects of modelling systematic fine spectral detail, timing, and intonation suggest that the approach increases intelligibility and naturalness.

Unsupervised prominence prediction for speech synthesis

Interspeech 2013, 2013

We propose an unsupervised prominence prediction method for expressive speech synthesis. Prominence patterns are learned by statistical analysis of prosodic features extracted from speech data. The advantages of our unsupervised datadriven prominence prediction include: easy adaptation to new speakers, speech styles, and even languages without requiring expert knowledge or complicated linguistic rules. In this approach, first, prominence predictive prosodic features are extracted at the foot level. Next, the extracted prosodic features are clustered, each cluster representing a prominence level. Based on just-noticeable-differences of prosodic features, the optimal number of perceptually distinct prominence levels is determined. Finally, the proposed prominence prediction is applied to prosody prediction for unit selection speech synthesis. Perceptual evaluation results show a preference for a 4-level unsupervised prominence prediction over a rule-based baseline in terms of naturalness and expressiveness of synthesized speech.

The voice synthesis business: 2022 update

Natural Language Engineering

In the past few years, high-quality automated text-to-speech synthesis has effectively become a commodity, with easy access to cloud-based APIs provided by a number of major players. At the same time, developments in deep learning have broadened the scope of voice synthesis functionalities that can be delivered, leading to a growth in the range of commercially viable use cases. We take a look at the technology features and use cases that have attracted attention and investment in the past few years, identifying the major players and recent start-ups in the space.

Adaptation of prosody in speech synthesis by changing command values of the generation process model of fundamental frequency

Interspeech 2011, 2011

A method was developed to adapt prosody to a new speaker/style in speech synthesis. It is based on predicting differences between target and original speakers/styles and applying them to the original one. Differences in fundamental frequency (F 0) contours are represented in the framework of the generation process model; differences in the command magnitudes/amplitudes. While the original one requires a certain amount of training corpus, while corpus for training command differences can be small. Furthermore, in the case of style adaptation, it is not necessarily the corpus being uttered by the same speaker of the original style. Speech synthesis was conducted using HMM-based speech synthesis system, where prosody was controlled by the method. Listening experiments on synthetic speech with style adaptation and voice conversion both showed the validity of the method.

Performance Evaluation of Speech Synthesis Techniques for English Language

The conversion of text to synthetic production of speech is known as text-to-speech synthesis (TTS). This can be achieved by the method of concate-native speech synthesis (CSS) and hidden Markov model techniques. Quality is the important paradigm for the artificial speech produced. The study involves the comparative analysis for quality of speech synthesis using hidden Markov model and unit selection approach. The quality of synthesized speech is evaluated with the two methods, i.e., subjective measurement using mean opinion score and objective measurement based on mean square score and peak signal-to-noise ratio (PSNR). Mel-frequency cepstral coefficient features are also extracted for synthesized speech. The experimental analysis shows that unit selection method results in better synthesized voice than hidden Markov model.

Continuous vocoder in feed-forward deep neural network based speech synthesis

2018

Recently in statistical parametric speech synthesis, we proposed a vocoder using continuous F0 in combination with Maximum Voiced Frequency (MVF), which was successfully used with hidden Markov model (HMM) based text-to-speech (TTS). However, HMMs often generate over-smoothed and muffled synthesized speech. From this point, we propose here to use the modified version of our continuous vocoder with deep neural networks (DNNs) for further improving its quality. Evaluations between DNNTTS using Continuous and WORLD vocoders are also presented. Experimental results from objective and subjective tests have shown that the DNN-TTS have higher naturalness than HMM-TTS, and the proposed framework provides quality similar to the WORLD vocoder, while being simpler in terms of the number of excitation parameters and models better the voiced/unvoiced speech regions than the WORLD vocoder.