FlexVoice: A Parametric Approach to High-Quality Speech Synthesis (original) (raw)
Related papers
The voice synthesis business: 2022 update
Natural Language Engineering
In the past few years, high-quality automated text-to-speech synthesis has effectively become a commodity, with easy access to cloud-based APIs provided by a number of major players. At the same time, developments in deep learning have broadened the scope of voice synthesis functionalities that can be delivered, leading to a growth in the range of commercially viable use cases. We take a look at the technology features and use cases that have attracted attention and investment in the past few years, identifying the major players and recent start-ups in the space.
Speech synthesis systems: disadvantages and limitations
International Journal of Engineering & Technology
The present speech synthesis systems can be successfully used for a wide range of diverse purposes. However, there are serious and important limitations in using various synthesizers. Many of these problems can be identified and resolved. The aim of this paper is to present the current state of development of speech synthesis systems and to examine their drawbacks and limitations. The paper dis-cusses the current classification, construction and functioning of speech synthesis systems, which gives an insight into synthesizers implemented so far. The analysis of disadvantages and limitations of speech synthesis systems focuses on identification of weak points of these systems, namely: the impact of emotions and prosody, spontaneous speech in terms of naturalness and intelligibility, preprocessing and text analysis, problem of ambiguity, natural sounding, adaptation to the situation, variety of systems, sparsely spoken languages, speech synthesis for older people, and some other minor...
Parameterization of vocal fry in HMM-based speech synthesis
2009
HMM-based speech synthesis offers a way to generate speech with different voice qualities. However, sometimes databases contain certain inherent voice qualities that need to be parametrized properly. One example of this is vocal fry typically occurring at the end of utterances. A popular mixed excitation vocoder for HMM-based speech synthesis is STRAIGHT. The standard STRAIGHT is optimized for modal voices and may not produce high quality with other voice types. Fortunately, due to the flexibility of STRAIGHT, different F0 and aperiodicity measures can be used in the synthesis without any inherent degradations in speech quality. We have replaced the STRAIGHT excitation with a representation based on a robust F0 measure and a carefully determined two-band voicing. According to our analysis-synthesis experiments, the new parameterization can improve the speech quality. In HMM-based speech synthesis, the quality is significantly improved especially due to the better modeling of vocal fry. Index Terms: speech synthesis, hidden Markov models, vocal fry, mixed excitation, STRAIGHT
Electronics and Communications in Japan (Part II: Electronics), 2005
This paper discusses a method of high-quality speech synthesis in which the speech rate can be controlled in various ways. When the prosody is adjusted by the PSOLA method or by the synthesis-by-analysis method in the waveform segment connection process, the quality declines as the extent of modification increases. To deal with this problem, this paper proposes a method in which modification of the segment duration is reduced and quality degradation is alleviated by using a speech database for each speech rate. The proposed method has the following features. (1) Synthesized speech with the target speech rate is produced for each utterance, and is recorded. (2) Speech databases of the same text at different speech rates are constructed. In this study, speech databases at three different speech rates, fast, medium, and slow, were acquired. Speech at two different speech rates (fast and slow) was synthesized by using the acquired speech databases and by the conventional method (using a speech database at the standard speech rate). Listening experiments showed that the proposed method can synthesize higher-quality speech than the conventional method. When speech databases with different speech rates are combined, there is a danger that the speech quality may be degraded due to differences in voice quality among the databases. The effect of voice quality was investigated in a listening experiment, and was found to be within the tolerable range. © 2005 Wiley Periodicals, Inc. Electron Comm Jpn Pt 2, 88(9): 38-47, 2005; Published online in Wiley InterScience (www.interscience.wiley.com).
A Comparative Performance of Various Speech Analysis-Synthesis Techniques
International Journal of Signal Processing Systems, 2014
In this paper, we present a comparative performance of the various analysis-synthesis techniques which separate the acoustic parameters and allow the reconstruction of the speech signal, which is very close to original speech. The analysis-synthesis of speech signal is used for speech enhancement, speech coding, speech synthesis, speech modification and voice conversion. Our comparative study includes Linear Predictive Coder, Cepstral Coder, Harmonic Noise Model based coder and Mel-Cepstrum Envelope with Mel Log Spectral Approximation. The comparative performance of these vocoders is evaluated using different objective measures namely line spectral distortion, Mel cepstral distortion and signal to noise ratio. Along with objective measures, subjective measure, mean opinion score is also considered to evaluate the quality and naturalness of the resynthesized speech in term of original speech.
Prosody Modifications for Voice Conversion
2013
Generally defined, speech modification is the process of changing certain perceptual properties of speech while leaving other properties unchanged. Among the many types of speech information that may be altered are rate of articulation, pitch and formant characteristics.Modifying the speech parameters like pitch, duration and strength of excitation by desired factor is termed as prosody modification. In this thesis prosody modifications for voice conversion framework are presented.Among all the speech modifications for prosody two things are important firstly modification of duartion and pauses (Time scale modification) in a speech utterance and secondly modification of the pitch(pitch scale modification).Prosody modification involves changing the pitch and duration of speech without affecting the message and naturalness.In this work time scale and pitch scale modifications of speech are discussed using two methods Time Domain Pitch Synchronous Overlapped-Add (TD-PSOLA) and epoch based approach.In order to apply desired speech modifications TD-PSOLA discussed in this thesis works directly on speech in time domian although there are many variations of TD-PSOLA.The epoch based approach involves modifications of LP-residual. Among the various perceptual properties of speech pitch contour plays a key role which defines the intonation patterns of speaker.Prosody modifications of speech in voice conversion framework involve modification of source pitch contour as per the pitch contour of target.In a voice conversion framework it requires prediction of target pitch contour. Mean/ variance method for pitch contour prediction is explored. Sinusoidal modeling has been successfully applied to a broad range of speech processing problems. It offers advantages over linear predictive modeling and the short-time Fourier transform for speech analysis/ synthesis and modification. The parameter estimation of sinusoidal modeling which permits flexible time and frequency scale voice modifications is presented. Speech synthesis using three models sinusoidal, harmonic and harmonic-plus-residual is discussed. vi