Voice transformations: from speech synthesis to mammalian vocalizations (original) (raw)

Voice conversion based on parameter transformation

1998

This paper describes a voice conversion system based on parameter transformation [1]. Voice conversion is the process of making one person's voice "source" sound like another person's voice "target"[2]. We will present a voice conversion scheme consisting of three stages. First an analysis is performed on the natural speech to obtain the acoustical parameters. These parameters will be voiced and unvoiced regions, the glottal source model, pitch, energy, formants and bandwidths. Once these parameters have been obtained for two different speakers they are transformed using linear functions. Finally the transformed parameters are synthesized by means of a formant synthesizer. Experiments will show that this scheme is effective in transforming the speaker individuality. It will also be shown that the transformation can not be unique from one speaker to another but it has to be divided in several functions each to transform a certain part of the speech signal. Segmentation based on spectral stability will divide the sentence into parts, for each segment a transformation function will be applied.

Vivos Voco: A Survey of Recent Research on Voice Transformations at IRCAM

IRCAM has a long experience in analysis, synthesis and transformation of voice. Natural voice transformations are of great interest for many applications and can be combine with text-to-speech system, leading to a powerful creation tool. We present research conducted at IRCAM on voice transformations for the last few years. Transformations can be achieved in a global way by modifying pitch, spectral envelope, durations etc. While it sacrifices the possibility to attain a specific target voice, the approach allows the production of new voices of a high degree of naturalness with different gender and age, modified vocal quality, or another speech style. These transformations can be applied in realtime using ircamTools TRAX.Transformation can also be done in a more specific way in order to transform a voice towards the voice of a target speaker. Finally, we present some recent research on the transformation of expressivity.

Voice conversion through transformation of spectral and intonation features

2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004

This paper presents a voice conversion method based on transformation of the characteristic features of a source speaker towards a target. Voice characteristic features are grouped into two main categories: (a) the spectral features at formants and (b) the pitch and intonation patterns. Signal modelling and transformation methods for each group of voice features are outlined. The spectral features at formants are modelled using a set of two-dimensional phoneme-dependent HMMs. Subband frequency warping is used for spectrum transformation with the subbands centred on the estimates of the formant trajectories. The F0 contour is used for modelling the pitch and intonation patterns of speech. A PSOLA based method is employed for transformation of pitch, intonation patterns and speaking rate. The experiments present illustrations and perceptual evaluations of the results of transformations of the various voice features.

Prosody Modifications for Voice Conversion

2013

Generally defined, speech modification is the process of changing certain perceptual properties of speech while leaving other properties unchanged. Among the many types of speech information that may be altered are rate of articulation, pitch and formant characteristics.Modifying the speech parameters like pitch, duration and strength of excitation by desired factor is termed as prosody modification. In this thesis prosody modifications for voice conversion framework are presented. Among all the speech modifications for prosody two things are important firstly modification of duartion and pauses (Time scale modification) in a speech utterance and secondly modification of the pitch(pitch scale modification).Prosody modification involves changing the pitch and duration of speech without affecting the message and naturalness.In this work time scale and pitch scale modifications of speech are discussed using two methods Time Domain Pitch Synchronous Overlapped-Add (TD-PSOLA) and epoch b...

A Phase Vocoder Model of the Glottis for Expressive Voice Synthesis

Proceedings of 9th SONY Research Forum, Tokyo, Japan , 1999

In this paper we explain how we are improving the source component of a source-filter vocal synthesis system. Our strategy for this improvement involves the replacement of the pulse generator by a phase vocoder module whose coefficients are derived from the analysis of speech signals. Firstly, we introduce the context of our research and then indicate the problem; finally, we present our solution followed by our conclusions and suggestions for further work.

Transformation of speaker characteristics for voice conversion

2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721), 2003

This paper presents a voice conversion method based on analysis and transformation of the characteristics that define a speaker's voice. Voice characteristic features are grouped into three main categories: (a) the spectral features at formants, (b) the pitch and intonation pattern and (c) the glottal pulse shape. Modelling and transformation methods of each group of voice features are outlined. The spectral features at formants are modelled using a two-dimensional phoneme-dependent HMMs. Subband frequency warping is used for spectrum transformation where the subbands are centred on estimates of formant trajectories. The F0 contour, extracted from autocorrelation-based pitchmarks, is used for modelling the pitch and intonation patterns of speech. A PSOLA based method is used for transformation of pitch, intonation patterns and speaking rate. Finally a method based on de-convolution of the vocal tract is used for modelling and mapping of the glottal pulse. The experimental results present illustrations of transformations of the various characteristics and perceptual evaluations.

Applying voice conversion to concatenative singing-voice synthesis

Proc. of Interspeech 2010, 2010

This work address the application of Voice Conversion to singing-voice. The GMM-based approach was applied to VOCALOID, a concatenative singing synthesizer, to perform singer timbre conversion. The conversion framework was applied to full-quality singing databases, achieving a satisfactory conversion effect on the synthesized utterances issued by VOCALOID. We report in this paper a description of our implementation as well as the results of our experimentation focused to study the spectral conversion performance when applied to specific pitch-range data

VHDL Implementation of Phase Vocoder for Voice Morphing

2015

Speech is usually characterized as voiced, unvoiced or transient forms. Voiced speech is produced by an air flow of pulses caused by the vibration of the vocal cords. The resulting signal could be described as quasi-periodic waveform with high energy and high adjacent sample correlation. The two broad categories of pitchestimation algorithms are time-domain algorithms and frequencydomain algorithms. Voice morphing is the process of producing intermediate or hybrid voices between the utterances of two speakers. It can also be defined as the process of gradually transforming the voice of one speaker to that of another. Like image morphing, speech morphing aims to preserve the shared characteristics of the starting and final signals, while generating a smooth transition between them. This paper describes the Phase vocoder method of voice morphing. We extract the feature difference of source and targeted voice apply this phase different to the source voice.

Parametric Formant Modelling and Transformation in Voice Conversion

International Journal of Speech Technology, 2005

This paper presents a method for the estimation and mapping of parametric models of speech resonance at formants for voice conversion. The spectral features at formants that contribute to voice characteristics are the trajectories of the frequencies, the bandwidths and intensities of the resonance at formants. The formant features are extracted from the poles of a linear prediction (LP) model of speech. The statistical distributions of formants are modelled by a two-dimensional hidden Markov model (HMM) spanning the time and frequency dimensions. Experimental results are presented which show a close match between HMM-based formant models and the histograms of formants. For voice conversion two alternative methods are explored for mapping the formants of a source speaker to those of a target speaker. The first method is based on an adaptive formant-tracking warping of the frequency response of the LP model and the second method is based on the rotation of the poles of the LP model of speech. Both methods transform all spectral parameters of the resonance at formants of the source speaker towards those of the target speaker. In addition, the issues affecting the selection of the warping ratios for the mapping functions are investigated. Experimental results of formant estimation and perceptual evaluation of voice morphing based on parametric formant models are presented.