A novel technique for voice conversion based on style and content decomposition with bilinear models (original) (raw)

A study of bilinear models in voice conversion

2011

ABSTRACT This paper presents a voice conversion technique based on bilinear models and introduces the concept of contextual modeling. The bilinear approach reformulates the spectral envelope representation from line spectral frequencies feature to a two-factor parameterization corresponding to speaker identity and phonetic information, the so-called style and content factors.

A voice conversion method based on joint pitch and spectral envelope transformation

… International Conference on …, 2004

Most of the research in Voice Conversion (VC) is devoted to spectral transformation while the conversion of prosodic features is essentially obtained through a simple linear transformation of pitch. These separate transformations lead to an unsatisfactory speech conversion quality, especially when the speaking styles of the source and target speakers are different. In this paper, we propose a method capable of jointly converting pitch and spectral envelope information. The parameters to be transformed are obtained by combining scaled pitch values with the spectral envelope parameters for the voiced frames and only spectral envelope parameters for the unvoiced ones. These parameters are clustered using a Gaussian Mixture Model (GMM). Then the transformation functions are determined using a conditional expectation estimator. Tests carried out show that, this process leads to a satisfactory pitch transformation. Moreover, it makes the spectral envelope transformation more robust.

Efficient model re-estimation in voice conversion

2008 16th European Signal Processing Conference, 2008

Voice conversion systems aim at converting an utterance spoken by one speaker to sound as speech uttered by a second speaker. Over the last few years, the interest towards voice conversion has risen immensely. Gaussian mixture model (GMM) based techniques have been found to be efficient in the transformation of features represented as scalars or vectors. However, reasonably large amount of aligned training data is needed to achieve good results. To solve this problem, this paper presents an efficient model re-estimation scheme. The proposed technique is based on adjusting an existing well-trained conversion model for a new target speaker with only a very small amount of training data. The experimental results provided in the paper demonstrate the efficiency of the re-estimation approach in line spectral frequency conversion and show that the proposed approach can reach good performance while using only a very limited amount of adaptation data.

Local linear transformation for voice conversion

2012

Abstract Many popular approaches to spectral conversion involve linear transformations determined for particular acoustic classes and compute the converted result as a linear combination between different local transformations in an attempt to ensure a continuous conversion. These methods often produce over-smoothed spectra and parameter tracks. The proposed method computes an individual linear transformation for every feature vector based on a small neighborhood in the acoustic space thus preserving local details.

Dynamic Model Selection for Spectral Voice Conversion

Eleventh Annual Conference of the …, 2010

Statistical methods for voice conversion are usually based on a single model selected in order to represent a tradeoff between goodness of fit and complexity. In this paper we assume that the best model may change over time, depending on the source acoustic features. We present a new method for spectral voice conversion 1 called Dynamic Model Selection (DMS), in which a set of potential best models with increasing complexity -including a mixture of Gaussian and probabilistic principal component analyzers -are considered during the conversion of a source speech signal into a target speech signal. This set is built during the learning phase, according to the Bayes information criterion (BIC). During the conversion, the best model is dynamically selected among the models in the set, according to the acoustical features of each source frame. Subjective tests show that the method improves the conversion in terms of proximity to the target and quality.

Voice conversion through transformation of spectral and intonation features

2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004

This paper presents a voice conversion method based on transformation of the characteristic features of a source speaker towards a target. Voice characteristic features are grouped into two main categories: (a) the spectral features at formants and (b) the pitch and intonation patterns. Signal modelling and transformation methods for each group of voice features are outlined. The spectral features at formants are modelled using a set of two-dimensional phoneme-dependent HMMs. Subband frequency warping is used for spectrum transformation with the subbands centred on the estimates of the formant trajectories. The F0 contour is used for modelling the pitch and intonation patterns of speech. A PSOLA based method is employed for transformation of pitch, intonation patterns and speaking rate. The experiments present illustrations and perceptual evaluations of the results of transformations of the various voice features.

Voice conversion based on parameter transformation

1998

This paper describes a voice conversion system based on parameter transformation [1]. Voice conversion is the process of making one person's voice "source" sound like another person's voice "target"[2]. We will present a voice conversion scheme consisting of three stages. First an analysis is performed on the natural speech to obtain the acoustical parameters. These parameters will be voiced and unvoiced regions, the glottal source model, pitch, energy, formants and bandwidths. Once these parameters have been obtained for two different speakers they are transformed using linear functions. Finally the transformed parameters are synthesized by means of a formant synthesizer. Experiments will show that this scheme is effective in transforming the speaker individuality. It will also be shown that the transformation can not be unique from one speaker to another but it has to be divided in several functions each to transform a certain part of the speech signal. Segmentation based on spectral stability will divide the sentence into parts, for each segment a transformation function will be applied.

Statistical Voice Conversion Based on Noisy Channel Model

IEEE Transactions on Audio, Speech, and Language Processing, 2000

This paper describes a novel framework of voice conversion effectively using both a joint density model and a speaker model. In voice conversion studies, approaches based on the Gaussian mixture model (GMM) with probabilistic densities of joint vectors of a source and a target speakers are widely used to estimate a transform function between both the speakers. However, to achieve sufficient quality, these approaches require a parallel corpus which contains plenty of utterances with the same linguistic content spoken by both the speakers. In addition, the joint density GMM methods often suffer from overtraining effects when the amount of training data is small. To compensate for these problems, we propose a voice conversion framework, which integrates the speaker GMM of the target with the joint density model using a noisy channel model. The proposed method trains the joint density model with a few parallel utterances, and the speaker model with nonparallel data of the target, independently. It can ease the burden on the source speaker. Experiments demonstrate the effectiveness of the proposed method, especially when the amount of the parallel corpus is small.