A first step towards text-independent voice conversion (original) (raw)

Voice conversion using exclusively unaligned training data

Spanish Society for …, 2004

Although all conventional voice conversion approaches require equivalent training utterances of source and target speaker, several recently proposed applications call for breaking this demand. In this paper, we present an algorithm which finds corresponding time frames within unaligned training data. The performance of this algorithm is tested by means of a voice conversion framework based on linear transformation of the spectral envelope. Experimental results are reported on a Spanish cross-gender corpus utilizing several objective error measures.

Text-independent voice conversion based on unit selection

2006

So far, most of the voice conversion training procedures are text-dependent, i.e., they are based on parallel training utterances of source and target speaker. Since several applications (e.g. speech-to-speech translation or dubbing) require textindependent training, over the last two years, training techniques that use non-parallel data were proposed. In this paper, we present a new approach that applies unit selection to find corresponding time frames in source and target speech. By means of a subjective experiment it is shown that this technique achieves the same performance as the conventional text-dependent training.

Text-independent cross-language voice conversion

Interspeech 2006

So far, cross-language voice conversion requires at least one bilingual speaker and parallel speech data to perform the training. This paper shows how these obstacles can be overcome by means of a recently presented text-independent training method based on unit selection. The new method is evaluated in the framework of the European speech-to-speech translation project TC-Star and achieves a performance similar to that of text-dependent intralingual voice conversion.

A voice conversion method based on joint pitch and spectral envelope transformation

… International Conference on …, 2004

Most of the research in Voice Conversion (VC) is devoted to spectral transformation while the conversion of prosodic features is essentially obtained through a simple linear transformation of pitch. These separate transformations lead to an unsatisfactory speech conversion quality, especially when the speaking styles of the source and target speakers are different. In this paper, we propose a method capable of jointly converting pitch and spectral envelope information. The parameters to be transformed are obtained by combining scaled pitch values with the spectral envelope parameters for the voiced frames and only spectral envelope parameters for the unvoiced ones. These parameters are clustered using a Gaussian Mixture Model (GMM). Then the transformation functions are determined using a conditional expectation estimator. Tests carried out show that, this process leads to a satisfactory pitch transformation. Moreover, it makes the spectral envelope transformation more robust.

Text-Independent F0 Transformation with Non-Parallel Data for Voice Conversion

In voice conversion, frame-level mean and variance normalization is typically used for fundamental frequency (F0) transformation, which is text-independent and requires no parallel training data. Some advanced methods transform pitch contours instead, but require either parallel training data or syllabic annotations. We propose a method which retains the simplicity and text-independence of the frame-level conversion while yielding high-quality conversion. We achieve these goals by (1) introducing a text-independent tri-frame alignment method, (2) including delta features of F0 into Gaussian mixture model (GMM) conversion and (3) reducing the well-known GMM oversmoothing effect by F0 histogram equalization. Our objective and subjective experiments on the CMU Arctic corpus indicate improvements over both the mean/variance normalization and the baseline GMM conversion.

Including dynamic and phonetic information in voice conversion systems

Proc. of the ICSLP'04, 2004

Voice Conversion (VC) systems modify a speaker voice (source speaker) to be perceived as if another speaker (target speaker) had uttered it. Previous published VC approaches using Gaussian Mixture Models [1] performs the conversion in a frame-by-frame basis using only spectral information. In this paper, two new approaches are studied in order to extend the GMM-based VC systems. First, dynamic information is used to build the speaker acoustic model. So, the transformation is carried out according to sequences of frames. Then, phonetic information is introduced in the training of the VC system. Objective and perceptual results compare the performance of the proposed systems.

Voice conversion through transformation of spectral and intonation features

2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004

This paper presents a voice conversion method based on transformation of the characteristic features of a source speaker towards a target. Voice characteristic features are grouped into two main categories: (a) the spectral features at formants and (b) the pitch and intonation patterns. Signal modelling and transformation methods for each group of voice features are outlined. The spectral features at formants are modelled using a set of two-dimensional phoneme-dependent HMMs. Subband frequency warping is used for spectrum transformation with the subbands centred on the estimates of the formant trajectories. The F0 contour is used for modelling the pitch and intonation patterns of speech. A PSOLA based method is employed for transformation of pitch, intonation patterns and speaking rate. The experiments present illustrations and perceptual evaluations of the results of transformations of the various voice features.

Voice Conversion Across Arbitrary Speakers Based on a Single Target-Speaker Utterance

Interspeech 2018

Developing a voice conversion (VC) system for a particular speaker typically requires considerable data from both the source and target speakers. This paper aims to effectuate VC across arbitrary speakers, which we call any-to-any VC, with only a single target-speaker utterance. Two systems are studied: (1) the i-vector-based VC (IVC) system and (2) the speakerencoder-based VC (SEVC) system. Phonetic PosteriorGrams are adopted as speaker-independent linguistic features extracted from speech samples. Both systems train a multi-speaker deep bidirectional long-short term memory (DBLSTM) VC model, taking in additional inputs that encode speaker identities, in order to generate the outputs. In the IVC system, the speaker identity of a new target speaker is represented by i-vectors. In the SEVC system, the speaker identity is represented by speaker embedding predicted from a separately trained model. Experiments verify the effectiveness of both systems in achieving VC based only on a single target-speaker utterance. Furthermore, the IVC approach is superior to SEVC, in terms of the quality of the converted speech and its similarity to the utterance produced by the genuine target speaker.

Non-Parallel Voice Conversion Using Joint Optimization of Alignment by Temporal Context and Spectral Distortion

Many voice conversion systems require parallel training sets of the source and target speakers. Non-parallel training is more complicated as it involves evaluation of source-target correspondence along with the conversion function itself. INCA is a recently proposed method for non-parallel train-ing, based on iterative estimation of alignment and conversion function. The alignment is evaluated using a simple nearest-neighbor search, which often leads to phonetic miss-matched source-target pairs. We propose here a generalized approach, denoted as Temporal-Context INCA (TC-INCA), based on matching temporal context vectors. We formulate the training stage as a minimization problem of a joint cost, considering both context-based alignment and conversion function. We show that TC-INCA reduces the joint cost and prove its convergence. Experimental results indicate that TC-INCA significantly improves the alignment accuracy, compared to INCA. Moreover, subjective evaluations show that TC-IN...