Vincent Pollet - Academia.edu (original) (raw)

Papers by Vincent Pollet

Research paper thumbnail of Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based on Transfer Learning

ArXiv, 2021

Deep learning models are becoming predominant in many fields of machine learning. Text-to-Speech ... more Deep learning models are becoming predominant in many fields of machine learning. Text-to-Speech (TTS), the process of synthesizing artificial speech from text, is no exception. To this end, a deep neural network is usually trained using a corpus of several hours of recorded speech from a single speaker. Trying to produce the voice of a speaker other than the one learned is expensive and requires large effort since it is necessary to record a new dataset and retrain the model. This is the main reason why the TTS models are usually single speaker. The proposed approach has the goal to overcome these limitations trying to obtain a system which is able to model a multi-speaker acoustic space. This allows the generation of speech audio similar to the voice of different target speakers, even if they were not observed during the training phase.

Research paper thumbnail of Statistical corpus-based speech segmentation

Interspeech 2004, 2004

An automatic speech segmentation technique is presented that is based on the alignment of a targe... more An automatic speech segmentation technique is presented that is based on the alignment of a target speech signal with a set of different reference speech signals generated by a specific designed corpus-based speech synthesis system that additionally generates phoneme boundary markers. Each reference signal is then warped to the target speech signal. By synthesizing and warping many different reference speech signals, each phoneme boundary of the target signal is characterized by a distribution of warped phoneme boundary positions. The boundary distributions are statistically and acoustically processed in order to generate the final segmentation. First, some problems related to manual and automatic phoneme segmentation are addressed. Then the technique of Statistical Corpus-based Segmentation (SCS) is introduced. Finally, intra-and inter-speaker segmentation results are presented.

Research paper thumbnail of Refined statistical model tuning for speech synthesis

This paper describes a number of approaches to refine and tune statistical models for speech synt... more This paper describes a number of approaches to refine and tune statistical models for speech synthesis. The first approach is to tune the sizes of the decision trees for central phonemes in a context. The second approach is a refinement technique for HMM models; a variable number of states for hidden semiMarkov models is emulated. A so-called “hard state-skip” training technique is introduced into the standard forwardbackward training. The results show that both the tune and refinement techniques lead to increased flexibility for speech synthesis modeling.

Research paper thumbnail of Synthèse par génération et concaténation de segments multiforme

L'invention concerne un systeme ou un procede de synthese de la parole. Une base de donnees d... more L'invention concerne un systeme ou un procede de synthese de la parole. Une base de donnees de segments de la parole reference des segments de parole ayant diverses structures differentes de representation de la parole. Un selecteur de segments de la parole selectionne dans la base de donnees de segments de parole une sequence de segments de parole candidats correspondant a un texte cible. Un sequenceur de segment de parole genere a partir des segments de parole candidats des segments de parole sequences correspondant au texte cible. Un synthetiseur de segments de parole combine les segments de parole sequences selectionnes pour produire une sortie de signal de parole synthetisee correspondant au texte cible.

Research paper thumbnail of Corpus-based speech synthesis on the basis of segmentrekombination

Research paper thumbnail of Synthesis by generation and concatenation of multiform segments

Interspeech 2008, 2008

Machine generated speech can be produced in different ways however there are two basic methods fo... more Machine generated speech can be produced in different ways however there are two basic methods for synthesizing speech in widespread use. One method generates speech from models, while the other method concatenates pre-stored speech segments. This paper presents a speech synthesis technique where these two basic synthesis methods are combined in a statistical framework. Synthetic speech is constructed by generation and concatenation of so-called "multiform segments". Multiform segments are different speech signal representations; synthesis models, templates and synthesis models augmented with template information. An evaluation of the multiform segment synthesis technique shows improvements over traditional concatenative methods of synthesis.

Research paper thumbnail of Psychoacoustic segment scoring for multi-form speech synthesis

Interspeech 2012, 2012

ABSTRACT In multi-form segment synthesis, output speech is constructed by splicing waveform segme... more ABSTRACT In multi-form segment synthesis, output speech is constructed by splicing waveform segments with statistically modeled and regenerated parametric speech segments. The fraction of model-derived segments is called model-template ratio. The motivation of this work is to further increase flexibility of multi-form synthesis maintaining high speech quality for high model-template ratios. An approach is presented where the representation type of a segment is selected per acoustic leaf. We introduce a novel method for leaf representation selection based on a psychoacoustic segment stationarity score. Additionally, refinements in multi-form segment concatenation including boundary constrained statistical parametric synthesis and time-domain alignment based on multi-peak analysis of cross-correlation for high model-template ratio multi-form synthesis are presented.

Research paper thumbnail of Uniform speech parameterization for multi-form segment synthesis

Interspeech 2011, 2011

In multi-form segment synthesis speech is constructed by sequencing speech segments of different ... more In multi-form segment synthesis speech is constructed by sequencing speech segments of different nature: model segments, i.e. mathematical abstractions of speech and template segments, i.e. speech waveform fragments. These multi-form segments can have shared, layered or alternate speech parameterization schemes. This paper introduces an advanced uniform speech parameterization scheme for statistical model segments and waveform segments employed in our multi-form segment synthesis system. Mel-Regularized Cepstrum derived from amplitude and phase spectra forms its basic framework. Furthermore, a new adaptive enhancement technique for model segments is presented that reduces the perceived gap in quality and similarity between model and template segments.

Research paper thumbnail of Refined inter-segment joining in multi-form speech synthesis

Interspeech 2014, 2014

In multi-form speech synthesis, speech output is constructed by splicing waveform segments and pa... more In multi-form speech synthesis, speech output is constructed by splicing waveform segments and parametric speech segments which are generated from statistical models. The decision whether to use the waveform or the statistical parametric form is made per segment. This approach faces certain challenges in the context of inter-segment joining. In this work, we present a novel method whereby all noncontiguous joints are represented by statistically generated speech frames without compromising on naturalness. Speech frames surrounding non-contiguous joints between the waveform segments are regenerated from the models and optimized for concatenation. In addition, a novel pitch smoothing algorithm that preserves the original intonation trajectory while maintaining smoothness is applied. We implemented the spectrum and the pitch smoothing algorithms within a multi-form speech synthesis framework that employs a uniform parametric representation for the natural and statistically modeled speech segments. This framework facilitates pitch modification in natural segments. Subjective evaluation results reveal that the proposed smoothing methods significantly improve the perceived speech quality.

Research paper thumbnail of Unit Selection with Hierarchical Cascaded Long Short Term Memory Bidirectional Recurrent Neural Nets

Interspeech 2017, 2017

Bidirectional recurrent neural nets have demonstrated state-ofthe-art performance for parametric ... more Bidirectional recurrent neural nets have demonstrated state-ofthe-art performance for parametric speech synthesis. In this paper, we introduce a top-down application of recurrent neural net models to unit-selection synthesis. A hierarchical cascaded network graph predicts context phone duration, speech unit encoding and frame-level logF0 information that serves as targets for the search of units. The new approach is compared with an existing state-of-art hybrid system that uses Hidden Markov Models as basis for the statistical unit search.

Research paper thumbnail of System and Method for Automatic Prediction of Speech Suitability for Statistical Modeling

Research paper thumbnail of Corpus-based speech synthesis based on segment recombination

Research paper thumbnail of Coherent modification of pitch and energy for expressive prosody implantation

2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015

ABSTRACT In expressive TTS and voice transformation systems, implantation of expressive prosody d... more ABSTRACT In expressive TTS and voice transformation systems, implantation of expressive prosody derived from external out-of-domain sources often leads to extreme pitch modification that compromises the naturalness of the synthesized speech. In this work we investigate and prove a hypothesis that the naturalness loss is in part attributed to a violation of a fundamental relationship between the instantaneous pitch frequency and instantaneous energy of a speech signal. We propose an enhancement for pitch modification where the instantaneous energy is modified coherently with the pitch frequency and demonstrate the potential of this method in a subjective listening evaluation. The proposed approach is complementary to and can be combined with spectrum shape transformation methods for achieving the maximal possible quality of pitch modification.

Research paper thumbnail of Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based on Transfer Learning

ArXiv, 2021

Deep learning models are becoming predominant in many fields of machine learning. Text-to-Speech ... more Deep learning models are becoming predominant in many fields of machine learning. Text-to-Speech (TTS), the process of synthesizing artificial speech from text, is no exception. To this end, a deep neural network is usually trained using a corpus of several hours of recorded speech from a single speaker. Trying to produce the voice of a speaker other than the one learned is expensive and requires large effort since it is necessary to record a new dataset and retrain the model. This is the main reason why the TTS models are usually single speaker. The proposed approach has the goal to overcome these limitations trying to obtain a system which is able to model a multi-speaker acoustic space. This allows the generation of speech audio similar to the voice of different target speakers, even if they were not observed during the training phase.

Research paper thumbnail of Statistical corpus-based speech segmentation

Interspeech 2004, 2004

An automatic speech segmentation technique is presented that is based on the alignment of a targe... more An automatic speech segmentation technique is presented that is based on the alignment of a target speech signal with a set of different reference speech signals generated by a specific designed corpus-based speech synthesis system that additionally generates phoneme boundary markers. Each reference signal is then warped to the target speech signal. By synthesizing and warping many different reference speech signals, each phoneme boundary of the target signal is characterized by a distribution of warped phoneme boundary positions. The boundary distributions are statistically and acoustically processed in order to generate the final segmentation. First, some problems related to manual and automatic phoneme segmentation are addressed. Then the technique of Statistical Corpus-based Segmentation (SCS) is introduced. Finally, intra-and inter-speaker segmentation results are presented.

Research paper thumbnail of Refined statistical model tuning for speech synthesis

This paper describes a number of approaches to refine and tune statistical models for speech synt... more This paper describes a number of approaches to refine and tune statistical models for speech synthesis. The first approach is to tune the sizes of the decision trees for central phonemes in a context. The second approach is a refinement technique for HMM models; a variable number of states for hidden semiMarkov models is emulated. A so-called “hard state-skip” training technique is introduced into the standard forwardbackward training. The results show that both the tune and refinement techniques lead to increased flexibility for speech synthesis modeling.

Research paper thumbnail of Synthèse par génération et concaténation de segments multiforme

L'invention concerne un systeme ou un procede de synthese de la parole. Une base de donnees d... more L'invention concerne un systeme ou un procede de synthese de la parole. Une base de donnees de segments de la parole reference des segments de parole ayant diverses structures differentes de representation de la parole. Un selecteur de segments de la parole selectionne dans la base de donnees de segments de parole une sequence de segments de parole candidats correspondant a un texte cible. Un sequenceur de segment de parole genere a partir des segments de parole candidats des segments de parole sequences correspondant au texte cible. Un synthetiseur de segments de parole combine les segments de parole sequences selectionnes pour produire une sortie de signal de parole synthetisee correspondant au texte cible.

Research paper thumbnail of Corpus-based speech synthesis on the basis of segmentrekombination

Research paper thumbnail of Synthesis by generation and concatenation of multiform segments

Interspeech 2008, 2008

Machine generated speech can be produced in different ways however there are two basic methods fo... more Machine generated speech can be produced in different ways however there are two basic methods for synthesizing speech in widespread use. One method generates speech from models, while the other method concatenates pre-stored speech segments. This paper presents a speech synthesis technique where these two basic synthesis methods are combined in a statistical framework. Synthetic speech is constructed by generation and concatenation of so-called "multiform segments". Multiform segments are different speech signal representations; synthesis models, templates and synthesis models augmented with template information. An evaluation of the multiform segment synthesis technique shows improvements over traditional concatenative methods of synthesis.

Research paper thumbnail of Psychoacoustic segment scoring for multi-form speech synthesis

Interspeech 2012, 2012

ABSTRACT In multi-form segment synthesis, output speech is constructed by splicing waveform segme... more ABSTRACT In multi-form segment synthesis, output speech is constructed by splicing waveform segments with statistically modeled and regenerated parametric speech segments. The fraction of model-derived segments is called model-template ratio. The motivation of this work is to further increase flexibility of multi-form synthesis maintaining high speech quality for high model-template ratios. An approach is presented where the representation type of a segment is selected per acoustic leaf. We introduce a novel method for leaf representation selection based on a psychoacoustic segment stationarity score. Additionally, refinements in multi-form segment concatenation including boundary constrained statistical parametric synthesis and time-domain alignment based on multi-peak analysis of cross-correlation for high model-template ratio multi-form synthesis are presented.

Research paper thumbnail of Uniform speech parameterization for multi-form segment synthesis

Interspeech 2011, 2011

In multi-form segment synthesis speech is constructed by sequencing speech segments of different ... more In multi-form segment synthesis speech is constructed by sequencing speech segments of different nature: model segments, i.e. mathematical abstractions of speech and template segments, i.e. speech waveform fragments. These multi-form segments can have shared, layered or alternate speech parameterization schemes. This paper introduces an advanced uniform speech parameterization scheme for statistical model segments and waveform segments employed in our multi-form segment synthesis system. Mel-Regularized Cepstrum derived from amplitude and phase spectra forms its basic framework. Furthermore, a new adaptive enhancement technique for model segments is presented that reduces the perceived gap in quality and similarity between model and template segments.

Research paper thumbnail of Refined inter-segment joining in multi-form speech synthesis

Interspeech 2014, 2014

In multi-form speech synthesis, speech output is constructed by splicing waveform segments and pa... more In multi-form speech synthesis, speech output is constructed by splicing waveform segments and parametric speech segments which are generated from statistical models. The decision whether to use the waveform or the statistical parametric form is made per segment. This approach faces certain challenges in the context of inter-segment joining. In this work, we present a novel method whereby all noncontiguous joints are represented by statistically generated speech frames without compromising on naturalness. Speech frames surrounding non-contiguous joints between the waveform segments are regenerated from the models and optimized for concatenation. In addition, a novel pitch smoothing algorithm that preserves the original intonation trajectory while maintaining smoothness is applied. We implemented the spectrum and the pitch smoothing algorithms within a multi-form speech synthesis framework that employs a uniform parametric representation for the natural and statistically modeled speech segments. This framework facilitates pitch modification in natural segments. Subjective evaluation results reveal that the proposed smoothing methods significantly improve the perceived speech quality.

Research paper thumbnail of Unit Selection with Hierarchical Cascaded Long Short Term Memory Bidirectional Recurrent Neural Nets

Interspeech 2017, 2017

Bidirectional recurrent neural nets have demonstrated state-ofthe-art performance for parametric ... more Bidirectional recurrent neural nets have demonstrated state-ofthe-art performance for parametric speech synthesis. In this paper, we introduce a top-down application of recurrent neural net models to unit-selection synthesis. A hierarchical cascaded network graph predicts context phone duration, speech unit encoding and frame-level logF0 information that serves as targets for the search of units. The new approach is compared with an existing state-of-art hybrid system that uses Hidden Markov Models as basis for the statistical unit search.

Research paper thumbnail of System and Method for Automatic Prediction of Speech Suitability for Statistical Modeling

Research paper thumbnail of Corpus-based speech synthesis based on segment recombination

Research paper thumbnail of Coherent modification of pitch and energy for expressive prosody implantation

2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015

ABSTRACT In expressive TTS and voice transformation systems, implantation of expressive prosody d... more ABSTRACT In expressive TTS and voice transformation systems, implantation of expressive prosody derived from external out-of-domain sources often leads to extreme pitch modification that compromises the naturalness of the synthesized speech. In this work we investigate and prove a hypothesis that the naturalness loss is in part attributed to a violation of a fundamental relationship between the instantaneous pitch frequency and instantaneous energy of a speech signal. We propose an enhancement for pitch modification where the instantaneous energy is modified coherently with the pitch frequency and demonstrate the potential of this method in a subjective listening evaluation. The proposed approach is complementary to and can be combined with spectrum shape transformation methods for achieving the maximal possible quality of pitch modification.