Ignasi Sanz - Academia.edu (original) (raw)
Papers by Ignasi Sanz
Annual Conference of the International Speech Communication Association, 2003
Page 1. A Hybrid Method Oriented to Concatenative Text-to-Speech Synthesis Ignasi Iriondo, France... more Page 1. A Hybrid Method Oriented to Concatenative Text-to-Speech Synthesis Ignasi Iriondo, Francesc Alıas, Javier Sanchis, Javier Melenchón ... Ø=Øia-Æ w 2 ´tµ ´s´tµ -h´tµµ 2 (4) where w´tµ represents the Hanning window and V is the clos-est integer to the local pitch period. ...
Annual Conference of the International Speech Communication Association, 2009
Annual Conference of the International Speech Communication Association, 2009
This paper describes our participation in the INTERSPEECH 2009 Emotion Challenge (1). Starting fr... more This paper describes our participation in the INTERSPEECH 2009 Emotion Challenge (1). Starting from our previous experience in the use of automatic classification for the validation of an expressive corpus, we have tackled the difficult task of emotion recognition from speech with real-life data. Our main contribution to this work is related to the classifier sub-challenge, for which we tested
This paper describes a multi-domain text-to-speech (MD-TTS) synthesis strategy for generating spe... more This paper describes a multi-domain text-to-speech (MD-TTS) synthesis strategy for generating speech among different domains and so increasing the flexibility of high quality TTS systems. To that effect, the MD-TTS introduces a flexible TTS architecture that includes an automatic domain classification module, which allows MD-TTS systems to be implemented by different synthesis strate- gies and speech corpus typologies. In this
The quality of corpus based text-to-speech systems depends on the accuracy of the unit selection ... more The quality of corpus based text-to-speech systems depends on the accuracy of the unit selection process, which relies on the values of the weights of the cost function. This paper is focused on defining a new framework for the tuning of these weights. We propose a technique for taking into account the subjective perception of speech in the selection process
Proceedings. IEEE International Conference on Multimedia and Expo, 2002
This paper describes a 2D realistic talking face. The facial appearance model is constructed with... more This paper describes a 2D realistic talking face. The facial appearance model is constructed with a parameterised 2D sample based model. This representation supports moderated head movements, facial gestures and emotional expressions. Two main contributions for talking heads applications are proposed. First, the image of the lips is synthesized by means of shape and texture information. Secondly, a nearly automated training process makes the talking face personalization easier, due to the use of mouth tracking. Additionally, lips are synchronized in real time with speech that is generated using a SAPI compliant text-to-speech engine.
RESUMEN En este trabajo se presenta un nuevo procedimiento para la medida de los parámetros de cu... more RESUMEN En este trabajo se presenta un nuevo procedimiento para la medida de los parámetros de cualidad de voz (VoQ), el jitter y el shimmer. Este nuevo procedimiento tiene en consideración la prosodia del enunciado, de manera que su efecto se atenúa antes de realizar la medida de cada uno de los parámetros. El objetivo, además de realizar la medida de una forma más fiable, es el de modificar estos parámetros de forma que puedan ser utilizados en síntesis del habla expresiva, por ello, en paralelo a esta nuevo procedimiento de análisis, se presenta cómo llevar a cabo la modificación de ambos. Finalmente, se realiza una evaluación mediante una prueba perceptual CMOS sobre cuatro estilos expresivos: agresivo, alegre, sensual y triste; provenientes de la salida de un sistema de conversión de texto en habla con modelado prosódico, de modo que se hace un estudio de la utilidad de estos parámetros bajo diferentes situaciones.
En este artículo se presenta la utilización del aprendizaje analógico, en particular el razonamie... more En este artículo se presenta la utilización del aprendizaje analógico, en particular el razonamiento basado en casos, como herramienta de generación automática de la prosodia a partir de texto, el cual ha sido etiquetado de forma automática con atributos prosódicos. Se trata de un método basado en corpus para el modelado cuantitativo de la prosodia y su estimación en un sistema de conversión del texto en habla. El principal objetivo es conseguir un método común para predecir los 3 rasgos prosódicos principales: la curva de frecuencia fundamental (F0), la duración segmental y la intensidad. Se ha llevado a cabo una evaluación objetiva y subjetiva para considerar su uso en el ámbito de la síntesis del habla expresiva.
This paper presents the text-to-speech (TTS) synthesis system of La Salle (Universitat Ramon Llul... more This paper presents the text-to-speech (TTS) synthesis system of La Salle (Universitat Ramon Llull, URL) and its adaptation to the Albayzin Evaluation Campaign of FALA2010 conference. The URL-TTS system follows the classical scheme of unit se-lection TTS synthesis systems. However, it presents two dis-tinguishable particularities: i) prosody prediction learned from labelled data by means of Case-Based-Reasoning (CBR) and perceptual weight tuning by means of active interactive Genetic Algorithms (aiGA). The aiGA-based weights are compared to multilinear regression (MLR) weights both considering classi-cal averaged cost function and its root-mean squared variant. The internal validation tests and the results of the evaluation campaing are described, and finally discussed.
Lecture Notes in Computer Science, 2000
... paper describes a Unit Selection system based on diphones that was developed by the Speech Te... more ... paper describes a Unit Selection system based on diphones that was developed by the Speech Technology Group of the Enginyeria Arquitectura La Salle ... It is well known that the segmental quality of synthetic speech is limited by the number of joins that can be encountered in ...
Lecture Notes in Computer Science, 2004
This paper describes an initial approach to emotional speech synthesis in Catalan based on a diph... more This paper describes an initial approach to emotional speech synthesis in Catalan based on a diphone concatenation TTS system. The main goal of this work is to develop a simple prosodic model for expressive synthesis. This model is obtained from an emotional speech collection artificially generated by means of a copy-prosody experiment. After validating the emotional content of this collection, the model was automated and incorporated into our TTS system. Finally, the automatic speech synthesis system has been evaluated by means of a perceptual test, obtaining encouraging results.
Lecture Notes in Computer Science, 2004
A new algorithm for the incremental learning and non-intrusive tracking of the appearance of a pr... more A new algorithm for the incremental learning and non-intrusive tracking of the appearance of a previously non-seen face is presented.
Lecture Notes in Computer Science, 2007
This paper presents the validation of the expressive content of an acted corpus produced to be us... more This paper presents the validation of the expressive content of an acted corpus produced to be used in speech synthesis. The use of acted speech can be rather lacking in authenticity and therefore its expressiveness validation is required. The goal is to obtain an automatic classifier able to prune the bad utterances –with wrong expressiveness–. Firstly, a subjective test has
Lecture Notes in Computer Science, 2007
Hidden Markov Models based text-to-speech (HMM-TTS) synthesis is a technique for generating speec... more Hidden Markov Models based text-to-speech (HMM-TTS) synthesis is a technique for generating speech from trained statistical models where spectrum, pitch and durations of basic speech units are modelled altogether. The aim of this work is to describe a Spanish HMM-TTS system using an external machine learning technique to help improving the expressiveness. System performance is analysed objectively and subjectively. The experiments were conducted on a reliably labelled speech corpus, whose units were clustered using contextual factors based on the Spanish language. The results show that the CBR-based F0 estimation is capable of improving the HMM-based baseline performance when synthesizing non-declarative short sentences while the durations accuracy is similar with the CBR or the HMM system.
Lecture Notes in Computer Science, 2007
... 87 2 Building Emotional Speech Corpora ... This corpus had a twofold purpose: to learn the ac... more ... 87 2 Building Emotional Speech Corpora ... This corpus had a twofold purpose: to learn the acoustic models of emotional speech and to be used as the speech unit database for the synthesizer. This section describes the steps followed in the production of the corpus. ...
Lecture Notes in Computer Science, 2010
This paper describes a high-quality Spanish HMM-based speech synthesis of emotional speaking styl... more This paper describes a high-quality Spanish HMM-based speech synthesis of emotional speaking styles. The quality of the HMM-based speech synthesis is enhanced by using the most recent features presented for the Blizzard system (i.e. STRAIGHT spectrum extraction and mixed excitation). Two techniques are evaluated. First, a method simultaneously model all emotions within a single acoustic model. Second, an adaptation techniques to convert a neutral emotional style to a target emotion. We consider 3 kinds of emotions expressions: neutral, happy and sad. A subjective evaluation will show the quality of the system and the intensity of the produced emotion while an objective evaluation based on voice quality parameters evaluates the effectiveness of the approaches.
Emulating Subjective Criteria in Corpus Validation (9781599048499): Ignasi Iriondo, Santiago Plan... more Emulating Subjective Criteria in Corpus Validation (9781599048499): Ignasi Iriondo, Santiago Planet, Francesc Alías, Joan-Claudi Socoró, Elisa Martínez: Book Chapters.
Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429), 2003
This paper presents a new method named text to visual synthesis with appearance models (TEVISAM) ... more This paper presents a new method named text to visual synthesis with appearance models (TEVISAM) for generating videorealistic talking heads. In a first step, the system learns a person-specific facial appearance model (PSFAM) automatically. PSFAM allows modeling all facial components (e.g. eyes, mouth, etc) independently and it will be used to animate the face from the input text dynamically. As reported by other researches, one of the key aspects in visual synthesis is the coarticulation effect. To solve such a problem, we introduce a new interpolation method in the high dimensional space of appearance allowing to create photorealistic and videorealistic avatars. In this work, preliminary experiments synthesizing virtual avatars from text are reported. Summarizing, in this paper we introduce three novelties: first, we make use of color PSFAM to animate virtual avatars; second, we introduce a non-linear high dimensional interpolation to achieve videorealistic animations; finally, this method allows to generate new expressions modeling the different facial elements.
Lecture Notes in Computer Science, 2007
Proceedings of the 3rd IEEE International Symposium on Signal Processing and Information Technology (IEEE Cat. No.03EX795), 2004
This paper proposes a new method for lip animation of personalized facial model from auditory spe... more This paper proposes a new method for lip animation of personalized facial model from auditory speech. It is based on Bayesian estimation and person specific appearance models (PSFAM). Initially, a video of a speaking person is recorded from which the visual and acoustic features of the speaker and their relationship will be learnt. First, the visual information of the speaker is stored in a color PSFAM by means of a registration algorithm. Second, the auditory features are extracted from the waveform attached to the recorded video sequence. Third, the relationship between the learnt PSFAM and the auditory features of the speaker is represented by Bayesian estimators. Finally, subjective perceptual tests are reported in order to measure the intelligibility of the preliminary results synthesizing isolated words.
Annual Conference of the International Speech Communication Association, 2003
Page 1. A Hybrid Method Oriented to Concatenative Text-to-Speech Synthesis Ignasi Iriondo, France... more Page 1. A Hybrid Method Oriented to Concatenative Text-to-Speech Synthesis Ignasi Iriondo, Francesc Alıas, Javier Sanchis, Javier Melenchón ... Ø=Øia-Æ w 2 ´tµ ´s´tµ -h´tµµ 2 (4) where w´tµ represents the Hanning window and V is the clos-est integer to the local pitch period. ...
Annual Conference of the International Speech Communication Association, 2009
Annual Conference of the International Speech Communication Association, 2009
This paper describes our participation in the INTERSPEECH 2009 Emotion Challenge (1). Starting fr... more This paper describes our participation in the INTERSPEECH 2009 Emotion Challenge (1). Starting from our previous experience in the use of automatic classification for the validation of an expressive corpus, we have tackled the difficult task of emotion recognition from speech with real-life data. Our main contribution to this work is related to the classifier sub-challenge, for which we tested
This paper describes a multi-domain text-to-speech (MD-TTS) synthesis strategy for generating spe... more This paper describes a multi-domain text-to-speech (MD-TTS) synthesis strategy for generating speech among different domains and so increasing the flexibility of high quality TTS systems. To that effect, the MD-TTS introduces a flexible TTS architecture that includes an automatic domain classification module, which allows MD-TTS systems to be implemented by different synthesis strate- gies and speech corpus typologies. In this
The quality of corpus based text-to-speech systems depends on the accuracy of the unit selection ... more The quality of corpus based text-to-speech systems depends on the accuracy of the unit selection process, which relies on the values of the weights of the cost function. This paper is focused on defining a new framework for the tuning of these weights. We propose a technique for taking into account the subjective perception of speech in the selection process
Proceedings. IEEE International Conference on Multimedia and Expo, 2002
This paper describes a 2D realistic talking face. The facial appearance model is constructed with... more This paper describes a 2D realistic talking face. The facial appearance model is constructed with a parameterised 2D sample based model. This representation supports moderated head movements, facial gestures and emotional expressions. Two main contributions for talking heads applications are proposed. First, the image of the lips is synthesized by means of shape and texture information. Secondly, a nearly automated training process makes the talking face personalization easier, due to the use of mouth tracking. Additionally, lips are synchronized in real time with speech that is generated using a SAPI compliant text-to-speech engine.
RESUMEN En este trabajo se presenta un nuevo procedimiento para la medida de los parámetros de cu... more RESUMEN En este trabajo se presenta un nuevo procedimiento para la medida de los parámetros de cualidad de voz (VoQ), el jitter y el shimmer. Este nuevo procedimiento tiene en consideración la prosodia del enunciado, de manera que su efecto se atenúa antes de realizar la medida de cada uno de los parámetros. El objetivo, además de realizar la medida de una forma más fiable, es el de modificar estos parámetros de forma que puedan ser utilizados en síntesis del habla expresiva, por ello, en paralelo a esta nuevo procedimiento de análisis, se presenta cómo llevar a cabo la modificación de ambos. Finalmente, se realiza una evaluación mediante una prueba perceptual CMOS sobre cuatro estilos expresivos: agresivo, alegre, sensual y triste; provenientes de la salida de un sistema de conversión de texto en habla con modelado prosódico, de modo que se hace un estudio de la utilidad de estos parámetros bajo diferentes situaciones.
En este artículo se presenta la utilización del aprendizaje analógico, en particular el razonamie... more En este artículo se presenta la utilización del aprendizaje analógico, en particular el razonamiento basado en casos, como herramienta de generación automática de la prosodia a partir de texto, el cual ha sido etiquetado de forma automática con atributos prosódicos. Se trata de un método basado en corpus para el modelado cuantitativo de la prosodia y su estimación en un sistema de conversión del texto en habla. El principal objetivo es conseguir un método común para predecir los 3 rasgos prosódicos principales: la curva de frecuencia fundamental (F0), la duración segmental y la intensidad. Se ha llevado a cabo una evaluación objetiva y subjetiva para considerar su uso en el ámbito de la síntesis del habla expresiva.
This paper presents the text-to-speech (TTS) synthesis system of La Salle (Universitat Ramon Llul... more This paper presents the text-to-speech (TTS) synthesis system of La Salle (Universitat Ramon Llull, URL) and its adaptation to the Albayzin Evaluation Campaign of FALA2010 conference. The URL-TTS system follows the classical scheme of unit se-lection TTS synthesis systems. However, it presents two dis-tinguishable particularities: i) prosody prediction learned from labelled data by means of Case-Based-Reasoning (CBR) and perceptual weight tuning by means of active interactive Genetic Algorithms (aiGA). The aiGA-based weights are compared to multilinear regression (MLR) weights both considering classi-cal averaged cost function and its root-mean squared variant. The internal validation tests and the results of the evaluation campaing are described, and finally discussed.
Lecture Notes in Computer Science, 2000
... paper describes a Unit Selection system based on diphones that was developed by the Speech Te... more ... paper describes a Unit Selection system based on diphones that was developed by the Speech Technology Group of the Enginyeria Arquitectura La Salle ... It is well known that the segmental quality of synthetic speech is limited by the number of joins that can be encountered in ...
Lecture Notes in Computer Science, 2004
This paper describes an initial approach to emotional speech synthesis in Catalan based on a diph... more This paper describes an initial approach to emotional speech synthesis in Catalan based on a diphone concatenation TTS system. The main goal of this work is to develop a simple prosodic model for expressive synthesis. This model is obtained from an emotional speech collection artificially generated by means of a copy-prosody experiment. After validating the emotional content of this collection, the model was automated and incorporated into our TTS system. Finally, the automatic speech synthesis system has been evaluated by means of a perceptual test, obtaining encouraging results.
Lecture Notes in Computer Science, 2004
A new algorithm for the incremental learning and non-intrusive tracking of the appearance of a pr... more A new algorithm for the incremental learning and non-intrusive tracking of the appearance of a previously non-seen face is presented.
Lecture Notes in Computer Science, 2007
This paper presents the validation of the expressive content of an acted corpus produced to be us... more This paper presents the validation of the expressive content of an acted corpus produced to be used in speech synthesis. The use of acted speech can be rather lacking in authenticity and therefore its expressiveness validation is required. The goal is to obtain an automatic classifier able to prune the bad utterances –with wrong expressiveness–. Firstly, a subjective test has
Lecture Notes in Computer Science, 2007
Hidden Markov Models based text-to-speech (HMM-TTS) synthesis is a technique for generating speec... more Hidden Markov Models based text-to-speech (HMM-TTS) synthesis is a technique for generating speech from trained statistical models where spectrum, pitch and durations of basic speech units are modelled altogether. The aim of this work is to describe a Spanish HMM-TTS system using an external machine learning technique to help improving the expressiveness. System performance is analysed objectively and subjectively. The experiments were conducted on a reliably labelled speech corpus, whose units were clustered using contextual factors based on the Spanish language. The results show that the CBR-based F0 estimation is capable of improving the HMM-based baseline performance when synthesizing non-declarative short sentences while the durations accuracy is similar with the CBR or the HMM system.
Lecture Notes in Computer Science, 2007
... 87 2 Building Emotional Speech Corpora ... This corpus had a twofold purpose: to learn the ac... more ... 87 2 Building Emotional Speech Corpora ... This corpus had a twofold purpose: to learn the acoustic models of emotional speech and to be used as the speech unit database for the synthesizer. This section describes the steps followed in the production of the corpus. ...
Lecture Notes in Computer Science, 2010
This paper describes a high-quality Spanish HMM-based speech synthesis of emotional speaking styl... more This paper describes a high-quality Spanish HMM-based speech synthesis of emotional speaking styles. The quality of the HMM-based speech synthesis is enhanced by using the most recent features presented for the Blizzard system (i.e. STRAIGHT spectrum extraction and mixed excitation). Two techniques are evaluated. First, a method simultaneously model all emotions within a single acoustic model. Second, an adaptation techniques to convert a neutral emotional style to a target emotion. We consider 3 kinds of emotions expressions: neutral, happy and sad. A subjective evaluation will show the quality of the system and the intensity of the produced emotion while an objective evaluation based on voice quality parameters evaluates the effectiveness of the approaches.
Emulating Subjective Criteria in Corpus Validation (9781599048499): Ignasi Iriondo, Santiago Plan... more Emulating Subjective Criteria in Corpus Validation (9781599048499): Ignasi Iriondo, Santiago Planet, Francesc Alías, Joan-Claudi Socoró, Elisa Martínez: Book Chapters.
Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429), 2003
This paper presents a new method named text to visual synthesis with appearance models (TEVISAM) ... more This paper presents a new method named text to visual synthesis with appearance models (TEVISAM) for generating videorealistic talking heads. In a first step, the system learns a person-specific facial appearance model (PSFAM) automatically. PSFAM allows modeling all facial components (e.g. eyes, mouth, etc) independently and it will be used to animate the face from the input text dynamically. As reported by other researches, one of the key aspects in visual synthesis is the coarticulation effect. To solve such a problem, we introduce a new interpolation method in the high dimensional space of appearance allowing to create photorealistic and videorealistic avatars. In this work, preliminary experiments synthesizing virtual avatars from text are reported. Summarizing, in this paper we introduce three novelties: first, we make use of color PSFAM to animate virtual avatars; second, we introduce a non-linear high dimensional interpolation to achieve videorealistic animations; finally, this method allows to generate new expressions modeling the different facial elements.
Lecture Notes in Computer Science, 2007
Proceedings of the 3rd IEEE International Symposium on Signal Processing and Information Technology (IEEE Cat. No.03EX795), 2004
This paper proposes a new method for lip animation of personalized facial model from auditory spe... more This paper proposes a new method for lip animation of personalized facial model from auditory speech. It is based on Bayesian estimation and person specific appearance models (PSFAM). Initially, a video of a speaking person is recorded from which the visual and acoustic features of the speaker and their relationship will be learnt. First, the visual information of the speaker is stored in a color PSFAM by means of a registration algorithm. Second, the auditory features are extracted from the waveform attached to the recorded video sequence. Third, the relationship between the learnt PSFAM and the auditory features of the speaker is represented by Bayesian estimators. Finally, subjective perceptual tests are reported in order to measure the intelligibility of the preliminary results synthesizing isolated words.