Asunción Moreno - Academia.edu (original) (raw)

Papers by Asunción Moreno

This paper describes a joint initiative of the Catalan and Spanish Government to produce Language... more This paper describes a joint initiative of the Catalan and Spanish Government to produce Language Resources for the Catalan language. A similar methodology to the Basic Language Resource Kit (BLARK) concept was applied to determine the priorities on the production of the Language Resources. The paper shows the LR and tools currently available for the Catalan Language both for Language and Speech technologies. The production of large databases for Automatic Speech Recognition purposes already started. All the resources generated in the project follow EU standards, will be validated by an external centre and will be free and public available through ELRA.

In this paper, our flexible harmonic/stochastic waveform generator for a speech synthesis system ... more In this paper, our flexible harmonic/stochastic waveform generator for a speech synthesis system is presented. The speech is modeled as the superposition of two components: a harmonic component and a stochastic or aperiodic component. The purpose of this representation is to provide a framework with maximum flexibility for all kind of speech transformations. In contrast to other similar systems found in the literature, like HNM, our system can operate using constant frame rate instead of a pitch-synchronous scheme. Thus, the analysis process is simplified, while the phase coherence is guaranteed by the new prosodic modification and concatenation procedures that have been designed for this scheme. As the system was created for voice conversion applications, in this work, as a previous step, we validate its performance in a speech synthesis context by comparing it to the well-known TD-PSOLA technique, using four different voices and different synthesis database sizes. The opinions of the listeners indicate that the methods and algorithms described are preferred rather than PSOLA, and thus are suitable for high-quality speech synthesis and for further voice transformations.

TC-STAR Workshop on …, 2006

This paper presents the baseline text-to-speech system developed at UPC (Ogmios) plus our recent ... more This paper presents the baseline text-to-speech system developed at UPC (Ogmios) plus our recent work on speech prosody generation and the procedures to create high quality language resources for speech synthesis. These contributions have been evaluated within the TC-STAR European project, which is focused on speech-to-speech translation. Several presented contributions have been developed in order to adapt the TTS component to the speech-to-speech translation framework. In this application, the input text is not writtenstyle text but transcriptions of talks. Moreover, we have to cope with errors coming from the speech recognition and speech translation engines. However, in speech-to-speech translation, the source speech can be used as a valuable source of information to generate the target prosody. The general framework and first results are presented in the paper.

As a part of the IST project Interface ("Multimodal Analysis/Synthesis System for Human Interacti... more As a part of the IST project Interface ("Multimodal Analysis/Synthesis System for Human Interaction to Virtual and Augmented environments"), an emotional speech database for Slovenian, English, Spanish, and French language has been recorded. The database is designed for general study of emotional speech as well as analysis of emotion characteristics for speech synthesis and for automatic emotion classification purposes. Six emotions have been defined: anger, sadness, joy, fear, disgust and surprise. The neutral styles were also recorded. One male speaker and one female speaker have been recorded, except for English language where two mail and one female speaker have been recorded. All the speakers are actors. The corpuses consist of 175-190 sentences for each language. For Spanish and Slovenian databases subjective evaluation tests have been made. The recorded Interface emotional speech database represents a good basis for emotional speech analysis and is also useful in synthesis of emotional speech.

First European Conference on Speech Communication and Technology (Eurospeech 1989)

This communication reports the use of demisyllables for continuous speech recognition in a specif... more This communication reports the use of demisyllables for continuous speech recognition in a specific application: the recognition of spanish numbers. After a briet outline of the recognition system, a description of demisyllable syntactic constraints and one-speaker reference generation is provided. Finally, the recognition performance is assessed by means of two experiments: the recognition of integer numbers from zero to one thousand and telephone numbers uttered in a spanish way (strings of integers from zero to ninety nine); in both applications the results that the system yielded were excellent.

Interspeech 2005, 2005

In the framework of the EU-funded Project LC-STAR, a set of Language Resources (LR) for all the S... more In the framework of the EU-funded Project LC-STAR, a set of Language Resources (LR) for all the Speech to Speech Translation components (Speech recognition, Machine Translation and Speech Synthesis) was developed. This paper deals with the development of bilingual corpora in Spanish, US English and Catalan. The corpora were obtained from spontaneous dialogues in one of these three languages which were translated to the other two languages. The paper describes the translation methodology, specific problems of translating spontaneous dialogues to be used for MT training, formats and the validation criteria.

Interspeech 2007, 2007

Most of the existing voice conversion methods calculate the optimal transformation function from ... more Most of the existing voice conversion methods calculate the optimal transformation function from a given set of paired acoustic vectors of the source and target speakers. The alignment of the phonetically equivalent source and target frames is problematic when the training corpus available is not parallel, although this is the most realistic situation. The alignment task is even more difficult in cross-lingual applications because the phoneme sets may be different in the involved languages. In this paper, a new iterative alignment method based on acoustic distances is proposed. The method is shown to be suitable for text-independent and cross-lingual voice conversion, and the conversion scores obtained in our evaluation experiments are not far from the performance achieved by using parallel training corpora.

This work describes a coder that works at bit rates of 9.6Kbits/s to 32 Kb/ s. Basically the syst... more This work describes a coder that works at bit rates of 9.6Kbits/s to 32 Kb/ s. Basically the system consist on a waveform DPCM coder with an improvement in the quantizer. This new feature consist on quantizing the prediction error taking into account its waveform characteristics. The prediction residual of a voiced signal is characterized by an energy sychronous with pitch. The envelope of this signal holds this information and can be used to quantize properly the prediction error. In this work a parametric version of the residual envelope is used in two ways: Dynamic bit assignment in time domain and adaptive control of the dynamic range of the quantizer.

We study some speech enhancement algorithms based on the iterative Wiener filtering method due to... more We study some speech enhancement algorithms based on the iterative Wiener filtering method due to Lim-Oppenhcim [2], where the AR spectral estimation of the speech is carried out using a 2nd-order analysis. But in our algorithms we consider an AR estimation by means of cumulant analysis. This work extends some preceding papers due to the authors, where information of previous speech frames is taken to initiate speech AR modelling of the current frame. Two parameters are introduced to dessign Wiener filter at first iteration of this iterative algorithm. These parameters are the Interframe Factor IF and the Previous Frame Iteration PFI. A detailed study of them shows they allow a very important noise suppression after processing only first iteration of this algorithm, without any appreciable increase of distortion. Two different ways to combine current and previous frame AR modelling are evaluated.

The goal of the LILA project was the collection of speech databases over cellular telephone netwo... more The goal of the LILA project was the collection of speech databases over cellular telephone networks of five languages in three Asian countries. Three languages were recorded in India: Hindi by first language speakers, Hindi by second language speakers and Indian English. Furthermore, Mandarin was recorded in China and Korean in South-Korea. The databases are part of the SpeechDat-family and follow the SpeechDat rules in many respects. All databases have been finished and have passed the validation tests. Both Hindi databases and the Korean database will be available to the public for sale.

Speech Communication, 2003

This paper introduces a first approach to emotion recognition using RAMSES, the UPC's speech reco... more This paper introduces a first approach to emotion recognition using RAMSES, the UPC's speech recognition system. The approach is based on standard speech recognition technology using hidden semi-continuous Markov models. Both the selection of low level features and the design of the recognition system are addressed. Results are given on speaker dependent emotion recognition using the Spanish corpus of INTERFACE Emotional Speech Synthesis Database. The accuracy recognising seven different emotions-the six ones defined in MPEG-4 plus neutral style-exceeds 80% using the best combination of low level features and HMM structure. This result is very similar to that obtained with the same database in subjective evaluation by human judges.

Under the SpeechDat specifications, the Spanish member of SpeechDat consortium has recorded a Cat... more Under the SpeechDat specifications, the Spanish member of SpeechDat consortium has recorded a Catalan database that includes one thousand speakers. This communication describes some experimental work that has been carried out using both the Spanish and the Catalan speech material. A speech recognition system has been trained for the Spanish language using a selection of the phonetically balanced utterances from the 4500 SpeechDat training sessions. Utterances with mispronounced or incomplete words and with intermittent noise were discarded. A set of 26 allophones was selected to account for the Spanish sounds and clustered demiphones have been used as context dependent sub-lexical units. Following the same methodology, a recognition system was trained from the Catalan SpeechDat database. Catalan sounds were described with 32 allophones. Additionally, a bilingual recognition system was built for both the Spanish and Catalan languages. By means of clustering techniques, the suitable s...

Multilingual Speech and …, 2006

In this paper, multidialectal acoustic modeling based on sharing data across dialects is addresse... more In this paper, multidialectal acoustic modeling based on sharing data across dialects is addressed. A comparative study of different methods of combining data based on decision tree clustering algorithms is presented. Approaches evolved differ in the way of evaluating the similarity of sounds between dialects, and the decision tree structure applied. Proposed systems are tested with Spanish dialects across Spain and Latin America. All multidialectal proposed systems improve monodialectal performance using data from another dialect but it is shown that the way to share data is critical. The best combination between similarity measure and tree structure achieves an improvement of 7% over the results obtained with monodialectal systems.

Proceedings EAEC 99, …, 1999

The SpeechDat-Car project included in the 4 th framework of the European Community's Language Eng... more The SpeechDat-Car project included in the 4 th framework of the European Community's Language Engineering Programme, started in April 1998 with a duration of 30 months. It is a common initiative of car manufacturers, telephone communications operators, companies active in voice operated services and Universities that aims at collecting a set of speech databases in nine different languages to support training and testing of robust multilingual speech recognition for in-car applications. This paper describes the database requirements, the background of the project, the design and validation of the databases, the definition of recording platforms, and main goals from the automotive exploitation point of view.

This paper presents our work around the FESTCAT project, whose main goal was the development of v... more This paper presents our work around the FESTCAT project, whose main goal was the development of voices for the Festival suite in Catalan. In the first year, we produced the corpus and the speech data needed for build 10 voices using the Clunits (unit selection) and the HTS (Markov models) methods. The resulting voices are freely available on the web page of the project and included in Linkat, a Catalan distribution of Linux. More recently, we have updated the voices using new versions of HTS, other technology (Multisyn) and we have produced a child voice. Furthermore, we have performed a prosodic labeling and analysis of the database using the break index labels proposed in the ToBI system aimed to improve the intonation of the synthetic speech.

One of the most common approaches to speech synthesis is the concatenation of diphones, extracted... more One of the most common approaches to speech synthesis is the concatenation of diphones, extracted from a previously recorded database. The prosodic parameters of the recorded speech fragments have to be adapted to the specifications of the new utterances to be synthesized. In this paper, the deterministic plus stochastic model of speech is used to modify and smoothly concatenate the analyzed diphones. A very high quality is reached without pitch-synchronism, and complex calculations like the vocal tract estimation are avoided. Instead, simple linear interpolations and fast calculations are performed, and only harmonically related sinusoids are taken into account. The resynthesis of the concatenated data is carried out by the overlap-add method.

This paper deals with the design of a synthesis database for a high quality corpus-based Speech S... more This paper deals with the design of a synthesis database for a high quality corpus-based Speech Synthesis system in Spanish. The database has been designed for speech synthesis, speech conversion and expressive speech. The design follows the specifications of TC-STAR project and has been applied to collect equivalent English and Mandarin synthesis databases. The sentences of the corpus have been selected mainly from transcribed speech and novels. The selection criterion is a phonetic and prosodic coverage. The corpus was completed with sentences specifically designed to cover frequent phrases and words. Two baseline speakers and four bilingual speakers were recorded. Recordings consist of 10 hours of speech for each baseline speaker and one hour of speech for each voice conversion bilingual speaker. The database is labelled and segmented. Pitch marks and phonetic segmentation was done automatically and up to 50% manually supervised. The database will be available at ELRA.

Language Resources and Evaluation, 2016

This article provides an overview of the dissemination work carried out in META-NET from 2010 unt... more This article provides an overview of the dissemination work carried out in META-NET from 2010 until early 2014; we describe its impact on the regional, national and international level, mainly with regard to politics and the situation of funding for LT topics. This paper documents the initiative's work throughout Europe in order to boost progress and innovation in our field.

In this paper we present a method for defining the question set for the induction of acoustic pho... more In this paper we present a method for defining the question set for the induction of acoustic phonetic decision trees. The method is data driven resulting in a continuous feature space in contrast to the usual categorical one. We apply the features to a multilingual speech recognition task, outperforming consistently the standard method using IPA-based characteristics. An extension to cross-lingual applications together with first preliminary results are given too.