MÓNICA DOMINGUEZ BAJO | Pompeu Fabra University (original) (raw)
Uploads
Papers by MÓNICA DOMINGUEZ BAJO
This paper presents a demonstration of a stochastic prosody tool for enrichment of synthesized sp... more This paper presents a demonstration of a stochastic prosody tool for enrichment of synthesized speech using SSML prosody tags applied over hierarchical thematicity spans in the context of a CTS application. The motivation for using hierarchical the-maticity is exemplified, together with the capabilities of the module to generate a variety of SSML prosody tags within a controlled range of values depending on the input thematicity label.
We present work in progress on an intelligent embodied conversation agent in the basic care and h... more We present work in progress on an intelligent embodied conversation agent in the basic care and healthcare domain. In
contrast to most of the existing agents, the presented agent is aimed to have linguistic cultural, social and emotional competence needed to interact with elderly and migrants.
It is composed of an ontology-based and reasoning-driven dialogue manager, multimodal communication analysis and
generation modules and a search engine for the retrieval of
multimedia background content from the web needed for conducting a conversation on a given topic.
The development of conversational agents with human interaction capabilities requires advanced af... more The development of conversational agents with human interaction capabilities requires advanced affective state recognition integrating non-verbal cues from the different modalities constituting what in human communication we perceive as an overall affective state. Each of the modalities is often handled by a different subsystem that conveys only a partial interpretation of the whole and, as such, is evaluated only in terms of its partial view. To tackle this shortcoming, we investigate the generation of a unified multimodal annotation schema of non-verbal cues from the perspective of an inter-disciplinary group of experts. We aim at obtaining a common ground-truth with a unique representation using the Valence and Arousal space and a discrete non-linear scale of values. The proposed annotation schema is demonstrated on a corpus in the health-care domain but is scalable to other purposes. Preliminary results on inter-rater variability show a positive correlation of consensus level with high (absolute) values of Valence and Arousal as well as with the number of annotators labeling a given video sequence.
Intonation is traditionally considered to be the most important prosodic feature, whereupon an im... more Intonation is traditionally considered to be the most important prosodic feature, whereupon an important research effort has been devoted to automatic segmentation and labeling of speech samples to grasp intonation cues. A number of studies also show that when duration or intensity are incorporated, automatic prosody labeling is further improved. However, the combination of word level acoustic features still attains poor results when machine learning techniques are applied on annotated corpora to derive intonation for speech synthesis applications. To address this problem, we present an experimental setup for the development of a hierarchical prosodic structure model which combines linguistic features, including information structure, and three acoustic elements (intensity, pitch and duration). We show empirically that this combination leads to a considerably more accurate representation of prosody and, consequently, a more reliable automatic labeling of speech corpora for machine learning.
State-of-the-art prosody modelling in content-to-speech (CTS) applications still uses the same me... more State-of-the-art prosody modelling in content-to-speech (CTS) applications still uses the same methodology to predict intonation cues as text-to-speech (TTS) applications, namely the analysis of the generated surface sentences with respect to part of speech, syntactic dependency relations and word order. On the other side, several theoretical studies argue that morphology, syntax, and information (or communicative) structure that organizes a given content (semantic or deep-syntactic structure) with respect to the intention of the speaker show a strong correlation with intonation. However, little empirical work based on sufficiently large corpora has been carried out so far to buttress this argumentation. We present empirical evidence for the Information Structure–Prosody correlation using the Wall Street Journal Penn Treebank corpus recorded by native American English speakers. Our experiments reach a prosody prediction accuracy of 80% using the hierarchical information structure from the Meaning-Text Theory, compared to 59% of the baseline.
UPF UPF ICREA and UPF Several grammar theories relate information structure (IS) and prosody, hig... more UPF UPF ICREA and UPF Several grammar theories relate information structure (IS) and prosody, highlighting a major correspondence between theme and rheme, and intonation patterns. Although these theories have been successfully exploited in some specific speech synthesis applications, they are mainly based on short default-order sentences, which limits their expressiveness for real discourse with longer sentences and complex structures. Our experiments performed on real discourse from the Wall Street Journal corpus show that we need a model that: (1) foresees a hierarchical theme/rheme structure, and (2) introduces, apart from the traditional theme and rheme, a new element-the specifier.
UPF UPF ICREA and UPF Speech synthesis applications use various layers of linguistic annotation (... more UPF UPF ICREA and UPF Speech synthesis applications use various layers of linguistic annotation (syntax, semantics, information and prosody structures) and therefore, a labeling of intonation patterns at the intonational phrase level is essential. We present a rule-based procedure for initial word-by-word AuToBI annotation and its adaptation to a phrase-based annotation. To validate our proposal, the outcome of the procedure is compared with manual annotation and with patterns prognosticated by information structure-prosody correlation argued for by main stream theories.
Several grammar theories relate information structure and prosody, highlighting a major correspon... more Several grammar theories relate information structure and prosody, highlighting a major correspondence between theme and rheme, and intonation patterns. Although these theories have been successfully exploited in some specific speech synthesis applications, they are mainly based on short default-order sentences, which limits their expressiveness for real discourse with longer sentences and complex structures. This paper revises these theories, identifying cases in which they are valid, and providing a new proposal for cases in which a more complex model is needed. Specifically, our experiments performed on real discourse from the Wall Street Journal corpus show that we need a model that: (1) foresees a hierarchical theme/rheme structure, and (2) introduces, apart from the traditional theme and rheme, a new element-the specifier.
This paper deals with the adaptation of AuToBI annotation for speech synthesis purposes. AuToBI i... more This paper deals with the adaptation of AuToBI annotation for speech synthesis purposes. AuToBI is a tool that automatically determines and classifies the standard ToBI labels for American English. AuToBI annotation is performed word-by-word. However, for speech synthesis applications that use various layers of linguistic annotation (syntax, semantic information and prosody structures) and, in particular, for the detection of the correlation between the information structure and prosody, a labeling of intonation patterns at the intonational phrase level is essential. We present a rule-based procedure for initial AuToBI annotation and its adaptation a phrase-based annotation, avoiding thus a post-processing stage of the extracted labels. To validate our proposal, the outcome of the procedure is compared with manual annotation and with patterns prognosticated by information structure-prosody correlation argued for by main stream theories.
Conference Presentations by MÓNICA DOMINGUEZ BAJO
Speech Synthesizers (TTS) use surface textual features to predict prosody acting as reading appli... more Speech Synthesizers (TTS) use surface textual features to predict prosody acting as reading applications. But read speech and spoken speech differ in the degree of expressiveness prosody must be able to cater for.
This paper presents a demonstration of a stochastic prosody tool for enrichment of synthesized sp... more This paper presents a demonstration of a stochastic prosody tool for enrichment of synthesized speech using SSML prosody tags applied over hierarchical thematicity spans in the context of a CTS application. The motivation for using hierarchical the-maticity is exemplified, together with the capabilities of the module to generate a variety of SSML prosody tags within a controlled range of values depending on the input thematicity label.
We present work in progress on an intelligent embodied conversation agent in the basic care and h... more We present work in progress on an intelligent embodied conversation agent in the basic care and healthcare domain. In
contrast to most of the existing agents, the presented agent is aimed to have linguistic cultural, social and emotional competence needed to interact with elderly and migrants.
It is composed of an ontology-based and reasoning-driven dialogue manager, multimodal communication analysis and
generation modules and a search engine for the retrieval of
multimedia background content from the web needed for conducting a conversation on a given topic.
The development of conversational agents with human interaction capabilities requires advanced af... more The development of conversational agents with human interaction capabilities requires advanced affective state recognition integrating non-verbal cues from the different modalities constituting what in human communication we perceive as an overall affective state. Each of the modalities is often handled by a different subsystem that conveys only a partial interpretation of the whole and, as such, is evaluated only in terms of its partial view. To tackle this shortcoming, we investigate the generation of a unified multimodal annotation schema of non-verbal cues from the perspective of an inter-disciplinary group of experts. We aim at obtaining a common ground-truth with a unique representation using the Valence and Arousal space and a discrete non-linear scale of values. The proposed annotation schema is demonstrated on a corpus in the health-care domain but is scalable to other purposes. Preliminary results on inter-rater variability show a positive correlation of consensus level with high (absolute) values of Valence and Arousal as well as with the number of annotators labeling a given video sequence.
Intonation is traditionally considered to be the most important prosodic feature, whereupon an im... more Intonation is traditionally considered to be the most important prosodic feature, whereupon an important research effort has been devoted to automatic segmentation and labeling of speech samples to grasp intonation cues. A number of studies also show that when duration or intensity are incorporated, automatic prosody labeling is further improved. However, the combination of word level acoustic features still attains poor results when machine learning techniques are applied on annotated corpora to derive intonation for speech synthesis applications. To address this problem, we present an experimental setup for the development of a hierarchical prosodic structure model which combines linguistic features, including information structure, and three acoustic elements (intensity, pitch and duration). We show empirically that this combination leads to a considerably more accurate representation of prosody and, consequently, a more reliable automatic labeling of speech corpora for machine learning.
State-of-the-art prosody modelling in content-to-speech (CTS) applications still uses the same me... more State-of-the-art prosody modelling in content-to-speech (CTS) applications still uses the same methodology to predict intonation cues as text-to-speech (TTS) applications, namely the analysis of the generated surface sentences with respect to part of speech, syntactic dependency relations and word order. On the other side, several theoretical studies argue that morphology, syntax, and information (or communicative) structure that organizes a given content (semantic or deep-syntactic structure) with respect to the intention of the speaker show a strong correlation with intonation. However, little empirical work based on sufficiently large corpora has been carried out so far to buttress this argumentation. We present empirical evidence for the Information Structure–Prosody correlation using the Wall Street Journal Penn Treebank corpus recorded by native American English speakers. Our experiments reach a prosody prediction accuracy of 80% using the hierarchical information structure from the Meaning-Text Theory, compared to 59% of the baseline.
UPF UPF ICREA and UPF Several grammar theories relate information structure (IS) and prosody, hig... more UPF UPF ICREA and UPF Several grammar theories relate information structure (IS) and prosody, highlighting a major correspondence between theme and rheme, and intonation patterns. Although these theories have been successfully exploited in some specific speech synthesis applications, they are mainly based on short default-order sentences, which limits their expressiveness for real discourse with longer sentences and complex structures. Our experiments performed on real discourse from the Wall Street Journal corpus show that we need a model that: (1) foresees a hierarchical theme/rheme structure, and (2) introduces, apart from the traditional theme and rheme, a new element-the specifier.
UPF UPF ICREA and UPF Speech synthesis applications use various layers of linguistic annotation (... more UPF UPF ICREA and UPF Speech synthesis applications use various layers of linguistic annotation (syntax, semantics, information and prosody structures) and therefore, a labeling of intonation patterns at the intonational phrase level is essential. We present a rule-based procedure for initial word-by-word AuToBI annotation and its adaptation to a phrase-based annotation. To validate our proposal, the outcome of the procedure is compared with manual annotation and with patterns prognosticated by information structure-prosody correlation argued for by main stream theories.
Several grammar theories relate information structure and prosody, highlighting a major correspon... more Several grammar theories relate information structure and prosody, highlighting a major correspondence between theme and rheme, and intonation patterns. Although these theories have been successfully exploited in some specific speech synthesis applications, they are mainly based on short default-order sentences, which limits their expressiveness for real discourse with longer sentences and complex structures. This paper revises these theories, identifying cases in which they are valid, and providing a new proposal for cases in which a more complex model is needed. Specifically, our experiments performed on real discourse from the Wall Street Journal corpus show that we need a model that: (1) foresees a hierarchical theme/rheme structure, and (2) introduces, apart from the traditional theme and rheme, a new element-the specifier.
This paper deals with the adaptation of AuToBI annotation for speech synthesis purposes. AuToBI i... more This paper deals with the adaptation of AuToBI annotation for speech synthesis purposes. AuToBI is a tool that automatically determines and classifies the standard ToBI labels for American English. AuToBI annotation is performed word-by-word. However, for speech synthesis applications that use various layers of linguistic annotation (syntax, semantic information and prosody structures) and, in particular, for the detection of the correlation between the information structure and prosody, a labeling of intonation patterns at the intonational phrase level is essential. We present a rule-based procedure for initial AuToBI annotation and its adaptation a phrase-based annotation, avoiding thus a post-processing stage of the extracted labels. To validate our proposal, the outcome of the procedure is compared with manual annotation and with patterns prognosticated by information structure-prosody correlation argued for by main stream theories.
Speech Synthesizers (TTS) use surface textual features to predict prosody acting as reading appli... more Speech Synthesizers (TTS) use surface textual features to predict prosody acting as reading applications. But read speech and spoken speech differ in the degree of expressiveness prosody must be able to cater for.