Towards a common framework for multimodal generation: The behavior markup language (original) (raw)

Realizing Multimodal Behavior Closing the gap between behavior planning and embodied agent presentation

2011

Abstract. Generating coordinated multimodal behavior for an embodied agent (speech, gesture, facial expression...) is challenging. It requires a high degree of animation control, in particular when reactive behaviors are required. We suggest to distinguish realization planning, where gesture and speech are processed symbolically using the behavior markup language (BML), and presentation which is controlled by a lower level animation language (EMBRScript). Reactive behaviors can bypass planning and directly control presentation. In this paper, we show how to define a behavior lexicon, how this lexicon relates to BML and how to resolve timing using formal constraint solvers. We conclude by demonstrating how to integrate reactive emotional behaviors. 1

Embodied conversational characters: representation formats for multimodal communicative behaviours

This contribution deals with the requirements on representation languages employed in planning and displaying communicative multimodal behaviour of Embodied Conversational Agents (ECAs). We focus on the role of behaviour representation frameworks as part of the processing chain from intent planning to the planning and generation of multimodal communicative behaviours. On the one hand, the field is fragmented, with almost everybody working on ECAs developing their own tailor-made representations, which is amongst others reflected in the extensive references list. On the other hand, there are general aspects that need to be modelled in order to generate multimodal behaviour. Throughout the chapter we take different perspectives on existing representation languages and outline the fundament of a common framework.

Realizing Multimodal Behavior

Lecture Notes in Computer Science, 2010

Generating coordinated multimodal behavior for an embodied agent (speech, gesture, facial expression. . . ) is challenging. It requires a high degree of animation control, in particular when reactive behaviors are required. We suggest to distinguish realization planning, where gesture and speech are processed symbolically using the behavior markup language (BML), and presentation which is controlled by a lower level animation language (EMBRScript). Reactive behaviors can bypass planning and directly control presentation. In this paper, we show how to define a behavior lexicon, how this lexicon relates to BML and how to resolve timing using formal constraint solvers. We conclude by demonstrating how to integrate reactive emotional behaviors.

Multimodal expressive embodied conversational agents

2005

Abstract In this paper we present our work toward the creation of a multimodal expressive Embodied Conversational Agent (ECA). Our agent, called Greta, exhibits nonverbal behaviors synchronized with speech. We are using the taxonomy of communicative functions developed by Isabella Poggi [22] to specify the behavior of the agent. Based on this taxonomy a representation language, Affective Presentation Markup Language, APML has been defined to drive the animation of the agent [4].

Thoughts on FML: Behavior generation in the virtual human communication architecture

2008

We discuss our current architecture for the generation of natural language and non-verbal behavior in ICT virtual humans. We draw on our experience developing this architecture to present our current perspective on several issues related to the standardization of FML and to the SAIBA framework more generally. In particular, we discuss our current use, and non-use, of FML-inspired representations in generating natural language, eye gaze, and emotional displays. We also comment on some of the shortcomings of our design as currently implemented.

Smartbody: Behavior realization for embodied conversational agents

2008

Researchers demand much from their embodied conversational agents (ECAs), requiring them to be both lifelike , as well as responsive to events in an interactive setting. We find that a flexible combination of animation approaches may be needed to satisfy these needs. In this paper we present SmartBody, an open source modular framework for animating ECAs in real time, based on the notion of hierarchically connected animation controllers. Controllers in SmartBody can employ arbitrary animation algorithms such as keyframe interpolation, motion capture or procedural animation. Controllers can also schedule or combine other controllers. We discuss our architecture in detail, including how we incorporate traditional approaches, and develop the notion of a controller as a reactive module within a generic framework, for realizing modular animation control. To illustrate the versatility of the architecture, we also discuss a range of applications that have used SmartBody successfully.

Nonverbal Behavior Generator for Embodied Conversational Agents

Lecture Notes in Computer Science, 2006

Believable nonverbal behaviors for embodied conversational agents (ECA) can create a more immersive experience for users and improve the effectiveness of communication. This paper describes a nonverbal behavior generator that analyzes the syntactic and semantic structure of the surface text as well as the affective state of the ECA and annotates the surface text with appropriate nonverbal behaviors. A number of video clips of people conversing were analyzed to extract the nonverbal behavior generation rules. The system works in real-time and is userextensible so that users can easily modify or extend the current behavior generation rules.

The TTS-driven affective embodied conversational agent EVA, based on a novel conversational-behavior generation algorithm

Engineering Applications of Artificial Intelligence, 2017

As a result of the convergence of different services delivered over the internet protocol, internet protocol television (IPTV) may be regarded as the one of the most widespread user interfaces accepted by a highly diverse user domain. Every generation, from children to the elderly, can use IPTV for recreation, as well as for gaining social contact and stimulating the mind. However, technological advances in digital platforms go hand in hand with the complexity of their user interfaces, and thus induce technological disinterest and technological exclusion. Therefore, interactivity and affective content presentations are, from the perspective of advanced user interfaces, two key factors in any application incorporating human-computer interaction (HCI). Furthermore, the perception and understanding of the information (meaning) conveyed is closely interlinked with visual cues and non-verbal elements that speakers generate throughout human-human dialogues. In this regard, co-verbal behavior provides information to the communicative act. It supports the speaker's communicative goal and allows for a variety of other information to be added to his/her messages, including (but not limited to) psychological states, attitudes, and personality. In the present paper, we address complexity and technological disinterest through the integration of natural, human-like multimodal output that incorporates a novel combined data-and rule-driven co-verbal behavior generator that is able to extract features from unannotated, general text. The core of the paper discusses the processes that model and synchronize non-verbal features with verbal features even when dealing with unknown context and/or limited contextual information. In addition, the proposed algorithm incorporates data-driven (speech prosody, repository of motor skills) and rule-based concepts (grammar, gesticon). The algorithm firstly classifies the communicative intent, then plans the coverbal cues and their form within the gesture unit, generates temporally synchronized co-verbal cues, and finally realizes them in the form of human-like co-verbal movements. In this way, the information can be represented in the form of both meaningfully and temporally synchronized co-verbal cues with accompanying synthesized speech, using communication channels to which people are most accustomed.

Multimodal adapted robot behavior synthesis within a narrative human-robot interaction

2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

In human-human interaction, three modalities of communication (i.e., verbal, nonverbal, and paraverbal) are naturally coordinated so as to enhance the meaning of the conveyed message. In this paper, we try to create a similar coordination between these modalities of communication in order to make the robot behave as naturally as possible. The proposed system uses a group of videos in order to elicit specific target emotions in a human user, upon which interactive narratives will start (i.e., interactive discussions between the participant and the robot around each video's content). During each interaction experiment, the humanoid expressive ALICE robot engages and generates an adapted multimodal behavior to the emotional content of the projected video using speech, head-arm metaphoric gestures, and/or facial expressions. The interactive speech of the robot is synthesized using Mary-TTS (text to speech toolkit), which is used-in parallel-to generate adapted head-arm gestures [1]. This synthesized multimodal robot behavior is evaluated by the interacting human at the end of each emotion-eliciting experiment. The obtained results validate the positive effect of the generated robot behavior multimodality on interaction.