The TTS-driven affective embodied conversational agent EVA, based on a novel conversational-behavior generation algorithm (original) (raw)

2017, Engineering Applications of Artificial Intelligence

As a result of the convergence of different services delivered over the internet protocol, internet protocol television (IPTV) may be regarded as the one of the most widespread user interfaces accepted by a highly diverse user domain. Every generation, from children to the elderly, can use IPTV for recreation, as well as for gaining social contact and stimulating the mind. However, technological advances in digital platforms go hand in hand with the complexity of their user interfaces, and thus induce technological disinterest and technological exclusion. Therefore, interactivity and affective content presentations are, from the perspective of advanced user interfaces, two key factors in any application incorporating human-computer interaction (HCI). Furthermore, the perception and understanding of the information (meaning) conveyed is closely interlinked with visual cues and non-verbal elements that speakers generate throughout human-human dialogues. In this regard, co-verbal behavior provides information to the communicative act. It supports the speaker's communicative goal and allows for a variety of other information to be added to his/her messages, including (but not limited to) psychological states, attitudes, and personality. In the present paper, we address complexity and technological disinterest through the integration of natural, human-like multimodal output that incorporates a novel combined data-and rule-driven co-verbal behavior generator that is able to extract features from unannotated, general text. The core of the paper discusses the processes that model and synchronize non-verbal features with verbal features even when dealing with unknown context and/or limited contextual information. In addition, the proposed algorithm incorporates data-driven (speech prosody, repository of motor skills) and rule-based concepts (grammar, gesticon). The algorithm firstly classifies the communicative intent, then plans the coverbal cues and their form within the gesture unit, generates temporally synchronized co-verbal cues, and finally realizes them in the form of human-like co-verbal movements. In this way, the information can be represented in the form of both meaningfully and temporally synchronized co-verbal cues with accompanying synthesized speech, using communication channels to which people are most accustomed.