Evaluating a virtual speech cuer (original) (raw)

Evaluation of a virtual speech cuer

2006

This paper presents the virtual speech cuer built in the context of the ARTUS project aiming at watermarking hand and face gestures of a virtual animated agent in a broadcasted audiovisual sequence. For deaf televiewers that master cued speech, the animated agent can be then superimposed -on demand and at the reception -on the original broadcast as an alternative to subtitling. The paper presents the multimodal text-to-speech synthesis system and the first evaluation performed by deaf users.

A System to Generate SignWriting for Video Tracks Enhancing Accessibility of Deaf People

International Journal of Interactive Multimedia and Artificial Intelligence, 2017

A web platform that also aims at facilitating the delivery of accessible content via web is presented in [6] and [7]. This platform is a global solution targeting every disability, as different adapters can be integrated to provide accessible content according to the different limitations of users. Specifically the Deaf People Accessibility Adapter has already been developed and described in [7]. This component adapts the application content for people with severe auditory disability by translating standard web applications to web applications based on SingWriting, a way of writing Sign Language. The present paper describes a solution that intends to complement this adapter, eliminating existing barriers in video subtitling. If a web page provides a standard subtitling file, the proposed solution translates the available plain text to vector graphics, representing SingWriting, which accompany the video sequence, allowing the perception of audio information of the video by Deaf people. Specifically, the adapter supports translation of subtitles written in English language into SignWriting for American Sign Language

From audio-only to audio and video text-to-speech

Acta Acustica united with Acustica

Assessing the quality of Text-to-Speech (TTS) systems is a complex problem due to the many modules involved that address different subtasks during synthesis. Adding face synthesis – the animation of a "talking head" and its rendering to video – to a TTS system makes evaluation even more difficult. In the case of talking heads, today, we are at the infancy of research towards evaluating such systems. This paper reports on progress made with the AT&T sample-based Visual TTS (VTTS) system. Our system incorporates unit-selection synthesis (now well known from Audio TTS) and a moderate-size recorded database of video segments that are modified and concatenated to render the desired output. Given the high quality the system achieves, we feel for the first time that we are close to passing the Turing test, that is, that we are almost able to synthesize "talking heads" that look like recordings of real people. We demonstrate this point in applications, either over the we...

Multimodal speech synthesis

2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532), 2000

EVALUATION OF A SPEECH CUER: FROM MOTION CAPTURE TO A CONCATENATIVE TEXT-TO-CUED SPEECH SYSTEM

We present here our efforts for characterizing the 3D movements of the right hand and the face of a French female during the production of manual cued speech. We analyzed the 3D trajectories of 50 hand and 63 facial fleshpoints during the production of 238 utterances carefully designed for covering all possible diphones of the French language. Linear and non linear statistical models of the hand and face deformations and postures have been developed using both separate and joint corpora. We implement a concatenative audiovisual text-to-cued speech synthesis system.

Audiovisual speech synthesis

International Journal of Speech …, 2003

This paper presents the main approaches used to synthesize talking faces, and provides greater detail on a handful of these approaches. An attempt is made to distinguish between facial synthesis itself (i.e. the manner in which facial movements are rendered on a computer screen), and the way these movements may be controlled and predicted using phonetic input. The two main

Feasibility of Face Animation on Mobile Phones for Deaf Users

2007 16th IST Mobile and Wireless Communications Summit, 2007

Mobile telephone systems have made a fantastic progress in general, but deaf people are practically excluded from the benefits of mobiles. Our aim was to develop a new communication aid for deaf users. Our system directly converts the audio speech signal into the video of animated face, so the deaf users can receive voice messages by lipreading. Our system was implemented and tested in a PC environment earlier. This paper reports on the implementation on mobile phones and PDAs. The implementation problems and the potential steps for further improvements are also discussed.

Speech to Facial Animation Conversion for Deaf Customers

2006 14th European Signal Processing Conference, 2006

A speech to facial animation direct conversion system was developed as a communication aid for deaf people. Utilizing the preliminary test results a specific database was constructed from audio and visual records of professional lipspeakers. The standardized MPEG-4 system was used to animate the speaking face model. The trained neural net is able to calculate the principal component weights of feature points from the speech frames. The control coordinates have been calculated from PC weights. The whole system can be implemented in standard mobile phones. Deaf persons were able correctly recognize about 50% of words from limited sets in the final test based on our facial animation model.

SynFace: speech-driven facial animation for virtual speech-reading support

EURASIP Journal on …, 2009

This paper describes SynFace, a supportive technology that aims at enhancing audio-based spoken communication in adverse acoustic conditions by providing the missing visual information in the form of an animated talking head. Firstly, we describe the system architecture, consisting of a 3D animated face model controlled from the speech input by a specifically optimised phonetic recogniser. Secondly, we report on speech intelligibility experiments with focus on multilinguality and robustness to audio quality. The system, already available for Swedish, English, and Flemish, was optimised for German and for Swedish wide-band speech quality available in TV, radio, and Internet communication. Lastly, the paper covers experiments with nonverbal motions driven from the speech signal. It is shown that turn-taking gestures can be used to affect the flow of human-human dialogues. We have focused specifically on two categories of cues that may be extracted from the acoustic signal: prominence/emphasis and interactional cues (turn-taking/back-channelling).