Narendra Rajput - Academia.edu (original) (raw)

Uploads

Papers by Narendra Rajput

2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03).

Speech driven lip synthesis is an interesting and important step toward human-computer interactio... more Speech driven lip synthesis is an interesting and important step toward human-computer interaction. An incoming speech signal is time aligned using a speech recognizer to generate phonetic sequence which is then converted to corresponding viseme sequence to be animated. In this paper, we present a novel method for generation of the viseme sequence, which uses viseme based acoustic models, instead of usual phone based acoustic models, to align the input speech signal. This results in higher accuracy and speed of the alignment procedure and allows a much simpler implementation of the speech driven lip synthesis system as it completely obviates the requirement of acoustic unit to visual unit conversion. We show through various experiments that the proposed method results in about 53% relative improvement in classification accuracy and about 52% reduction in time, required to compute alignments.

IEEE International Conference on Multimedia and Expo, 2001. ICME 2001., 2001

In this paper, we demonstrate a morphing based automated audio driven facial animation system. Ba... more In this paper, we demonstrate a morphing based automated audio driven facial animation system. Based on an incoming audio stream, a face image is animated with full lip synchronization and expression. An animation sequence using optical flow between visemes is constructed, given an incoming audio stream and still pictures of a face speaking different visemes. Rules are formulated based on coarticulation and the duration of a viseme to control the continuity in terms of shape and extent of lip opening. In addition to this new viseme-expression combinations are synthesized to be able to generate animations with new facial expressions. Finally various applications of this system are discussed in the context of creating audiovisual reality.

2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, 2007

Utterance classification is an important task in spoken-dialog systems. The response of the syste... more Utterance classification is an important task in spoken-dialog systems. The response of the system is dependent on category assigned to the speaker's utterance by the classifier. However often the input speech is spontaneous and noisy which results in high word error rates. This ...

2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), 2007

In this paper, we present a system, called Sensei, for assessment of spoken English skills of cal... more In this paper, we present a system, called Sensei, for assessment of spoken English skills of call center agents. Sensei evaluates multiple parameters of spoken English skills, i.e., articulation of sounds, correctness of lexical stress in words and spoken grammar proficiency. Sensei provides an assessment test to be taken by a call center agent (or candidate) and generates score on each of the spoken English parameters as well as a combined score. It is implemented in the form of a web application so that it can be accessed through a web browser and doesn't require any software to be installed at the client side. We describe how the individual parameters are assessed in Sensei using various speech processing techniques and the experiments conducted to evaluate these techniques. The performance is compared with assessment performed by human assessors. A correlation of 0.8 is obtained between overall score generated by Sensei and human assessors on a real life test dataset of 243 candidates which compares well with the corresponding human-to-human correlation of 0.91.

The Journal of the Acoustical Society of America, 2005

Audio-driven facial animation is an interesting and evolving technique for human-computer interac... more Audio-driven facial animation is an interesting and evolving technique for human-computer interaction. Based on an incoming audio stream, a face image is animated with full lip synchronization. This requires a speech recognition system in the language in which audio is provided to get the time alignment for the phonetic sequence of the audio signal. However, building a speech recognition system is data intensive and is a very tedious and time consuming task. We present a novel scheme to implement a language independent system for audio-driven facial animation given a speech recognition system for just one language, in our case, English. The method presented here can also be used for text to audiovisual speech synthesis.

IEEE Transactions on Multimedia, 2004

This paper describes a morphing-based audio driven facial animation system. Based on an incoming ... more This paper describes a morphing-based audio driven facial animation system. Based on an incoming audio stream, a face image is animated with full lip synchronization and synthesized expressions. A novel scheme to implement a language independent system for audio-driven facial animation given a speech recognition system for just one language, in our case, English, is presented. The method presented here can also be used for text to audiovisual speech synthesis. Visemes in new expressions are synthesized to be able to generate animations with different facial expressions. An animation sequence using optical flow between visemes is constructed, given an incoming audio stream and still pictures of a face representing different visemes. The presented techniques give improved lip synchronization and naturalness to the animated video.

Multimedia Signal Processing, 1999

Considers the problem of combining visual cues with audio signals for the purpose of improved aut... more Considers the problem of combining visual cues with audio signals for the purpose of improved automatic machine recognition of speech. Although significant progress has been made in the machine transcription of large-vocabulary continuous speech (LVCSR) over the last few years, the technology to date is most effective only under controlled conditions, such as low noise, speaker-dependent recognition, read speech (as

2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03).

IEEE International Conference on Multimedia and Expo, 2001. ICME 2001., 2001

2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07, 2007

2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), 2007

The Journal of the Acoustical Society of America, 2005

IEEE Transactions on Multimedia, 2004

Multimedia Signal Processing, 1999