Arni Fp | Universitas Gadjah Mada (Yogyakarta) (original) (raw)
Uploads
Papers by Arni Fp
AVSP 2003-International …, Jan 1, 2003
This paper proposes an audio-visual speech recognition method using lip movement extracted from s... more This paper proposes an audio-visual speech recognition method using lip movement extracted from side-face images to attempt to increase noise-robustness in mobile environments. Although most previous bimodal speech recognition methods use frontal face (lip) images, these methods are not easy for users since they need to hold a device with a camera in front of their face when talking. Our proposed method capturing lip movement using a small camera installed in a handset is more natural, easy and convenient. This method also effectively avoids a decrease of signal-to-noise ratio (SNR) of input speech. Visual features are extracted by optical-flow analysis and combined with audio features in the framework of HMM-based recognition. Phone HMMs are built by the multi-stream HMM technique. Experiments conducted using Japanese connected digit speech contaminated with white noise in various SNR conditions show effectiveness of the proposed method. Recognition accuracy is improved by using the visual information in all SNR conditions, and the best improvement is approximately 6% at 5dB SNR.
AVSP 2003-International …, Jan 1, 2003
This paper proposes an audio-visual speech recognition method using lip movement extracted from s... more This paper proposes an audio-visual speech recognition method using lip movement extracted from side-face images to attempt to increase noise-robustness in mobile environments. Although most previous bimodal speech recognition methods use frontal face (lip) images, these methods are not easy for users since they need to hold a device with a camera in front of their face when talking. Our proposed method capturing lip movement using a small camera installed in a handset is more natural, easy and convenient. This method also effectively avoids a decrease of signal-to-noise ratio (SNR) of input speech. Visual features are extracted by optical-flow analysis and combined with audio features in the framework of HMM-based recognition. Phone HMMs are built by the multi-stream HMM technique. Experiments conducted using Japanese connected digit speech contaminated with white noise in various SNR conditions show effectiveness of the proposed method. Recognition accuracy is improved by using the visual information in all SNR conditions, and the best improvement is approximately 6% at 5dB SNR.