Atsuhiko Kai - Academia.edu (original) (raw)

Papers by Atsuhiko Kai

Research paper thumbnail of Hands-free speaker identification based on spectral subtraction using a multi-channel least mean square approach

2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013

Research paper thumbnail of Speech recognition using blind source separation and dereverberation method for mixed sound of speech and music

2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2013

In this paper, we propose a method for performing a non-stationary noise reduction and dereverber... more In this paper, we propose a method for performing a non-stationary noise reduction and dereverberation method. We use a blind dereverberation method based on spectral subtraction using a multi-channel least mean square algorithm has been proposed in our previous study. To suppress the non-stationary noise, we used a blind source separation based on an efficient fast independent component analysis algorithm. This method is evaluated using a mixed sound of speech and music, and achieves an average relative word error reduction rate of 41.9% and 7.9% compared with a baseline method and the state-of-the-art multistep linear prediction-based dereverberation, respectively, in a real environment.

Research paper thumbnail of Using acoustic dissimilarity measures based on state-level distance vector representation for improved spoken term detection

2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2013

This paper proposes a simple approach to subwordbased spoken term detection (STD) which uses impr... more This paper proposes a simple approach to subwordbased spoken term detection (STD) which uses improved acoustic dissimilarity measures based on a distance-vector representation at the state-level. Our approach assumes that both the query term and spoken documents are represented by subword units and then converted to the sequence of HMM states. A set of all distributions in subword-based HMMs is used for generating distance-vector representation of each state of all subword units. The element of a distance-vector corresponds to the distance between distributions of two different states, and thus a vector represents a structural feature at the state-level. The experimental result showed that the proposed method significantly outperforms the baseline method, which employs a conventional acoustic dissimilarity measure based on subword unit, with very little increase in the required search time.

Research paper thumbnail of Evaluation of unknown word processing in a spoken word recognition system

Research paper thumbnail of An understanding strategy based on plausibility score in recognition history using CSR confidence measure

Research paper thumbnail of Environment-dependent denoising autoencoder for distant-talking speech recognition

EURASIP Journal on Advances in Signal Processing, 2015

Research paper thumbnail of Evaluation of Hands-Free Large Vocabulary Continuous Speech Recognition by Blind Dereverberation Based on Spectral Subtraction by Multi-channel LMS Algorithm

Lecture Notes in Computer Science, 2011

Research paper thumbnail of Combination of bottleneck feature extraction and dereverberation for distant-talking speech recognition

Multimedia Tools and Applications, 2015

Research paper thumbnail of Distant Speaker Recognition Based on the Automatic Selection of Reverberant Environments Using GMMs

2009 Chinese Conference on Pattern Recognition, 2009

ABSTRACT Channel distortion for a distant environment may drastically degrade the performance of ... more ABSTRACT Channel distortion for a distant environment may drastically degrade the performance of speaker recognition because the training and test conditions differ significantly. In this paper, we propose robust distant speaker recognition that is based on the automatic selection of reverberant environments using Gaussian mixture models. Three methods involving (I) optimum channel determination, (II) joint optimum speaker and channel determination, or (III) optimum channel determination at the utterance level are proposed. Real-world speech data and simulated reverberant speech data are used to evaluate our proposed methods. The third proposed method achieves a relative error reduction of 69.6% over (baseline) speaker recognition using a reverberant environment-independent method, and it has performance equivalent to that of a reverberant environment-dependent method (an ideal-condition method).

Research paper thumbnail of Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification

EURASIP Journal on Audio, Speech, and Music Processing, 2015

ABSTRACT Deep neural network (DNN)-based approaches have been shown to be effective in many autom... more ABSTRACT Deep neural network (DNN)-based approaches have been shown to be effective in many automatic speech recognition systems. However, few works have focused on DNNs for distant-talking speaker recognition. In this study, a bottleneck feature derived from a DNN and a cepstral domain denoising autoencoder (DAE)-based dereverberation are presented for distant-talking speaker identification, and a combination of these two approaches is proposed. For the DNN-based bottleneck feature, we noted that DNNs can transform the reverberant speech feature to a new feature space with greater discriminative classification ability for distant-talking speaker recognition. Conversely, cepstral domain DAE-based dereverberation tries to suppress the reverberation by mapping the cepstrum of reverberant speech to that of clean speech with the expectation of improving the performance of distant-talking speaker recognition. Since the DNN-based discriminant bottleneck feature and DAE-based dereverberation have a strong complementary nature, the combination of these two methods is expected to be very effective for distant-talking speaker identification. A speaker identification experiment was performed on a distant-talking speech set, with reverberant environments differing from the training environments. In suppressing late reverberation, our method outperformed some state-of-the-art dereverberation approaches such as the multichannel least mean squares (MCLMS). Compared with the MCLMS, we obtained a reduction in relative error rates of 21.4% for the bottleneck feature and 47.0% for the autoencoder feature. Moreover, the combination of likelihoods of the DNN-based bottleneck feature and DAE-based dereverberation further improved the performance.

Research paper thumbnail of Least Distance Based Inefficiency Measures on the Pareto-Efficient Frontier in Dea

Journal of the Operations Research Society of Japan, 2012

Since Briec developed a family of the least distance based inefficiency measures satisfying weak ... more Since Briec developed a family of the least distance based inefficiency measures satisfying weak monotonicity over weakly efficient frontier, the existence of a least distance based efficiency measure satisfying strong monotonicity on the strongly efficient frontier is still an open problem. This paper gives a negative answer to the open problem and its relaxed open problem. Modifying Briec's inefficiency measures gives an alternative solution to the relaxed open problem, that can be used for theoretical and practical applications.

Research paper thumbnail of A frame-synchronous continuous speech recognition algorithm using a top-down parsing of context-free grammar

Research paper thumbnail of A pairwise discriminant approach using artificial neural networks for continuous speech recognition

Journal of the Acoustical Society of Japan, Nov 1, 1992

Research paper thumbnail of Single-channel dereverberation for distant-talking speech recognition by combining denoising autoencoder and temporal structure normalization

The 9th International Symposium on Chinese Spoken Language Processing, Sep 1, 2014

In this paper, we propose a robust distant-talking speech recognition by combining cepstral domai... more In this paper, we propose a robust distant-talking speech recognition by combining cepstral domain denoising autoencoder (DAE) and temporal structure normalization (TSN) filter. As DAE has a deep structure and nonlinear processing steps, it is flexible enough to model highly nonlinear mapping between input and output space. In this paper, we train a DAE to map reverberant and noisy speech features to the underlying clean speech features in the cepstral domain. For the proposed method, after applying a DAE in the cepstral domain of speech to suppress reverberation, we apply a post-processing technology based on temporal structure normalization (TSN) filter to reduce the noise and reverberation effects by normalizing the modulation spectra to reference spectra of clean speech. The proposed method was evaluated using speech in simulated and real reverberant environments. By combining a cepstral-domain DAE and TSN, the average Word Error Rate (WER) was reduced from 25.2 % of the baseline system to 21.2 % in simulated environments and from 47.5 % to 41.3 % in real environments, respectively.

Research paper thumbnail of Usability of Browser-Based Pen-Touch/Speech User Interfaces for Form-Based Applications in Mobile Environment

Lecture Notes in Computer Science, 2000

ABSTRACT This paper describes a speech interface system for the information retrieval services on... more ABSTRACT This paper describes a speech interface system for the information retrieval services on the WWW and the experimental result of a usability evaluation for the form-based information retrieval tasks. We have presented a general speech interface system which can be basically applied to many menu-based information retrieval services on the WWW. The system enables additional speech input capability for a general WWW browser. A usability evaluation experiment of the speechenabled system for several existing menu-based information retrieval services is conducted and the results are compared with the case of a conventional system with pen-touch input mode. We also investigated the difference in the effect of usability for di.erent operating conditions.

Research paper thumbnail of Comparison of continuous speech recognition systems with unknown‐word processing for speech disfluencies

Systems and Computers in Japan, 1998

This paper describes speech recognition systems for dealing with spontaneous speech, in which an ... more This paper describes speech recognition systems for dealing with spontaneous speech, in which an unknownword processing method based on subword sequence decoding is employed. We propose an efficient algorithm for unknown-word processing that employs an independent process of subword sequence decoding while an utterance is verified and searched with an appropriate linguistic constraint. The algorithm is applied both to one pass-based and spotting-based search algorithms after small modification. We compared speech understanding systems whose recognition strategies differ in terms of the search strategies and with the way of handling unknown words and interjections. We observed that the effectiveness of the unknown-word processing technique depends on the accuracy of the acoustic model and that a one pass-based search algorithm with unknown-word processing attains the best performance for phrase/sentence accuracy and computational efficiency, although the sentence understanding rate for the evaluated task is comparable to or slightly less than the best among other methods. The experimental results showed that when the unknown-word processing technique was employed to deal with extraneous speech, a sentence understanding rate of 80% was attained for a task with a test set perplexity of 40. © 1998 Scripta Technica, Syst Comp Jpn, 29(9): 4353, 1998

Research paper thumbnail of Distant-talking speech recognition using multi-channel LMS and multiple-step linear prediction

The 9th International Symposium on Chinese Spoken Language Processing, 2014

Research paper thumbnail of Dereverberation Based on Spectral Subtraction by Multi-channel LMS Algorithm for Hands-free Speech Recognition

Modern Speech Recognition Approaches with Case Studies, 2012

Research paper thumbnail of A context-free grammar-driven, one-pass HMM-based continuous speech recognition method

Systems and Computers in Japan, 1994

ABSTRACT This paper describes a frame synchronous continuous speech recognition algorithm using a... more ABSTRACT This paper describes a frame synchronous continuous speech recognition algorithm using a context-free grammar and its evaluation. A frame-synchronous parsing algorithm for context-free grammar based on the top-down strategy is used for the prediction of successive words. The parsing process is incorporated into the one-pass search algorithm with the dynamical expansion of a finite-state automaton. Of course, since the number of the states becomes large, the beam search method should be used in the prediction and in pruning unreliable candidate branches from the search space. By using the proposed algorithm, a sentence recognition rate of 90 percent in a speaker adaptation mode is obtained while the conventional system based on word spotting and lattice parsing algorithm obtained 80.7 percent for the task of perplexity 10.

Research paper thumbnail of Denoising autoencoder and environment adaptation for distant-talking speech recognition with asynchronous speech recording

Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific, 2014

Research paper thumbnail of Hands-free speaker identification based on spectral subtraction using a multi-channel least mean square approach

2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013

Research paper thumbnail of Speech recognition using blind source separation and dereverberation method for mixed sound of speech and music

2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2013

In this paper, we propose a method for performing a non-stationary noise reduction and dereverber... more In this paper, we propose a method for performing a non-stationary noise reduction and dereverberation method. We use a blind dereverberation method based on spectral subtraction using a multi-channel least mean square algorithm has been proposed in our previous study. To suppress the non-stationary noise, we used a blind source separation based on an efficient fast independent component analysis algorithm. This method is evaluated using a mixed sound of speech and music, and achieves an average relative word error reduction rate of 41.9% and 7.9% compared with a baseline method and the state-of-the-art multistep linear prediction-based dereverberation, respectively, in a real environment.

Research paper thumbnail of Using acoustic dissimilarity measures based on state-level distance vector representation for improved spoken term detection

2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2013

This paper proposes a simple approach to subwordbased spoken term detection (STD) which uses impr... more This paper proposes a simple approach to subwordbased spoken term detection (STD) which uses improved acoustic dissimilarity measures based on a distance-vector representation at the state-level. Our approach assumes that both the query term and spoken documents are represented by subword units and then converted to the sequence of HMM states. A set of all distributions in subword-based HMMs is used for generating distance-vector representation of each state of all subword units. The element of a distance-vector corresponds to the distance between distributions of two different states, and thus a vector represents a structural feature at the state-level. The experimental result showed that the proposed method significantly outperforms the baseline method, which employs a conventional acoustic dissimilarity measure based on subword unit, with very little increase in the required search time.

Research paper thumbnail of Evaluation of unknown word processing in a spoken word recognition system

Research paper thumbnail of An understanding strategy based on plausibility score in recognition history using CSR confidence measure

Research paper thumbnail of Environment-dependent denoising autoencoder for distant-talking speech recognition

EURASIP Journal on Advances in Signal Processing, 2015

Research paper thumbnail of Evaluation of Hands-Free Large Vocabulary Continuous Speech Recognition by Blind Dereverberation Based on Spectral Subtraction by Multi-channel LMS Algorithm

Lecture Notes in Computer Science, 2011

Research paper thumbnail of Combination of bottleneck feature extraction and dereverberation for distant-talking speech recognition

Multimedia Tools and Applications, 2015

Research paper thumbnail of Distant Speaker Recognition Based on the Automatic Selection of Reverberant Environments Using GMMs

2009 Chinese Conference on Pattern Recognition, 2009

ABSTRACT Channel distortion for a distant environment may drastically degrade the performance of ... more ABSTRACT Channel distortion for a distant environment may drastically degrade the performance of speaker recognition because the training and test conditions differ significantly. In this paper, we propose robust distant speaker recognition that is based on the automatic selection of reverberant environments using Gaussian mixture models. Three methods involving (I) optimum channel determination, (II) joint optimum speaker and channel determination, or (III) optimum channel determination at the utterance level are proposed. Real-world speech data and simulated reverberant speech data are used to evaluate our proposed methods. The third proposed method achieves a relative error reduction of 69.6% over (baseline) speaker recognition using a reverberant environment-independent method, and it has performance equivalent to that of a reverberant environment-dependent method (an ideal-condition method).

Research paper thumbnail of Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification

EURASIP Journal on Audio, Speech, and Music Processing, 2015

ABSTRACT Deep neural network (DNN)-based approaches have been shown to be effective in many autom... more ABSTRACT Deep neural network (DNN)-based approaches have been shown to be effective in many automatic speech recognition systems. However, few works have focused on DNNs for distant-talking speaker recognition. In this study, a bottleneck feature derived from a DNN and a cepstral domain denoising autoencoder (DAE)-based dereverberation are presented for distant-talking speaker identification, and a combination of these two approaches is proposed. For the DNN-based bottleneck feature, we noted that DNNs can transform the reverberant speech feature to a new feature space with greater discriminative classification ability for distant-talking speaker recognition. Conversely, cepstral domain DAE-based dereverberation tries to suppress the reverberation by mapping the cepstrum of reverberant speech to that of clean speech with the expectation of improving the performance of distant-talking speaker recognition. Since the DNN-based discriminant bottleneck feature and DAE-based dereverberation have a strong complementary nature, the combination of these two methods is expected to be very effective for distant-talking speaker identification. A speaker identification experiment was performed on a distant-talking speech set, with reverberant environments differing from the training environments. In suppressing late reverberation, our method outperformed some state-of-the-art dereverberation approaches such as the multichannel least mean squares (MCLMS). Compared with the MCLMS, we obtained a reduction in relative error rates of 21.4% for the bottleneck feature and 47.0% for the autoencoder feature. Moreover, the combination of likelihoods of the DNN-based bottleneck feature and DAE-based dereverberation further improved the performance.

Research paper thumbnail of Least Distance Based Inefficiency Measures on the Pareto-Efficient Frontier in Dea

Journal of the Operations Research Society of Japan, 2012

Since Briec developed a family of the least distance based inefficiency measures satisfying weak ... more Since Briec developed a family of the least distance based inefficiency measures satisfying weak monotonicity over weakly efficient frontier, the existence of a least distance based efficiency measure satisfying strong monotonicity on the strongly efficient frontier is still an open problem. This paper gives a negative answer to the open problem and its relaxed open problem. Modifying Briec's inefficiency measures gives an alternative solution to the relaxed open problem, that can be used for theoretical and practical applications.

Research paper thumbnail of A frame-synchronous continuous speech recognition algorithm using a top-down parsing of context-free grammar

Research paper thumbnail of A pairwise discriminant approach using artificial neural networks for continuous speech recognition

Journal of the Acoustical Society of Japan, Nov 1, 1992

Research paper thumbnail of Single-channel dereverberation for distant-talking speech recognition by combining denoising autoencoder and temporal structure normalization

The 9th International Symposium on Chinese Spoken Language Processing, Sep 1, 2014

In this paper, we propose a robust distant-talking speech recognition by combining cepstral domai... more In this paper, we propose a robust distant-talking speech recognition by combining cepstral domain denoising autoencoder (DAE) and temporal structure normalization (TSN) filter. As DAE has a deep structure and nonlinear processing steps, it is flexible enough to model highly nonlinear mapping between input and output space. In this paper, we train a DAE to map reverberant and noisy speech features to the underlying clean speech features in the cepstral domain. For the proposed method, after applying a DAE in the cepstral domain of speech to suppress reverberation, we apply a post-processing technology based on temporal structure normalization (TSN) filter to reduce the noise and reverberation effects by normalizing the modulation spectra to reference spectra of clean speech. The proposed method was evaluated using speech in simulated and real reverberant environments. By combining a cepstral-domain DAE and TSN, the average Word Error Rate (WER) was reduced from 25.2 % of the baseline system to 21.2 % in simulated environments and from 47.5 % to 41.3 % in real environments, respectively.

Research paper thumbnail of Usability of Browser-Based Pen-Touch/Speech User Interfaces for Form-Based Applications in Mobile Environment

Lecture Notes in Computer Science, 2000

ABSTRACT This paper describes a speech interface system for the information retrieval services on... more ABSTRACT This paper describes a speech interface system for the information retrieval services on the WWW and the experimental result of a usability evaluation for the form-based information retrieval tasks. We have presented a general speech interface system which can be basically applied to many menu-based information retrieval services on the WWW. The system enables additional speech input capability for a general WWW browser. A usability evaluation experiment of the speechenabled system for several existing menu-based information retrieval services is conducted and the results are compared with the case of a conventional system with pen-touch input mode. We also investigated the difference in the effect of usability for di.erent operating conditions.

Research paper thumbnail of Comparison of continuous speech recognition systems with unknown‐word processing for speech disfluencies

Systems and Computers in Japan, 1998

This paper describes speech recognition systems for dealing with spontaneous speech, in which an ... more This paper describes speech recognition systems for dealing with spontaneous speech, in which an unknownword processing method based on subword sequence decoding is employed. We propose an efficient algorithm for unknown-word processing that employs an independent process of subword sequence decoding while an utterance is verified and searched with an appropriate linguistic constraint. The algorithm is applied both to one pass-based and spotting-based search algorithms after small modification. We compared speech understanding systems whose recognition strategies differ in terms of the search strategies and with the way of handling unknown words and interjections. We observed that the effectiveness of the unknown-word processing technique depends on the accuracy of the acoustic model and that a one pass-based search algorithm with unknown-word processing attains the best performance for phrase/sentence accuracy and computational efficiency, although the sentence understanding rate for the evaluated task is comparable to or slightly less than the best among other methods. The experimental results showed that when the unknown-word processing technique was employed to deal with extraneous speech, a sentence understanding rate of 80% was attained for a task with a test set perplexity of 40. © 1998 Scripta Technica, Syst Comp Jpn, 29(9): 4353, 1998

Research paper thumbnail of Distant-talking speech recognition using multi-channel LMS and multiple-step linear prediction

The 9th International Symposium on Chinese Spoken Language Processing, 2014

Research paper thumbnail of Dereverberation Based on Spectral Subtraction by Multi-channel LMS Algorithm for Hands-free Speech Recognition

Modern Speech Recognition Approaches with Case Studies, 2012

Research paper thumbnail of A context-free grammar-driven, one-pass HMM-based continuous speech recognition method

Systems and Computers in Japan, 1994

ABSTRACT This paper describes a frame synchronous continuous speech recognition algorithm using a... more ABSTRACT This paper describes a frame synchronous continuous speech recognition algorithm using a context-free grammar and its evaluation. A frame-synchronous parsing algorithm for context-free grammar based on the top-down strategy is used for the prediction of successive words. The parsing process is incorporated into the one-pass search algorithm with the dynamical expansion of a finite-state automaton. Of course, since the number of the states becomes large, the beam search method should be used in the prediction and in pruning unreliable candidate branches from the search space. By using the proposed algorithm, a sentence recognition rate of 90 percent in a speaker adaptation mode is obtained while the conventional system based on word spotting and lattice parsing algorithm obtained 80.7 percent for the task of perplexity 10.

Research paper thumbnail of Denoising autoencoder and environment adaptation for distant-talking speech recognition with asynchronous speech recording

Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific, 2014