Masafumi Nishimura - Academia.edu (original) (raw)

Uploads

Books by Masafumi Nishimura

Research paper thumbnail of Throat Microphone Speech Enhancement using Machine Learning Technique

2nd International Conference on Innovative Computing and Cutting-edge Technologies (ICICCT 2020), Book Chapter Name: "Intelligent Computing Paradigm and Cutting-edge Technologies". It is a Part of the Learning and Analytics in Intelligent Systems book series , 2021

Throat Microphone (TM) speech is a narrow bandwidth speech and it sounds unnatural, unlike acoust... more Throat Microphone (TM) speech is a narrow bandwidth speech and it sounds unnatural, unlike acoustic microphone (AM) recording. Although the TM captured speech is not affected by the environmental noise but it suffers naturalness and intelligibility problems. In this paper, we focus on the problem of enhancing the perceptual quality of the TM speech using the machine learning technique by modifying the spectral envelope and vocal tract parameters. The Mel-frequency Cepstral Coefficients (MFCCs) feature extraction technique is carried out to extract speech features. Then mapping technique is used between the features of the TM and AM speech using Neural Network. This improves the perceptual quality of the TM speech with respect to AM speech by estimating and correcting the missing high-frequency components in between 4 kHz to 8 kHz from the low-frequency band (0 kHz to 4 kHz) of TM speech signal. Then the least-square estimation and Inverse Short-time Fourier Transform Magnitude methods are applied to measure the power spectrum is used to reconstruct the speech signal. The ATR503 dataset is used to test the proposed technique. The simulation results show a visible performance in the field of speech enhancement in adverse environments. The aim of this study is for natural human-machine interaction for vocal tract affected people.

Papers by Masafumi Nishimura

Research paper thumbnail of A method for style adaptation to spontaneous speech by using a semi-linear interpolation technique

This paper deals with a method for adapting a language model created fromwritten-text corpora to ... more This paper deals with a method for adapting a language model created fromwritten-text corpora to spontaneous speech by using a semi-linear interpolation technique. Sizes and topic coverages of spoken language corpora are usually far smaller those of written-text corpora. We propose an approach to adapt a base language model to the styles of spontaneous speech on the basis of the following assumptions. The words that are topic-independent, that is to say, common in spontaneous speech should be predicted mainly by a model created from spontaneous speech corpora (style model), while the base model is more reliable for predicting topic-related words, because they are di cult to predict from a model based on a small corpus. We classi ed all words into dis uencies and normal words. The normal words are classi ed into two more categories; common words and topic words according to mutual information. For each category, the quali ed models (base or style) with the optimal weights for linear interpolation are selected. In other words, a di erent linear combination of the models is used for each category of a predicted word. We conducted experiments by using a spoken-language corpus of Japanese for creating the style model. We achieved 159.1 in test-set perplexity compared with the baseline of 189.3 (simple linear interpolation) and the perplexity of the style speci c model, which was 230.7.

Research paper thumbnail of Automatic Cough Detection Using Deep Neural Network and a Throat Microphone

Research paper thumbnail of Screening of Mild Cognitive Impairment Through Conversations with Humanoid Robots: Preliminary Study (Preprint)

BACKGROUND The rising number of dementia patients has become a serious social problem worldwide. ... more BACKGROUND The rising number of dementia patients has become a serious social problem worldwide. To detect dementia in patients at an early stage, many studies have been made to detect signs of cognitive decline by prosodic and acoustic features. However, many of these methods are not suitable for everyday use as they are focusing examinations for cognitive functions or conversational speech during the examinations. On the other hand, conversational humanoid robots are expected to be used in elderly care to help to reduce the work of care and monitoring through interaction. OBJECTIVE For early detection of mild cognitive impairment (MCI) through conversations between elderly people and humanoid robots without specific examinations such as neuropsychological examination. METHODS We collected the conversation data during neuropsychological examination (MMSE) and daily conversation with a humanoid robot from a total of 94 participants (47 cognitively normal and 47 patients with MCI). W...

Research paper thumbnail of IPSJ Trial Standard Concerning Spoken Language Interface

Research paper thumbnail of Speech recognition method

The Journal of the Acoustical Society of America, 1990

Research paper thumbnail of Throat microphone speech recognition using wav2vec 2.0 and feature mapping

2022 IEEE 11th Global Conference on Consumer Electronics (GCCE), Oct 18, 2022

Throat microphones can record the voice which and simultaneously suppress the impact of external ... more Throat microphones can record the voice which and simultaneously suppress the impact of external noise. This work aims to improve speech recognition performance using throat microphones within a high-noise environment. However, as there is no large database on throat microphone speech, training data are insufficient. This study proposes a method to realize the improved throat microphone speech recognition by utilizing self-supervised learning models such as wav2vec 2.0. However, because the volume of throat microphone speech data available for training of the model is rather limited, linguistic information is not particularly well trained. We therefore apply feature mapping to a large Japanese speech corpus to generate a quantity of pseudo-throat microphone speech features. It was confirmed that a significant improvement in the recognition rate could be achieved by utilizing the generated data for fine tuning of the wav2vec 2.0 model.

Research paper thumbnail of An Improvement of HMM Separation for Reverberant Speech Recognition

IPSJ SIG Notes, Feb 7, 2003

Research paper thumbnail of Identification of vocal tract state before and after swallowing using acoustic features

2022 IEEE 11th Global Conference on Consumer Electronics (GCCE), Oct 18, 2022

Research paper thumbnail of Automatic Accent Labeling for a Text-to-Speech System

IPSJ SIG Notes, Feb 10, 2007

Research paper thumbnail of Local Peak Enhancement for In-Car Speech Recognition in Noisy Environment

IEICE Transactions on Information and Systems, Mar 1, 2008

Research paper thumbnail of DNN-based feature transformation for speech recognition using throat microphone

In this paper, we focus on utilizing a throat microphone as noise robust device because its signa... more In this paper, we focus on utilizing a throat microphone as noise robust device because its signal is much less affected by surrounding noise than a conventional acoustic microphone signal. However, it can only record narrow frequency bands, and the microphone characteristics are also different from characteristics of acoustic microphone. Therefore, speech recognition performance is greatly degraded when a throat microphone is used as it is instead of a conventional acoustic microphone. To overcome this problem, we propose using a deep neural network (DNN)-based feature transformation method while also using model adaptation. We conducted a continuous digit recognition experiment. The result revealed that the proposed method improved the word error rate (WER) of using the throat microphone from 41.4% to 17.6%.

Research paper thumbnail of A Metric for Evaluating Speech Recognition Accuracy Based on Human Perception (音声) -- (第16回音声言語シンポジウム)

IEICE technical report. Speech, Dec 15, 2014

Research paper thumbnail of Telephony Speech Phrasing based on Breath Event Detection

IEICE Technical Report; IEICE Tech. Rep., Feb 2, 2012

Research paper thumbnail of A Metric for Evaluating Speech Recognition Accuracy Based on Human Perception

IEICE technical report. Speech, Dec 8, 2014

Research paper thumbnail of Discriminative re-ranking for automatic speech recognition by leveraging invariant structures

Speech Communication, Sep 1, 2015

ABSTRACT An invariant structure was proposed in (Minematsu, 2004 and Minematsu et al., 2010) and ... more ABSTRACT An invariant structure was proposed in (Minematsu, 2004 and Minematsu et al., 2010) and it is a long-span feature to suppress non-linguistic factors. In contrast to frame-based features such as Mel-Frequency Cepstrum Coefficients (MFCC), the invariant structures are extracted as contrasts between speech events in a given utterance. Because the invariant structure is not a time series of short-term features, it is difficult to use it directly in the general framework of Automatic Speech Recognition (ASR) although its robustness against non-linguistic factors is desirable for ASR. To introduce the invariant structure effectively to ASR, we are working on a method to leverage the invariant structure in a discriminative re-ranking paradigm for ASR. In our re-ranking paradigm, a baseline ASR system is used to generate N-best lists with hypothesized phoneme-level alignments so that we can extract one invariant structure for each hypothesis. We also propose methods to convert an extracted invariant structure into a fixed-dimensional feature vector to be used in discriminative re-ranking. Experimental results on the three tasks of continuous digit recognition, digit recognition in noisy environments, and large vocabulary continuous speech recognition showed significant error reductions and robustness improvements against noisy environments.

Research paper thumbnail of Speech recognition method

Journal of the Acoustical Society of America, Dec 1, 1993

Research paper thumbnail of Automatic Detection of the Chewing Side Using Two-channel Recordings under the Ear

2020 IEEE 2nd Global Conference on Life Sciences and Technologies (LifeTech), Mar 1, 2020

Eating behavior is an important parameter of the state of health. A previous study confirmed that... more Eating behavior is an important parameter of the state of health. A previous study confirmed that the method of recording eating sounds under the ear along with the long short-term memory-connectionist temporal classification (LSTM-CTC) were effective in detecting chewing events. This study examined the possibility of identifying the left and right sides of chewing to improve the analytical ability of eating behavior. More accurate detection was achieved through the utilization of the two-channel recordings and their cross-correlation as a new feature than through the conventional mel-frequency cepstral coefficients (MFCC) features.

Research paper thumbnail of Effects of Mounting Position on Throat Microphone Speech Recognition

Speech recognition using a throat microphone is studied. Good signal-to-noise ratios are obtained... more Speech recognition using a throat microphone is studied. Good signal-to-noise ratios are obtained from throat microphones even under heavy environmental noise; however, the spectrum is much different from that of an acoustic microphone and varies according to position. In order to clarify the effect of mounting position, we first measured the spectral distance between close-talk and throat microphones. Secondly, we gathered a medium-sized corpus of training data by using a throat microphone mounted at a suitable position and utilized them to improve speech recognition accuracy. When applying knowledge distillation (KD) to the DNN-HMM with this training data in a quiet environment, throat microphone speech recognition achieved a performance close to that of close-talk microphones.

Research paper thumbnail of Bottleneck feature-mediated DNN-based feature mapping for throat microphone speech recognition

2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Throat microphones are more robust to environmental noises than usual acoustic microphones such a... more Throat microphones are more robust to environmental noises than usual acoustic microphones such as close-talk microphones because they detect speech signals through skin vibrations rather than by air transmission. Throat microphones, however, cannot be used in conventional speech recognition systems because their acoustic characteristics are much different from those of the acoustic microphones. In this study, we propose a deep neural network (DNN)-based feature mapping method for throat microphone speech recognition. To utilize a large amount of training data recorded by acoustic microphones and effectively reduce the acoustic mismatch between the throat and acoustic microphones, we tried to use the bottleneck features to mediate between them. Evaluation results for a large-vocabulary speech recognition task of Japanese free conversation revealed that the proposed system had a 45.8% lower character error rate (75.5% → 40.9%) than the typical MFCC system trained from the acoustic microphone data. I.

Research paper thumbnail of Throat Microphone Speech Enhancement using Machine Learning Technique

2nd International Conference on Innovative Computing and Cutting-edge Technologies (ICICCT 2020), Book Chapter Name: "Intelligent Computing Paradigm and Cutting-edge Technologies". It is a Part of the Learning and Analytics in Intelligent Systems book series , 2021

Throat Microphone (TM) speech is a narrow bandwidth speech and it sounds unnatural, unlike acoust... more Throat Microphone (TM) speech is a narrow bandwidth speech and it sounds unnatural, unlike acoustic microphone (AM) recording. Although the TM captured speech is not affected by the environmental noise but it suffers naturalness and intelligibility problems. In this paper, we focus on the problem of enhancing the perceptual quality of the TM speech using the machine learning technique by modifying the spectral envelope and vocal tract parameters. The Mel-frequency Cepstral Coefficients (MFCCs) feature extraction technique is carried out to extract speech features. Then mapping technique is used between the features of the TM and AM speech using Neural Network. This improves the perceptual quality of the TM speech with respect to AM speech by estimating and correcting the missing high-frequency components in between 4 kHz to 8 kHz from the low-frequency band (0 kHz to 4 kHz) of TM speech signal. Then the least-square estimation and Inverse Short-time Fourier Transform Magnitude methods are applied to measure the power spectrum is used to reconstruct the speech signal. The ATR503 dataset is used to test the proposed technique. The simulation results show a visible performance in the field of speech enhancement in adverse environments. The aim of this study is for natural human-machine interaction for vocal tract affected people.

Research paper thumbnail of A method for style adaptation to spontaneous speech by using a semi-linear interpolation technique

This paper deals with a method for adapting a language model created fromwritten-text corpora to ... more This paper deals with a method for adapting a language model created fromwritten-text corpora to spontaneous speech by using a semi-linear interpolation technique. Sizes and topic coverages of spoken language corpora are usually far smaller those of written-text corpora. We propose an approach to adapt a base language model to the styles of spontaneous speech on the basis of the following assumptions. The words that are topic-independent, that is to say, common in spontaneous speech should be predicted mainly by a model created from spontaneous speech corpora (style model), while the base model is more reliable for predicting topic-related words, because they are di cult to predict from a model based on a small corpus. We classi ed all words into dis uencies and normal words. The normal words are classi ed into two more categories; common words and topic words according to mutual information. For each category, the quali ed models (base or style) with the optimal weights for linear interpolation are selected. In other words, a di erent linear combination of the models is used for each category of a predicted word. We conducted experiments by using a spoken-language corpus of Japanese for creating the style model. We achieved 159.1 in test-set perplexity compared with the baseline of 189.3 (simple linear interpolation) and the perplexity of the style speci c model, which was 230.7.

Research paper thumbnail of Automatic Cough Detection Using Deep Neural Network and a Throat Microphone

Research paper thumbnail of Screening of Mild Cognitive Impairment Through Conversations with Humanoid Robots: Preliminary Study (Preprint)

BACKGROUND The rising number of dementia patients has become a serious social problem worldwide. ... more BACKGROUND The rising number of dementia patients has become a serious social problem worldwide. To detect dementia in patients at an early stage, many studies have been made to detect signs of cognitive decline by prosodic and acoustic features. However, many of these methods are not suitable for everyday use as they are focusing examinations for cognitive functions or conversational speech during the examinations. On the other hand, conversational humanoid robots are expected to be used in elderly care to help to reduce the work of care and monitoring through interaction. OBJECTIVE For early detection of mild cognitive impairment (MCI) through conversations between elderly people and humanoid robots without specific examinations such as neuropsychological examination. METHODS We collected the conversation data during neuropsychological examination (MMSE) and daily conversation with a humanoid robot from a total of 94 participants (47 cognitively normal and 47 patients with MCI). W...

Research paper thumbnail of IPSJ Trial Standard Concerning Spoken Language Interface

Research paper thumbnail of Speech recognition method

The Journal of the Acoustical Society of America, 1990

Research paper thumbnail of Throat microphone speech recognition using wav2vec 2.0 and feature mapping

2022 IEEE 11th Global Conference on Consumer Electronics (GCCE), Oct 18, 2022

Throat microphones can record the voice which and simultaneously suppress the impact of external ... more Throat microphones can record the voice which and simultaneously suppress the impact of external noise. This work aims to improve speech recognition performance using throat microphones within a high-noise environment. However, as there is no large database on throat microphone speech, training data are insufficient. This study proposes a method to realize the improved throat microphone speech recognition by utilizing self-supervised learning models such as wav2vec 2.0. However, because the volume of throat microphone speech data available for training of the model is rather limited, linguistic information is not particularly well trained. We therefore apply feature mapping to a large Japanese speech corpus to generate a quantity of pseudo-throat microphone speech features. It was confirmed that a significant improvement in the recognition rate could be achieved by utilizing the generated data for fine tuning of the wav2vec 2.0 model.

Research paper thumbnail of An Improvement of HMM Separation for Reverberant Speech Recognition

IPSJ SIG Notes, Feb 7, 2003

Research paper thumbnail of Identification of vocal tract state before and after swallowing using acoustic features

2022 IEEE 11th Global Conference on Consumer Electronics (GCCE), Oct 18, 2022

Research paper thumbnail of Automatic Accent Labeling for a Text-to-Speech System

IPSJ SIG Notes, Feb 10, 2007

Research paper thumbnail of Local Peak Enhancement for In-Car Speech Recognition in Noisy Environment

IEICE Transactions on Information and Systems, Mar 1, 2008

Research paper thumbnail of DNN-based feature transformation for speech recognition using throat microphone

In this paper, we focus on utilizing a throat microphone as noise robust device because its signa... more In this paper, we focus on utilizing a throat microphone as noise robust device because its signal is much less affected by surrounding noise than a conventional acoustic microphone signal. However, it can only record narrow frequency bands, and the microphone characteristics are also different from characteristics of acoustic microphone. Therefore, speech recognition performance is greatly degraded when a throat microphone is used as it is instead of a conventional acoustic microphone. To overcome this problem, we propose using a deep neural network (DNN)-based feature transformation method while also using model adaptation. We conducted a continuous digit recognition experiment. The result revealed that the proposed method improved the word error rate (WER) of using the throat microphone from 41.4% to 17.6%.

Research paper thumbnail of A Metric for Evaluating Speech Recognition Accuracy Based on Human Perception (音声) -- (第16回音声言語シンポジウム)

IEICE technical report. Speech, Dec 15, 2014

Research paper thumbnail of Telephony Speech Phrasing based on Breath Event Detection

IEICE Technical Report; IEICE Tech. Rep., Feb 2, 2012

Research paper thumbnail of A Metric for Evaluating Speech Recognition Accuracy Based on Human Perception

IEICE technical report. Speech, Dec 8, 2014

Research paper thumbnail of Discriminative re-ranking for automatic speech recognition by leveraging invariant structures

Speech Communication, Sep 1, 2015

ABSTRACT An invariant structure was proposed in (Minematsu, 2004 and Minematsu et al., 2010) and ... more ABSTRACT An invariant structure was proposed in (Minematsu, 2004 and Minematsu et al., 2010) and it is a long-span feature to suppress non-linguistic factors. In contrast to frame-based features such as Mel-Frequency Cepstrum Coefficients (MFCC), the invariant structures are extracted as contrasts between speech events in a given utterance. Because the invariant structure is not a time series of short-term features, it is difficult to use it directly in the general framework of Automatic Speech Recognition (ASR) although its robustness against non-linguistic factors is desirable for ASR. To introduce the invariant structure effectively to ASR, we are working on a method to leverage the invariant structure in a discriminative re-ranking paradigm for ASR. In our re-ranking paradigm, a baseline ASR system is used to generate N-best lists with hypothesized phoneme-level alignments so that we can extract one invariant structure for each hypothesis. We also propose methods to convert an extracted invariant structure into a fixed-dimensional feature vector to be used in discriminative re-ranking. Experimental results on the three tasks of continuous digit recognition, digit recognition in noisy environments, and large vocabulary continuous speech recognition showed significant error reductions and robustness improvements against noisy environments.

Research paper thumbnail of Speech recognition method

Journal of the Acoustical Society of America, Dec 1, 1993

Research paper thumbnail of Automatic Detection of the Chewing Side Using Two-channel Recordings under the Ear

2020 IEEE 2nd Global Conference on Life Sciences and Technologies (LifeTech), Mar 1, 2020

Eating behavior is an important parameter of the state of health. A previous study confirmed that... more Eating behavior is an important parameter of the state of health. A previous study confirmed that the method of recording eating sounds under the ear along with the long short-term memory-connectionist temporal classification (LSTM-CTC) were effective in detecting chewing events. This study examined the possibility of identifying the left and right sides of chewing to improve the analytical ability of eating behavior. More accurate detection was achieved through the utilization of the two-channel recordings and their cross-correlation as a new feature than through the conventional mel-frequency cepstral coefficients (MFCC) features.

Research paper thumbnail of Effects of Mounting Position on Throat Microphone Speech Recognition

Speech recognition using a throat microphone is studied. Good signal-to-noise ratios are obtained... more Speech recognition using a throat microphone is studied. Good signal-to-noise ratios are obtained from throat microphones even under heavy environmental noise; however, the spectrum is much different from that of an acoustic microphone and varies according to position. In order to clarify the effect of mounting position, we first measured the spectral distance between close-talk and throat microphones. Secondly, we gathered a medium-sized corpus of training data by using a throat microphone mounted at a suitable position and utilized them to improve speech recognition accuracy. When applying knowledge distillation (KD) to the DNN-HMM with this training data in a quiet environment, throat microphone speech recognition achieved a performance close to that of close-talk microphones.

Research paper thumbnail of Bottleneck feature-mediated DNN-based feature mapping for throat microphone speech recognition

2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Throat microphones are more robust to environmental noises than usual acoustic microphones such a... more Throat microphones are more robust to environmental noises than usual acoustic microphones such as close-talk microphones because they detect speech signals through skin vibrations rather than by air transmission. Throat microphones, however, cannot be used in conventional speech recognition systems because their acoustic characteristics are much different from those of the acoustic microphones. In this study, we propose a deep neural network (DNN)-based feature mapping method for throat microphone speech recognition. To utilize a large amount of training data recorded by acoustic microphones and effectively reduce the acoustic mismatch between the throat and acoustic microphones, we tried to use the bottleneck features to mediate between them. Evaluation results for a large-vocabulary speech recognition task of Japanese free conversation revealed that the proposed system had a 45.8% lower character error rate (75.5% → 40.9%) than the typical MFCC system trained from the acoustic microphone data. I.

Research paper thumbnail of Speech Input Method in Automobiles Reflecting Analysis on How Users Speak