Xinhui Hu - Academia.edu (original) (raw)

Papers by Xinhui Hu

Research paper thumbnail of The RoyalFlush System of Speech Recognition for M2MeT Challenge

arXiv (Cornell University), Feb 3, 2022

This paper describes our RoyalFlush system for the track of multi-speaker automatic speech recogn... more This paper describes our RoyalFlush system for the track of multi-speaker automatic speech recognition (ASR) in the M2MeT challenge. We adopted the serialized output training (SOT) based multi-speakers ASR system with large-scale simulation data. Firstly, we investigated a set of front-end methods, including multi-channel weighted predicted error (WPE), beamforming, speech separation, speech enhancement, etc., to process training, evaluation, and test sets. However, according to their experimental results, we only selected the WPE and beamforming approach as our front-end methods. Secondly, we made great efforts in the data augmentation for multi-speaker ASR, including adding noise and reverberation, overlapped speech simulation, multi-channel speech simulation, speed perturbation, front-end processing, etc., which brought us a significant performance improvement. Finally, to make full use of the performance complementary of different model architecture, we trained the standard conformer based joint CTC/Attention (Conformer) and U2++ ASR model with a bidirectional attention decoder, a modification of Conformer, to fuse their results. Compared with the official baseline system, our system got a 12.22% absolute Character Error Rate (CER) reduction on the evaluation set and 12.11% on the test set.

Research paper thumbnail of Combination of multiple acoustic models with unsupervised adaptation for lecture speech transcription

Speech Communication, 2016

Abstract Automatic speech recognition systems (ASR) have achieved considerable progress in real a... more Abstract Automatic speech recognition systems (ASR) have achieved considerable progress in real applications because of skilled design of the architecture with advanced techniques and algorithms. However, how to design a system efficiently integrating these various techniques to obtain advanced performance is still a challenging task. In this paper, we introduced an ensemble model combination and adaptation based ASR system with two characteristics: (1) large-scale combination of multiple ASR systems based on a Recognizer Output Voting Error Reduction (ROVER) system, and (2) multi-pass unsupervised speaker adaptation for deep neural network acoustic models and topic adaptation on language model. The multiple acoustic models were trained with different acoustic features and model architectures which helped to provide complementary and discriminative information in the ROVER process. With these multiple acoustic models, a better estimation of word confidence could be obtained from ROVER process which helped in selecting data for unsupervised adaptation on the previously trained acoustic models. The final recognition result was obtained using multi-pass decoding, ROVER, and adaptation processes. We tested the system on lecture speeches with topics related to Technology, Entertainment and Design (TED) that were used in the international workshop on spoken language translation (IWSLT) evaluation campaign, and obtained 6.5%, 7.0%, 10.6%, and 8.4% word error rates for test sets in 2011, 2012, 2013, and 2014, which to our knowledge are the best results for these evaluation sets.

Research paper thumbnail of Spoken document retrieval using topic models

Proceedings of the 3rd International Universal Communication Symposium, 2009

ABSTRACT In this paper, we propose a document topic model (DTM) based on the non-negative matrix ... more ABSTRACT In this paper, we propose a document topic model (DTM) based on the non-negative matrix factorization (NMF) approach to explore spontaneous spoken document retrieval. The model uses latent semantic indexing to detect underlying semantic relationships within documents. Each document is interpreted as a generative topic model belonging to many topics. The relevance of a document to a query is expressed by the probability of a query being generated by the model. The term-document matrix used for NMF is built stochastically from the speech recognition N-best results, so that multiple recognition hypotheses can be utilized to compensate for the word recognition errors. Using this approach, experiments are conducted on a test collection from the Corpus of Spontaneous Japanese (CSJ), with 39 queries for over 600 hours of spontaneous Japanese speech. The retrieval performance of this model is proved to be superior to the conventional vector space model (VSM) when the dimension or topic number exceeds a certain threshold. Moreover, whether from the viewpoint of retrieval performance or the ability of topic expression, the NMF-based topic model is verified to surpass another latent indexing method that is based on the singular value decomposition (SVD). The extent to which this topic model can resist speech recognition error, which is a special problem of spoken document retrieval, is also investigated.

Research paper thumbnail of A Myanmar large vocabulary continuous speech recognition system

2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2015

Research paper thumbnail of Royalflush Speaker Diarization System for ICASSP 2022 Multi-channel Multi-party Meeting Transcription Challenge

This paper describes the Royalflush speaker diarization system submitted to the Multi-channel Mul... more This paper describes the Royalflush speaker diarization system submitted to the Multi-channel Multi-party Meeting Transcription Challenge. Our system comprises speech enhancement, overlapped speech detection, speaker embedding extraction, speaker clustering, speech separation and system fusion. In this system, we made three contributions. First, we propose an architecture of combining the multi-channel and U-Net-based models, aiming at utilizing the benefits of these two individual architectures, for far-field overlapped speech detection. Second, in order to use overlapped speech detection model to help speaker diarization, a speech separation based overlapped speech handling approach, in which the speaker verification technique is further applied, is proposed. Third, we explore three speaker embedding methods, and obtained the state-of-the-art performance on the CNCeleb-E test set. With these proposals, our best individual system significantly reduces DER from 15.25% to 6.40%, and ...

Research paper thumbnail of ACOUSTIC SCENE CLASSIFICATION WITH DEVICE MISMATCH USING DATA AUGMENTATION BY SPECTRUM CORRECTION Technical Report

This report describes the submissions by RoyalFlush of DCASE2020 task1a. Our aim is to find an au... more This report describes the submissions by RoyalFlush of DCASE2020 task1a. Our aim is to find an audio scene classification system that is robust against multiple devices. We use logMel and its first and second derivatives as input features. We use the fully convolutional deep neural networks as classification model, and some strategies such as pre-Act, L2 regularization, dropout and feature normalization were applied. For improving the data imbalance caused by the different device, we tried to generate more training data by using device-related spectrum correction method.

Research paper thumbnail of Progress Report of Spoken Document Processing Working Group

Scientific Programming, 2011

This report describes the activities of SLP Spoken Document Processing Working Group (SDPWG). The... more This report describes the activities of SLP Spoken Document Processing Working Group (SDPWG). The SDPWG was organized in 2006. The working group was reorganized in 2009. This report mainly describes the activities of the second period of the SDPWG. The SDPWG organized Interspeech2010 special session, NTCIR-9 workshop, and the special issue on spoken document processing of IPSJ journal. This report focuses the planning process of these activities. 1. はじめに

Research paper thumbnail of The NCT ASR system for IWSLT 2014

This paper describes our automatic speech recognition system for IWSLT2014 evaluation campaign. T... more This paper describes our automatic speech recognition system for IWSLT2014 evaluation campaign. The system is based on weighted finite-state transducers and a combination of multiple subsystems which consists of four types of acoustic feature sets, four types of acoustic models, and Ngram and recurrent neural network language models. Compared with our system used in last year, we added additional subsystems based on deep neural network modeling on filter bank feature and convolutional deep neural network modeling on filter bank feature with tonal features. In addition, modifications and improvements on automatic acoustic segmentation and deep neural network speaker adaptation were applied. Compared with our last year's system on speech recognition experiments, our new system achieved 21.5% relative improvement on word error rate on the 2013 English test data set.

Research paper thumbnail of An End-to-End Dialect Identification System with Transfer Learning from a Multilingual Automatic Speech Recognition Model

Interspeech 2021, 2021

In this paper, we propose an end-to-end (E2E) dialect identification system trained using transfe... more In this paper, we propose an end-to-end (E2E) dialect identification system trained using transfer learning from a multilingual automatic speech recognition (ASR) model. This is also an extension of our submitted system to the Oriental Language Recognition Challenge 2020 (AP20-OLR). We verified its applicability using the dialect identification (DID) task of the AP20-OLR. First, we trained a robust conformer-based joint connectionist temporal classification (CTC) /attention multilingual E2E ASR model using the training corpora of eight languages, independent of the target dialects. Second, we initialized the E2E-based classifier with the ASR model's shared encoder using a transfer learning approach. Finally, we trained the classifier on the target dialect corpus. We obtained the final classifier by selecting the best model from the following: (1) the averaged model in term of the loss values; and (2) the averaged model in term of classification accuracy. Our experiments on the DID test-set of the AP20-OLR demonstrated that significant identification improvements were achieved for three Chinese dialects. The performances of our system outperforms the winning team of the AP20-OLR, with the largest relative reductions of 19.5% in Cavg and 25.2% in EER.

Research paper thumbnail of Data Augmentation for Code-Switch Language Modeling by Fusing Multiple Text Generation Methods

Interspeech 2020, 2020

To deal with the problem of data scarce in training language model (LM) for code-switching (CS) s... more To deal with the problem of data scarce in training language model (LM) for code-switching (CS) speech recognition, we proposed an approach to obtain augmentation texts from three different viewpoints. The first one is to enhance monolingual LM by selecting corresponding sentences for existing conversational corpora; The second one is based on replacements using syntactic constraint for a monolingual Chinese corpus, with the helps of an aligned word list obtained from a pseudo-parallel corpus, and part-of-speech (POS) of words; The third one is to use text generation based on a pointer-generator network with copy mechanism, using a real CS text data for training. All sentences from these approaches show improvement for CS LMs, and they are finally fused into an LM for CS ASR tasks. Evaluations on LMs built by the above augmented data were conducted on two Mandarin-English CS speech sets D-TANG, and SEAME. The perplexities were greatly reduced with all kinds of augmented texts, and speech recognition performances were steadily improved. The mixed word error rate (MER) of DTANG and SEAME evaluation dataset got relative reduction by 9.10% and 29.73%, respectively.

Research paper thumbnail of Collecting sentences from web resources for constructing spontaneous Chinese language model

2012 8th International Symposium on Chinese Spoken Language Processing, 2012

ABSTRACT In this paper, we present our work on collecting spontaneous texts from the Web for cons... more ABSTRACT In this paper, we present our work on collecting spontaneous texts from the Web for constructing a language model in a Chinese speech recognition system. The selection of spontaneous-like texts involves two steps: First, word-segmented web texts are selected using a perplexity-based approach in which the style-related words are strengthened by omitting infrequent topic words from similarity measurements. Second, the selected texts are then clustered based on non-noun part-of-speech (POS) words and optimal clusters are chosen by referring to a set of spontaneous seed sentences. Using the language model interpolated with the one trained by the selected sentences and a baseline model, speech recognition evaluations were conducted on an open domain spontaneous test set. We effectively reduced the character error rate (CER), with 1.64% absolute (or 6.5% relative) reduction by comparison with the baseline model. We also verified that the proposed method is superior to the conventional perplexity-based approach with about 1% absolute (or 4.0% relative) reduction in CER.

Research paper thumbnail of Construction of Chinese segmented and POS-tagged conversational corpora and their evaluations on spontaneous speech recognitions

Proceedings of the 7th Workshop on Asian Language Resources - ALR7, 2009

The performance of a corpus-based language and speech processing system depends heavily on the qu... more The performance of a corpus-based language and speech processing system depends heavily on the quantity and quality of the training corpora. Although several famous Chinese corpora have been developed, most of them are mainly written text. Even for some existing corpora that contain spoken data, the quantity is insufficient and the domain is limited. In this paper, we describe the development of Chinese conversational annotated textual corpora currently being used in the NICT/ATR speech-to-speech translation system. A total of 510K manually checked utterances provide 3.5M words of Chinese corpora. As far as we know, this is the largest conversational textual corpora in the domain of travel. A set of three parallel corpora is obtained with the corresponding pairs of Japanese and English words from which the Chinese words are translated. Evaluation experiments on these corpora were conducted by comparing the parameters of the language models, perplexities of test sets, and speech recognition performance with Japanese and English. The characteristics of the Chinese corpora, their limitations, and solutions to these limitations are analyzed and discussed.

Research paper thumbnail of Overview of the NTCIR-10 spokendoc-2 task

Research paper thumbnail of Collecting Colloquial and Spontaneous-like Sentences from Web Resources for Constructing Chinese Language Models of Speech Recognition

Journal of Information Processing, 2013

In this paper, we present our work on collecting training texts from the Web for constructing lan... more In this paper, we present our work on collecting training texts from the Web for constructing language models in colloquial and spontaneous Chinese automatic speech recognition systems. The selection involves two steps: first, web texts are selected using a perplexity-based approach in which the style-related words are strengthened by omitting infrequent topic words. Second, the selected texts are then clustered based on non-noun part-of-speech words and optimal clusters are chosen by referring to a set of spontaneous seed sentences. With the proposed method, we selected over 3.80 M sentences. By qualitative analysis on the selected results, the colloquial and spontaneousspeech like texts are effectively selected. The effectiveness of the selection is also quantitatively verified by the speech recognition experiments. Using the language model interpolated with the one trained by these selected sentences and a baseline model, speech recognition evaluations were conducted on an open domain colloquial and spontaneous test set. We effectively reduced the character error rate 4.0% over the baseline model meanwhile the word coverage was also greatly increased. We also verified that the proposed method is superior to a conventional perplexity-based approach with a difference of 1.57% in character error rate.

Research paper thumbnail of The Royalflush System of Speech Recognition for M2met Challenge

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

This paper describes our RoyalFlush system for the track of multi-speaker automatic speech recogn... more This paper describes our RoyalFlush system for the track of multi-speaker automatic speech recognition (ASR) in the M2MeT challenge. We adopted the serialized output training (SOT) based multi-speakers ASR system with large-scale simulation data. Firstly, we investigated a set of front-end methods, including multi-channel weighted predicted error (WPE), beamforming, speech separation, speech enhancement, etc., to process training, evaluation, and test sets. However, according to their experimental results, we only selected the WPE and beamforming approach as our front-end methods. Secondly, we made great efforts in the data augmentation for multi-speaker ASR, including adding noise and reverberation, overlapped speech simulation, multi-channel speech simulation, speed perturbation, front-end processing, etc., which brought us a significant performance improvement. Finally, to make full use of the performance complementary of different model architecture, we trained the standard conformer based joint CTC/Attention (Conformer) and U2++ ASR model with a bidirectional attention decoder, a modification of Conformer, to fuse their results. Compared with the official baseline system, our system got a 12.22% absolute Character Error Rate (CER) reduction on the evaluation set and 12.11% on the test set.

Research paper thumbnail of An Investigation of Using Hybrid Modeling Units for Improving End-to-End Speech Recognition System

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021

The acoustic modeling unit is crucial for an end-to-end speech recognition system, especially for... more The acoustic modeling unit is crucial for an end-to-end speech recognition system, especially for the Mandarin language. Until now, most of the studies on Mandarin speech recognition focused on individual units, and few of them paid attention to using a combination of these units. This paper uses a hybrid of the syllable, Chinese character, and subword as the modeling units for the end-to-end speech recognition system based on the CTC/attention multi-task learning. In this approach, the character-subword unit is assigned to train the transformer model in the main task learning stage. In contrast, the syllable unit is assigned to enhance the transformer’s shared encoder in the auxiliary task stage with the Connectionist Temporal Classification (CTC) loss function. The recognition experiments were conducted on AISHELL-1 and an open data set of 1200-hour Mandarin speech corpus collected from the OpenSLR, respectively. The experimental results demonstrated that using the syllable-char-subword hybrid modeling unit can achieve better performances than the conventional units of char-subword, and 6.6% relative CER reduction on our 1200-hour data. The substitution error also achieves a considerable reduction.

Research paper thumbnail of Tdcgan: Temporal Dilated Convolutional Generative Adversarial Network for End-to-end Speech Enhancement

arXiv: Audio and Speech Processing, 2020

In this paper, in order to further deal with the performance degradation caused by ignoring the p... more In this paper, in order to further deal with the performance degradation caused by ignoring the phase information in conventional speech enhancement systems, we proposed a temporal dilated convolutional generative adversarial network (TDCGAN) in the end-to-end based speech enhancement architecture. For the first time, we introduced the temporal dilated convolutional network with depthwise separable convolutions into the GAN structure so that the receptive field can be greatly increased without increasing the number of parameters. We also first explored the effect of signal-to-noise ratio (SNR) penalty item as regularization of the loss function of generator on improving the SNR of enhanced speech. The experimental results demonstrated that our proposed method outperformed the state-of-the-art end-to-end GAN-based speech enhancement. Moreover, compared with previous GAN-based methods, the proposed TDCGAN could greatly decreased the number of parameters. As expected, the work also dem...

Research paper thumbnail of The RoyalFlush Synthesis System for Blizzard Challenge 2020

Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 2020

The paper presents the RoyalFlush synthesis system for Blizzard Challenge 2020. Two required voic... more The paper presents the RoyalFlush synthesis system for Blizzard Challenge 2020. Two required voices are built from the released Mandarin and Shanghainese data. Based on endto-end speech synthesis technology, some improvements are introduced to the system compared with our system of last year. Firstly, a Mandarin front-end transforming input text into phoneme sequence along with prosody labels is employed. Then, to improve speech stability, a modified Tacotron acoustic model is proposed. Moreover, we apply GMM-based attention mechanism for robust long-form speech synthesis. Finally, a lightweight LPCNet-based neural vocoder is adopted to achieve a nice traceoff between effectiveness and efficiency. Among all the participating teams of the Challenge, the identifier for our system is N. Evaluation results demonstrates that our system performs relatively well in intelligibility. But it still needs to be improved in terms of naturalness and similarity.

Research paper thumbnail of Exploration of Feature Extraction Methods and Dimension for sEMG Signal Classification

Applied Sciences, 2019

It is necessary to complete the two parts of gesture recognition and wireless remote control to r... more It is necessary to complete the two parts of gesture recognition and wireless remote control to realize the gesture control of the automatic pruning machine. To realize gesture recognition, in this paper, we have carried out the research of gesture recognition technology based on surface electromyography signal, and discussed the influence of different numbers and different gesture combinations on the optimal size. We have calculated the 630-dimensional eigenvector from the benchmark scientific database of sEMG signals and extracted the features using principal component analysis (PCA). Discriminant analysis (DA) has been used to compare the processing effects of each feature extraction method. The experimental results have shown that the recognition rate of four gestures can reach 100.0%, the recognition rate of six gestures can reach 98.29%, and the optimal size is 516~523 dimensions. This study lays a foundation for the follow-up work of the pruning machine gesture control, and p...

Research paper thumbnail of Chinese Character-based Segmentation & POS-tagging and Named Entity Identification with a CRF Chunker

In this paper, we propose a character-based conditional random field (CRF) chunker to identify Ch... more In this paper, we propose a character-based conditional random field (CRF) chunker to identify Chinese named entity words in the text files. The input for it is from a character-based tagger in which the segmentation and part-of-speech (POS) tagging are conducted simultanueously. The character-based tagger is trained by using a corpus in which each character is tagged with both its position (POC) in a word and POS tag of the word. The chunker is trained by an IOB2 tagged corpus, in which each character is labelled with POC, POS and chunk tags (one of the B, I, O). 4 kinds of named entities, including personal names, location names, organization names, and other proper nouns, are assumed to be identification targets. In experiments using the People's Daily corpus, we found the CRF chunker can obtain better results than the maximum entropy model and support vector machine model in the case of using similar features. We also confirmed that the bigram features for the CRF chunker is...

Research paper thumbnail of The RoyalFlush System of Speech Recognition for M2MeT Challenge

arXiv (Cornell University), Feb 3, 2022

This paper describes our RoyalFlush system for the track of multi-speaker automatic speech recogn... more This paper describes our RoyalFlush system for the track of multi-speaker automatic speech recognition (ASR) in the M2MeT challenge. We adopted the serialized output training (SOT) based multi-speakers ASR system with large-scale simulation data. Firstly, we investigated a set of front-end methods, including multi-channel weighted predicted error (WPE), beamforming, speech separation, speech enhancement, etc., to process training, evaluation, and test sets. However, according to their experimental results, we only selected the WPE and beamforming approach as our front-end methods. Secondly, we made great efforts in the data augmentation for multi-speaker ASR, including adding noise and reverberation, overlapped speech simulation, multi-channel speech simulation, speed perturbation, front-end processing, etc., which brought us a significant performance improvement. Finally, to make full use of the performance complementary of different model architecture, we trained the standard conformer based joint CTC/Attention (Conformer) and U2++ ASR model with a bidirectional attention decoder, a modification of Conformer, to fuse their results. Compared with the official baseline system, our system got a 12.22% absolute Character Error Rate (CER) reduction on the evaluation set and 12.11% on the test set.

Research paper thumbnail of Combination of multiple acoustic models with unsupervised adaptation for lecture speech transcription

Speech Communication, 2016

Abstract Automatic speech recognition systems (ASR) have achieved considerable progress in real a... more Abstract Automatic speech recognition systems (ASR) have achieved considerable progress in real applications because of skilled design of the architecture with advanced techniques and algorithms. However, how to design a system efficiently integrating these various techniques to obtain advanced performance is still a challenging task. In this paper, we introduced an ensemble model combination and adaptation based ASR system with two characteristics: (1) large-scale combination of multiple ASR systems based on a Recognizer Output Voting Error Reduction (ROVER) system, and (2) multi-pass unsupervised speaker adaptation for deep neural network acoustic models and topic adaptation on language model. The multiple acoustic models were trained with different acoustic features and model architectures which helped to provide complementary and discriminative information in the ROVER process. With these multiple acoustic models, a better estimation of word confidence could be obtained from ROVER process which helped in selecting data for unsupervised adaptation on the previously trained acoustic models. The final recognition result was obtained using multi-pass decoding, ROVER, and adaptation processes. We tested the system on lecture speeches with topics related to Technology, Entertainment and Design (TED) that were used in the international workshop on spoken language translation (IWSLT) evaluation campaign, and obtained 6.5%, 7.0%, 10.6%, and 8.4% word error rates for test sets in 2011, 2012, 2013, and 2014, which to our knowledge are the best results for these evaluation sets.

Research paper thumbnail of Spoken document retrieval using topic models

Proceedings of the 3rd International Universal Communication Symposium, 2009

ABSTRACT In this paper, we propose a document topic model (DTM) based on the non-negative matrix ... more ABSTRACT In this paper, we propose a document topic model (DTM) based on the non-negative matrix factorization (NMF) approach to explore spontaneous spoken document retrieval. The model uses latent semantic indexing to detect underlying semantic relationships within documents. Each document is interpreted as a generative topic model belonging to many topics. The relevance of a document to a query is expressed by the probability of a query being generated by the model. The term-document matrix used for NMF is built stochastically from the speech recognition N-best results, so that multiple recognition hypotheses can be utilized to compensate for the word recognition errors. Using this approach, experiments are conducted on a test collection from the Corpus of Spontaneous Japanese (CSJ), with 39 queries for over 600 hours of spontaneous Japanese speech. The retrieval performance of this model is proved to be superior to the conventional vector space model (VSM) when the dimension or topic number exceeds a certain threshold. Moreover, whether from the viewpoint of retrieval performance or the ability of topic expression, the NMF-based topic model is verified to surpass another latent indexing method that is based on the singular value decomposition (SVD). The extent to which this topic model can resist speech recognition error, which is a special problem of spoken document retrieval, is also investigated.

Research paper thumbnail of A Myanmar large vocabulary continuous speech recognition system

2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2015

Research paper thumbnail of Royalflush Speaker Diarization System for ICASSP 2022 Multi-channel Multi-party Meeting Transcription Challenge

This paper describes the Royalflush speaker diarization system submitted to the Multi-channel Mul... more This paper describes the Royalflush speaker diarization system submitted to the Multi-channel Multi-party Meeting Transcription Challenge. Our system comprises speech enhancement, overlapped speech detection, speaker embedding extraction, speaker clustering, speech separation and system fusion. In this system, we made three contributions. First, we propose an architecture of combining the multi-channel and U-Net-based models, aiming at utilizing the benefits of these two individual architectures, for far-field overlapped speech detection. Second, in order to use overlapped speech detection model to help speaker diarization, a speech separation based overlapped speech handling approach, in which the speaker verification technique is further applied, is proposed. Third, we explore three speaker embedding methods, and obtained the state-of-the-art performance on the CNCeleb-E test set. With these proposals, our best individual system significantly reduces DER from 15.25% to 6.40%, and ...

Research paper thumbnail of ACOUSTIC SCENE CLASSIFICATION WITH DEVICE MISMATCH USING DATA AUGMENTATION BY SPECTRUM CORRECTION Technical Report

This report describes the submissions by RoyalFlush of DCASE2020 task1a. Our aim is to find an au... more This report describes the submissions by RoyalFlush of DCASE2020 task1a. Our aim is to find an audio scene classification system that is robust against multiple devices. We use logMel and its first and second derivatives as input features. We use the fully convolutional deep neural networks as classification model, and some strategies such as pre-Act, L2 regularization, dropout and feature normalization were applied. For improving the data imbalance caused by the different device, we tried to generate more training data by using device-related spectrum correction method.

Research paper thumbnail of Progress Report of Spoken Document Processing Working Group

Scientific Programming, 2011

This report describes the activities of SLP Spoken Document Processing Working Group (SDPWG). The... more This report describes the activities of SLP Spoken Document Processing Working Group (SDPWG). The SDPWG was organized in 2006. The working group was reorganized in 2009. This report mainly describes the activities of the second period of the SDPWG. The SDPWG organized Interspeech2010 special session, NTCIR-9 workshop, and the special issue on spoken document processing of IPSJ journal. This report focuses the planning process of these activities. 1. はじめに

Research paper thumbnail of The NCT ASR system for IWSLT 2014

This paper describes our automatic speech recognition system for IWSLT2014 evaluation campaign. T... more This paper describes our automatic speech recognition system for IWSLT2014 evaluation campaign. The system is based on weighted finite-state transducers and a combination of multiple subsystems which consists of four types of acoustic feature sets, four types of acoustic models, and Ngram and recurrent neural network language models. Compared with our system used in last year, we added additional subsystems based on deep neural network modeling on filter bank feature and convolutional deep neural network modeling on filter bank feature with tonal features. In addition, modifications and improvements on automatic acoustic segmentation and deep neural network speaker adaptation were applied. Compared with our last year's system on speech recognition experiments, our new system achieved 21.5% relative improvement on word error rate on the 2013 English test data set.

Research paper thumbnail of An End-to-End Dialect Identification System with Transfer Learning from a Multilingual Automatic Speech Recognition Model

Interspeech 2021, 2021

In this paper, we propose an end-to-end (E2E) dialect identification system trained using transfe... more In this paper, we propose an end-to-end (E2E) dialect identification system trained using transfer learning from a multilingual automatic speech recognition (ASR) model. This is also an extension of our submitted system to the Oriental Language Recognition Challenge 2020 (AP20-OLR). We verified its applicability using the dialect identification (DID) task of the AP20-OLR. First, we trained a robust conformer-based joint connectionist temporal classification (CTC) /attention multilingual E2E ASR model using the training corpora of eight languages, independent of the target dialects. Second, we initialized the E2E-based classifier with the ASR model's shared encoder using a transfer learning approach. Finally, we trained the classifier on the target dialect corpus. We obtained the final classifier by selecting the best model from the following: (1) the averaged model in term of the loss values; and (2) the averaged model in term of classification accuracy. Our experiments on the DID test-set of the AP20-OLR demonstrated that significant identification improvements were achieved for three Chinese dialects. The performances of our system outperforms the winning team of the AP20-OLR, with the largest relative reductions of 19.5% in Cavg and 25.2% in EER.

Research paper thumbnail of Data Augmentation for Code-Switch Language Modeling by Fusing Multiple Text Generation Methods

Interspeech 2020, 2020

To deal with the problem of data scarce in training language model (LM) for code-switching (CS) s... more To deal with the problem of data scarce in training language model (LM) for code-switching (CS) speech recognition, we proposed an approach to obtain augmentation texts from three different viewpoints. The first one is to enhance monolingual LM by selecting corresponding sentences for existing conversational corpora; The second one is based on replacements using syntactic constraint for a monolingual Chinese corpus, with the helps of an aligned word list obtained from a pseudo-parallel corpus, and part-of-speech (POS) of words; The third one is to use text generation based on a pointer-generator network with copy mechanism, using a real CS text data for training. All sentences from these approaches show improvement for CS LMs, and they are finally fused into an LM for CS ASR tasks. Evaluations on LMs built by the above augmented data were conducted on two Mandarin-English CS speech sets D-TANG, and SEAME. The perplexities were greatly reduced with all kinds of augmented texts, and speech recognition performances were steadily improved. The mixed word error rate (MER) of DTANG and SEAME evaluation dataset got relative reduction by 9.10% and 29.73%, respectively.

Research paper thumbnail of Collecting sentences from web resources for constructing spontaneous Chinese language model

2012 8th International Symposium on Chinese Spoken Language Processing, 2012

ABSTRACT In this paper, we present our work on collecting spontaneous texts from the Web for cons... more ABSTRACT In this paper, we present our work on collecting spontaneous texts from the Web for constructing a language model in a Chinese speech recognition system. The selection of spontaneous-like texts involves two steps: First, word-segmented web texts are selected using a perplexity-based approach in which the style-related words are strengthened by omitting infrequent topic words from similarity measurements. Second, the selected texts are then clustered based on non-noun part-of-speech (POS) words and optimal clusters are chosen by referring to a set of spontaneous seed sentences. Using the language model interpolated with the one trained by the selected sentences and a baseline model, speech recognition evaluations were conducted on an open domain spontaneous test set. We effectively reduced the character error rate (CER), with 1.64% absolute (or 6.5% relative) reduction by comparison with the baseline model. We also verified that the proposed method is superior to the conventional perplexity-based approach with about 1% absolute (or 4.0% relative) reduction in CER.

Research paper thumbnail of Construction of Chinese segmented and POS-tagged conversational corpora and their evaluations on spontaneous speech recognitions

Proceedings of the 7th Workshop on Asian Language Resources - ALR7, 2009

The performance of a corpus-based language and speech processing system depends heavily on the qu... more The performance of a corpus-based language and speech processing system depends heavily on the quantity and quality of the training corpora. Although several famous Chinese corpora have been developed, most of them are mainly written text. Even for some existing corpora that contain spoken data, the quantity is insufficient and the domain is limited. In this paper, we describe the development of Chinese conversational annotated textual corpora currently being used in the NICT/ATR speech-to-speech translation system. A total of 510K manually checked utterances provide 3.5M words of Chinese corpora. As far as we know, this is the largest conversational textual corpora in the domain of travel. A set of three parallel corpora is obtained with the corresponding pairs of Japanese and English words from which the Chinese words are translated. Evaluation experiments on these corpora were conducted by comparing the parameters of the language models, perplexities of test sets, and speech recognition performance with Japanese and English. The characteristics of the Chinese corpora, their limitations, and solutions to these limitations are analyzed and discussed.

Research paper thumbnail of Overview of the NTCIR-10 spokendoc-2 task

Research paper thumbnail of Collecting Colloquial and Spontaneous-like Sentences from Web Resources for Constructing Chinese Language Models of Speech Recognition

Journal of Information Processing, 2013

In this paper, we present our work on collecting training texts from the Web for constructing lan... more In this paper, we present our work on collecting training texts from the Web for constructing language models in colloquial and spontaneous Chinese automatic speech recognition systems. The selection involves two steps: first, web texts are selected using a perplexity-based approach in which the style-related words are strengthened by omitting infrequent topic words. Second, the selected texts are then clustered based on non-noun part-of-speech words and optimal clusters are chosen by referring to a set of spontaneous seed sentences. With the proposed method, we selected over 3.80 M sentences. By qualitative analysis on the selected results, the colloquial and spontaneousspeech like texts are effectively selected. The effectiveness of the selection is also quantitatively verified by the speech recognition experiments. Using the language model interpolated with the one trained by these selected sentences and a baseline model, speech recognition evaluations were conducted on an open domain colloquial and spontaneous test set. We effectively reduced the character error rate 4.0% over the baseline model meanwhile the word coverage was also greatly increased. We also verified that the proposed method is superior to a conventional perplexity-based approach with a difference of 1.57% in character error rate.

Research paper thumbnail of The Royalflush System of Speech Recognition for M2met Challenge

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

This paper describes our RoyalFlush system for the track of multi-speaker automatic speech recogn... more This paper describes our RoyalFlush system for the track of multi-speaker automatic speech recognition (ASR) in the M2MeT challenge. We adopted the serialized output training (SOT) based multi-speakers ASR system with large-scale simulation data. Firstly, we investigated a set of front-end methods, including multi-channel weighted predicted error (WPE), beamforming, speech separation, speech enhancement, etc., to process training, evaluation, and test sets. However, according to their experimental results, we only selected the WPE and beamforming approach as our front-end methods. Secondly, we made great efforts in the data augmentation for multi-speaker ASR, including adding noise and reverberation, overlapped speech simulation, multi-channel speech simulation, speed perturbation, front-end processing, etc., which brought us a significant performance improvement. Finally, to make full use of the performance complementary of different model architecture, we trained the standard conformer based joint CTC/Attention (Conformer) and U2++ ASR model with a bidirectional attention decoder, a modification of Conformer, to fuse their results. Compared with the official baseline system, our system got a 12.22% absolute Character Error Rate (CER) reduction on the evaluation set and 12.11% on the test set.

Research paper thumbnail of An Investigation of Using Hybrid Modeling Units for Improving End-to-End Speech Recognition System

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021

The acoustic modeling unit is crucial for an end-to-end speech recognition system, especially for... more The acoustic modeling unit is crucial for an end-to-end speech recognition system, especially for the Mandarin language. Until now, most of the studies on Mandarin speech recognition focused on individual units, and few of them paid attention to using a combination of these units. This paper uses a hybrid of the syllable, Chinese character, and subword as the modeling units for the end-to-end speech recognition system based on the CTC/attention multi-task learning. In this approach, the character-subword unit is assigned to train the transformer model in the main task learning stage. In contrast, the syllable unit is assigned to enhance the transformer’s shared encoder in the auxiliary task stage with the Connectionist Temporal Classification (CTC) loss function. The recognition experiments were conducted on AISHELL-1 and an open data set of 1200-hour Mandarin speech corpus collected from the OpenSLR, respectively. The experimental results demonstrated that using the syllable-char-subword hybrid modeling unit can achieve better performances than the conventional units of char-subword, and 6.6% relative CER reduction on our 1200-hour data. The substitution error also achieves a considerable reduction.

Research paper thumbnail of Tdcgan: Temporal Dilated Convolutional Generative Adversarial Network for End-to-end Speech Enhancement

arXiv: Audio and Speech Processing, 2020

In this paper, in order to further deal with the performance degradation caused by ignoring the p... more In this paper, in order to further deal with the performance degradation caused by ignoring the phase information in conventional speech enhancement systems, we proposed a temporal dilated convolutional generative adversarial network (TDCGAN) in the end-to-end based speech enhancement architecture. For the first time, we introduced the temporal dilated convolutional network with depthwise separable convolutions into the GAN structure so that the receptive field can be greatly increased without increasing the number of parameters. We also first explored the effect of signal-to-noise ratio (SNR) penalty item as regularization of the loss function of generator on improving the SNR of enhanced speech. The experimental results demonstrated that our proposed method outperformed the state-of-the-art end-to-end GAN-based speech enhancement. Moreover, compared with previous GAN-based methods, the proposed TDCGAN could greatly decreased the number of parameters. As expected, the work also dem...

Research paper thumbnail of The RoyalFlush Synthesis System for Blizzard Challenge 2020

Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 2020

The paper presents the RoyalFlush synthesis system for Blizzard Challenge 2020. Two required voic... more The paper presents the RoyalFlush synthesis system for Blizzard Challenge 2020. Two required voices are built from the released Mandarin and Shanghainese data. Based on endto-end speech synthesis technology, some improvements are introduced to the system compared with our system of last year. Firstly, a Mandarin front-end transforming input text into phoneme sequence along with prosody labels is employed. Then, to improve speech stability, a modified Tacotron acoustic model is proposed. Moreover, we apply GMM-based attention mechanism for robust long-form speech synthesis. Finally, a lightweight LPCNet-based neural vocoder is adopted to achieve a nice traceoff between effectiveness and efficiency. Among all the participating teams of the Challenge, the identifier for our system is N. Evaluation results demonstrates that our system performs relatively well in intelligibility. But it still needs to be improved in terms of naturalness and similarity.

Research paper thumbnail of Exploration of Feature Extraction Methods and Dimension for sEMG Signal Classification

Applied Sciences, 2019

It is necessary to complete the two parts of gesture recognition and wireless remote control to r... more It is necessary to complete the two parts of gesture recognition and wireless remote control to realize the gesture control of the automatic pruning machine. To realize gesture recognition, in this paper, we have carried out the research of gesture recognition technology based on surface electromyography signal, and discussed the influence of different numbers and different gesture combinations on the optimal size. We have calculated the 630-dimensional eigenvector from the benchmark scientific database of sEMG signals and extracted the features using principal component analysis (PCA). Discriminant analysis (DA) has been used to compare the processing effects of each feature extraction method. The experimental results have shown that the recognition rate of four gestures can reach 100.0%, the recognition rate of six gestures can reach 98.29%, and the optimal size is 516~523 dimensions. This study lays a foundation for the follow-up work of the pruning machine gesture control, and p...

Research paper thumbnail of Chinese Character-based Segmentation & POS-tagging and Named Entity Identification with a CRF Chunker

In this paper, we propose a character-based conditional random field (CRF) chunker to identify Ch... more In this paper, we propose a character-based conditional random field (CRF) chunker to identify Chinese named entity words in the text files. The input for it is from a character-based tagger in which the segmentation and part-of-speech (POS) tagging are conducted simultanueously. The character-based tagger is trained by using a corpus in which each character is tagged with both its position (POC) in a word and POS tag of the word. The chunker is trained by an IOB2 tagged corpus, in which each character is labelled with POC, POS and chunk tags (one of the B, I, O). 4 kinds of named entities, including personal names, location names, organization names, and other proper nouns, are assumed to be identification targets. In experiments using the People's Daily corpus, we found the CRF chunker can obtain better results than the maximum entropy model and support vector machine model in the case of using similar features. We also confirmed that the bigram features for the CRF chunker is...