Vladimir Bataev | Moscow State University (original) (raw)

Uploads

Speech Recognition by Vladimir Bataev

Research paper thumbnail of The STC ASR System for the VOiCES from a Distance Challenge 2019

Interspeech 2019

This paper is a description of the Speech Technology Center (STC) automatic speech recognition (A... more This paper is a description of the Speech Technology Center (STC) automatic speech recognition (ASR) system for the "VOiCES from a Distance Challenge 2019". We participated in the Fixed condition of the ASR task, which means that the only training data available was an 80-hour subset of the Lib-riSpeech corpus. The main difficulty of the challenge is a mismatch between clean training data and distant noisy development/evaluation data. In order to tackle this, we applied room acoustics simulation and weighted prediction error (WPE) dereverberation. We also utilized well-known speaker adaptation using x-vector speaker embeddings, as well as novel room acoustics adaptation with R-vector room impulse response (RIR) embeddings. The system used a lattice-level combination of 6 acoustic models based on different pronunciation dictionaries and input features. N-best hypotheses were rescored with 3 neural network language models (NNLMs) trained on both words and sub-word units. NNLMs were also explored for out-of-vocabulary (OOV) words handling by means of artificial texts generation. The final system achieved Word Error Rate (WER) of 14.7% on the evaluation data, which is the best result in the challenge.

Research paper thumbnail of The STC System for the CHiME 2018 Challenge

CHiME 2018 Workshop on Speech Processing in Everyday Environments

This paper describes the Speech Technology Center (STC) system for the 5th CHiME challenge. This ... more This paper describes the Speech Technology Center (STC) system for the 5th CHiME challenge. This challenge considers the problem of distant multi-microphone conversational speech recognition in everyday home environments. Our efforts were focused on the single-array track, however, we participated in the multiple-array track as well. The system is in the ranking A of the challenge: acoustic models remain frame-level tied phonetic targets, lexicon and language model are not changed compared to the conventional ASR baseline. Our system employs a combination of 4 acoustic models based on convolutional and recurrent neural networks. Speaker adaptation with target speaker masks and multi-channel speaker-aware acoustic model with neural network beamforming are two major features of the system. Moreover, various techniques for improving acoustic models are applied, including array synchronization, data cleanup, alignment transfer, mixup, speed perturbation data augmentation, room simulation, and backstitch training. Our system scored 3rd in the single-array track with Word Error Rate (WER) of 55.5% and 4th in the multiple-array track with WER of 55.6% on the evaluation data, achieving a substantial improvement over the baseline system.

Research paper thumbnail of R-Vectors: New Technique for Adaptation to Room Acoustics

Papers by Vladimir Bataev

Research paper thumbnail of NVIDIA NeMo Offline Speech Translation Systems for IWSLT 2023

Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

This paper provides an overview of NVIDIA NeMo's speech translation systems for the IWSLT 2023 Of... more This paper provides an overview of NVIDIA NeMo's speech translation systems for the IWSLT 2023 Offline Speech Translation Task. This year, we focused on end-to-end system which capitalizes on pre-trained models and synthetic data to mitigate the problem of direct speech translation data scarcity. When trained on IWSLT 2022 constrained data, our best En→De end-to-end model achieves the average score of 31 BLEU on 7 test sets from IWSLT 2010-2020 which improves over our last year cascade (28.4) and end-to-end (25.7) submissions. When trained on IWSLT 2023 constrained data, the average score drops to 29.5 BLEU.

Research paper thumbnail of Powerful and Extensible WFST Framework for RNN-Transducer Losses

arXiv (Cornell University), Mar 18, 2023

Research paper thumbnail of Digital Peter: Dataset, Competition and Handwriting Recognition Methods

arXiv (Cornell University), Mar 16, 2021

This paper presents a new dataset of Peter the Great's manuscripts and describes a segmentation p... more This paper presents a new dataset of Peter the Great's manuscripts and describes a segmentation procedure that converts initial images of documents into lines. This new dataset may be useful for researchers to train handwriting text recognition models as a benchmark when comparing different models. It consists of 9694 images and text files corresponding to different lines in historical documents. The open machine learning competition "Digital Peter" was held based on the considered dataset. The baseline solution for this competition and advanced methods on handwritten text recognition are described in the article. The full dataset and all codes are publicly available.

Research paper thumbnail of Techniques for Vocabulary Expansion in Hybrid Speech Recognition Systems

ArXiv, 2020

The problem of out of vocabulary words (OOV) is typical for any speech recognition system, hybrid... more The problem of out of vocabulary words (OOV) is typical for any speech recognition system, hybrid systems are usually constructed to recognize a fixed set of words and rarely can include all the words that will be encountered during exploitation of the system. One of the popular approach to cover OOVs is to use subword units rather then words. Such system can potentially recognize any previously unseen word if the word can be constructed from present subword units, but also non-existing words can be recognized. The other popular approach is to modify HMM part of the system so that it can be easily and effectively expanded with custom set of words we want to add to the system. In this paper we explore different existing methods of this solution on both graph construction and search method levels. We also present a novel vocabulary expansion techniques which solve some common internal subroutine problems regarding recognition graph processing.

Research paper thumbnail of Exploring End-to-End Techniques for Low-Resource Speech Recognition

Speech and Computer, 2018

In this work we present simple grapheme-based system for low-resource speech recognition using Ba... more In this work we present simple grapheme-based system for low-resource speech recognition using Babel data for Turkish spontaneous speech (80 hours). We have investigated different neural network architectures performance, including fully-convolutional, recurrent and ResNet with GRU. Different features and normalization techniques are compared as well. We also proposed CTC-loss modification using segmentation during training, which leads to improvement while decoding with small beam size. Our best model achieved word error rate of 45.8%, which is the best reported result for end-to-end systems using in-domain data for this task, according to our knowledge.

Research paper thumbnail of Powerful and Extensible WFST Framework for Rnn-Transducer Losses

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

This paper presents a framework based on Weighted Finite-State Transducers (WFST) to simplify the... more This paper presents a framework based on Weighted Finite-State Transducers (WFST) to simplify the development of modifications for RNN-Transducer (RNN-T) loss. Existing implementations of RNN-T use CUDA-related code, which is hard to extend and debug. WFSTs are easy to construct and extend, and allow debugging through visualization. We introduce two WFST-powered RNN-T implementations: (1) "Compose-Transducer", based on a composition of the WFST graphs from acoustic and textual schema-computationally competitive and easy to modify; (2) "Grid-Transducer", which constructs the lattice directly for further computations-most compact, and computationally efficient. We illustrate the ease of extensibility through introduction of a new W-Transducer loss-the adaptation of the Connectionist Temporal Classification with Wild Cards. W-Transducer (W-RNNT) consistently outperforms the standard RNN-T in a weaklysupervised data setup with missing parts of transcriptions at the beginning and end of utterances. All RNN-T losses are implemented with the k2 framework and are available in the NeMo toolkit.

Research paper thumbnail of The STC ASR System for the VOiCES from a Distance Challenge 2019

Interspeech 2019

This paper is a description of the Speech Technology Center (STC) automatic speech recognition (A... more This paper is a description of the Speech Technology Center (STC) automatic speech recognition (ASR) system for the "VOiCES from a Distance Challenge 2019". We participated in the Fixed condition of the ASR task, which means that the only training data available was an 80-hour subset of the Lib-riSpeech corpus. The main difficulty of the challenge is a mismatch between clean training data and distant noisy development/evaluation data. In order to tackle this, we applied room acoustics simulation and weighted prediction error (WPE) dereverberation. We also utilized well-known speaker adaptation using x-vector speaker embeddings, as well as novel room acoustics adaptation with R-vector room impulse response (RIR) embeddings. The system used a lattice-level combination of 6 acoustic models based on different pronunciation dictionaries and input features. N-best hypotheses were rescored with 3 neural network language models (NNLMs) trained on both words and sub-word units. NNLMs were also explored for out-of-vocabulary (OOV) words handling by means of artificial texts generation. The final system achieved Word Error Rate (WER) of 14.7% on the evaluation data, which is the best result in the challenge.

Research paper thumbnail of The STC System for the CHiME 2018 Challenge

CHiME 2018 Workshop on Speech Processing in Everyday Environments

This paper describes the Speech Technology Center (STC) system for the 5th CHiME challenge. This ... more This paper describes the Speech Technology Center (STC) system for the 5th CHiME challenge. This challenge considers the problem of distant multi-microphone conversational speech recognition in everyday home environments. Our efforts were focused on the single-array track, however, we participated in the multiple-array track as well. The system is in the ranking A of the challenge: acoustic models remain frame-level tied phonetic targets, lexicon and language model are not changed compared to the conventional ASR baseline. Our system employs a combination of 4 acoustic models based on convolutional and recurrent neural networks. Speaker adaptation with target speaker masks and multi-channel speaker-aware acoustic model with neural network beamforming are two major features of the system. Moreover, various techniques for improving acoustic models are applied, including array synchronization, data cleanup, alignment transfer, mixup, speed perturbation data augmentation, room simulation, and backstitch training. Our system scored 3rd in the single-array track with Word Error Rate (WER) of 55.5% and 4th in the multiple-array track with WER of 55.6% on the evaluation data, achieving a substantial improvement over the baseline system.

Research paper thumbnail of R-Vectors: New Technique for Adaptation to Room Acoustics

Research paper thumbnail of NVIDIA NeMo Offline Speech Translation Systems for IWSLT 2023

Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

This paper provides an overview of NVIDIA NeMo's speech translation systems for the IWSLT 2023 Of... more This paper provides an overview of NVIDIA NeMo's speech translation systems for the IWSLT 2023 Offline Speech Translation Task. This year, we focused on end-to-end system which capitalizes on pre-trained models and synthetic data to mitigate the problem of direct speech translation data scarcity. When trained on IWSLT 2022 constrained data, our best En→De end-to-end model achieves the average score of 31 BLEU on 7 test sets from IWSLT 2010-2020 which improves over our last year cascade (28.4) and end-to-end (25.7) submissions. When trained on IWSLT 2023 constrained data, the average score drops to 29.5 BLEU.

Research paper thumbnail of Powerful and Extensible WFST Framework for RNN-Transducer Losses

arXiv (Cornell University), Mar 18, 2023

Research paper thumbnail of Digital Peter: Dataset, Competition and Handwriting Recognition Methods

arXiv (Cornell University), Mar 16, 2021

This paper presents a new dataset of Peter the Great's manuscripts and describes a segmentation p... more This paper presents a new dataset of Peter the Great's manuscripts and describes a segmentation procedure that converts initial images of documents into lines. This new dataset may be useful for researchers to train handwriting text recognition models as a benchmark when comparing different models. It consists of 9694 images and text files corresponding to different lines in historical documents. The open machine learning competition "Digital Peter" was held based on the considered dataset. The baseline solution for this competition and advanced methods on handwritten text recognition are described in the article. The full dataset and all codes are publicly available.

Research paper thumbnail of Techniques for Vocabulary Expansion in Hybrid Speech Recognition Systems

ArXiv, 2020

The problem of out of vocabulary words (OOV) is typical for any speech recognition system, hybrid... more The problem of out of vocabulary words (OOV) is typical for any speech recognition system, hybrid systems are usually constructed to recognize a fixed set of words and rarely can include all the words that will be encountered during exploitation of the system. One of the popular approach to cover OOVs is to use subword units rather then words. Such system can potentially recognize any previously unseen word if the word can be constructed from present subword units, but also non-existing words can be recognized. The other popular approach is to modify HMM part of the system so that it can be easily and effectively expanded with custom set of words we want to add to the system. In this paper we explore different existing methods of this solution on both graph construction and search method levels. We also present a novel vocabulary expansion techniques which solve some common internal subroutine problems regarding recognition graph processing.

Research paper thumbnail of Exploring End-to-End Techniques for Low-Resource Speech Recognition

Speech and Computer, 2018

In this work we present simple grapheme-based system for low-resource speech recognition using Ba... more In this work we present simple grapheme-based system for low-resource speech recognition using Babel data for Turkish spontaneous speech (80 hours). We have investigated different neural network architectures performance, including fully-convolutional, recurrent and ResNet with GRU. Different features and normalization techniques are compared as well. We also proposed CTC-loss modification using segmentation during training, which leads to improvement while decoding with small beam size. Our best model achieved word error rate of 45.8%, which is the best reported result for end-to-end systems using in-domain data for this task, according to our knowledge.

Research paper thumbnail of Powerful and Extensible WFST Framework for Rnn-Transducer Losses

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

This paper presents a framework based on Weighted Finite-State Transducers (WFST) to simplify the... more This paper presents a framework based on Weighted Finite-State Transducers (WFST) to simplify the development of modifications for RNN-Transducer (RNN-T) loss. Existing implementations of RNN-T use CUDA-related code, which is hard to extend and debug. WFSTs are easy to construct and extend, and allow debugging through visualization. We introduce two WFST-powered RNN-T implementations: (1) "Compose-Transducer", based on a composition of the WFST graphs from acoustic and textual schema-computationally competitive and easy to modify; (2) "Grid-Transducer", which constructs the lattice directly for further computations-most compact, and computationally efficient. We illustrate the ease of extensibility through introduction of a new W-Transducer loss-the adaptation of the Connectionist Temporal Classification with Wild Cards. W-Transducer (W-RNNT) consistently outperforms the standard RNN-T in a weaklysupervised data setup with missing parts of transcriptions at the beginning and end of utterances. All RNN-T losses are implemented with the k2 framework and are available in the NeMo toolkit.