krishna bharat - Academia.edu (original) (raw)
Related Authors
Graduate Center of the City University of New York
Uploads
Papers by krishna bharat
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020
The task of automatic language identification (LID) involving multiple dialects of the same langu... more The task of automatic language identification (LID) involving multiple dialects of the same language family in the presence of noise is a challenging problem. In these scenarios, the identity of the language/dialect may be reliably present only in parts of the temporal sequence of the speech signal. The conventional approaches to LID (and for speaker recognition) ignore the sequence information by extracting long-term statistical summary of the recording assuming an independence of the feature frames. In this paper, we propose a neural network framework utilizing short-sequence information in language recognition. In particular, a new model is proposed for incorporating relevance in language recognition, where parts of speech data are weighted more based on their relevance for the language recognition task. This relevance weighting is achieved using the bidirectional long short-term memory (BLSTM) network with attention modeling. We explore two approaches, the first approach uses segment level i-vector/x-vector representations that are aggregated in the neural model and the second approach where the acoustic features are directly modeled in an end-to-end neural model. Experiments are performed using the language recognition task in NIST LRE 2017 Challenge using clean, noisy and multi-speaker speech data as well as in the RATS language recognition corpus. In these experiments on noisy LRE tasks as well as the RATS dataset, the proposed approach yields significant improvements over the conventional i-vector/x-vector based language recognition approaches as well as with other previous models incorporating sequence information.
Interspeech 2019, 2019
In this paper, a hybrid i-vector neural network framework (i-BLSTM) which models the sequence inf... more In this paper, a hybrid i-vector neural network framework (i-BLSTM) which models the sequence information present in a series of short segment i-vectors for the task of spoken language recognition (LRE) is proposed. A sequence of short segment i-vectors are extracted for every speech utterance and are then modeled using a bidirectional long short-term memory (BLSTM) recurrent neural network (RNN). Attention mechanism inside the neural network relevantly weights segments of the speech utterance and the model learns to give higher weights to parts of speech data which are more helpful to the classification task. The proposed framework performs better in short duration and noisy environments when compared with the conventional i-vector system. Experiments are performed on clean, noisy and multi-speaker speech data from NIST LRE 2017 and RATS language recognition corpus. In these experiments, the proposed approach yields significant improvements (relative improvements of 7.6-13% in terms of accuracy for noisy conditions) over the conventional i-vector based language recognition approach and also over an end-to-end LSTM-RNN based approach.
Odyssey 2018 The Speaker and Language Recognition Workshop, 2018
The language recognition evaluation (LRE) 2017 challenge comprises an open evaluation of the lang... more The language recognition evaluation (LRE) 2017 challenge comprises an open evaluation of the language identification (LID) task on a set of 14 languages/dialects. In this paper, we describe our submission to the LRE 2017 challenge fixed condition which consisted of developing various LID systems using i-vector based modeling. The front end processing is performed using deep neural network (DNN) based bottleneck features for i-vector modeling with a Gaussian mixture model (GMM) universal background model (UBM) approach. Several back-end systems consisting of support vector machines (SVMs) and deep neural network (DNN) models were used for the language/dialect classification. The submission system achieved significant improvements over the evaluation baseline system provided by NIST (relative improvements of more than 50% over the baseline). In the later part of the paper, we detail our post evaluation efforts to improve the language recognition system for short duration speech data using novel approaches of sequence modeling of segment i-vectors. The post evaluation efforts resulted in further improvements over the submitted system (relative improvements of about 22 %). An error analysis is also presented which highlights the confusions and errors in the final system.
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020
The task of automatic language identification (LID) involving multiple dialects of the same langu... more The task of automatic language identification (LID) involving multiple dialects of the same language family in the presence of noise is a challenging problem. In these scenarios, the identity of the language/dialect may be reliably present only in parts of the temporal sequence of the speech signal. The conventional approaches to LID (and for speaker recognition) ignore the sequence information by extracting long-term statistical summary of the recording assuming an independence of the feature frames. In this paper, we propose a neural network framework utilizing short-sequence information in language recognition. In particular, a new model is proposed for incorporating relevance in language recognition, where parts of speech data are weighted more based on their relevance for the language recognition task. This relevance weighting is achieved using the bidirectional long short-term memory (BLSTM) network with attention modeling. We explore two approaches, the first approach uses segment level i-vector/x-vector representations that are aggregated in the neural model and the second approach where the acoustic features are directly modeled in an end-to-end neural model. Experiments are performed using the language recognition task in NIST LRE 2017 Challenge using clean, noisy and multi-speaker speech data as well as in the RATS language recognition corpus. In these experiments on noisy LRE tasks as well as the RATS dataset, the proposed approach yields significant improvements over the conventional i-vector/x-vector based language recognition approaches as well as with other previous models incorporating sequence information.
Interspeech 2019, 2019
In this paper, a hybrid i-vector neural network framework (i-BLSTM) which models the sequence inf... more In this paper, a hybrid i-vector neural network framework (i-BLSTM) which models the sequence information present in a series of short segment i-vectors for the task of spoken language recognition (LRE) is proposed. A sequence of short segment i-vectors are extracted for every speech utterance and are then modeled using a bidirectional long short-term memory (BLSTM) recurrent neural network (RNN). Attention mechanism inside the neural network relevantly weights segments of the speech utterance and the model learns to give higher weights to parts of speech data which are more helpful to the classification task. The proposed framework performs better in short duration and noisy environments when compared with the conventional i-vector system. Experiments are performed on clean, noisy and multi-speaker speech data from NIST LRE 2017 and RATS language recognition corpus. In these experiments, the proposed approach yields significant improvements (relative improvements of 7.6-13% in terms of accuracy for noisy conditions) over the conventional i-vector based language recognition approach and also over an end-to-end LSTM-RNN based approach.
Odyssey 2018 The Speaker and Language Recognition Workshop, 2018
The language recognition evaluation (LRE) 2017 challenge comprises an open evaluation of the lang... more The language recognition evaluation (LRE) 2017 challenge comprises an open evaluation of the language identification (LID) task on a set of 14 languages/dialects. In this paper, we describe our submission to the LRE 2017 challenge fixed condition which consisted of developing various LID systems using i-vector based modeling. The front end processing is performed using deep neural network (DNN) based bottleneck features for i-vector modeling with a Gaussian mixture model (GMM) universal background model (UBM) approach. Several back-end systems consisting of support vector machines (SVMs) and deep neural network (DNN) models were used for the language/dialect classification. The submission system achieved significant improvements over the evaluation baseline system provided by NIST (relative improvements of more than 50% over the baseline). In the later part of the paper, we detail our post evaluation efforts to improve the language recognition system for short duration speech data using novel approaches of sequence modeling of segment i-vectors. The post evaluation efforts resulted in further improvements over the submitted system (relative improvements of about 22 %). An error analysis is also presented which highlights the confusions and errors in the final system.