Hossein Hadian | Sharif University of Technology (original) (raw)

Papers by Hossein Hadian

Continual learning (CL), or domain expansion, recently became a popular topic for automatic speec... more Continual learning (CL), or domain expansion, recently became a popular topic for automatic speech recognition (ASR) acoustic modeling because practical systems have to be updated frequently in order to work robustly on types of speech not observed during initial training. While sequential adaptation allows tuning a system to a new domain, it may result in performance degradation on the old domains due to catastrophic forgetting. In this work we explore regularization-based CL for neural network acoustic models trained with the lattice-free maximum mutual information (LF-MMI) criterion. We simulate domain expansion by incrementally adapting the acoustic model on different public datasets that include several accents and speaking styles. We investigate two well-known CL techniques, elastic weight consolidation (EWC) and learning without forgetting (LWF), which aim to reduce forgetting by preserving model weights or network outputs. We additionally introduce a sequence-level LWF regul...

We present our work on improving the numerator graph for discriminative training using the lattic... more We present our work on improving the numerator graph for discriminative training using the lattice-free maximum mutual information (MMI) criterion. Specifically, we propose a scheme for creating unconstrained numerator graphs by removing time constraints from the baseline numerator graphs. This leads to much smaller graphs and therefore faster preparation of training supervisions. By testing the proposed un-constrained supervisions using factorized time-delay neural network (TDNN) models, we observe 0.5% to 2.6% relative improvement over the state-of-the-art word error rates on various large-vocabulary speech recognition databases.

I-vectors have proved to be the most effective features for text-independent speaker verification... more I-vectors have proved to be the most effective features for text-independent speaker verification in recent researches. In this article a new scheme is proposed to utilize i-vectors in text-prompted speaker verification in a simple while effective manner. In order to examine this scheme empirically, a telephony dataset of Persian month names is introduced. Experiments show that the proposed scheme reduces the EER by 31% compared to the state-of-the-art State-GMM-MAP method. Furthermore it is shown that using HMM instead of GMM for universal background modeling leads to 15% reduction in EER.

It is common in applications of ASR to have a large amount of data out-of-domain to the test data... more It is common in applications of ASR to have a large amount of data out-of-domain to the test data and a smaller amount of in-domain data similar to the test data. In this paper, we investigate different ways to utilize this out-of-domain data to improve ASR models based on Lattice-free MMI (LF-MMI). In particular, we experiment with multi-task training using a network with shared hidden layers; and we try various ways of adapting previously trained models to a new domain. Both types of methods are effective in reducing the WER versus in-domain models, with the jointly trained models generally giving more improvement.

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

The lattice-free MMI objective (LF-MMI) has been used in supervised training of state-of-the-art ... more The lattice-free MMI objective (LF-MMI) has been used in supervised training of state-of-the-art neural network acoustic models for automatic speech recognition (ASR). With large amounts of unsupervised data available, extending this approach to the semisupervised scenario is of significance. Finite-state transducer (FST) based supervision used with LF-MMI provides a natural way to incorporate uncertainties when dealing with unsupervised data. In this paper, we describe various extensions to standard LF-MMI training to allow the use as supervision of lattices obtained via decoding of unsupervised data. The lattices are rescored with a strong LM. We investigate different methods for splitting the lattices and incorporating frame tolerances into the supervision FST. We report results on different subsets of Fisher English, where we achieve WER recovery of 59-64% using lattice supervision, which is significantly better than using just the best path transcription.

2017 Iranian Conference on Electrical Engineering (ICEE), 2017

In this paper, we introduce a new dataset for SAD and evaluate certain common methods such as GMM... more In this paper, we introduce a new dataset for SAD and evaluate certain common methods such as GMM, DNN, and RNN on it. We have collected our dataset in a semi-supervised approach, using subtitled movies, with a labeling accuracy of 95%. This semi-automatic method can help us collect huge amounts of labeled audio data with very high diversity in language, speaker, and channel. We model the problem of SAD as a classification task to two classes of speech and non-speech. When using GMM for this problem, we use two separate mixtures to model speech and non-speech. In the case of neural networks, we use a softmax layer at the end of the network, with two neurons which represent speech and non-speech, and train the network using stochastic gradient descent to minimize cross-entropy loss. The input to our models is the extracted MFCC and PLP features (concatenated to each other) from audio frames. We also investigate the effect of context by taking into account past and future frames. Our ...

Active learning has proved effective in many fields of natural language processing. However, in t... more Active learning has proved effective in many fields of natural language processing. However, in the field of spoken language understanding which is always dealing with noise, no complete comparison between different active learning methods has been done. This paper compares the best known active learning methods in noisy conditions for spoken language understanding. Additionally a new method based on Fisher information named as Weighted Gradient Uncertainty (WGU) is proposed. Furthermore, Strict Local Density (SLD) method is proposed based on a new concept of local density and a new technique of utilizing information density measures. Results demonstrate that both proposed methods outperform the best performance of the previous methods in noisy and noise-free conditions with SLD being superior to WGU slightly.

2019 International Conference on Document Analysis and Recognition (ICDAR)

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Interspeech 2018

In recent years, different studies have proposed new methods for DNN-based feature extraction and... more In recent years, different studies have proposed new methods for DNN-based feature extraction and joint acoustic model training and feature learning from raw waveform for large vocabulary speech recognition. However, conventional pre-processed methods such as MFCC and PLP are still preferred in the stateof-the-art speech recognition systems as they are perceived to be more robust. Besides, the raw waveform methods-most of which are based on the time-domain signal-do not significantly outperform the conventional methods. In this paper, we propose a frequency-domain feature-learning layer which can allow acoustic model training directly from the waveform. The main distinctions from previous works are a new normalization block and a short-range constraint on the filter weights. The proposed setup achieves consistent performance improvements compared to the baseline MFCC and log-Mel features as well as other proposed time and frequency domain setups on different LVCSR tasks. Finally, based on the learned filters in our feature-learning layer, we propose a new set of analytic filters using polynomial approximation, which outperforms log-Mel filters significantly while being equally fast.

Self-attention – an attention mechanism where the input and output sequence lengths are the same ... more Self-attention – an attention mechanism where the input and output sequence lengths are the same – has recently been successfully applied to machine translation, caption generation, and phoneme recognition. In this paper we apply a restricted self-attention mechanism (with multiple heads) to speech recognition. By " restricted " we mean that the mechanism at a particular frame only sees input from a limited number of frames to the left and right. Restricting the context makes it easier to encode the position of the input – we use a 1-hot encoding of the frame offset. We try introducing attention layers into TDNN architectures, and replacing LSTM layers with attention layers in TDNN+LSTM architectures. We show experiments on a number of ASR setups. We observe improvements compared to the TDNN and TDNN+LSTM baselines. Attention layers are also faster than LSTM layers in test time, since they lack recurrence.

We present our work on end-to-end training of acoustic models using the lattice-free maximum mutu... more We present our work on end-to-end training of acoustic models using the lattice-free maximum mutual information (LF-MMI) objective function in the context of hidden Markov models. By end-to-end training, we mean flat-start training of a single DNN in one stage without using any previously trained models, forced alignments, or building state-tying decision trees. We use full biphones to enable context-dependent modeling without trees, and show that our end-to-end LF-MMI approach can achieve comparable results to regular LF-MMI on well-known large vocabulary tasks. We also compare with other end-to-end methods such as CTC in character-based and lexicon-free settings and show 5 to 25 percent relative reduction in word error rates on different large vocabulary tasks while using significantly smaller models.

There has been no isolated word recognition database for the Persian language so far. In this pap... more There has been no isolated word recognition database for the Persian language so far. In this paper we introduce FarsName dataset which contains 20 thousands isolated-word Persian utterances spoken by 226 speakers from all regions of the country each saying an average of 88 Persian names. There is a total of 5235 unique names in this dataset. Various cell phone brands have been used to record this dataset. This indicates the high diversity of the utterances in this dataset. We have been able to achieve 10.34% WER on this set using Kaldi. This is a very good performance considering the recording environment have been normal and potentially noisy.

Active learning has proved effective in many fields of natural language processing. However, in t... more Active learning has proved effective in many fields of natural language processing. However,
in the field of spoken language understanding which is always dealing with noise, no complete
comparison between different active learning methods has been done. This paper compares the
best known active learning methods in noisy conditions for spoken language understanding.
Additionally a new method based on Fisher information named as Weighted Gradient
Uncertainty (WGU) is proposed. Furthermore, Strict Local Density (SLD) method is proposed
based on a new concept of local density and a new technique of utilizing information density
measures. Results demonstrate that both proposed methods outperform the best performance
of the previous methods in noisy and noise-free conditions with SLD being superior to WGU
slightly.

2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015

2015 23rd Iranian Conference on Electrical Engineering, 2015

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

2017 Iranian Conference on Electrical Engineering (ICEE), 2017

2019 International Conference on Document Analysis and Recognition (ICDAR)

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Interspeech 2018

Active learning has proved effective in many fields of natural language processing. However, in t... more Active learning has proved effective in many fields of natural language processing. However,
in the field of spoken language understanding which is always dealing with noise, no complete
comparison between different active learning methods has been done. This paper compares the
best known active learning methods in noisy conditions for spoken language understanding.
Additionally a new method based on Fisher information named as Weighted Gradient
Uncertainty (WGU) is proposed. Furthermore, Strict Local Density (SLD) method is proposed
based on a new concept of local density and a new technique of utilizing information density
measures. Results demonstrate that both proposed methods outperform the best performance
of the previous methods in noisy and noise-free conditions with SLD being superior to WGU
slightly.

2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015

2015 23rd Iranian Conference on Electrical Engineering, 2015