Biswajit Das | TCS Innovation labs (original) (raw)
Papers by Biswajit Das
Pattern Recognition Letters, 2013
The article describes the speech recognition system development in Bengali language for aging pop... more The article describes the speech recognition system development in Bengali language for aging population with various adaptation techniques. Variability in acoustic characteristics among different speakers degrades speech recognition accuracy. In general, perceptual as well as acoustical variations exists among speakers, but variations are more pronounced between young and aged population. Deviation in voice source features between two age groups, affect the speech recognition performance. Existing automatic speech recognition algorithms demands large amount of training data with all variability to develop a robust speech recognition system. However, speaker normalization and adaptation techniques attempts to reduce inter-speaker or intra-speaker acoustic variability without having large amount of training data. Here, conventional acoustic model adaptation method e.g. vocal tract length normalization, maximum likelihood linear regression and/or maximum a posteriori are combined in the current study to improve recognition accuracy. Moreover, maximum mutual information estimation technique has been implemented in this study.
ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 1, 2019
Parkinsonism refers to Parkinson's Disease (PD) and Atypical Parkinsonian Syndromes (APS), such a... more Parkinsonism refers to Parkinson's Disease (PD) and Atypical Parkinsonian Syndromes (APS), such as Progressive Supranuclear Palsy (PSP) and Multiple System Atrophy (MSA). Discrimination between PD and APS and within APS groups in early disease stages is a very challenging task. Interestingly, speech disorder is frequently an early and prominent clinical feature of both PD and APS. This renders speech/voice analysis a promising tool for the development of an objective marker to assist neurologists in their diagnosis. This paper is a continuation of a recent work on speech-based differential diagnosis within APS. We address the difficult problem of defining disease-specific speech features which is crucial in the perspective of early differential diagnosis. We investigate this problem by considering the constraint that only a small amount of training data can be available in this setting. To do so, we perform univariate statistical analysis followed by a supervised learning that forces the designed new features to be 1-dimensional. We carry out experiments using speech recordings of MSA and PSP patients. We show that linear classification models allow the definition of new scalar variables which can be considered as speech features which are specific to each disease, MSA and PSP.
The article describes the speech recognition system development in Bengali language for aging pop... more The article describes the speech recognition system development in Bengali language for aging population with various adaptation techniques. Variability in acoustic characteristics among different speakers degrades speech recognition accuracy. In general, perceptual as well as acoustical variations exists among speakers, but variations are more pronounced between young and aged population. Deviation in voice source features between two age groups, affect the speech recognition performance. Existing automatic speech recognition algorithms demands large amount of training data with all variability to develop a robust speech recognition system. However, speaker normalization and adaptation techniques attempts to reduce inter-speaker or intra-speaker acoustic variability without having large amount of training data. Here, conventional acoustic model adaptation method e.g. vocal tract length normalization, maximum likelihood linear regression and/or maximum a posteriori are combined in the current study to improve recognition accuracy. Moreover, maximum mutual information estimation technique has been implemented in this study.► We have developed a automatic speech recognition system in Bengali for aged population. ► We have analyzed phoneme and word recognition performance of aged people employing several acoustic model. ► We have combined speaker normalization and model adaptation techniques to improve recognition performance. ► We have find out more affected phone which motivate us to incorporate these finding in acoustic model creation in future.
The article studies age related variations of speech characteristics of two age groups, in the Be... more The article studies age related variations of speech characteristics of two age groups, in the Bengali language. The study considers 60 speakers in the each age groups, 60-80 years and 20-40 years, respectively. We have considered different voice source features like fundamental frequency, formant frequencies, jitter, shimmer and harmonic to noise ratio. Cepstral domain feature, Mel Frequency Cepstral coefficients (MFCC) of different voiced Bengali vowels are also analyzed for younger and older adult groups. MFCC feature and Hidden Markov model parameter of different voiced vowels are used to study phoneme dissimilarities measure between two age groups. Age related changes in elderly speech affect the automatic speech recognition performance as was observed in our study, raising the need for specific acoustic models for elderly persons.
Speech corpus plays a key role in construction of automatic speech recognition (ASR), text-to-spe... more Speech corpus plays a key role in construction of automatic speech recognition (ASR), text-to-speech (TTS) synthesis and phone recognition (PR) system. PR system and ASR system are quite similar in functionality. The difference between these two is that for PR system the speech signal is converted to phonefootnote{smallest discrete segment of sound in uttered speech} text whereas for ASR system the speech signal is converted to word text. Speech corpus for PR system usually consists of a text corpus, recording data corresponding to the text corpus, phonetic representation of the text corpus and a pronunciation dictionary. Selecting optimum text from available text with balanced phone distribution is an important task for developing high quality PR system. In this paper, we describe our text selection technique and discuss the performance of phone recognition system.
This paper presents Bengali speech corpus development for speaker independent continuous speech r... more This paper presents Bengali speech corpus development for speaker independent continuous speech recognition. speech corpora is the backbone of automatic speech recognition (ASR) system. Speech corpus can be classified into several class. It may be language dependent or age dependent. We have developed speech corpus for two age groups. Younger group belongs to 20 to 40 years of age whereas older group is distributed into 60 to 80 years. We have created phone and triphone labeled speech corpora. Initially, speech samples are aligned with statistical modeling technique. Statistically labeled files are then pruned by manual correction. Hidden Markov Model Toolkit (HTK) has been used for aligning the speech data. We have observed phoneme recognition and continuous word recognition performance to check speech corpus quality.
Pattern Recognition Letters, 2013
The article describes the speech recognition system development in Bengali language for aging pop... more The article describes the speech recognition system development in Bengali language for aging population with various adaptation techniques. Variability in acoustic characteristics among different speakers degrades speech recognition accuracy. In general, perceptual as well as acoustical variations exists among speakers, but variations are more pronounced between young and aged population. Deviation in voice source features between two age groups, affect the speech recognition performance. Existing automatic speech recognition algorithms demands large amount of training data with all variability to develop a robust speech recognition system. However, speaker normalization and adaptation techniques attempts to reduce inter-speaker or intra-speaker acoustic variability without having large amount of training data. Here, conventional acoustic model adaptation method e.g. vocal tract length normalization, maximum likelihood linear regression and/or maximum a posteriori are combined in the current study to improve recognition accuracy. Moreover, maximum mutual information estimation technique has been implemented in this study.
ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 1, 2019
Parkinsonism refers to Parkinson's Disease (PD) and Atypical Parkinsonian Syndromes (APS), such a... more Parkinsonism refers to Parkinson's Disease (PD) and Atypical Parkinsonian Syndromes (APS), such as Progressive Supranuclear Palsy (PSP) and Multiple System Atrophy (MSA). Discrimination between PD and APS and within APS groups in early disease stages is a very challenging task. Interestingly, speech disorder is frequently an early and prominent clinical feature of both PD and APS. This renders speech/voice analysis a promising tool for the development of an objective marker to assist neurologists in their diagnosis. This paper is a continuation of a recent work on speech-based differential diagnosis within APS. We address the difficult problem of defining disease-specific speech features which is crucial in the perspective of early differential diagnosis. We investigate this problem by considering the constraint that only a small amount of training data can be available in this setting. To do so, we perform univariate statistical analysis followed by a supervised learning that forces the designed new features to be 1-dimensional. We carry out experiments using speech recordings of MSA and PSP patients. We show that linear classification models allow the definition of new scalar variables which can be considered as speech features which are specific to each disease, MSA and PSP.
The article describes the speech recognition system development in Bengali language for aging pop... more The article describes the speech recognition system development in Bengali language for aging population with various adaptation techniques. Variability in acoustic characteristics among different speakers degrades speech recognition accuracy. In general, perceptual as well as acoustical variations exists among speakers, but variations are more pronounced between young and aged population. Deviation in voice source features between two age groups, affect the speech recognition performance. Existing automatic speech recognition algorithms demands large amount of training data with all variability to develop a robust speech recognition system. However, speaker normalization and adaptation techniques attempts to reduce inter-speaker or intra-speaker acoustic variability without having large amount of training data. Here, conventional acoustic model adaptation method e.g. vocal tract length normalization, maximum likelihood linear regression and/or maximum a posteriori are combined in the current study to improve recognition accuracy. Moreover, maximum mutual information estimation technique has been implemented in this study.► We have developed a automatic speech recognition system in Bengali for aged population. ► We have analyzed phoneme and word recognition performance of aged people employing several acoustic model. ► We have combined speaker normalization and model adaptation techniques to improve recognition performance. ► We have find out more affected phone which motivate us to incorporate these finding in acoustic model creation in future.
The article studies age related variations of speech characteristics of two age groups, in the Be... more The article studies age related variations of speech characteristics of two age groups, in the Bengali language. The study considers 60 speakers in the each age groups, 60-80 years and 20-40 years, respectively. We have considered different voice source features like fundamental frequency, formant frequencies, jitter, shimmer and harmonic to noise ratio. Cepstral domain feature, Mel Frequency Cepstral coefficients (MFCC) of different voiced Bengali vowels are also analyzed for younger and older adult groups. MFCC feature and Hidden Markov model parameter of different voiced vowels are used to study phoneme dissimilarities measure between two age groups. Age related changes in elderly speech affect the automatic speech recognition performance as was observed in our study, raising the need for specific acoustic models for elderly persons.
Speech corpus plays a key role in construction of automatic speech recognition (ASR), text-to-spe... more Speech corpus plays a key role in construction of automatic speech recognition (ASR), text-to-speech (TTS) synthesis and phone recognition (PR) system. PR system and ASR system are quite similar in functionality. The difference between these two is that for PR system the speech signal is converted to phonefootnote{smallest discrete segment of sound in uttered speech} text whereas for ASR system the speech signal is converted to word text. Speech corpus for PR system usually consists of a text corpus, recording data corresponding to the text corpus, phonetic representation of the text corpus and a pronunciation dictionary. Selecting optimum text from available text with balanced phone distribution is an important task for developing high quality PR system. In this paper, we describe our text selection technique and discuss the performance of phone recognition system.
This paper presents Bengali speech corpus development for speaker independent continuous speech r... more This paper presents Bengali speech corpus development for speaker independent continuous speech recognition. speech corpora is the backbone of automatic speech recognition (ASR) system. Speech corpus can be classified into several class. It may be language dependent or age dependent. We have developed speech corpus for two age groups. Younger group belongs to 20 to 40 years of age whereas older group is distributed into 60 to 80 years. We have created phone and triphone labeled speech corpora. Initially, speech samples are aligned with statistical modeling technique. Statistically labeled files are then pruned by manual correction. Hidden Markov Model Toolkit (HTK) has been used for aligning the speech data. We have observed phoneme recognition and continuous word recognition performance to check speech corpus quality.