Shefali Waldekar - Academia.edu (original) (raw)
Papers by Shefali Waldekar
International Journal of Speech Technology
With the rise in multimedia content over the years, more variety is observed in the recording env... more With the rise in multimedia content over the years, more variety is observed in the recording environments of audio. An audio processing system might benefit when it has a module to identify the acoustic domain at its front-end. In this paper, we demonstrate the idea of acoustic domain identification (ADI) for speaker diarization. For this, we first present a detailed study of the various domains of the third DIHARD challenge highlighting the factors that differentiated them from each other. Our main contribution is to develop a simple and efficient solution for ADI. In the present work, we explore speaker embeddings for this task. Next, we integrate the ADI module with the speaker diarization framework of the DIHARD III challenge. The performance substantially improved over that of the baseline when the thresholds for agglomerative hierarchical clustering were optimized according to the respective domains. We achieved a relative improvement of more than 5% and 8% in DER for core and full conditions, respectively, on Track 1 of the DIHARD III evaluation set.
ArXiv, 2021
This report describes the speaker diarization system developed by the ABSP Laboratory team for th... more This report describes the speaker diarization system developed by the ABSP Laboratory team for the third DIHARD speech diarization challenge. Our primary contribution is to develop acoustic domain identification (ADI) system for speaker diarization. We investigate speaker embeddings based ADI system. We apply a domain-dependent threshold for agglomerative hierarchical clustering. Besides, we optimize the parameters for PCA-based dimensionality reduction in a domain-dependent way. Our method of integrating domain-based processing schemes in the baseline system of the challenge achieved a relative improvement of 9.63% and 10.64% in DER for core and full conditions, respectively, for Track 1 of the DIHARD III evaluation set. I. NOTABLE HIGHLIGHTS We participated in the Track 1 of the third DIHARD challenge [1]. Our main focus was to apply domain-dependent processing which was found promising in preliminary studies with the second DIHARD dataset [2], [3]. We propose a simple modificatio...
Applied Acoustics, 2020
Growing demands from applications like surveillance, archiving, and context-aware devices have fu... more Growing demands from applications like surveillance, archiving, and context-aware devices have fuelled research towards efficient extraction of useful information from environmental sounds. Assigning a textual label to an audio segment based on the general characteristics of locations or situations is dealt with in acoustic scene classification (ASC). Because of the different nature of audio scenes, a single featureclassifier pair may not efficiently discriminate among environments. Also, the acoustic scenes might vary with the problem under investigation. However, for most of the ASC applications, rather than giving explicit scene labels (like home, park, etc.) a general estimate of the type of surroundings (e.g., indoor or outdoor) might be enough. In this paper, we propose a two-level hierarchical framework for ASC wherein finer labels follow coarse classification. At the first level, texture features extracted from time-frequency representation of the audio samples are used to generate the coarse labels. The system then explores combinations of six well-known spectral features, successfully used in different audio processing fields for second level classification to give finer details of the audio scene. The performance of the proposed system is compared with baseline methods using detection and classification of acoustic scenes and events (DCASE, 2016 and 2017) ASC databases, and found to be superior in terms of classification accuracy. Additionally, the proposed hierarchical method provides important intermediate results as coarse labels that may be useful in certain applications.
Multimedia Tools and Applications, 2020
Analysis of audio from real-life environments and their categorization into different acoustic sc... more Analysis of audio from real-life environments and their categorization into different acoustic scenes can make context-aware devices and applications more efficient. Unlike speech, such signals have overlapping frequency content while spanning a much larger audible frequency range. Also, they are less structured than speech/music signals. Wavelet transform has good time-frequency localization ability owing to its variable-length basis functions. Consequently, it facilitates the extraction of more characteristic information from environmental audio. This paper attempts to classify acoustic scenes by a novel use of wavelet-based mel-scaled features. The design of the proposed framework is based on the experiments conducted on two datasets which have same scene classes but differ with regard to sample length and amount of data (in hours). It outperformed two benchmark systems, one based on mel-frequency cepstral coefficients and Gaussian mixture models and the other based on log mel-band energies and multi-layer perceptron. We also present an investigation on the use of different train and test sample duration for acoustic scene classification. Keywords DCASE • Environmental sounds • Haar function • MFCC • SVM 1 Introduction Machine hearing has emerged as a rapidly growing field in audio signal processing [19]. Research domains such as computational auditory scene analysis [4], soundscape cognition [9], acoustic event detection (AED) [18, 20, 22], and acoustic scene classification (ASC) [2] come under its fold. ASC involves analysis of an audio signal based on the acoustic characteristics of the recording location and labeling it textually. The advances in this area can be attributed to the importance associated with the knowledge of environmental sounds Shefali Waldekar
Interspeech 2018, 2018
Acoustic scene classification (ASC) is an audio signal processing task where mel-scaled spectral ... more Acoustic scene classification (ASC) is an audio signal processing task where mel-scaled spectral features are widely used by researchers. These features, considered de facto baseline in speech processing, traditionally employ Fourier based transforms. Unlike speech, environmental audio spans a larger range of audible frequency and might contain short high-frequency transients and continuous low-frequency background noise, simultaneously. Wavelets, with a better time-frequency localization capacity, can be considered more suitable for dealing with such signals. This paper attempts ASC by a novel use of wavelet transform based mel-scaled features. The proposed features are shown to possess better discriminative properties than other spectral features while using a similar classification framework. The experiments are performed on two datasets, similar in scene classes but differing by dataset size and length of the audio samples. When compared with two benchmark systems, one based on mel-frequency cepstral coefficients and Gaussian mixture models, and the other based on log mel-band energies and multi-layer perceptron, the proposed system performed considerably better on the test data.
Cornell University - arXiv, Jan 24, 2021
This report presents the system developed by the ABSP Laboratory team for the third DIHARD speech... more This report presents the system developed by the ABSP Laboratory team for the third DIHARD speech diarization challenge. Our main contribution in this work is to develop a simple and efficient solution for acoustic domain dependent speech diarization. We explore speaker embeddings for acoustic domain identification (ADI) task. Our study reveals that i-vector based method achieves considerably better performance than xvector based approach in the third DIHARD challenge dataset. Next, we integrate the ADI module with the diarization framework. The performance substantially improved over that of the baseline when we optimized the thresholds for agglomerative hierarchical clustering and the parameters for dimensionality reduction during scoring for individual acoustic domains. We achieved a relative improvement of 9.63% and 10.64% in DER for core and full conditions, respectively, for Track 1 of the DI-HARD III evaluation set.
Digital Signal Processing
2019 IEEE Region 10 Symposium (TENSYMP)
Context-aware devices and applications can benefit when audio from real-life environments is cate... more Context-aware devices and applications can benefit when audio from real-life environments is categorized into different acoustic scenes. Such categorization is referred to as acoustic scene classification (ASC). However, the scene labels are database dependent. For most of the ASC applications, rather than giving explicit scene labels (like home, park etc), a general estimate of the type of surroundings (e.g. indoor or outdoor) might be enough. ASC has been generally achieved with mel-scaled cepstral features by the state-of-the-art systems. The characteristics that differentiate one scene class from the other are embedded in the texture of the time-frequency representation of the audio. In this paper, we propose to capture this textural information through statistics of local binary pattern of the mel-filterbank energies. The experiments were conducted on two datasets having same scene classes but varying audio sample duration and unequal total amount of data. The proposed framework outperforms two mel-scale based benchmark systems.
IIT Kharagpur, Sep 1, 2020
Classifying an audio stream as either speech or music is receiving wide spread attention due to i... more Classifying an audio stream as either speech or music is receiving wide spread attention due to its varied applications. In this paper, we propose a novel block based mel frequency cepstral coefficient (MFCC) feature extraction method for music and speech classification. We found that the proposed features give better classification accuracy as compared to conventional MFCC features and zero crossing rate (ZCR) features. Here, we use support vector machine (SVM) classifier with 3-fold cross validation scheme. Evaluation is done on GTZAN music/speech dataset. Further, we investigate the effect of number of blocks, size of each block and number of filter banks on the classification performance.
This report describes a submission for IEEE AASP Challenge on Detection and Classification of Aco... more This report describes a submission for IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 for Task 1 (acoustic scene classification (ASC)), sub-task A (basic ASC) and sub-task B (ASC with mismatched recording devices). The system exploits time-frequency representation of audio to obtain the scene labels. It follows a simple pattern classification framework employing wavelet transform based mel-scaled features along with support vector machine as classifier. The proposed system relatively outperforms the deep-learning based baseline system by almost 8% for sub-task A and 26% for sub-task B on the development dataset provided for the respective sub-tasks.
This report describes two submissions for Acoustic Scene Classification (ASC) task of the IEEE AA... more This report describes two submissions for Acoustic Scene Classification (ASC) task of the IEEE AASP challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2017. The first system follows an approach based on a score-level fusion of some well-known spectral features of audio processing. The second system uses the first proposed system in a two-stage hierarchical classification framework. The two systems respectively show 18% and 21% better performance on the development dataset, and 10% and 6% better performance on the evaluation dataset, relative to that of the MLP-based baseline system of DCASE 2017.
This report describes a submission for IEEE AASP Challenge on Detection and Classification of Aco... more This report describes a submission for IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 for Task 1 (Acoustic Scene Classification (ASC)), sub-task A (ASC with multiple devices) and sub-task B (lowcomplexity ASC). The systems exploit time-frequency representation of audio to obtain the scene labels. The system for subtask A follows a simple pattern classification framework employing wavelet transform based mel-scaled features along with support vector machine as classifier. Texture features, namely local binary pattern, extracted from log of mel-band energies is used in a similar classification framework for sub-task B. The proposed systems outperform the deep-learning based baseline systems with the development dataset provided for the respective sub-tasks.
Digital Signal Processing, 2018
The rapidly increasing requirements from context-aware gadgets, like smartphones and intelligent ... more The rapidly increasing requirements from context-aware gadgets, like smartphones and intelligent wearable devices, along with applications such as audio archiving, have given a fillip to the research in the field of Acoustic Scene Classification (ASC). The Detection and Classification of Acoustic Scenes and Events (DCASE) challenges have seen systems addressing the problem of ASC from different directions. Some of them could achieve better results than the Mel Frequency Cepstral Coefficients-Gaussian Mixture Model (MFCC-GMM) baseline system. However, a collective decision from all participating systems was found to surpass the accuracy obtained by each system. The simultaneous use of various approaches can exploit the discriminating information in a better way for audio collected from different environments covering audible-frequency range in varying degrees. In this work, we show that the framelevel statistics of some well-known spectral features when fed to Support Vector Machine (SVM) classifier individually, are able to outperform the baseline system of DCASE challenges. Furthermore, we analyzed different methods of combining these features, and also of combining information from two channels when the data is in binaural format. The proposed approach resulted in around 17% and 9% relative improvement in accuracy with respect to the baseline system on the development and evaluation dataset, respectively, from DCASE 2016 ASC task.
International Journal of Speech Technology
With the rise in multimedia content over the years, more variety is observed in the recording env... more With the rise in multimedia content over the years, more variety is observed in the recording environments of audio. An audio processing system might benefit when it has a module to identify the acoustic domain at its front-end. In this paper, we demonstrate the idea of acoustic domain identification (ADI) for speaker diarization. For this, we first present a detailed study of the various domains of the third DIHARD challenge highlighting the factors that differentiated them from each other. Our main contribution is to develop a simple and efficient solution for ADI. In the present work, we explore speaker embeddings for this task. Next, we integrate the ADI module with the speaker diarization framework of the DIHARD III challenge. The performance substantially improved over that of the baseline when the thresholds for agglomerative hierarchical clustering were optimized according to the respective domains. We achieved a relative improvement of more than 5% and 8% in DER for core and full conditions, respectively, on Track 1 of the DIHARD III evaluation set.
ArXiv, 2021
This report describes the speaker diarization system developed by the ABSP Laboratory team for th... more This report describes the speaker diarization system developed by the ABSP Laboratory team for the third DIHARD speech diarization challenge. Our primary contribution is to develop acoustic domain identification (ADI) system for speaker diarization. We investigate speaker embeddings based ADI system. We apply a domain-dependent threshold for agglomerative hierarchical clustering. Besides, we optimize the parameters for PCA-based dimensionality reduction in a domain-dependent way. Our method of integrating domain-based processing schemes in the baseline system of the challenge achieved a relative improvement of 9.63% and 10.64% in DER for core and full conditions, respectively, for Track 1 of the DIHARD III evaluation set. I. NOTABLE HIGHLIGHTS We participated in the Track 1 of the third DIHARD challenge [1]. Our main focus was to apply domain-dependent processing which was found promising in preliminary studies with the second DIHARD dataset [2], [3]. We propose a simple modificatio...
Applied Acoustics, 2020
Growing demands from applications like surveillance, archiving, and context-aware devices have fu... more Growing demands from applications like surveillance, archiving, and context-aware devices have fuelled research towards efficient extraction of useful information from environmental sounds. Assigning a textual label to an audio segment based on the general characteristics of locations or situations is dealt with in acoustic scene classification (ASC). Because of the different nature of audio scenes, a single featureclassifier pair may not efficiently discriminate among environments. Also, the acoustic scenes might vary with the problem under investigation. However, for most of the ASC applications, rather than giving explicit scene labels (like home, park, etc.) a general estimate of the type of surroundings (e.g., indoor or outdoor) might be enough. In this paper, we propose a two-level hierarchical framework for ASC wherein finer labels follow coarse classification. At the first level, texture features extracted from time-frequency representation of the audio samples are used to generate the coarse labels. The system then explores combinations of six well-known spectral features, successfully used in different audio processing fields for second level classification to give finer details of the audio scene. The performance of the proposed system is compared with baseline methods using detection and classification of acoustic scenes and events (DCASE, 2016 and 2017) ASC databases, and found to be superior in terms of classification accuracy. Additionally, the proposed hierarchical method provides important intermediate results as coarse labels that may be useful in certain applications.
Multimedia Tools and Applications, 2020
Analysis of audio from real-life environments and their categorization into different acoustic sc... more Analysis of audio from real-life environments and their categorization into different acoustic scenes can make context-aware devices and applications more efficient. Unlike speech, such signals have overlapping frequency content while spanning a much larger audible frequency range. Also, they are less structured than speech/music signals. Wavelet transform has good time-frequency localization ability owing to its variable-length basis functions. Consequently, it facilitates the extraction of more characteristic information from environmental audio. This paper attempts to classify acoustic scenes by a novel use of wavelet-based mel-scaled features. The design of the proposed framework is based on the experiments conducted on two datasets which have same scene classes but differ with regard to sample length and amount of data (in hours). It outperformed two benchmark systems, one based on mel-frequency cepstral coefficients and Gaussian mixture models and the other based on log mel-band energies and multi-layer perceptron. We also present an investigation on the use of different train and test sample duration for acoustic scene classification. Keywords DCASE • Environmental sounds • Haar function • MFCC • SVM 1 Introduction Machine hearing has emerged as a rapidly growing field in audio signal processing [19]. Research domains such as computational auditory scene analysis [4], soundscape cognition [9], acoustic event detection (AED) [18, 20, 22], and acoustic scene classification (ASC) [2] come under its fold. ASC involves analysis of an audio signal based on the acoustic characteristics of the recording location and labeling it textually. The advances in this area can be attributed to the importance associated with the knowledge of environmental sounds Shefali Waldekar
Interspeech 2018, 2018
Acoustic scene classification (ASC) is an audio signal processing task where mel-scaled spectral ... more Acoustic scene classification (ASC) is an audio signal processing task where mel-scaled spectral features are widely used by researchers. These features, considered de facto baseline in speech processing, traditionally employ Fourier based transforms. Unlike speech, environmental audio spans a larger range of audible frequency and might contain short high-frequency transients and continuous low-frequency background noise, simultaneously. Wavelets, with a better time-frequency localization capacity, can be considered more suitable for dealing with such signals. This paper attempts ASC by a novel use of wavelet transform based mel-scaled features. The proposed features are shown to possess better discriminative properties than other spectral features while using a similar classification framework. The experiments are performed on two datasets, similar in scene classes but differing by dataset size and length of the audio samples. When compared with two benchmark systems, one based on mel-frequency cepstral coefficients and Gaussian mixture models, and the other based on log mel-band energies and multi-layer perceptron, the proposed system performed considerably better on the test data.
Cornell University - arXiv, Jan 24, 2021
This report presents the system developed by the ABSP Laboratory team for the third DIHARD speech... more This report presents the system developed by the ABSP Laboratory team for the third DIHARD speech diarization challenge. Our main contribution in this work is to develop a simple and efficient solution for acoustic domain dependent speech diarization. We explore speaker embeddings for acoustic domain identification (ADI) task. Our study reveals that i-vector based method achieves considerably better performance than xvector based approach in the third DIHARD challenge dataset. Next, we integrate the ADI module with the diarization framework. The performance substantially improved over that of the baseline when we optimized the thresholds for agglomerative hierarchical clustering and the parameters for dimensionality reduction during scoring for individual acoustic domains. We achieved a relative improvement of 9.63% and 10.64% in DER for core and full conditions, respectively, for Track 1 of the DI-HARD III evaluation set.
Digital Signal Processing
2019 IEEE Region 10 Symposium (TENSYMP)
Context-aware devices and applications can benefit when audio from real-life environments is cate... more Context-aware devices and applications can benefit when audio from real-life environments is categorized into different acoustic scenes. Such categorization is referred to as acoustic scene classification (ASC). However, the scene labels are database dependent. For most of the ASC applications, rather than giving explicit scene labels (like home, park etc), a general estimate of the type of surroundings (e.g. indoor or outdoor) might be enough. ASC has been generally achieved with mel-scaled cepstral features by the state-of-the-art systems. The characteristics that differentiate one scene class from the other are embedded in the texture of the time-frequency representation of the audio. In this paper, we propose to capture this textural information through statistics of local binary pattern of the mel-filterbank energies. The experiments were conducted on two datasets having same scene classes but varying audio sample duration and unequal total amount of data. The proposed framework outperforms two mel-scale based benchmark systems.
IIT Kharagpur, Sep 1, 2020
Classifying an audio stream as either speech or music is receiving wide spread attention due to i... more Classifying an audio stream as either speech or music is receiving wide spread attention due to its varied applications. In this paper, we propose a novel block based mel frequency cepstral coefficient (MFCC) feature extraction method for music and speech classification. We found that the proposed features give better classification accuracy as compared to conventional MFCC features and zero crossing rate (ZCR) features. Here, we use support vector machine (SVM) classifier with 3-fold cross validation scheme. Evaluation is done on GTZAN music/speech dataset. Further, we investigate the effect of number of blocks, size of each block and number of filter banks on the classification performance.
This report describes a submission for IEEE AASP Challenge on Detection and Classification of Aco... more This report describes a submission for IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 for Task 1 (acoustic scene classification (ASC)), sub-task A (basic ASC) and sub-task B (ASC with mismatched recording devices). The system exploits time-frequency representation of audio to obtain the scene labels. It follows a simple pattern classification framework employing wavelet transform based mel-scaled features along with support vector machine as classifier. The proposed system relatively outperforms the deep-learning based baseline system by almost 8% for sub-task A and 26% for sub-task B on the development dataset provided for the respective sub-tasks.
This report describes two submissions for Acoustic Scene Classification (ASC) task of the IEEE AA... more This report describes two submissions for Acoustic Scene Classification (ASC) task of the IEEE AASP challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2017. The first system follows an approach based on a score-level fusion of some well-known spectral features of audio processing. The second system uses the first proposed system in a two-stage hierarchical classification framework. The two systems respectively show 18% and 21% better performance on the development dataset, and 10% and 6% better performance on the evaluation dataset, relative to that of the MLP-based baseline system of DCASE 2017.
This report describes a submission for IEEE AASP Challenge on Detection and Classification of Aco... more This report describes a submission for IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 for Task 1 (Acoustic Scene Classification (ASC)), sub-task A (ASC with multiple devices) and sub-task B (lowcomplexity ASC). The systems exploit time-frequency representation of audio to obtain the scene labels. The system for subtask A follows a simple pattern classification framework employing wavelet transform based mel-scaled features along with support vector machine as classifier. Texture features, namely local binary pattern, extracted from log of mel-band energies is used in a similar classification framework for sub-task B. The proposed systems outperform the deep-learning based baseline systems with the development dataset provided for the respective sub-tasks.
Digital Signal Processing, 2018
The rapidly increasing requirements from context-aware gadgets, like smartphones and intelligent ... more The rapidly increasing requirements from context-aware gadgets, like smartphones and intelligent wearable devices, along with applications such as audio archiving, have given a fillip to the research in the field of Acoustic Scene Classification (ASC). The Detection and Classification of Acoustic Scenes and Events (DCASE) challenges have seen systems addressing the problem of ASC from different directions. Some of them could achieve better results than the Mel Frequency Cepstral Coefficients-Gaussian Mixture Model (MFCC-GMM) baseline system. However, a collective decision from all participating systems was found to surpass the accuracy obtained by each system. The simultaneous use of various approaches can exploit the discriminating information in a better way for audio collected from different environments covering audible-frequency range in varying degrees. In this work, we show that the framelevel statistics of some well-known spectral features when fed to Support Vector Machine (SVM) classifier individually, are able to outperform the baseline system of DCASE challenges. Furthermore, we analyzed different methods of combining these features, and also of combining information from two channels when the data is in binaural format. The proposed approach resulted in around 17% and 9% relative improvement in accuracy with respect to the baseline system on the development and evaluation dataset, respectively, from DCASE 2016 ASC task.