R Kumara Swamy - Academia.edu (original) (raw)
Papers by R Kumara Swamy
Speech Communication, 2015
While biometric authentication has advanced significantly in recent years, evidence shows the tec... more While biometric authentication has advanced significantly in recent years, evidence shows the technology can be susceptible to malicious spoofing attacks. The research community has responded with dedicated countermeasures which aim to detect and deflect such attacks. Even if the literature shows that they can be effective, the problem is far from being solved; biometric systems remain vulnerable to spoofing. Despite a growing momentum to develop spoofing countermeasures for automatic speaker verification, now that the technology has matured sufficiently to support mass deployment in an array of diverse applications, greater effort will be needed in the future to ensure adequate protection against spoofing. This article provides a survey of past work and identifies priority research directions for the future. We summarise previous studies involving impersonation, replay, speech synthesis and voice conversion spoofing attacks and more recent efforts to develop dedicated countermeasures. The survey shows that future research should address the lack of standard datasets and the over-fitting of existing countermeasures to specific, known spoofing attacks.
2017 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET)
Acquisition and calibration of multimodal biometrie data that aids in the design of a biometrie r... more Acquisition and calibration of multimodal biometrie data that aids in the design of a biometrie recognition system is the focus of this paper. Face images, speech/voice and scanned copy of a handwritten disclaimer from the subject are the data that are acquired and calibrated. Pose and illumination are the considered covariates for face images. Speech is collected from a distance of around 15 cms and 75 cms from the subject. An evaluation protocol is specified based on the conditions and the chosen covariates for the particular trait. Calibration studies have been conducted on the calibration subset of the collected database. Results of unimodal systems are reported. They perform reasonably well with the collected data from both scenarios.
2016 2nd International Conference on Contemporary Computing and Informatics (IC3I), 2016
This paper presents an alternate representation of phase information in speech signals using Hart... more This paper presents an alternate representation of phase information in speech signals using Hartley transform. Hartley Group Delay Function (HGDF) is computed on similar lines of Fourier Group delay function. Cepstral smoothing is applied so as to reduce the spiky nature of the group delay functions. The smoothened HGDF (SHGDF) is reported to have better resolution in group delay spectrum. A speaker verification system is designed as an application for the proposed signal representation. SHGDF is then presented as input to feed forward neural network. Performance curves using MFCCs, MODGFs and proposed SHGDF as features for the neural network are compared. It is found that the SHGDF functions provide better average performance for the speaker recognition system.
Lecture Notes in Computer Science, 2004
In this paper, we investigate the problem of video classification into predefined genre. The appr... more In this paper, we investigate the problem of video classification into predefined genre. The approach adopted is based on spatial and temporal descriptors derived from short video sequences (20 seconds). By using support vector machines (SVMs), we propose an optimized multiclass classification method. Five popular TV broadcast genre namely cartoon, commercials, cricket, football and tennis are studied. We tested our scheme on more than 2 hours of video data and achieved an accuracy of 92.5%.
IEEE Transactions on Audio, Speech, and Language Processing, 2009
In this paper, we propose an approach for processing multispeaker speech signals collected simult... more In this paper, we propose an approach for processing multispeaker speech signals collected simultaneously using a pair of spatially separated microphones in a real room environment. Spatial separation of microphones results in a fixed time-delay of arrival of speech signals from a given speaker at the pair of microphones. These time-delays are estimated by exploiting the impulse-like characteristic of excitation during speech production. The differences in the time-delays for different speakers are used to determine the number of speakers from the mixed multispeaker speech signals. There is difference in the signal levels due to differences in the distances between the speaker and each of the microphones. The differences in the signal levels dictate the values of the mixing parameters. Knowledge of speech production, especially the excitation source characteristics, is used to derive an approximate weight function for locating the regions specific to a given speaker. The scatter plots of the weighted and delay-compensated mixed speech signals are used to estimate the mixing parameters. The proposed method is applied on the data collected in actual laboratory environment for an underdetermined case, where the number of speakers is more than the number of microphones. Enhancement of speech due to a speaker is also examined using the information of the time-delays and the mixing parameters, and is evaluated using objective measures proposed in the literature.
IEEE Signal Processing Letters, 2007
In this letter, we address the issue of determining the number of speakers from multispeaker spee... more In this letter, we address the issue of determining the number of speakers from multispeaker speech signals collected simultaneously using a pair of spatially separated microphones. The spatial separation of the microphones results in time delay of arrival of speech signals from a given speaker. The differences in the time delays for different speakers are exploited to determine the number of speakers from the multispeaker signals. The key idea is that for a given speaker, the relative spacings of the instants of significant excitation of the vocal tract system remain unchanged in the direct components of the speech signals at the two microphones. The time delays can be estimated from the cross-correlation of the Hilbert envelopes of the linear prediction residuals of the multispeaker signals collected at the two microphones.
International Journal of High Performance Computing and Networking, 2020
Speaker verification can be viewed as a process of verifying the person using his/her utterance. ... more Speaker verification can be viewed as a process of verifying the person using his/her utterance. The major challenge to implement automatic speaker verification in security applications is spoofing attacks. Speaker verification systems can be spoofed using pre-recorded speech, synthetic and voice conversion speech. Hence, there is a need to develop spoof detection system in order to make voice biometrics viable for security applications. This paper proposes to explore time-frequency representations obtained using gammatone filterbank and constant Q transform for detecting presentation attack for automatic speaker verification. The experiments are carried out for ASV spoof 2017 database and the results are compared with state-of-art replay speech detection systems based on cepstral features.
Speech Communication, 2015
While biometric authentication has advanced significantly in recent years, evidence shows the tec... more While biometric authentication has advanced significantly in recent years, evidence shows the technology can be susceptible to malicious spoofing attacks. The research community has responded with dedicated countermeasures which aim to detect and deflect such attacks. Even if the literature shows that they can be effective, the problem is far from being solved; biometric systems remain vulnerable to spoofing. Despite a growing momentum to develop spoofing countermeasures for automatic speaker verification, now that the technology has matured sufficiently to support mass deployment in an array of diverse applications, greater effort will be needed in the future to ensure adequate protection against spoofing. This article provides a survey of past work and identifies priority research directions for the future. We summarise previous studies involving impersonation, replay, speech synthesis and voice conversion spoofing attacks and more recent efforts to develop dedicated countermeasures. The survey shows that future research should address the lack of standard datasets and the over-fitting of existing countermeasures to specific, known spoofing attacks.
2017 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET)
Acquisition and calibration of multimodal biometrie data that aids in the design of a biometrie r... more Acquisition and calibration of multimodal biometrie data that aids in the design of a biometrie recognition system is the focus of this paper. Face images, speech/voice and scanned copy of a handwritten disclaimer from the subject are the data that are acquired and calibrated. Pose and illumination are the considered covariates for face images. Speech is collected from a distance of around 15 cms and 75 cms from the subject. An evaluation protocol is specified based on the conditions and the chosen covariates for the particular trait. Calibration studies have been conducted on the calibration subset of the collected database. Results of unimodal systems are reported. They perform reasonably well with the collected data from both scenarios.
2016 2nd International Conference on Contemporary Computing and Informatics (IC3I), 2016
This paper presents an alternate representation of phase information in speech signals using Hart... more This paper presents an alternate representation of phase information in speech signals using Hartley transform. Hartley Group Delay Function (HGDF) is computed on similar lines of Fourier Group delay function. Cepstral smoothing is applied so as to reduce the spiky nature of the group delay functions. The smoothened HGDF (SHGDF) is reported to have better resolution in group delay spectrum. A speaker verification system is designed as an application for the proposed signal representation. SHGDF is then presented as input to feed forward neural network. Performance curves using MFCCs, MODGFs and proposed SHGDF as features for the neural network are compared. It is found that the SHGDF functions provide better average performance for the speaker recognition system.
Lecture Notes in Computer Science, 2004
In this paper, we investigate the problem of video classification into predefined genre. The appr... more In this paper, we investigate the problem of video classification into predefined genre. The approach adopted is based on spatial and temporal descriptors derived from short video sequences (20 seconds). By using support vector machines (SVMs), we propose an optimized multiclass classification method. Five popular TV broadcast genre namely cartoon, commercials, cricket, football and tennis are studied. We tested our scheme on more than 2 hours of video data and achieved an accuracy of 92.5%.
IEEE Transactions on Audio, Speech, and Language Processing, 2009
In this paper, we propose an approach for processing multispeaker speech signals collected simult... more In this paper, we propose an approach for processing multispeaker speech signals collected simultaneously using a pair of spatially separated microphones in a real room environment. Spatial separation of microphones results in a fixed time-delay of arrival of speech signals from a given speaker at the pair of microphones. These time-delays are estimated by exploiting the impulse-like characteristic of excitation during speech production. The differences in the time-delays for different speakers are used to determine the number of speakers from the mixed multispeaker speech signals. There is difference in the signal levels due to differences in the distances between the speaker and each of the microphones. The differences in the signal levels dictate the values of the mixing parameters. Knowledge of speech production, especially the excitation source characteristics, is used to derive an approximate weight function for locating the regions specific to a given speaker. The scatter plots of the weighted and delay-compensated mixed speech signals are used to estimate the mixing parameters. The proposed method is applied on the data collected in actual laboratory environment for an underdetermined case, where the number of speakers is more than the number of microphones. Enhancement of speech due to a speaker is also examined using the information of the time-delays and the mixing parameters, and is evaluated using objective measures proposed in the literature.
IEEE Signal Processing Letters, 2007
In this letter, we address the issue of determining the number of speakers from multispeaker spee... more In this letter, we address the issue of determining the number of speakers from multispeaker speech signals collected simultaneously using a pair of spatially separated microphones. The spatial separation of the microphones results in time delay of arrival of speech signals from a given speaker. The differences in the time delays for different speakers are exploited to determine the number of speakers from the multispeaker signals. The key idea is that for a given speaker, the relative spacings of the instants of significant excitation of the vocal tract system remain unchanged in the direct components of the speech signals at the two microphones. The time delays can be estimated from the cross-correlation of the Hilbert envelopes of the linear prediction residuals of the multispeaker signals collected at the two microphones.
International Journal of High Performance Computing and Networking, 2020
Speaker verification can be viewed as a process of verifying the person using his/her utterance. ... more Speaker verification can be viewed as a process of verifying the person using his/her utterance. The major challenge to implement automatic speaker verification in security applications is spoofing attacks. Speaker verification systems can be spoofed using pre-recorded speech, synthetic and voice conversion speech. Hence, there is a need to develop spoof detection system in order to make voice biometrics viable for security applications. This paper proposes to explore time-frequency representations obtained using gammatone filterbank and constant Q transform for detecting presentation attack for automatic speaker verification. The experiments are carried out for ASV spoof 2017 database and the results are compared with state-of-art replay speech detection systems based on cepstral features.