Analyzing Noise Robustness of Cochleogram and Mel Spectrogram Features in Deep Learning Based Speaker Recogntion (original) (raw)

The performance of speaker recognition is very well in a clean dataset or without mismatch between training and test set. However, the performance is degraded with noise, channel variation, physical and behavioral changes with the speaker. The studies confirmed that the features which represent speech in the Equal Rectangular Band (ERB) scale are more robust than Mel Scale at low Signal to Noise Ratio (SNR) level. Gammatone Frequency Cepstral Coefficient (GFCC) which represents speech in ERB scale is widely used in classical machine learning based speaker recognition at noisy conditions. Recently, deep learning models are widely applied in speaker recognition and show better performance than classical machine learning. Previous deep learning based speaker recognition models used Mel Spectrogram as an input rather than hand crafted features. However, the performance of Mel spectrogram drastically degraded at low SNR level because Mel Spectrogram represents speech in Mel Scale. Cochle...