Speaker Recognition from Spectrogram Images (original) (raw)

Abstract

Speaker identification is used to identify the owner of the voice among many people based on the uniqueness of everyone’s speech style. In this paper, we combine Convolutional Neural Network with Recurrent Neural Network using Long Short-Term Memory models for speaker recognition and implement the deep learning architecture on our dataset of spectrogram images for 77 different non-native speakers reading the same texts in Turkish. Usage of identical text reading eliminates the possible variations and diversities on spectrograms depending on vocabularies. Experiments show that the used method is very effective on recognition rate with satisfying performance and over 98% accuracy.

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

References (12)

S. Furui, Speaker Recognition in Smart Environments, Human-Centric Interfaces for Ambient Intelligence, 2010, pp.163-184.
S. Singh and P. Pandey, Features and Techniques for Speaker Recognition, 2003.
S. Narang and D. Gupta, "Speech Feature Extraction Techniques: A Review," International Journal of Computer Science and Mobile Computing, Vol.4 Issue.3, 2015, pp. 107-114.
S. Revay and M. Teschke, "Multiclass Language Identification using Deep Learning on Spectral Images of Audio Signals", arXiv e-prints, 2019.
S. Albawi, T. A. Mohammed and S. Al-Zawi, "Understanding of a convolutional neural network," 2017 International Conference on Engineering and Technology (ICET), Antalya, 2017, pp. 1-6.
H. Hasegawa and M. Inazumi, "Speech recognition by dynamic recurrent neural networks," Proceedings of 1993 International Conference on Neural Networks (IJCNN-93-Nagoya, Japan), Nagoya, Japan, 1993, pp. 2219-2222 vol.3.
Q. Xu, M. Wang, C. Xu and L. Xu, "Speaker Recognition Based on Long Short-Term Memory Networks," 2020 IEEE 5th International Conference on Signal and Image Processing (ICSIP), Nanjing, China, 2020, pp. 318-322.
C. Turan, S. Kadyrov and D. Burissova, "An Improved Face Recognition Algorithm Based on Sparse Representation," 2018 International Conference on Computing and Network Communications (CoCoNet), Astana, 2018, pp. 32-35.
A. Aitimov, C. Turan and Z. Duisebekov, "Gesture Recognition Based on Sparse Reconstruction," 2018 14th International Conference on Electronics Computer and Computation (ICECCO), Kaskelen, Kazakhstan, 2018, pp. 206-212.
Bartz, C., Herold, T., Yang, H. and Meinel, C., "Language identification using deep convolutional recurrent neural networks." In International Conference on Neural Information Processing, 2017, pp. 880-889. Springer, Cham
Revay, S. and Teschke, M., "Multiclass language identification using deep learning on spectral images of audio signals." arXiv preprint arXiv:1905.04348, 2019
Woods, N. and Babatunde, G., "A robust ensemble model for spoken language recognition." Applied Computer Science, 2020, 16(3).