Alexandr Kozlov | Saint Petersburg State Electrotechnical University "LETI" (ETU) (original) (raw)
Papers by Alexandr Kozlov
We investigate deep neural network performance in the textindependent speaker recognition task. W... more We investigate deep neural network performance in the textindependent speaker recognition task. We demonstrate that using angular softmax activation at the last classification layer of a classification neural network instead of a simple softmax activation allows to train a more generalized discriminative speaker embedding extractor. Cosine similarity is an effective metric for speaker verification in this embedding space. We also address the problem of choosing an architecture for the extractor. We found that deep networks with residual frame level connections outperform wide but relatively shallow architectures. This paper also proposes several improvements for previous DNN-based extractor systems to increase the speaker recognition accuracy. We show that the discriminatively trained similarity metric learning approach outperforms the standard LDA-PLDA method as an embedding backend. The results obtained on Speakers in the Wild and NIST SRE 2016 evaluation sets demonstrate robustne...
This paper describes the Speech Technology Center (STC) antispoofing systems submitted to the ASV... more This paper describes the Speech Technology Center (STC) antispoofing systems submitted to the ASVspoof 2019 challenge . The ASVspoof2019 is the extended version of the previous challenges and includes 2 evaluation conditions: logical access use-case scenario with speech synthesis and voice conversion attack types and physical access use-case scenario with replay attacks. During the challenge we developed anti-spoofing solutions for both scenarios. The proposed systems are implemented using deep learning approach and are based on different types of acoustic features. We enhanced Light CNN architecture previously considered by the authors for replay attacks detection and which performed high spoofing detection quality during the ASVspoof2017 challenge. In particular here we investigate the efficiency of angular margin based softmax activation for training robust deep Light CNN classifier to solve the mentioned-above tasks. Submitted systems achieved EER of 1.86% in logical access scen...
This paper describes the ITMO University (DI-IT team) speaker diarization systems submitted to DI... more This paper describes the ITMO University (DI-IT team) speaker diarization systems submitted to DIHARD Challenge II. As with DIHARD I, this challenge is focused on diarization task for microphone recordings in varying difficult conditions. According to the results of the previous DIHARD I Challenge state-ofthe-art diarization systems are based on x-vector embeddings. Such embeddings are clustered using aglomerative hierarchical clustering (AHC) algorithm by means of PLDA scoring. Current research continues the investigation of deep speaker embedding efficiency for the speaker diarization task. This paper explores new types of embedding extractors with different deep neural network architectures and training strategies. We also used AHC to perform embeddings clustering. Alternatively to the PLDA scoring in our AHC procedure we used discriminatively trained cosine similarity metric learning (CSML) model for scoring. Moreover we focused on the optimal AHC threshold tuning according to t...
Interspeech 2019, 2019
This paper presents the Speech Technology Center (STC) speaker recognition (SR) systems submitted... more This paper presents the Speech Technology Center (STC) speaker recognition (SR) systems submitted to the VOiCES From a Distance challenge 2019 1. The challenge's SR task is focused on the problem of speaker recognition in single channel distant/far-field audio under noisy conditions. In this work we investigate different deep neural networks architectures for speaker embedding extraction to solve the task. We show that deep networks with residual frame level connections outperform more shallow architectures. Simple energy based speech activity detector (SAD) and automatic speech recognition (ASR) based SAD are investigated in this work. We also address the problem of data preparation for robust embedding extractors training. The reverberation for the data augmentation was performed using automatic room impulse response generator. In our systems we used discriminatively trained cosine similarity metric learning model as embedding backend. Scores normalization procedure was applied for each individual subsystem we used. Our final submitted systems were based on the fusion of different subsystems. The results obtained on the VOiCES development and evaluation sets demonstrate effectiveness and robustness of the proposed systems when dealing with distant/far-field audio under noisy conditions.
Interspeech 2019, 2019
This paper describes the Speech Technology Center (STC) antispoofing systems submitted to the ASV... more This paper describes the Speech Technology Center (STC) antispoofing systems submitted to the ASVspoof 2019 challenge 1. The ASVspoof2019 is the extended version of the previous challenges and includes 2 evaluation conditions: logical access use-case scenario with speech synthesis and voice conversion attack types and physical access use-case scenario with replay attacks. During the challenge we developed anti-spoofing solutions for both scenarios. The proposed systems are implemented using deep learning approach and are based on different types of acoustic features. We enhanced Light CNN architecture previously considered by the authors for replay attacks detection and which performed high spoofing detection quality during the ASVspoof2017 challenge. In particular here we investigate the efficiency of angular margin based softmax activation for training robust deep Light CNN classifier to solve the mentioned-above tasks. Submitted systems achieved EER of 1.86% in logical access scenario and 0.54% in physical access scenario on the evaluation part of the Challenge corpora. High performance obtained for the unknown types of spoofing attacks demonstrates the stability of the offered approach in both evaluation conditions.
Odyssey 2018 The Speaker and Language Recognition Workshop, 2018
We investigate deep neural network performance in the textindependent speaker recognition task. W... more We investigate deep neural network performance in the textindependent speaker recognition task. We demonstrate that using angular softmax activation at the last classification layer of a classification neural network instead of a simple softmax activation allows to train a more generalized discriminative speaker embedding extractor. Cosine similarity is an effective metric for speaker verification in this embedding space. We also address the problem of choosing an architecture for the extractor. We found that deep networks with residual frame level connections outperform wide but relatively shallow architectures. This paper also proposes several improvements for previous DNN-based extractor systems to increase the speaker recognition accuracy. We show that the discriminatively trained similarity metric learning approach outperforms the standard LDA-PLDA method as an embedding backend. The results obtained on Speakers in the Wild and NIST SRE 2016 evaluation sets demonstrate robustness of the proposed systems when dealing with close to real-life conditions.
Lecture Notes in Computer Science, 2015
Lecture Notes in Computer Science, 2015
2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016
We investigate deep neural network performance in the textindependent speaker recognition task. W... more We investigate deep neural network performance in the textindependent speaker recognition task. We demonstrate that using angular softmax activation at the last classification layer of a classification neural network instead of a simple softmax activation allows to train a more generalized discriminative speaker embedding extractor. Cosine similarity is an effective metric for speaker verification in this embedding space. We also address the problem of choosing an architecture for the extractor. We found that deep networks with residual frame level connections outperform wide but relatively shallow architectures. This paper also proposes several improvements for previous DNN-based extractor systems to increase the speaker recognition accuracy. We show that the discriminatively trained similarity metric learning approach outperforms the standard LDA-PLDA method as an embedding backend. The results obtained on Speakers in the Wild and NIST SRE 2016 evaluation sets demonstrate robustne...
This paper describes the Speech Technology Center (STC) antispoofing systems submitted to the ASV... more This paper describes the Speech Technology Center (STC) antispoofing systems submitted to the ASVspoof 2019 challenge . The ASVspoof2019 is the extended version of the previous challenges and includes 2 evaluation conditions: logical access use-case scenario with speech synthesis and voice conversion attack types and physical access use-case scenario with replay attacks. During the challenge we developed anti-spoofing solutions for both scenarios. The proposed systems are implemented using deep learning approach and are based on different types of acoustic features. We enhanced Light CNN architecture previously considered by the authors for replay attacks detection and which performed high spoofing detection quality during the ASVspoof2017 challenge. In particular here we investigate the efficiency of angular margin based softmax activation for training robust deep Light CNN classifier to solve the mentioned-above tasks. Submitted systems achieved EER of 1.86% in logical access scen...
This paper describes the ITMO University (DI-IT team) speaker diarization systems submitted to DI... more This paper describes the ITMO University (DI-IT team) speaker diarization systems submitted to DIHARD Challenge II. As with DIHARD I, this challenge is focused on diarization task for microphone recordings in varying difficult conditions. According to the results of the previous DIHARD I Challenge state-ofthe-art diarization systems are based on x-vector embeddings. Such embeddings are clustered using aglomerative hierarchical clustering (AHC) algorithm by means of PLDA scoring. Current research continues the investigation of deep speaker embedding efficiency for the speaker diarization task. This paper explores new types of embedding extractors with different deep neural network architectures and training strategies. We also used AHC to perform embeddings clustering. Alternatively to the PLDA scoring in our AHC procedure we used discriminatively trained cosine similarity metric learning (CSML) model for scoring. Moreover we focused on the optimal AHC threshold tuning according to t...
Interspeech 2019, 2019
This paper presents the Speech Technology Center (STC) speaker recognition (SR) systems submitted... more This paper presents the Speech Technology Center (STC) speaker recognition (SR) systems submitted to the VOiCES From a Distance challenge 2019 1. The challenge's SR task is focused on the problem of speaker recognition in single channel distant/far-field audio under noisy conditions. In this work we investigate different deep neural networks architectures for speaker embedding extraction to solve the task. We show that deep networks with residual frame level connections outperform more shallow architectures. Simple energy based speech activity detector (SAD) and automatic speech recognition (ASR) based SAD are investigated in this work. We also address the problem of data preparation for robust embedding extractors training. The reverberation for the data augmentation was performed using automatic room impulse response generator. In our systems we used discriminatively trained cosine similarity metric learning model as embedding backend. Scores normalization procedure was applied for each individual subsystem we used. Our final submitted systems were based on the fusion of different subsystems. The results obtained on the VOiCES development and evaluation sets demonstrate effectiveness and robustness of the proposed systems when dealing with distant/far-field audio under noisy conditions.
Interspeech 2019, 2019
This paper describes the Speech Technology Center (STC) antispoofing systems submitted to the ASV... more This paper describes the Speech Technology Center (STC) antispoofing systems submitted to the ASVspoof 2019 challenge 1. The ASVspoof2019 is the extended version of the previous challenges and includes 2 evaluation conditions: logical access use-case scenario with speech synthesis and voice conversion attack types and physical access use-case scenario with replay attacks. During the challenge we developed anti-spoofing solutions for both scenarios. The proposed systems are implemented using deep learning approach and are based on different types of acoustic features. We enhanced Light CNN architecture previously considered by the authors for replay attacks detection and which performed high spoofing detection quality during the ASVspoof2017 challenge. In particular here we investigate the efficiency of angular margin based softmax activation for training robust deep Light CNN classifier to solve the mentioned-above tasks. Submitted systems achieved EER of 1.86% in logical access scenario and 0.54% in physical access scenario on the evaluation part of the Challenge corpora. High performance obtained for the unknown types of spoofing attacks demonstrates the stability of the offered approach in both evaluation conditions.
Odyssey 2018 The Speaker and Language Recognition Workshop, 2018
We investigate deep neural network performance in the textindependent speaker recognition task. W... more We investigate deep neural network performance in the textindependent speaker recognition task. We demonstrate that using angular softmax activation at the last classification layer of a classification neural network instead of a simple softmax activation allows to train a more generalized discriminative speaker embedding extractor. Cosine similarity is an effective metric for speaker verification in this embedding space. We also address the problem of choosing an architecture for the extractor. We found that deep networks with residual frame level connections outperform wide but relatively shallow architectures. This paper also proposes several improvements for previous DNN-based extractor systems to increase the speaker recognition accuracy. We show that the discriminatively trained similarity metric learning approach outperforms the standard LDA-PLDA method as an embedding backend. The results obtained on Speakers in the Wild and NIST SRE 2016 evaluation sets demonstrate robustness of the proposed systems when dealing with close to real-life conditions.
Lecture Notes in Computer Science, 2015
Lecture Notes in Computer Science, 2015
2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016