Alexandros Lazaridis - Academia.edu (original) (raw)
Papers by Alexandros Lazaridis
2018 IEEE Spoken Language Technology Workshop (SLT)
In this work, we study the use of attention mechanisms to enhance the performance of the state-of... more In this work, we study the use of attention mechanisms to enhance the performance of the state-of-the-art deep learning model in Speech Emotion Recognition (SER). We introduce a new Long Short-Term Memory (LSTM)-based neural network attention model which is able to take into account the temporal information in speech during the computation of the attention vector. The proposed LSTM-based model is evaluated on the IEMOCAP dataset using a 5-fold cross-validation scheme and achieved 68.8% weighted accuracy on 4 classes, which outperforms the state-of-the-art models.
Interspeech 2009
... Dimitrios P. Lyras1, George Kokkinakis 1, Alexandros Lazaridis 1, Kyriakos Sgarbas1, Fakotaki... more ... Dimitrios P. Lyras1, George Kokkinakis 1, Alexandros Lazaridis 1, Kyriakos Sgarbas1, Fakotakis1 ... The CPA was used later as a basis for the creation of the SAMPA alphabet [19] together with the then existing two other ALVEY and COST alphabets. ...
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)
In sequence modeling tasks the token order matters, but this information can be partially lost du... more In sequence modeling tasks the token order matters, but this information can be partially lost due to the discretization of the sequence into data points. In this paper, we study the imbalance between the way certain token pairs are included in data points and others are not. We denote this a token order imbalance (TOI) and we link the partial sequence information loss to a diminished performance of the system as a whole, both in text and speech processing tasks. We then provide a mechanism to leverage the full token order information-Alleviated TOI-by iteratively overlapping the token composition of data points. For recurrent networks, we use prime numbers for the batch size to avoid redundancies when building batches from overlapped data points. The proposed method achieved state of the art performance in both text and speech related tasks.
Interspeech 2019
Speech Emotion Recognition (SER) has been shown to benefit from many of the recent advances in de... more Speech Emotion Recognition (SER) has been shown to benefit from many of the recent advances in deep learning, including recurrent based and attention based neural network architectures as well. Nevertheless, performance still falls short of that of humans. In this work, we investigate whether SER could benefit from the self-attention and global windowing of the transformer model. We show on the IEMOCAP database that this is indeed the case. Finally, we investigate whether using the distribution of, possibly conflicting, annotations in the training data, as soft targets could outperform a majority voting. We prove that this performance increases with the agreement level of the annotators.
Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics
We present the first prototype of the SUMMA Platform: an integrated platform for multilingual med... more We present the first prototype of the SUMMA Platform: an integrated platform for multilingual media monitoring. The platform contains a rich suite of low-level and high-level natural language processing technologies: automatic speech recognition of broadcast media, machine translation, automated tagging and classification of named entities, semantic parsing to detect relationships between entities, and automatic construction / augmentation of factual knowledge bases. Implemented on the Docker platform, it can easily be deployed, customised, and scaled to large volumes of incoming media streams.
Journal of Computer Science
Problem statement: In this study we cope with the task of phone duration modeling for Greek emoti... more Problem statement: In this study we cope with the task of phone duration modeling for Greek emotional speech synthesis. Approach: Various well established machine learning techniques are applied for this purpose to an emotional speech database consisting of five archetypal emotions. The constructed phone duration prediction models are built on phonetic, morphosyntactic and prosodic features that can be extracted only from text. We employ model and regression trees, linear regression, lazy learning algorithms and meta-learning algorithms using regression trees as base classifiers, trained on a Modern Greek emotional database consisting of five emotional categories: anger, fear, joy, neutral and sadness. Results: Model trees based on the M5' algorithm and meta-learning algorithms using as base classifier regression trees based on the M5' algorithm proved to perform better. Conclusion: It was observed that the emotional categories of the speech database with the most uniform distribution of phone durations built the most accurate models.
International Journal on Artificial Intelligence Tools
Aiming to an automatic sound recognizer for radio broadcasting events, a methodology of clusterin... more Aiming to an automatic sound recognizer for radio broadcasting events, a methodology of clustering the audio feature space using the discrimination ability of the audio descriptors as a criterion, is investigated in this work. From a given and close set of audio events, commonly found in broadcast news transmissions, a large set of audio descriptors is extracted and their data-driven ranking of relevance is clustered, providing a more robust feature selection. The clusters of the feature space are feeding machine learning algorithms implemented as classification models during the experimental evaluation. This methodology showed that support vector machines provide significantly good results, considering the achieved accuracy due to their ability of coping well in high dimensionality experimental conditions.
Interspeech 2016
We describe here a collection of speech data of bilingual and trilingual speakers of English, Fre... more We describe here a collection of speech data of bilingual and trilingual speakers of English, French, German and Italian. In the context of speech to speech translation (S2ST), this database is designed for several purposes and studies: training CLSA systems (cross-language speaker adaptation), conveying emphasis through S2ST systems, and evaluating TTS systems. More precisely, 36 speakers judged as accentless (22 bilingual and 14 trilingual speakers) were recorded for a set of 171 prompts in two or three languages, amounting to a total of 24 hours of speech. These sets of prompts include 100 sentences from news, 25 sentences from Europarl, the same 25 sentences with one acted emphasised word, 20 semantically unpredictable sentences, and finally a 240-word long text. All in all, it yielded 64 bilingual session pairs of the six possible combinations of the four languages. The database is freely available for non-commercial use and scientific research purposes.
9th ISCA Speech Synthesis Workshop, 2016
In our recent work, a novel speech synthesis with enhanced prosody (SSEP) system using probabilis... more In our recent work, a novel speech synthesis with enhanced prosody (SSEP) system using probabilistic amplitude demodulation (PAD) features was introduced. These features were used to improve prosody in speech synthesis. The PAD was applied iteratively for generating syllable and stress amplitude modulations in a cascade manner. The PAD features were used as a secondary input scheme along with the standard text-based input features in deep neural network (DNN) speech synthesis. Objective and subjective evaluation validated the improvement of the quality of the synthesized speech. In this paper, a spectral amplitude modulation phase hierarchy (S-AMPH) technique is used in a similar to the PAD speech synthesis scheme, way. Instead of the two modulations used in PAD case, three modulations, i.e., stress-, syllable-and phoneme-level ones (2, 5 and 20 Hz respectively) are implemented with the S-AMPH model. The objective evaluation has shown that the proposed system using the S-AMPH features improved synthetic speech quality in respect to the system using the PAD features; in terms of relative reduction in mel-cepstral distortion (MCD) by approximately 9% and in terms of relative reduction in root mean square error (RMSE) of the fundamental frequency (F0) by approximately 25%. Multi-task training is also investigated in this work, giving no statistically significant improvements.
Computer Speech & Language, 2017
In this paper, we propose a platform based on phonological speech vocoding for examining relation... more In this paper, we propose a platform based on phonological speech vocoding for examining relations between phonology and speech processing, and in broader terms, between the abstract and physical structures of speech signal. The goal of this paper is to go a step further towards bridging both fields and contributing to the program of Laboratory Phonology. We show two application examples for the laboratory phonology: a comparison of the phonological systems and an experimental phonological parametric text-to-speech (TTS) system. The following three phonological systems are considered in this work: (i) Government Phonology (GP) features, (ii) the Sound Pattern of English (SPE) features, and (iii) the extended SPE (eSPE) features. Comparing GP-and eSPE-based vocoded speech, we conclude that eSPE achieves slightly better results than the other two systems. However, GP-the most compact phonological speech representation-performs comparably to the systems with higher number of phonological classes. The parametric TTS based on phonological speech representation, and trained from an unlabelled audio book in an unsupervised manner, achieves promising results. We envision that the presented approach paves the way for researchers in both fields to form meaningful hypotheses that are explicitly testable using the concepts developed and exemplified in this paper. Laboratory phonologists might test the applied concepts of their theoretical models and speech processing community may be suggesting tests for the potential of advancing the performance of current state-of-the-art applications using the concepts developed for the theoretical phonological models.
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016
Most current very low bit rate (VLBR) speech coding systems use hidden Markov model (HMM) based s... more Most current very low bit rate (VLBR) speech coding systems use hidden Markov model (HMM) based speech recognition/synthesis techniques. This allows transmission of information (such as phonemes) segment by segment that decreases the bit rate. However, the encoder based on a phoneme speech recognition may create bursts of segmental errors. Segmental errors are further propagated to optional suprasegmental (such as syllable) information coding. Together with the errors of voicing detection in pitch parametrization, HMM-based speech coding creates speech discontinuities and unnatural speech sound artefacts. In this paper, we propose a novel VLBR speech coding framework based on neural networks (NNs) for end-to-end speech analysis and synthesis without HMMs. The speech coding framework relies on phonological (sub-phonetic) representation of speech, and it is designed as a composition of deep and spiking NNs: a bank of phonological analysers at the transmitter, and a phonological synthesizer at the receiver, both realised as deep NNs, and a spiking NN as an incremental and robust encoder of syllable boundaries for coding of continuous fundamental frequency (F0). A combination of phonological features defines much more sound patterns than phonetic features defined by HMM-based speech coders, and the finer analysis/synthesis code contributes into smoother encoded speech. Listeners significantly prefer the NN-based approach due to fewer discontinuities and speech artefacts of the encoded speech. A single forward pass is required during the speech encoding and decoding. The proposed VLBR speech coding operates at bit rate about 360 bits/sec.
2015 IEEE 7th International Conference on Biometrics Theory, Applications and Systems (BTAS), 2015
Automatic speaker verification (ASV) systems are subject to various kinds of malicious attacks. R... more Automatic speaker verification (ASV) systems are subject to various kinds of malicious attacks. Replay, voice conversion and speech synthesis attacks drastically degrade the performance of a standard ASV system by increasing its false acceptance rates. This issue raised a high level of interest in the speech research community where the possible voice spoofing attacks and their related countermeasures have been investigated. However, much less effort has been devoted in creating realistic and diverse spoofing attack databases that foster researchers to correctly evaluate their countermeasures against attacks. The existing studies are not complete in terms of types of attacks, and often difficult to reproduce because of unavailability of public databases. In this paper we introduce the voice spoofing data-set of AVspoof, a public audiovisual spoofing database. AVspoof includes ten realistic spoofing threats generated using replay, speech synthesis and voice conversion. In addition, we provide a set of experimental results that show the effect of such attacks on current state-of-the-art ASV systems.
Lecture Notes in Computer Science, 2015
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015
Current very low bit rate speech coders are, due to complexity limitations, designed to work off-... more Current very low bit rate speech coders are, due to complexity limitations, designed to work off-line. This paper investigates incremental speech coding that operates real-time and incrementally (i.e., encoded speech depends only on already-uttered speech without the need of future speech information). Since human speech communication is asynchronous (i.e., different information flows being simultaneously processed), we hypothesised that such an incremental speech coder should also operate asynchronously. To accomplish this task, we describe speech coding that reflects the human cortical temporal sampling that packages information into units of different temporal granularity, such as phonemes and syllables, in parallel. More specifically, a phonetic vocoder-cascaded speech recognition and synthesis systems-extended with syllable-based information transmission mechanisms is investigated. There are two main aspects evaluated in this work, the synchronous and asynchronous coding. Synchronous coding refers to the case when the phonetic vocoder and speech generation process depend on the syllable boundaries during encoding and decoding respectively. On the other hand, asynchronous coding refers to the case when the phonetic encoding and speech generation processes are done independently of the syllable boundaries. Our experiments confirmed that the asynchronous incremental speech coding performs better, in terms of intelligibility and overall speech quality, mainly due to better alignment of the segmental and prosodic information. The proposed vocoding operates at an uncompressed bit rate of 213 bits/sec and achieves an average communication delay of 243 ms.
In this paper we investigate external phone duration models (PDMs) for improving the quality of s... more In this paper we investigate external phone duration models (PDMs) for improving the quality of synthetic speech in hidden Markov model (HMM)-based speech synthesis. Support Vector Regression (SVR) and Multilayer Perceptron (MLP) were used for this task. SVR and MLP PDMs were compared with the explicit duration modelling of hidden semi-Markov models (HSMMs). Experiments done on an American English database showed the SVR outperforming the MLP and HSMM duration modelling on objective and subjective evaluation. In the objective test, SVR managed to outperform MLP and HSMM models achieving 15.3% and 25.09% relative improvement in terms of root mean square error (RMSE) respectively. Moreover, in the subjective evaluation test, on synthesized speech, the SVR model was preferred over the MLP and HSMM models, achieving a preference score of 35.93% and 56.30%, respectively.
In this paper an attempt is made to automatically recognize speaker's accent among regional Swiss... more In this paper an attempt is made to automatically recognize speaker's accent among regional Swiss French accents from four regions of Switzerland. To achieve this goal a syllable-based classification framework is implemented using prosodic features extracted from the speech signal. Since, among these regional accents, the variations in speech mainly originate from the speaking style, i.e., different rhythm and pitch variations, rather than from the pronunciation of the words, we focus mainly on features related to variations in pitch, intensity and rhythm. For the classification task, a well known and widely used machine learning algorithm was used, i.e. support vector machines (SVM).
It is very common for a language to have different dialects or accents. The different pronunciati... more It is very common for a language to have different dialects or accents. The different pronunciations of the same words is one of the reasons for the different accents, in the same language. Swiss French accents have similar pronunciation to standard French, but noticeable differences in prosody. In this paper we investigate the use of standard French synthetic acoustic parameters combined with Swiss French prosody in order to evaluate the importance of prosody in modelling Swiss French accents. We use speech synthesis techniques to produce standard French pronunciation with Swiss French duration and intonation. Subjective evaluation to rate the degree of Swiss accent was conducted and showed that prosody modification alone reduces perceived difference between original Swiss accented speech and standard French coupled with original duration and intonation by 29%.
ABSTRACT In this paper an attempt is made to automatically recognize the speaker’s accent among r... more ABSTRACT In this paper an attempt is made to automatically recognize the speaker’s accent among regional Swiss French accents from four different regions of Switzerland, i.e. Geneva (GE), Martigny (MA), Neuchˆatel (NE) and Nyon (NY). To achieve this goal, we rely on a generative probabilistic framework for classification based on Gaussian mixture modelling (GMM). Two different GMM-based algorithms are investigated: (1) the baseline technique of universal background modelling (UBM) followed by maximum-a-posteriori (MAP) adaptation, and (2) total variability (i-vector) modelling. Both systems perform well, with the i-vector-based system outperforming the baseline system, achieving a relative improvement of 17.1% in the overall regional accent identification accuracy.
2018 IEEE Spoken Language Technology Workshop (SLT)
In this work, we study the use of attention mechanisms to enhance the performance of the state-of... more In this work, we study the use of attention mechanisms to enhance the performance of the state-of-the-art deep learning model in Speech Emotion Recognition (SER). We introduce a new Long Short-Term Memory (LSTM)-based neural network attention model which is able to take into account the temporal information in speech during the computation of the attention vector. The proposed LSTM-based model is evaluated on the IEMOCAP dataset using a 5-fold cross-validation scheme and achieved 68.8% weighted accuracy on 4 classes, which outperforms the state-of-the-art models.
Interspeech 2009
... Dimitrios P. Lyras1, George Kokkinakis 1, Alexandros Lazaridis 1, Kyriakos Sgarbas1, Fakotaki... more ... Dimitrios P. Lyras1, George Kokkinakis 1, Alexandros Lazaridis 1, Kyriakos Sgarbas1, Fakotakis1 ... The CPA was used later as a basis for the creation of the SAMPA alphabet [19] together with the then existing two other ALVEY and COST alphabets. ...
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)
In sequence modeling tasks the token order matters, but this information can be partially lost du... more In sequence modeling tasks the token order matters, but this information can be partially lost due to the discretization of the sequence into data points. In this paper, we study the imbalance between the way certain token pairs are included in data points and others are not. We denote this a token order imbalance (TOI) and we link the partial sequence information loss to a diminished performance of the system as a whole, both in text and speech processing tasks. We then provide a mechanism to leverage the full token order information-Alleviated TOI-by iteratively overlapping the token composition of data points. For recurrent networks, we use prime numbers for the batch size to avoid redundancies when building batches from overlapped data points. The proposed method achieved state of the art performance in both text and speech related tasks.
Interspeech 2019
Speech Emotion Recognition (SER) has been shown to benefit from many of the recent advances in de... more Speech Emotion Recognition (SER) has been shown to benefit from many of the recent advances in deep learning, including recurrent based and attention based neural network architectures as well. Nevertheless, performance still falls short of that of humans. In this work, we investigate whether SER could benefit from the self-attention and global windowing of the transformer model. We show on the IEMOCAP database that this is indeed the case. Finally, we investigate whether using the distribution of, possibly conflicting, annotations in the training data, as soft targets could outperform a majority voting. We prove that this performance increases with the agreement level of the annotators.
Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics
We present the first prototype of the SUMMA Platform: an integrated platform for multilingual med... more We present the first prototype of the SUMMA Platform: an integrated platform for multilingual media monitoring. The platform contains a rich suite of low-level and high-level natural language processing technologies: automatic speech recognition of broadcast media, machine translation, automated tagging and classification of named entities, semantic parsing to detect relationships between entities, and automatic construction / augmentation of factual knowledge bases. Implemented on the Docker platform, it can easily be deployed, customised, and scaled to large volumes of incoming media streams.
Journal of Computer Science
Problem statement: In this study we cope with the task of phone duration modeling for Greek emoti... more Problem statement: In this study we cope with the task of phone duration modeling for Greek emotional speech synthesis. Approach: Various well established machine learning techniques are applied for this purpose to an emotional speech database consisting of five archetypal emotions. The constructed phone duration prediction models are built on phonetic, morphosyntactic and prosodic features that can be extracted only from text. We employ model and regression trees, linear regression, lazy learning algorithms and meta-learning algorithms using regression trees as base classifiers, trained on a Modern Greek emotional database consisting of five emotional categories: anger, fear, joy, neutral and sadness. Results: Model trees based on the M5' algorithm and meta-learning algorithms using as base classifier regression trees based on the M5' algorithm proved to perform better. Conclusion: It was observed that the emotional categories of the speech database with the most uniform distribution of phone durations built the most accurate models.
International Journal on Artificial Intelligence Tools
Aiming to an automatic sound recognizer for radio broadcasting events, a methodology of clusterin... more Aiming to an automatic sound recognizer for radio broadcasting events, a methodology of clustering the audio feature space using the discrimination ability of the audio descriptors as a criterion, is investigated in this work. From a given and close set of audio events, commonly found in broadcast news transmissions, a large set of audio descriptors is extracted and their data-driven ranking of relevance is clustered, providing a more robust feature selection. The clusters of the feature space are feeding machine learning algorithms implemented as classification models during the experimental evaluation. This methodology showed that support vector machines provide significantly good results, considering the achieved accuracy due to their ability of coping well in high dimensionality experimental conditions.
Interspeech 2016
We describe here a collection of speech data of bilingual and trilingual speakers of English, Fre... more We describe here a collection of speech data of bilingual and trilingual speakers of English, French, German and Italian. In the context of speech to speech translation (S2ST), this database is designed for several purposes and studies: training CLSA systems (cross-language speaker adaptation), conveying emphasis through S2ST systems, and evaluating TTS systems. More precisely, 36 speakers judged as accentless (22 bilingual and 14 trilingual speakers) were recorded for a set of 171 prompts in two or three languages, amounting to a total of 24 hours of speech. These sets of prompts include 100 sentences from news, 25 sentences from Europarl, the same 25 sentences with one acted emphasised word, 20 semantically unpredictable sentences, and finally a 240-word long text. All in all, it yielded 64 bilingual session pairs of the six possible combinations of the four languages. The database is freely available for non-commercial use and scientific research purposes.
9th ISCA Speech Synthesis Workshop, 2016
In our recent work, a novel speech synthesis with enhanced prosody (SSEP) system using probabilis... more In our recent work, a novel speech synthesis with enhanced prosody (SSEP) system using probabilistic amplitude demodulation (PAD) features was introduced. These features were used to improve prosody in speech synthesis. The PAD was applied iteratively for generating syllable and stress amplitude modulations in a cascade manner. The PAD features were used as a secondary input scheme along with the standard text-based input features in deep neural network (DNN) speech synthesis. Objective and subjective evaluation validated the improvement of the quality of the synthesized speech. In this paper, a spectral amplitude modulation phase hierarchy (S-AMPH) technique is used in a similar to the PAD speech synthesis scheme, way. Instead of the two modulations used in PAD case, three modulations, i.e., stress-, syllable-and phoneme-level ones (2, 5 and 20 Hz respectively) are implemented with the S-AMPH model. The objective evaluation has shown that the proposed system using the S-AMPH features improved synthetic speech quality in respect to the system using the PAD features; in terms of relative reduction in mel-cepstral distortion (MCD) by approximately 9% and in terms of relative reduction in root mean square error (RMSE) of the fundamental frequency (F0) by approximately 25%. Multi-task training is also investigated in this work, giving no statistically significant improvements.
Computer Speech & Language, 2017
In this paper, we propose a platform based on phonological speech vocoding for examining relation... more In this paper, we propose a platform based on phonological speech vocoding for examining relations between phonology and speech processing, and in broader terms, between the abstract and physical structures of speech signal. The goal of this paper is to go a step further towards bridging both fields and contributing to the program of Laboratory Phonology. We show two application examples for the laboratory phonology: a comparison of the phonological systems and an experimental phonological parametric text-to-speech (TTS) system. The following three phonological systems are considered in this work: (i) Government Phonology (GP) features, (ii) the Sound Pattern of English (SPE) features, and (iii) the extended SPE (eSPE) features. Comparing GP-and eSPE-based vocoded speech, we conclude that eSPE achieves slightly better results than the other two systems. However, GP-the most compact phonological speech representation-performs comparably to the systems with higher number of phonological classes. The parametric TTS based on phonological speech representation, and trained from an unlabelled audio book in an unsupervised manner, achieves promising results. We envision that the presented approach paves the way for researchers in both fields to form meaningful hypotheses that are explicitly testable using the concepts developed and exemplified in this paper. Laboratory phonologists might test the applied concepts of their theoretical models and speech processing community may be suggesting tests for the potential of advancing the performance of current state-of-the-art applications using the concepts developed for the theoretical phonological models.
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016
Most current very low bit rate (VLBR) speech coding systems use hidden Markov model (HMM) based s... more Most current very low bit rate (VLBR) speech coding systems use hidden Markov model (HMM) based speech recognition/synthesis techniques. This allows transmission of information (such as phonemes) segment by segment that decreases the bit rate. However, the encoder based on a phoneme speech recognition may create bursts of segmental errors. Segmental errors are further propagated to optional suprasegmental (such as syllable) information coding. Together with the errors of voicing detection in pitch parametrization, HMM-based speech coding creates speech discontinuities and unnatural speech sound artefacts. In this paper, we propose a novel VLBR speech coding framework based on neural networks (NNs) for end-to-end speech analysis and synthesis without HMMs. The speech coding framework relies on phonological (sub-phonetic) representation of speech, and it is designed as a composition of deep and spiking NNs: a bank of phonological analysers at the transmitter, and a phonological synthesizer at the receiver, both realised as deep NNs, and a spiking NN as an incremental and robust encoder of syllable boundaries for coding of continuous fundamental frequency (F0). A combination of phonological features defines much more sound patterns than phonetic features defined by HMM-based speech coders, and the finer analysis/synthesis code contributes into smoother encoded speech. Listeners significantly prefer the NN-based approach due to fewer discontinuities and speech artefacts of the encoded speech. A single forward pass is required during the speech encoding and decoding. The proposed VLBR speech coding operates at bit rate about 360 bits/sec.
2015 IEEE 7th International Conference on Biometrics Theory, Applications and Systems (BTAS), 2015
Automatic speaker verification (ASV) systems are subject to various kinds of malicious attacks. R... more Automatic speaker verification (ASV) systems are subject to various kinds of malicious attacks. Replay, voice conversion and speech synthesis attacks drastically degrade the performance of a standard ASV system by increasing its false acceptance rates. This issue raised a high level of interest in the speech research community where the possible voice spoofing attacks and their related countermeasures have been investigated. However, much less effort has been devoted in creating realistic and diverse spoofing attack databases that foster researchers to correctly evaluate their countermeasures against attacks. The existing studies are not complete in terms of types of attacks, and often difficult to reproduce because of unavailability of public databases. In this paper we introduce the voice spoofing data-set of AVspoof, a public audiovisual spoofing database. AVspoof includes ten realistic spoofing threats generated using replay, speech synthesis and voice conversion. In addition, we provide a set of experimental results that show the effect of such attacks on current state-of-the-art ASV systems.
Lecture Notes in Computer Science, 2015
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015
Current very low bit rate speech coders are, due to complexity limitations, designed to work off-... more Current very low bit rate speech coders are, due to complexity limitations, designed to work off-line. This paper investigates incremental speech coding that operates real-time and incrementally (i.e., encoded speech depends only on already-uttered speech without the need of future speech information). Since human speech communication is asynchronous (i.e., different information flows being simultaneously processed), we hypothesised that such an incremental speech coder should also operate asynchronously. To accomplish this task, we describe speech coding that reflects the human cortical temporal sampling that packages information into units of different temporal granularity, such as phonemes and syllables, in parallel. More specifically, a phonetic vocoder-cascaded speech recognition and synthesis systems-extended with syllable-based information transmission mechanisms is investigated. There are two main aspects evaluated in this work, the synchronous and asynchronous coding. Synchronous coding refers to the case when the phonetic vocoder and speech generation process depend on the syllable boundaries during encoding and decoding respectively. On the other hand, asynchronous coding refers to the case when the phonetic encoding and speech generation processes are done independently of the syllable boundaries. Our experiments confirmed that the asynchronous incremental speech coding performs better, in terms of intelligibility and overall speech quality, mainly due to better alignment of the segmental and prosodic information. The proposed vocoding operates at an uncompressed bit rate of 213 bits/sec and achieves an average communication delay of 243 ms.
In this paper we investigate external phone duration models (PDMs) for improving the quality of s... more In this paper we investigate external phone duration models (PDMs) for improving the quality of synthetic speech in hidden Markov model (HMM)-based speech synthesis. Support Vector Regression (SVR) and Multilayer Perceptron (MLP) were used for this task. SVR and MLP PDMs were compared with the explicit duration modelling of hidden semi-Markov models (HSMMs). Experiments done on an American English database showed the SVR outperforming the MLP and HSMM duration modelling on objective and subjective evaluation. In the objective test, SVR managed to outperform MLP and HSMM models achieving 15.3% and 25.09% relative improvement in terms of root mean square error (RMSE) respectively. Moreover, in the subjective evaluation test, on synthesized speech, the SVR model was preferred over the MLP and HSMM models, achieving a preference score of 35.93% and 56.30%, respectively.
In this paper an attempt is made to automatically recognize speaker's accent among regional Swiss... more In this paper an attempt is made to automatically recognize speaker's accent among regional Swiss French accents from four regions of Switzerland. To achieve this goal a syllable-based classification framework is implemented using prosodic features extracted from the speech signal. Since, among these regional accents, the variations in speech mainly originate from the speaking style, i.e., different rhythm and pitch variations, rather than from the pronunciation of the words, we focus mainly on features related to variations in pitch, intensity and rhythm. For the classification task, a well known and widely used machine learning algorithm was used, i.e. support vector machines (SVM).
It is very common for a language to have different dialects or accents. The different pronunciati... more It is very common for a language to have different dialects or accents. The different pronunciations of the same words is one of the reasons for the different accents, in the same language. Swiss French accents have similar pronunciation to standard French, but noticeable differences in prosody. In this paper we investigate the use of standard French synthetic acoustic parameters combined with Swiss French prosody in order to evaluate the importance of prosody in modelling Swiss French accents. We use speech synthesis techniques to produce standard French pronunciation with Swiss French duration and intonation. Subjective evaluation to rate the degree of Swiss accent was conducted and showed that prosody modification alone reduces perceived difference between original Swiss accented speech and standard French coupled with original duration and intonation by 29%.
ABSTRACT In this paper an attempt is made to automatically recognize the speaker’s accent among r... more ABSTRACT In this paper an attempt is made to automatically recognize the speaker’s accent among regional Swiss French accents from four different regions of Switzerland, i.e. Geneva (GE), Martigny (MA), Neuchˆatel (NE) and Nyon (NY). To achieve this goal, we rely on a generative probabilistic framework for classification based on Gaussian mixture modelling (GMM). Two different GMM-based algorithms are investigated: (1) the baseline technique of universal background modelling (UBM) followed by maximum-a-posteriori (MAP) adaptation, and (2) total variability (i-vector) modelling. Both systems perform well, with the i-vector-based system outperforming the baseline system, achieving a relative improvement of 17.1% in the overall regional accent identification accuracy.