Germán Bordel | Universidad del País Vasco UPV/EHU (original) (raw)
Papers by Germán Bordel
Proceedings of the 2nd International Conference on Agents and Artificial Intelligence, 2010
Detecting and tracking people in real-time in complicated and crowded scenes is a challenging pro... more Detecting and tracking people in real-time in complicated and crowded scenes is a challenging problem. This paper presents a multi-cue methodology to detect and track pedestrians in real-time in the entrance gates using stationary CCD cameras. The proposed approach is the combination of two main algorithms, the detecting and tracking for solitude situations and an estimation process for overcrowded scenes. In the former method, the detection component includes finding local maximums in foreground mask of Gaussian-Mixture and Ω-shaped objects in the edge map by trained PCA. And the tracking engine employs a Dynamic VCM with automated criteria based on the shape and size of detected human shaped entities. This new approach has several advantages. First, it uses a well-defined and robust feature space which includes polar and angular data. Furthermore due to its fast method to find human shaped objects in the scene, it's intrinsically suitable for real-time purposes. In addition, this approach verifies human formed objects based on PCA algorithm, which makes it robust in decreasing false positive cases. This novel approach has been implemented in a sacred place and the experimental results demonstrated the system's robustness under many difficult situations such as partial or full occlusions of pedestrians.
Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002.
ABSTRACT Morphological information is traditionally used to develop high quality text to speech (... more ABSTRACT Morphological information is traditionally used to develop high quality text to speech (TTS) and automatic speech recognition (ASR) systems. The use of this information improves the naturalness and intelligibility of the TTS synthesis and provides an appropriated way to select lexical units (LU) for ASR. Basque is an agglutinative language with a complex structure inside the words and the morphological information is essential both in TTS and ASR. In this work an automatic morphological segmentation tool oriented to TTS and ASR tasks is presented.
2007 IEEE Workshop on Automatic Identification Advanced Technologies, 2007
The Mel-Frequency Cepstral Coefficients (MFCC) and their derivatives are commonly used as acousti... more The Mel-Frequency Cepstral Coefficients (MFCC) and their derivatives are commonly used as acoustic features for speaker recognition. The issue arises of whether some of those features are redundant or dependent on other features. Probably, not all of them are equally relevant for speaker recognition. Reduced feature sets allow more robust estimates of the model parameters. Also, less computational resources are required, which is crucial for real-time speaker recognition applications using low-resource devices. In this paper, we use feature weighting as an intermediate step towards feature selection. Genetic algorithms are used to find the optimal set of weights for a 38-dimensional feature set, consisting of 12 MFCC, their first and second derivatives, energy and its first derivative. To evaluate each set of weights, speaker recognition errors are counted over a validation dataset. Speaker models are based on empirical distributions of acoustic labels, obtained through vector quantization. On average, weighting acoustic features yields between 15% and 25% error reduction in speaker recognition tests. Finally, features are sorted according to their weights, and the K features with greatest average ranks are retained and evaluated. We conclude that combining feature weighting and feature selection allows to reduce costs without degrading performance. 1
The development of an automatic index system of broadcast news requires appropriate Video and Lan... more The development of an automatic index system of broadcast news requires appropriate Video and Language Resources (LR) to design all the components of the system. Nowadays, large and well-defined resources can be found in most widely used languages (Informedia), but there is a lot of work to do with respect to minority languages. The main goal of this work is the design of resources in Basque and Spanish for the transcription of broadcast news. These two languages have been chosen because they are both official in the Basque Autonomous Community and they are used in the Basque Public Radio and Television EITB (EITB).
This paper briefly describes the system presented by the Working Group on Software Technologies (... more This paper briefly describes the system presented by the Working Group on Software Technologies (GTTS) 1 of the University of the Basque Country (UPV/EHU) to the Spoken Web Search task at MediaEval 2012. The GTTS system apply state-of-the-art phone decoders to search for approximate matchings of the N-best decodings of a spoken query in the phone lattice of the target audio document.
This paper briefly describes the language recognition systems developed for the 2011 NIST Languag... more This paper briefly describes the language recognition systems developed for the 2011 NIST Language Recognition Evaluation (LRE) by the BLZ (Bilbao-Lisboa-Zaragoza) team, a threesite joint including GTTS from the University of the Basque Country (Spain), L 2 F (Spoken Language Systems Lab) from INESC-ID Lisboa (Portugal) and I3A from the University of Zaragoza (Spain). The primary system fuses 8 (3 acoustic + 5 phonotactic) subsystems: a Linearized Eigenchannel GMM (LE-GMM) subsystem, a JFA subsystem, an iVector subsystem, three Phone-SVM subsystems using the Brno University of Technology phone decoders for Czech, Hungarian and Russian, and two Phone-SVM subsystems using the L 2 F phone decoders for European Portuguese and Brazilian Portuguese. Gaussian backends and multiclass fusion have been applied to get the final scores. Three contrastive systems have been also submitted, featuring: (1) the fusion of the whole set of 13 (6 acoustic + 7 phonotactic) subsystems; (2) the fusion of 3 subsystems, for the combination of one subsystem per site yielding the best performance on development data; and (3) the fusion of the same 8 subsystems used in the primary system under a different configuration.
This paper briefly describes the dot-scoring speaker recognition system developed by the Software... more This paper briefly describes the dot-scoring speaker recognition system developed by the Software Technology Working Group (http://gtts.ehu.es) at the University of the Basque Country (EHU), for the NIST 2010 Speaker Recognition Evaluation. The system does eigenchannel compensation in the sufficient statistics space and scoring is performed by a simple dot product. An optimized Matlab implementation of of the eigenchannels estimation, the channel compensation and the normalized mean vector computation is provided.
In this paper, an XML resource definition is presented fitting in with the architecture of a mult... more In this paper, an XML resource definition is presented fitting in with the architecture of a multilingual (Spanish, English, Basque) spoken document retrieval system. The XML resource not only stores all the information extracted from the audio signal, but also adds the structure required to create an index database and retrieve information according to various criteria. The XML resource is based on the concept of segment and provides generic but powerful mechanisms to characterize segments and group segments into sections. Audio and video files described through this XML resource can be easily exploited in other tasks, such as topic tracking, speaker diarization, etc.
This paper presents an overview of the Albayzin 2010 Language Recognition Evaluation, carried out... more This paper presents an overview of the Albayzin 2010 Language Recognition Evaluation, carried out from June to October 2010, organized by the Spanish Thematic Network on Speech Technology and coordinated by the Speech Technology Working Group of the University of the Basque Country. The evaluation was designed according to the test procedures, protocols and performance measures used in the last NIST Language Recognition Evaluations. Development and evaluation data were extracted from KALAKA-2, a database including clean and noisy speech in various languages, recorded from TV broadcasts and stored in single-channel 16-bit 16 kHz audio files. The task consisted in deciding whether or not a target language was spoken in a test utterance. Four different conditions were defined: closed-set/clean-speech, closed-set/noisy speech, open-set/clean-speech and open-set/noisy speech. Evaluation was performed on three subsets of test segments, with nominal durations of 30, 10 and 3 seconds, respectively. The task involved 6 target languages: English, Portuguese and the four official languages spoken in Spain (Basque, Catalan, Galician and Spanish), other (unknown) languages being also recorded to allow open-set verification tests. Four teams (2 from Spanish universities, one from a Portuguese research center and one from a Finnish university) presented their systems to this evaluation. The best primary system in the closed-set/cleanspeech condition on the subset of 30-second segments yielded Cavg = 0.0184 (around 2% EER).
A speech database, named KALAKA, was created to support the Albayzin 2008 Evaluation of Language ... more A speech database, named KALAKA, was created to support the Albayzin 2008 Evaluation of Language Recognition Systems, organized by the Spanish Network on Speech Technologies from May to November 2008. This evaluation, designed according to the criteria and methodology applied in the NIST Language Recognition Evaluations, involved four target languages: Basque, Catalan, Galician and Spanish (official languages in Spain), and included speech signals in other (unknown) languages to allow open-set verification trials. In this paper, the process of designing, collecting data and building the train, development and evaluation datasets of KALAKA is described. Results attained in the Albayzin 2008 LRE are presented as a means of evaluating the database. The performance of a state-of-the-art language recognition system on a closed-set evaluation task is also presented for reference. Future work includes extending KALAKA by adding Portuguese and English as target languages and renewing the set of unknown languages needed to carry out open-set evaluations.
Proceedings of Odyssey, 2008
Cohort of background speakers Few speakers to cover all the potential impostors Universal Backgro... more Cohort of background speakers Few speakers to cover all the potential impostors Universal Background Model (UBM) Provides universal acoustic coverage Uses a pool of speakers to train a single speaker-independent model
IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.
In this paper, we address the problem of computing a consensus translation given the outputs from... more In this paper, we address the problem of computing a consensus translation given the outputs from a set of Machine Translation (MT) systems. The translations from the MT systems are aligned with a multiple string alignment algorithm and the consensus translation is then computed. We describe the multiple string alignment algorithm and the consensus MT hypothesis computation. We report on the subjective and objective performance of the multilingual acquisition approach on a limited domain spoken language application. We evaluate five domain-independent off-theshelf MT systems and show that the consensus-based translation performs equal or better than any of the given MT systems both in terms of objective and subjective measures.
6th European Conference on Speech Communication and Technology (Eurospeech 1999)
This paper presents a new system for the continuous speech recognition of Spanish, integrating pr... more This paper presents a new system for the continuous speech recognition of Spanish, integrating previous works in the fields of acoustic-phonetic decoding and language modelling. Acoustic and language modelsseparately trained with speech and text samples, respectively-are integrated into one single automaton, and their probabilities combined according to a standard beam search procedure. Two key issues were to adequately adjust the beam parameter and the weight affecting the language model probabilities. For the implementation, a client-server arquitecture was selected, due to the desirable working scene where one or more simple machines in the client side make the speech analysis task, and a more powerful workstation in the server side looks for the best sentence hypotheses. Preliminary experimentation gives promising results with around 90% word recognition rates in a medium size word speech recognition task 1 .
Interspeech 2006, 2006
The Mel-Frequency Cepstral Coefficients (MFCC) are widely accepted as a suitable representation f... more The Mel-Frequency Cepstral Coefficients (MFCC) are widely accepted as a suitable representation for speaker recognition applications. MFCC are usually augmented with dynamic features, leading to high dimensional representations. The issue arises of whether some of those features are redundant or dependent on other features. Probably, not all of them are equally relevant for speaker recognition. In this work, we explore the potential benefit of weighting acoustic features to improve speaker recognition accuracy. Genetic algorithms (GAs) are used to find the optimal set of weights for a 38dimensional feature set. To evaluate each set of weights, recognition error is measured over a validation dataset. Naive speaker models are used, based on empirical distributions of vector quantizer labels. Weighting acoustic features yields 24.58% and 14.68% relative error reductions in two series of speaker recognition tests. These results provide evidence that further improvements in speaker recognition performance can be attained by weighting acoustic features. They also validate the use of GAs to search for an optimal set of feature weights. 1
IberSPEECH 2018, 2018
This paper describes the systems developed by GTTS-EHU for the QbE-STD and STD tasks of the Albay... more This paper describes the systems developed by GTTS-EHU for the QbE-STD and STD tasks of the Albayzin 2018 Search on Speech Evaluation. Stacked bottleneck features (sBNF) are used as frame-level acoustic representation for both audio documents and spoken queries. In QbE-STD, a flavour of segmental DTW (originally developed for MediaEval 2013) is used to perform the search, which iteratively finds the match that minimizes the average distance between two test-normalized sBNF vectors, until either a maximum number of hits is obtained or the score does not attain a given threshold. The STD task is performed by synthesizing spoken queries (using publicly available TTS APIs), then averaging their sBNF representations and using the average query for QbE-STD. A publicly available toolkit (developed by BUT/Phonexia) has been used to extract three sBNF sets, trained for English monophone and triphone state posteriors (contrastive systems 3 and 4) and for multilingual triphone posteriors (contrastive system 2), respectively. The concatenation of the three sBNF sets has been also tested (contrastive system 1). The primary system consists of a discriminative fusion of the four contrastive systems. Detection scores are normalized on a query-by-query basis (qnorm), calibrated and, if two or more systems are considered, fused with other scores. Calibration and fusion parameters are discriminatively estimated using the ground truth of development data. Finally, due to a lack of robustness in calibration, Yes/No decisions are made by applying the MTWV thresholds obtained for the development sets, except for the COREMAH test set. In this case, calibration is based on the MAVIR corpus, and the 15% highest scores are taken as positive (Yes) detections.
ISSPA '99. Proceedings of the Fifth International Symposium on Signal Processing and its Applications (IEEE Cat. No.99EX359)
Speech understanding applications where a word based output of the uttered sentence is not needed... more Speech understanding applications where a word based output of the uttered sentence is not needed, can benefit from the use of alternative lexical units. Experimental results from these systems show that the use of non word lexical units bring us a new degree of freedom in order to improve the system performance (better recognition rate and lower size can be obtained in comparison to word based models). However, if the aim of the system is a speechto-text translation, a post-processing stage must be included in order to convert the non-word sequences into word sentences. In this paper a technique to perform this conversion as well as an experimental test carried out over a task oriented Spanish corpus are reported. As a conclusion, we see that the whole speech-to-text system neatly outperforms the word-constrained baseline system.
Lecture Notes in Computer Science, 2005
Automatic Indexing of Broadcast News is a developing research area of great recent interest [1]. ... more Automatic Indexing of Broadcast News is a developing research area of great recent interest [1]. This paper describes the development steps for designing an automatic index system of broadcast news for both Basque and Spanish. This application requires of appropriate Language Resources to design all the components of the system. Nowadays, large and well-defined resources can be found in most widely used languages, but there is a lot of work to do with respect to minority languages. Even if Spanish has much more resources than Basque, this work has parallel efforts for both languages. These two languages have been chosen because they are evenly official in the Basque Autonomous Community and they are used in many mass media of the Community including the Basque Public Radio and Television EITB [2].
En este artículo se describe la estructura y los distintos problemas que se han abordado en el de... more En este artículo se describe la estructura y los distintos problemas que se han abordado en el desarrollo de una herramienta software para el Parlamento Vasco, que permite la generación de subtítulos para vídeos con las transcripciones textuales disponibles. La mayor parte de las dificultades se encontraron en el pre-procesamiento del texto y en la sincronización de texto y audio. La herramienta es capaz de tratar con recursos multilingües, lo que también ha supuesto una fuente importante de dificultad.
1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258), 1999
If the objective of a Continuous Automatic Speech Understanding system is not a speech-to-text tr... more If the objective of a Continuous Automatic Speech Understanding system is not a speech-to-text translation, words are not strictly needed, and then the use of alternative lexical units (LUs) will bring us a new degree of freedom to improve the system performance. Consequently, we experimentally explore some methods to automatically extract a set of LUs from a Spanish training corpus and verify that the system can be improved in two ways: reducing the computational costs and increasing the recognition rates. Moreover, preliminary results point out that, even if the system target is a speech-to-text translation, using non-word units and post-processing the output to produce the corresponding word chain outperforms the word based system. * * This work has been partially supported by the Spanish CICYT under grant TIC-94-0210-E and by the UPV/EHU under grant UPV-224.310-EA036/97.
Proceedings of the 2nd International Conference on Agents and Artificial Intelligence, 2010
Detecting and tracking people in real-time in complicated and crowded scenes is a challenging pro... more Detecting and tracking people in real-time in complicated and crowded scenes is a challenging problem. This paper presents a multi-cue methodology to detect and track pedestrians in real-time in the entrance gates using stationary CCD cameras. The proposed approach is the combination of two main algorithms, the detecting and tracking for solitude situations and an estimation process for overcrowded scenes. In the former method, the detection component includes finding local maximums in foreground mask of Gaussian-Mixture and Ω-shaped objects in the edge map by trained PCA. And the tracking engine employs a Dynamic VCM with automated criteria based on the shape and size of detected human shaped entities. This new approach has several advantages. First, it uses a well-defined and robust feature space which includes polar and angular data. Furthermore due to its fast method to find human shaped objects in the scene, it's intrinsically suitable for real-time purposes. In addition, this approach verifies human formed objects based on PCA algorithm, which makes it robust in decreasing false positive cases. This novel approach has been implemented in a sacred place and the experimental results demonstrated the system's robustness under many difficult situations such as partial or full occlusions of pedestrians.
Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002.
ABSTRACT Morphological information is traditionally used to develop high quality text to speech (... more ABSTRACT Morphological information is traditionally used to develop high quality text to speech (TTS) and automatic speech recognition (ASR) systems. The use of this information improves the naturalness and intelligibility of the TTS synthesis and provides an appropriated way to select lexical units (LU) for ASR. Basque is an agglutinative language with a complex structure inside the words and the morphological information is essential both in TTS and ASR. In this work an automatic morphological segmentation tool oriented to TTS and ASR tasks is presented.
2007 IEEE Workshop on Automatic Identification Advanced Technologies, 2007
The Mel-Frequency Cepstral Coefficients (MFCC) and their derivatives are commonly used as acousti... more The Mel-Frequency Cepstral Coefficients (MFCC) and their derivatives are commonly used as acoustic features for speaker recognition. The issue arises of whether some of those features are redundant or dependent on other features. Probably, not all of them are equally relevant for speaker recognition. Reduced feature sets allow more robust estimates of the model parameters. Also, less computational resources are required, which is crucial for real-time speaker recognition applications using low-resource devices. In this paper, we use feature weighting as an intermediate step towards feature selection. Genetic algorithms are used to find the optimal set of weights for a 38-dimensional feature set, consisting of 12 MFCC, their first and second derivatives, energy and its first derivative. To evaluate each set of weights, speaker recognition errors are counted over a validation dataset. Speaker models are based on empirical distributions of acoustic labels, obtained through vector quantization. On average, weighting acoustic features yields between 15% and 25% error reduction in speaker recognition tests. Finally, features are sorted according to their weights, and the K features with greatest average ranks are retained and evaluated. We conclude that combining feature weighting and feature selection allows to reduce costs without degrading performance. 1
The development of an automatic index system of broadcast news requires appropriate Video and Lan... more The development of an automatic index system of broadcast news requires appropriate Video and Language Resources (LR) to design all the components of the system. Nowadays, large and well-defined resources can be found in most widely used languages (Informedia), but there is a lot of work to do with respect to minority languages. The main goal of this work is the design of resources in Basque and Spanish for the transcription of broadcast news. These two languages have been chosen because they are both official in the Basque Autonomous Community and they are used in the Basque Public Radio and Television EITB (EITB).
This paper briefly describes the system presented by the Working Group on Software Technologies (... more This paper briefly describes the system presented by the Working Group on Software Technologies (GTTS) 1 of the University of the Basque Country (UPV/EHU) to the Spoken Web Search task at MediaEval 2012. The GTTS system apply state-of-the-art phone decoders to search for approximate matchings of the N-best decodings of a spoken query in the phone lattice of the target audio document.
This paper briefly describes the language recognition systems developed for the 2011 NIST Languag... more This paper briefly describes the language recognition systems developed for the 2011 NIST Language Recognition Evaluation (LRE) by the BLZ (Bilbao-Lisboa-Zaragoza) team, a threesite joint including GTTS from the University of the Basque Country (Spain), L 2 F (Spoken Language Systems Lab) from INESC-ID Lisboa (Portugal) and I3A from the University of Zaragoza (Spain). The primary system fuses 8 (3 acoustic + 5 phonotactic) subsystems: a Linearized Eigenchannel GMM (LE-GMM) subsystem, a JFA subsystem, an iVector subsystem, three Phone-SVM subsystems using the Brno University of Technology phone decoders for Czech, Hungarian and Russian, and two Phone-SVM subsystems using the L 2 F phone decoders for European Portuguese and Brazilian Portuguese. Gaussian backends and multiclass fusion have been applied to get the final scores. Three contrastive systems have been also submitted, featuring: (1) the fusion of the whole set of 13 (6 acoustic + 7 phonotactic) subsystems; (2) the fusion of 3 subsystems, for the combination of one subsystem per site yielding the best performance on development data; and (3) the fusion of the same 8 subsystems used in the primary system under a different configuration.
This paper briefly describes the dot-scoring speaker recognition system developed by the Software... more This paper briefly describes the dot-scoring speaker recognition system developed by the Software Technology Working Group (http://gtts.ehu.es) at the University of the Basque Country (EHU), for the NIST 2010 Speaker Recognition Evaluation. The system does eigenchannel compensation in the sufficient statistics space and scoring is performed by a simple dot product. An optimized Matlab implementation of of the eigenchannels estimation, the channel compensation and the normalized mean vector computation is provided.
In this paper, an XML resource definition is presented fitting in with the architecture of a mult... more In this paper, an XML resource definition is presented fitting in with the architecture of a multilingual (Spanish, English, Basque) spoken document retrieval system. The XML resource not only stores all the information extracted from the audio signal, but also adds the structure required to create an index database and retrieve information according to various criteria. The XML resource is based on the concept of segment and provides generic but powerful mechanisms to characterize segments and group segments into sections. Audio and video files described through this XML resource can be easily exploited in other tasks, such as topic tracking, speaker diarization, etc.
This paper presents an overview of the Albayzin 2010 Language Recognition Evaluation, carried out... more This paper presents an overview of the Albayzin 2010 Language Recognition Evaluation, carried out from June to October 2010, organized by the Spanish Thematic Network on Speech Technology and coordinated by the Speech Technology Working Group of the University of the Basque Country. The evaluation was designed according to the test procedures, protocols and performance measures used in the last NIST Language Recognition Evaluations. Development and evaluation data were extracted from KALAKA-2, a database including clean and noisy speech in various languages, recorded from TV broadcasts and stored in single-channel 16-bit 16 kHz audio files. The task consisted in deciding whether or not a target language was spoken in a test utterance. Four different conditions were defined: closed-set/clean-speech, closed-set/noisy speech, open-set/clean-speech and open-set/noisy speech. Evaluation was performed on three subsets of test segments, with nominal durations of 30, 10 and 3 seconds, respectively. The task involved 6 target languages: English, Portuguese and the four official languages spoken in Spain (Basque, Catalan, Galician and Spanish), other (unknown) languages being also recorded to allow open-set verification tests. Four teams (2 from Spanish universities, one from a Portuguese research center and one from a Finnish university) presented their systems to this evaluation. The best primary system in the closed-set/cleanspeech condition on the subset of 30-second segments yielded Cavg = 0.0184 (around 2% EER).
A speech database, named KALAKA, was created to support the Albayzin 2008 Evaluation of Language ... more A speech database, named KALAKA, was created to support the Albayzin 2008 Evaluation of Language Recognition Systems, organized by the Spanish Network on Speech Technologies from May to November 2008. This evaluation, designed according to the criteria and methodology applied in the NIST Language Recognition Evaluations, involved four target languages: Basque, Catalan, Galician and Spanish (official languages in Spain), and included speech signals in other (unknown) languages to allow open-set verification trials. In this paper, the process of designing, collecting data and building the train, development and evaluation datasets of KALAKA is described. Results attained in the Albayzin 2008 LRE are presented as a means of evaluating the database. The performance of a state-of-the-art language recognition system on a closed-set evaluation task is also presented for reference. Future work includes extending KALAKA by adding Portuguese and English as target languages and renewing the set of unknown languages needed to carry out open-set evaluations.
Proceedings of Odyssey, 2008
Cohort of background speakers Few speakers to cover all the potential impostors Universal Backgro... more Cohort of background speakers Few speakers to cover all the potential impostors Universal Background Model (UBM) Provides universal acoustic coverage Uses a pool of speakers to train a single speaker-independent model
IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.
In this paper, we address the problem of computing a consensus translation given the outputs from... more In this paper, we address the problem of computing a consensus translation given the outputs from a set of Machine Translation (MT) systems. The translations from the MT systems are aligned with a multiple string alignment algorithm and the consensus translation is then computed. We describe the multiple string alignment algorithm and the consensus MT hypothesis computation. We report on the subjective and objective performance of the multilingual acquisition approach on a limited domain spoken language application. We evaluate five domain-independent off-theshelf MT systems and show that the consensus-based translation performs equal or better than any of the given MT systems both in terms of objective and subjective measures.
6th European Conference on Speech Communication and Technology (Eurospeech 1999)
This paper presents a new system for the continuous speech recognition of Spanish, integrating pr... more This paper presents a new system for the continuous speech recognition of Spanish, integrating previous works in the fields of acoustic-phonetic decoding and language modelling. Acoustic and language modelsseparately trained with speech and text samples, respectively-are integrated into one single automaton, and their probabilities combined according to a standard beam search procedure. Two key issues were to adequately adjust the beam parameter and the weight affecting the language model probabilities. For the implementation, a client-server arquitecture was selected, due to the desirable working scene where one or more simple machines in the client side make the speech analysis task, and a more powerful workstation in the server side looks for the best sentence hypotheses. Preliminary experimentation gives promising results with around 90% word recognition rates in a medium size word speech recognition task 1 .
Interspeech 2006, 2006
The Mel-Frequency Cepstral Coefficients (MFCC) are widely accepted as a suitable representation f... more The Mel-Frequency Cepstral Coefficients (MFCC) are widely accepted as a suitable representation for speaker recognition applications. MFCC are usually augmented with dynamic features, leading to high dimensional representations. The issue arises of whether some of those features are redundant or dependent on other features. Probably, not all of them are equally relevant for speaker recognition. In this work, we explore the potential benefit of weighting acoustic features to improve speaker recognition accuracy. Genetic algorithms (GAs) are used to find the optimal set of weights for a 38dimensional feature set. To evaluate each set of weights, recognition error is measured over a validation dataset. Naive speaker models are used, based on empirical distributions of vector quantizer labels. Weighting acoustic features yields 24.58% and 14.68% relative error reductions in two series of speaker recognition tests. These results provide evidence that further improvements in speaker recognition performance can be attained by weighting acoustic features. They also validate the use of GAs to search for an optimal set of feature weights. 1
IberSPEECH 2018, 2018
This paper describes the systems developed by GTTS-EHU for the QbE-STD and STD tasks of the Albay... more This paper describes the systems developed by GTTS-EHU for the QbE-STD and STD tasks of the Albayzin 2018 Search on Speech Evaluation. Stacked bottleneck features (sBNF) are used as frame-level acoustic representation for both audio documents and spoken queries. In QbE-STD, a flavour of segmental DTW (originally developed for MediaEval 2013) is used to perform the search, which iteratively finds the match that minimizes the average distance between two test-normalized sBNF vectors, until either a maximum number of hits is obtained or the score does not attain a given threshold. The STD task is performed by synthesizing spoken queries (using publicly available TTS APIs), then averaging their sBNF representations and using the average query for QbE-STD. A publicly available toolkit (developed by BUT/Phonexia) has been used to extract three sBNF sets, trained for English monophone and triphone state posteriors (contrastive systems 3 and 4) and for multilingual triphone posteriors (contrastive system 2), respectively. The concatenation of the three sBNF sets has been also tested (contrastive system 1). The primary system consists of a discriminative fusion of the four contrastive systems. Detection scores are normalized on a query-by-query basis (qnorm), calibrated and, if two or more systems are considered, fused with other scores. Calibration and fusion parameters are discriminatively estimated using the ground truth of development data. Finally, due to a lack of robustness in calibration, Yes/No decisions are made by applying the MTWV thresholds obtained for the development sets, except for the COREMAH test set. In this case, calibration is based on the MAVIR corpus, and the 15% highest scores are taken as positive (Yes) detections.
ISSPA '99. Proceedings of the Fifth International Symposium on Signal Processing and its Applications (IEEE Cat. No.99EX359)
Speech understanding applications where a word based output of the uttered sentence is not needed... more Speech understanding applications where a word based output of the uttered sentence is not needed, can benefit from the use of alternative lexical units. Experimental results from these systems show that the use of non word lexical units bring us a new degree of freedom in order to improve the system performance (better recognition rate and lower size can be obtained in comparison to word based models). However, if the aim of the system is a speechto-text translation, a post-processing stage must be included in order to convert the non-word sequences into word sentences. In this paper a technique to perform this conversion as well as an experimental test carried out over a task oriented Spanish corpus are reported. As a conclusion, we see that the whole speech-to-text system neatly outperforms the word-constrained baseline system.
Lecture Notes in Computer Science, 2005
Automatic Indexing of Broadcast News is a developing research area of great recent interest [1]. ... more Automatic Indexing of Broadcast News is a developing research area of great recent interest [1]. This paper describes the development steps for designing an automatic index system of broadcast news for both Basque and Spanish. This application requires of appropriate Language Resources to design all the components of the system. Nowadays, large and well-defined resources can be found in most widely used languages, but there is a lot of work to do with respect to minority languages. Even if Spanish has much more resources than Basque, this work has parallel efforts for both languages. These two languages have been chosen because they are evenly official in the Basque Autonomous Community and they are used in many mass media of the Community including the Basque Public Radio and Television EITB [2].
En este artículo se describe la estructura y los distintos problemas que se han abordado en el de... more En este artículo se describe la estructura y los distintos problemas que se han abordado en el desarrollo de una herramienta software para el Parlamento Vasco, que permite la generación de subtítulos para vídeos con las transcripciones textuales disponibles. La mayor parte de las dificultades se encontraron en el pre-procesamiento del texto y en la sincronización de texto y audio. La herramienta es capaz de tratar con recursos multilingües, lo que también ha supuesto una fuente importante de dificultad.
1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258), 1999
If the objective of a Continuous Automatic Speech Understanding system is not a speech-to-text tr... more If the objective of a Continuous Automatic Speech Understanding system is not a speech-to-text translation, words are not strictly needed, and then the use of alternative lexical units (LUs) will bring us a new degree of freedom to improve the system performance. Consequently, we experimentally explore some methods to automatically extract a set of LUs from a Spanish training corpus and verify that the system can be improved in two ways: reducing the computational costs and increasing the recognition rates. Moreover, preliminary results point out that, even if the system target is a speech-to-text translation, using non-word units and post-processing the output to produce the corresponding word chain outperforms the word based system. * * This work has been partially supported by the Spanish CICYT under grant TIC-94-0210-E and by the UPV/EHU under grant UPV-224.310-EA036/97.