Gilles Boulianne | Centre de Recherche Informatique de Montréal (original) (raw)
Uploads
Papers by Gilles Boulianne
This paper describes the ASR system proposed by the SODA consortium to participate in the ASR tas... more This paper describes the ASR system proposed by the SODA consortium to participate in the ASR task of the French REPERE evaluation campaign. The official test REPERE corpus is composed of TV shows. The entire ASR system was produced by combining two ASR systems built by two members of the consortium. Each ASR system has some specificities: one uses an i-vector-based speaker adaptation of deep neural networks for acoustic modeling, while the other one rescores word-lattices with continuous space language models. The entire ASR system won the REPERE evaluation campaign on the ASR task. On the REPERE test corpus, this composite ASR system reaches a word error rate of 13.5%.
IEEE Transactions on Audio, Speech and Language Processing, 2000
We compare two approaches to the problem of session variability in GMM-based speaker verification... more We compare two approaches to the problem of session variability in GMM-based speaker verification, eigenchannels and joint factor analysis, on the NIST 2005 speaker recognition evaluation data. We show how the two approaches can be implemented using essentially the same software at all stages except for the enrollment of target speakers. We demonstrate the effectiveness of zt-norm score normalization and a new decision criterion for speaker recognition which can handle large numbers of t-norm speakers and large numbers of speaker factors at little computational cost. We found that factor analysis was far more effective than eigenchannel modeling. The best result we obtained was a detection cost of 0.016 on the core condition (all trials) of the evaluation.
IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01., 2001
Page 1. OUT-OF-VOCABULARY WORD MODELING USING MULTIPLE LEXICAL FILLERS Gilles Boulianne, Pierre D... more Page 1. OUT-OF-VOCABULARY WORD MODELING USING MULTIPLE LEXICAL FILLERS Gilles Boulianne, Pierre Durnouchel ... We describe a lexi-cal filler model that can be used in a single pass recognition system to detect out-of-vocabulary words and reduce the er-ror rate. ...
2008 IEEE Spoken Language Technology Workshop, 2008
ABSTRACT Real-time speech recognition captioning has not progressed much, beyond television broad... more ABSTRACT Real-time speech recognition captioning has not progressed much, beyond television broadcast, to other tasks like meetings in the workplace. A number of obstacles prevent this transition, such as proper means to receive and display captions, or on-site shadow speakers costs. More problematic is the insufficient performance of speech recognition for less formal and one-time events. We describe how we developed a mobile platform for remote captioning during trials in several conferences and meetings. We also show that sentence selection based on relative entropy allows training of adequate language models with small amounts of in-domain data, making real-time captioning of an event possible with only a few hours of preparation.
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2012
ABSTRACT The speed of modern processors has remained constant over the last few years but the int... more ABSTRACT The speed of modern processors has remained constant over the last few years but the integration capacity continues to follow Moore's law and thus, to be scalable, applications must be parallelized. This paper presents results in using the A* search algorithm in a large vocabulary speech recognition parallel system. This algorithm allows better parallelization over the Viterbi algorithm. First experiments with a “unigram approximation” heuristic resulted in approximatively 8.7 times less states being explored compared to our classical Viterbi decoder. The multi-thread implementation of the A* decoder led to a speed-up factor of 3 over its sequential counterpart.
2012 11th International Conference on Information Science, Signal Processing and their Applications, ISSPA 2012, 2012
ABSTRACT The speed of modern processors has remained constant over the last few years but the int... more ABSTRACT The speed of modern processors has remained constant over the last few years but the integration capacity continues to follow Moore's law and thus, to be scalable, applications must be parallelized. In addition to the main CPU, almost every computer is equipped with a Graphics Processors Unit (GPU) which is in essence a specialized parallel processor. This paper explore how performance of speech recognition systems can be enhanced by using the A* algorithm which allows better parallelization over the Viterbi algorithm and a GPU for the acoustic computations in large vocabulary applications. First experiments with a “unigram approximation” heuristic resulted in approximatively 8.7 times less states being explored compared to our classical Viterbi decoder. The multi-thread implementation of the A* decoder combined with GPU for acoustic computation led to a speed-up factor of 5.2 over its sequential counterpart and an improvement of 5% absolute of the accuracy over the sequential Viterbi search at real-time.
Proceedings of the 19th international conference on Computational linguistics -, 2002
The objective of this work is to disambiguate transducers which have the following form: T = R • ... more The objective of this work is to disambiguate transducers which have the following form: T = R • D and to be able to apply the determinization algorithm described in . Our approach to disambiguating T = R • D consists first of computing the composition T and thereafter to disambiguate the transducer T . We will give an important consequence of this result that allows us to compose any number of transducers R with the transducer D, in contrast to the previous approach which consisted in first disambiguating transducers D and R to produce respectively D and R , then computing T = R • D where T is unambiguous. We will present results in the case of a transducer D representing a dictionary and R representing phonological rules.
IEEE Transactions on Audio, Speech, and Language Processing, 2000
ABSTRACT The speed of modern processors has remained constant over the last few years but the int... more ABSTRACT The speed of modern processors has remained constant over the last few years but the integration capacity continues to follow Moore's law and thus, to be scalable, applications must be parallelized. The parallelization of the classical Viterbi beam search has been shown to be very difficult on multi-core processor architectures or massively threaded architectures such as Graphics Processing Unit (GPU). The problem with this approach is that active states are scattered in memory and thus, they cannot be efficiently transferred to the processor memory. This problem can be circumvented by using the A* search which uses a heuristic to significantly reduce the number of explored hypotheses. The main advantage of this algorithm is that the processing time is moved from the search in the recognition network to the computation of heuristic costs, which can be designed to take advantage of parallel architectures. Our parallel implementation of the A* decoder on a 4-core processor with a GPU led to a speed-up factor of 6.13 compared to the Viterbi beam search at its maximum capacity and an improvement of 4% absolute in accuracy at real-time.
This paper describes methods for integrating source language and target language infor-mation for... more This paper describes methods for integrating source language and target language infor-mation for machine aided human translation (MAHT) of text documents. These methods are applied to a language translation task in-volving a human translator dictating a first draft translation of a source language docu-ment. A method is presented which integrates target language automatic speech recognition (ASR) models with source language statistical machine translation (SMT) and named entity recognition (NER) information at the phonetic level. Information extracted from a source lan-guage document including translation model probabilities and translated named entities are combined with acoustic-phonetic information obtained from phone lattices produced by the ASR system. Phone-level integration allows the combined MAHT system to correctly de-code words that are either not in the ASR vo-cabulary or would have been incorrectly de-coded by the ASR system. It is shown that the combined MAHT system r...
Live closed-captions for deaf and hard of hearing audiences are currently produced by stenographe... more Live closed-captions for deaf and hard of hearing audiences are currently produced by stenographers, or by voice writers using speech recognition. Both techniques can produce captions with errors. We are currently developing a correction module that allows a user to intercept the real-time caption stream and correct it before it is broadcast. We report results of preliminary experiments on correction rate and actual user performance using a prototype correction module connected to the output of a speech recognition captioning system.
In this paper, we present the approach we used to produce a training database from a set of recor... more In this paper, we present the approach we used to produce a training database from a set of recorded newscasts for which we had inaccurate transcriptions. These transcribed segments correspond to a set of prepared anchor texts and journalist stories, not necessarily in chronological order of their actual presentation. No segmental time boundary information is provided. Our main concern is thus to establish time marks that delimit the audio segments of the corresponding texts. To resolve this problem, we have developped a time marking procedure using our speech recognition engine. We obtain a segmentation accuracy of 80%.
This paper describes the ASR system proposed by the SODA consortium to participate in the ASR tas... more This paper describes the ASR system proposed by the SODA consortium to participate in the ASR task of the French REPERE evaluation campaign. The official test REPERE corpus is composed of TV shows. The entire ASR system was produced by combining two ASR systems built by two members of the consortium. Each ASR system has some specificities: one uses an i-vector-based speaker adaptation of deep neural networks for acoustic modeling, while the other one rescores word-lattices with continuous space language models. The entire ASR system won the REPERE evaluation campaign on the ASR task. On the REPERE test corpus, this composite ASR system reaches a word error rate of 13.5%.
IEEE Transactions on Audio, Speech and Language Processing, 2000
We compare two approaches to the problem of session variability in GMM-based speaker verification... more We compare two approaches to the problem of session variability in GMM-based speaker verification, eigenchannels and joint factor analysis, on the NIST 2005 speaker recognition evaluation data. We show how the two approaches can be implemented using essentially the same software at all stages except for the enrollment of target speakers. We demonstrate the effectiveness of zt-norm score normalization and a new decision criterion for speaker recognition which can handle large numbers of t-norm speakers and large numbers of speaker factors at little computational cost. We found that factor analysis was far more effective than eigenchannel modeling. The best result we obtained was a detection cost of 0.016 on the core condition (all trials) of the evaluation.
IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01., 2001
Page 1. OUT-OF-VOCABULARY WORD MODELING USING MULTIPLE LEXICAL FILLERS Gilles Boulianne, Pierre D... more Page 1. OUT-OF-VOCABULARY WORD MODELING USING MULTIPLE LEXICAL FILLERS Gilles Boulianne, Pierre Durnouchel ... We describe a lexi-cal filler model that can be used in a single pass recognition system to detect out-of-vocabulary words and reduce the er-ror rate. ...
2008 IEEE Spoken Language Technology Workshop, 2008
ABSTRACT Real-time speech recognition captioning has not progressed much, beyond television broad... more ABSTRACT Real-time speech recognition captioning has not progressed much, beyond television broadcast, to other tasks like meetings in the workplace. A number of obstacles prevent this transition, such as proper means to receive and display captions, or on-site shadow speakers costs. More problematic is the insufficient performance of speech recognition for less formal and one-time events. We describe how we developed a mobile platform for remote captioning during trials in several conferences and meetings. We also show that sentence selection based on relative entropy allows training of adequate language models with small amounts of in-domain data, making real-time captioning of an event possible with only a few hours of preparation.
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2012
ABSTRACT The speed of modern processors has remained constant over the last few years but the int... more ABSTRACT The speed of modern processors has remained constant over the last few years but the integration capacity continues to follow Moore's law and thus, to be scalable, applications must be parallelized. This paper presents results in using the A* search algorithm in a large vocabulary speech recognition parallel system. This algorithm allows better parallelization over the Viterbi algorithm. First experiments with a “unigram approximation” heuristic resulted in approximatively 8.7 times less states being explored compared to our classical Viterbi decoder. The multi-thread implementation of the A* decoder led to a speed-up factor of 3 over its sequential counterpart.
2012 11th International Conference on Information Science, Signal Processing and their Applications, ISSPA 2012, 2012
ABSTRACT The speed of modern processors has remained constant over the last few years but the int... more ABSTRACT The speed of modern processors has remained constant over the last few years but the integration capacity continues to follow Moore's law and thus, to be scalable, applications must be parallelized. In addition to the main CPU, almost every computer is equipped with a Graphics Processors Unit (GPU) which is in essence a specialized parallel processor. This paper explore how performance of speech recognition systems can be enhanced by using the A* algorithm which allows better parallelization over the Viterbi algorithm and a GPU for the acoustic computations in large vocabulary applications. First experiments with a “unigram approximation” heuristic resulted in approximatively 8.7 times less states being explored compared to our classical Viterbi decoder. The multi-thread implementation of the A* decoder combined with GPU for acoustic computation led to a speed-up factor of 5.2 over its sequential counterpart and an improvement of 5% absolute of the accuracy over the sequential Viterbi search at real-time.
Proceedings of the 19th international conference on Computational linguistics -, 2002
The objective of this work is to disambiguate transducers which have the following form: T = R • ... more The objective of this work is to disambiguate transducers which have the following form: T = R • D and to be able to apply the determinization algorithm described in . Our approach to disambiguating T = R • D consists first of computing the composition T and thereafter to disambiguate the transducer T . We will give an important consequence of this result that allows us to compose any number of transducers R with the transducer D, in contrast to the previous approach which consisted in first disambiguating transducers D and R to produce respectively D and R , then computing T = R • D where T is unambiguous. We will present results in the case of a transducer D representing a dictionary and R representing phonological rules.
IEEE Transactions on Audio, Speech, and Language Processing, 2000
ABSTRACT The speed of modern processors has remained constant over the last few years but the int... more ABSTRACT The speed of modern processors has remained constant over the last few years but the integration capacity continues to follow Moore's law and thus, to be scalable, applications must be parallelized. The parallelization of the classical Viterbi beam search has been shown to be very difficult on multi-core processor architectures or massively threaded architectures such as Graphics Processing Unit (GPU). The problem with this approach is that active states are scattered in memory and thus, they cannot be efficiently transferred to the processor memory. This problem can be circumvented by using the A* search which uses a heuristic to significantly reduce the number of explored hypotheses. The main advantage of this algorithm is that the processing time is moved from the search in the recognition network to the computation of heuristic costs, which can be designed to take advantage of parallel architectures. Our parallel implementation of the A* decoder on a 4-core processor with a GPU led to a speed-up factor of 6.13 compared to the Viterbi beam search at its maximum capacity and an improvement of 4% absolute in accuracy at real-time.
This paper describes methods for integrating source language and target language infor-mation for... more This paper describes methods for integrating source language and target language infor-mation for machine aided human translation (MAHT) of text documents. These methods are applied to a language translation task in-volving a human translator dictating a first draft translation of a source language docu-ment. A method is presented which integrates target language automatic speech recognition (ASR) models with source language statistical machine translation (SMT) and named entity recognition (NER) information at the phonetic level. Information extracted from a source lan-guage document including translation model probabilities and translated named entities are combined with acoustic-phonetic information obtained from phone lattices produced by the ASR system. Phone-level integration allows the combined MAHT system to correctly de-code words that are either not in the ASR vo-cabulary or would have been incorrectly de-coded by the ASR system. It is shown that the combined MAHT system r...
Live closed-captions for deaf and hard of hearing audiences are currently produced by stenographe... more Live closed-captions for deaf and hard of hearing audiences are currently produced by stenographers, or by voice writers using speech recognition. Both techniques can produce captions with errors. We are currently developing a correction module that allows a user to intercept the real-time caption stream and correct it before it is broadcast. We report results of preliminary experiments on correction rate and actual user performance using a prototype correction module connected to the output of a speech recognition captioning system.
In this paper, we present the approach we used to produce a training database from a set of recor... more In this paper, we present the approach we used to produce a training database from a set of recorded newscasts for which we had inaccurate transcriptions. These transcribed segments correspond to a set of prepared anchor texts and journalist stories, not necessarily in chronological order of their actual presentation. No segmental time boundary information is provided. Our main concern is thus to establish time marks that delimit the audio segments of the corresponding texts. To resolve this problem, we have developped a time marking procedure using our speech recognition engine. We obtain a segmentation accuracy of 80%.