Fergus McInnes | University of Edinburgh (original) (raw)
Papers by Fergus McInnes
The recognition performances of two front ends are compared for two continuous speech recognition... more The recognition performances of two front ends are compared for two continuous speech recognition tasks. First, a neural network model (NNM) front end was used, with frame labeling performed by a radial basis function network and segmentation by a Viterbi algorithm. The second front end was a discrete hidden Markov model (HMM), featuring explicit state duration probability distributions. Two experiments were performed. The first used a speaker-dependent database, with a lexicon of 571 words. Using a low-perplexity grammar, the NNM front end produced a word accuracy of 94% and a sentence accuracy of 86%. This was slightly inferior to the HMM front end, which produced word accuracies of 96% and sentence accuracies of 88%. Without a grammar, word accuracies of 58% (NNM) and 49% (HMM) were recorded. The second set of experiments used the MIT portion of the TIMIT database (415 speakers and 2072 sentences in total). Results were poor for both front ends, with the NNM producing marginally better results
The authors describe the results of an experiment to study the effectiveness of using acoustic st... more The authors describe the results of an experiment to study the effectiveness of using acoustic stress to improve automatic speech recognition. The CSTR speech recognition system uses hidden semi-Markov models (HSMM) with a separate lexical search component. A hybrid prosodic component has been included which determines the sentence level stress and marks the vowel of stressed syllables as stressed in the phoneme lattice. Lexical stress is marked on all content words in the lexicon. Adding stress information to the system in this way resulted in a 65% reduction in word error rate and a 45% reduction in sentence error rate, relative to a baseline system without prosody
Journal of The Acoustical Society of America, 2002
This study compared acoustic and electroglottographic ͑EGG͒ jitter from ͓a͔ vowels of 103 dysphon... more This study compared acoustic and electroglottographic ͑EGG͒ jitter from ͓a͔ vowels of 103 dysphonic speakers. The EGG recordings were chosen according to their intensity, signal-to-noise ratio, and percentage of unvoiced intervals, while acoustic signals were selected based on voicing detection and the reliability of jitter extraction. The agreement between jitter measures was expressed numerically as a normalized difference. In 63.1% ͑65/103͒ of the cases the differences fell within Ϯ22.5%. Positive differences above ϩ22.5% were associated with increased acoustic jitter and occurred in 12.6% ͑13/103͒ of the speakers. These were, typically, cases of small nodular lesions without problems in the posterior larynx. On the other hand, substantial rises in EGG jitter leading to differences below Ϫ22.5% took place in 24.3% ͑25/103͒ of the speakers and were related to hyperfunctional voices, creaky-like voices, small laryngeal asymmetries affecting the arytenoids, or small-to-moderate glottal chinks. A clinically relevant outcome of the study was the possibility of detecting gentle laryngeal asymmetries among cases of large unilateral increase in EGG jitter. These asymmetries can be linked with vocal problems that are often overlooked in endoscopic examinations.
IEEE Transactions on Speech and Audio Processing, 2001
This paper addresses the problem of temporal constraints in the Viterbi algorithm using condition... more This paper addresses the problem of temporal constraints in the Viterbi algorithm using conditional transition probabilities. The results here presented suggest that in a speaker dependent small vocabulary task the statistical modelling of state durations is not relevant if the max and min state duration restrictions are imposed, and that truncated probability densities give better results than a metric previously proposed 1 . Finally, context dependent and context independent temporal restrictions are compared in a connected word speech recognition task and it is shown that the former leads to better results with the same computational load.
This paper addresses the problem of speech recognition with signals corrupted by additive noise a... more This paper addresses the problem of speech recognition with signals corrupted by additive noise at moderate SNR. A technique based on spectral subtraction and noise cancellation reliability weighting in acoustic pattern matching algorithms is studied. A model for additive noise is proposed and used to compute the variance of the hidden clean signal information and the reliability of the spectral subtraction process. The results presented show that a proper weight on the information provided by static parameters can substantially reduce the error rate
Speech Communication, 2004
Surname capture via automatic speech recognition over the telephone has many commercial applicati... more Surname capture via automatic speech recognition over the telephone has many commercial applications, including automated directory assistance and travel reservation services. This paper presents a usability evaluation of three different dialogue designs for automated surname capture, within the context of a flight reservation service. The three designs explored were: a Speak Only strategy, in which callers simply say the surname; a One Stage Speak and Spell strategy in which callers speak and spell the surname in a single utterance; and a Two Stage Speak and Spell strategy in which callers speak and spell the surname in two separate dialogue stages. The methodology employed in the research provides both quantitative user attitude data and performance results for each of the strategies, based on an empirical study with a cohort of 95 participants. The results show a clear distinction between strategies. User attitude towards the dialogues that involve both speaking and spelling the name is high. User attitude towards the Speak Only strategy is significantly less positive. Task completion rates are also significantly higher in the two strategies that involve spelling the name, at around 80% compared to just over 50% in the Speak Only strategy. The data underline the importance of user testing, demonstrating the value of the evaluation methodology used, and provide encouraging results for the strategies that involve both speaking and spelling the name.
Pattern Recognition, 1996
Some fast clustering algorithms for vector quantization (VQ) based on the LBG recursive algorithm... more Some fast clustering algorithms for vector quantization (VQ) based on the LBG recursive algorithm are presented and compared. Experimental results in comparison to the conventional vector-quantization (VQ) clustering algorithm with speech data demonstrate that the best approach will save more than 99% in the number of multiplications, as well as considerable saving in the number of additions. The increase in the number of comparisons is moderate. An improve absolute error inequality (AEI) criterion for Euclidean distortion measure is also proposed and utilized in the VQ clustering algorithm.
Electronics Letters, 1996
Implementation: If we compare the eight-state trellis from Fig. 1 to the EPR4 trellis we note tha... more Implementation: If we compare the eight-state trellis from Fig. 1 to the EPR4 trellis we note that they are structurally identical. This means that they have the same number of states, the same number of branches, and the same connectivity. This suggests that the eight-state Viterbi ...
Acoustical analysis of speech using computers has reached an important development in the latest ... more Acoustical analysis of speech using computers has reached an important development in the latest years. The subjective evaluation of a clinician is complemented with an objective measure of relevant parameters of voice. Praat, MDVP and SAV are some examples of software for speech analysis. In this paper we describe an algorithm for the estimation of the fundamental frequency that considers the non-periodic nature of the speech signal under analysis. The experiments show that the use of these estimated f0 values reduces the errors in perturbation measures of f0, compared to the errors of other state-of-the-art speech analysis softwares, such as Praat and MDVP.
The multi-mutation rates, multi-crossover rates and a scheme of reinitialization are applied to p... more The multi-mutation rates, multi-crossover rates and a scheme of reinitialization are applied to parallel genetic algorithms for assigning the codevector indices for noisy channels for the purpose of minimizing the distortion caused by bit errors. Experimental results based on the memoryless binary symmetric channel for any bit error demonstrate the robustness of this new approach compared with the authors' previous work. The property of multiple global optima is also emphasized
IEEE Transactions on Speech and Audio Processing, 1998
Addresses the problem of speech recognition with signals corrupted by additive noise at moderate ... more Addresses the problem of speech recognition with signals corrupted by additive noise at moderate signal-to-noise ratio (SNR). A model for additive noise is presented and used to compute the uncertainty about the hidden clean signal so as to weight the estimation provided by spectral subtraction. Weighted dynamic time warping (DTW) and Viterbi (HMM) algorithms are tested, and the results show that weighting the information along the signal can substantially increase the performance of spectral subtraction, an easily implemented technique, even with a poor estimation for noise and without using any information about the speaker. It is also shown that the weighting procedure can reduce the error rate when cepstral mean normalization is also used to cancel the convolutional noise
Electronics Letters, 1989
ABSTRACT
Electronics Letters, 1996
ABSTRACT
Iee Proceedings-vision Image and Signal Processing, 1996
A bound for a Minkowski metric based on Lp distortion measure is proposed and evaluated as a mean... more A bound for a Minkowski metric based on Lp distortion measure is proposed and evaluated as a means to reduce the computation in vector quantisation. This bound provides a better criterion than the absolute error inequality (AEI) elimination rule on the Euclidean distortion measure. For the Minkowski metric of order IZ, this bound contributes the elimination criterion from L, metric to Ln metric. This bound can also be extended to a quadratic metric which can be applied to the hidden Markov model with Gaussian mixture probability density function.
... Spain, September 18-21,1995 METHODOLOGICAL ASPECTS IN A MULTIMEDIA DATABASE OF VOCAL FOLD PAT... more ... Spain, September 18-21,1995 METHODOLOGICAL ASPECTS IN A MULTIMEDIA DATABASE OF VOCAL FOLD PATHOLOGIES Maurilio Nunes Vieira**, Fergus Mclnnes, Mervyn Jack, CCIR ... [10] HF Robinson, "Assessment of Voice Problems" in J. R .Beech, L. Harding with D ...
The recognition performances of two front ends are compared for two continuous speech recognition... more The recognition performances of two front ends are compared for two continuous speech recognition tasks. First, a neural network model (NNM) front end was used, with frame labeling performed by a radial basis function network and segmentation by a Viterbi algorithm. The second front end was a discrete hidden Markov model (HMM), featuring explicit state duration probability distributions. Two experiments were performed. The first used a speaker-dependent database, with a lexicon of 571 words. Using a low-perplexity grammar, the NNM front end produced a word accuracy of 94% and a sentence accuracy of 86%. This was slightly inferior to the HMM front end, which produced word accuracies of 96% and sentence accuracies of 88%. Without a grammar, word accuracies of 58% (NNM) and 49% (HMM) were recorded. The second set of experiments used the MIT portion of the TIMIT database (415 speakers and 2072 sentences in total). Results were poor for both front ends, with the NNM producing marginally better results
The authors describe the results of an experiment to study the effectiveness of using acoustic st... more The authors describe the results of an experiment to study the effectiveness of using acoustic stress to improve automatic speech recognition. The CSTR speech recognition system uses hidden semi-Markov models (HSMM) with a separate lexical search component. A hybrid prosodic component has been included which determines the sentence level stress and marks the vowel of stressed syllables as stressed in the phoneme lattice. Lexical stress is marked on all content words in the lexicon. Adding stress information to the system in this way resulted in a 65% reduction in word error rate and a 45% reduction in sentence error rate, relative to a baseline system without prosody
Journal of The Acoustical Society of America, 2002
This study compared acoustic and electroglottographic ͑EGG͒ jitter from ͓a͔ vowels of 103 dysphon... more This study compared acoustic and electroglottographic ͑EGG͒ jitter from ͓a͔ vowels of 103 dysphonic speakers. The EGG recordings were chosen according to their intensity, signal-to-noise ratio, and percentage of unvoiced intervals, while acoustic signals were selected based on voicing detection and the reliability of jitter extraction. The agreement between jitter measures was expressed numerically as a normalized difference. In 63.1% ͑65/103͒ of the cases the differences fell within Ϯ22.5%. Positive differences above ϩ22.5% were associated with increased acoustic jitter and occurred in 12.6% ͑13/103͒ of the speakers. These were, typically, cases of small nodular lesions without problems in the posterior larynx. On the other hand, substantial rises in EGG jitter leading to differences below Ϫ22.5% took place in 24.3% ͑25/103͒ of the speakers and were related to hyperfunctional voices, creaky-like voices, small laryngeal asymmetries affecting the arytenoids, or small-to-moderate glottal chinks. A clinically relevant outcome of the study was the possibility of detecting gentle laryngeal asymmetries among cases of large unilateral increase in EGG jitter. These asymmetries can be linked with vocal problems that are often overlooked in endoscopic examinations.
IEEE Transactions on Speech and Audio Processing, 2001
This paper addresses the problem of temporal constraints in the Viterbi algorithm using condition... more This paper addresses the problem of temporal constraints in the Viterbi algorithm using conditional transition probabilities. The results here presented suggest that in a speaker dependent small vocabulary task the statistical modelling of state durations is not relevant if the max and min state duration restrictions are imposed, and that truncated probability densities give better results than a metric previously proposed 1 . Finally, context dependent and context independent temporal restrictions are compared in a connected word speech recognition task and it is shown that the former leads to better results with the same computational load.
This paper addresses the problem of speech recognition with signals corrupted by additive noise a... more This paper addresses the problem of speech recognition with signals corrupted by additive noise at moderate SNR. A technique based on spectral subtraction and noise cancellation reliability weighting in acoustic pattern matching algorithms is studied. A model for additive noise is proposed and used to compute the variance of the hidden clean signal information and the reliability of the spectral subtraction process. The results presented show that a proper weight on the information provided by static parameters can substantially reduce the error rate
Speech Communication, 2004
Surname capture via automatic speech recognition over the telephone has many commercial applicati... more Surname capture via automatic speech recognition over the telephone has many commercial applications, including automated directory assistance and travel reservation services. This paper presents a usability evaluation of three different dialogue designs for automated surname capture, within the context of a flight reservation service. The three designs explored were: a Speak Only strategy, in which callers simply say the surname; a One Stage Speak and Spell strategy in which callers speak and spell the surname in a single utterance; and a Two Stage Speak and Spell strategy in which callers speak and spell the surname in two separate dialogue stages. The methodology employed in the research provides both quantitative user attitude data and performance results for each of the strategies, based on an empirical study with a cohort of 95 participants. The results show a clear distinction between strategies. User attitude towards the dialogues that involve both speaking and spelling the name is high. User attitude towards the Speak Only strategy is significantly less positive. Task completion rates are also significantly higher in the two strategies that involve spelling the name, at around 80% compared to just over 50% in the Speak Only strategy. The data underline the importance of user testing, demonstrating the value of the evaluation methodology used, and provide encouraging results for the strategies that involve both speaking and spelling the name.
Pattern Recognition, 1996
Some fast clustering algorithms for vector quantization (VQ) based on the LBG recursive algorithm... more Some fast clustering algorithms for vector quantization (VQ) based on the LBG recursive algorithm are presented and compared. Experimental results in comparison to the conventional vector-quantization (VQ) clustering algorithm with speech data demonstrate that the best approach will save more than 99% in the number of multiplications, as well as considerable saving in the number of additions. The increase in the number of comparisons is moderate. An improve absolute error inequality (AEI) criterion for Euclidean distortion measure is also proposed and utilized in the VQ clustering algorithm.
Electronics Letters, 1996
Implementation: If we compare the eight-state trellis from Fig. 1 to the EPR4 trellis we note tha... more Implementation: If we compare the eight-state trellis from Fig. 1 to the EPR4 trellis we note that they are structurally identical. This means that they have the same number of states, the same number of branches, and the same connectivity. This suggests that the eight-state Viterbi ...
Acoustical analysis of speech using computers has reached an important development in the latest ... more Acoustical analysis of speech using computers has reached an important development in the latest years. The subjective evaluation of a clinician is complemented with an objective measure of relevant parameters of voice. Praat, MDVP and SAV are some examples of software for speech analysis. In this paper we describe an algorithm for the estimation of the fundamental frequency that considers the non-periodic nature of the speech signal under analysis. The experiments show that the use of these estimated f0 values reduces the errors in perturbation measures of f0, compared to the errors of other state-of-the-art speech analysis softwares, such as Praat and MDVP.
The multi-mutation rates, multi-crossover rates and a scheme of reinitialization are applied to p... more The multi-mutation rates, multi-crossover rates and a scheme of reinitialization are applied to parallel genetic algorithms for assigning the codevector indices for noisy channels for the purpose of minimizing the distortion caused by bit errors. Experimental results based on the memoryless binary symmetric channel for any bit error demonstrate the robustness of this new approach compared with the authors' previous work. The property of multiple global optima is also emphasized
IEEE Transactions on Speech and Audio Processing, 1998
Addresses the problem of speech recognition with signals corrupted by additive noise at moderate ... more Addresses the problem of speech recognition with signals corrupted by additive noise at moderate signal-to-noise ratio (SNR). A model for additive noise is presented and used to compute the uncertainty about the hidden clean signal so as to weight the estimation provided by spectral subtraction. Weighted dynamic time warping (DTW) and Viterbi (HMM) algorithms are tested, and the results show that weighting the information along the signal can substantially increase the performance of spectral subtraction, an easily implemented technique, even with a poor estimation for noise and without using any information about the speaker. It is also shown that the weighting procedure can reduce the error rate when cepstral mean normalization is also used to cancel the convolutional noise
Electronics Letters, 1989
ABSTRACT
Electronics Letters, 1996
ABSTRACT
Iee Proceedings-vision Image and Signal Processing, 1996
A bound for a Minkowski metric based on Lp distortion measure is proposed and evaluated as a mean... more A bound for a Minkowski metric based on Lp distortion measure is proposed and evaluated as a means to reduce the computation in vector quantisation. This bound provides a better criterion than the absolute error inequality (AEI) elimination rule on the Euclidean distortion measure. For the Minkowski metric of order IZ, this bound contributes the elimination criterion from L, metric to Ln metric. This bound can also be extended to a quadratic metric which can be applied to the hidden Markov model with Gaussian mixture probability density function.
... Spain, September 18-21,1995 METHODOLOGICAL ASPECTS IN A MULTIMEDIA DATABASE OF VOCAL FOLD PAT... more ... Spain, September 18-21,1995 METHODOLOGICAL ASPECTS IN A MULTIMEDIA DATABASE OF VOCAL FOLD PATHOLOGIES Maurilio Nunes Vieira**, Fergus Mclnnes, Mervyn Jack, CCIR ... [10] HF Robinson, "Assessment of Voice Problems" in J. R .Beech, L. Harding with D ...