Jim Hieronymus - Academia.edu (original) (raw)
Papers by Jim Hieronymus
Journal of the Acoustical Society of America, Oct 1, 1984
This paper investigates a weighted finite state transducer (WFST) based syllable decoding and tra... more This paper investigates a weighted finite state transducer (WFST) based syllable decoding and transduction method for keyword search (KWS), and compares it with sub-word search and phone confusion methods in detail. Acoustic context dependent phone models are trained from word forced alignments and then used for syllable decoding and lattice generation. Out-of-vocabulary (OOV) keyword pronunciations are produced using a grapheme-to-syllable (G2S) system and then used to construct a lexical transducer. The lexical transducer is then composed with a keyword-boosted language model (LM) to transduce the syllable lattices to word lattices for final KWS. Word Error Rates (WER) and KWS results are reported for 5 different languages. It is shown that the syllable transduction method gives comparable KWS results to the syllable search and phone confusion methods. Combination of these three methods further improves OOV KWS performance.
Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2003
We present a demonstration of a prototype system aimed at providing support with procedural tasks... more We present a demonstration of a prototype system aimed at providing support with procedural tasks for astronauts on board the International Space Station. Current functionality includes navigation within the procedure, previewing steps, requesting a list of images or a particular image, recording voice notes and spoken alarms, setting parameters such as audio volume. Dialogue capabilities include handling spoken corrections for an entire dialogue move, reestablishing context in response to a user request, responding to user barge-in, and help on demand. The current system has been partially reimplemented for better efficiency and in response to feedback from astronauts and astronaut training personnel. Added features include visual and spoken step previewing, and spoken correction of dialogue moves. The intention is to introduce the system into astronaut training as a prelude to flight on board the International Space Station.
iaf, 2002
New missions of space exploration will require unprecedented levels of autonomy to successfully a... more New missions of space exploration will require unprecedented levels of autonomy to successfully accomplish their objectives. Both inherent complexity and communication distances will preclude levels of human involvement common to current and previous space flight missions. With exponentially increasing capabilities of computer hardware and software, including networks and communication systems, a new balance of work is being developed between humans
Seven English monothong vowels were studied in continuous sentences. The purpose of the study was... more Seven English monothong vowels were studied in continuous sentences. The purpose of the study was to determine what methods are likely to be successful in compensating for coarticulation in all vowel and consonantal contexts. A method by Kuwahara has been examined in detail. The Kuwahara compensation improves the separation of the vowel regions in a space composed of the first and second formant. Important issues are where to measure the formant 'target" frequencies, how to obtain good formant tracks, measuring speaking rate accurately, and how to label vowels accurately.
Speech is the most ~t u r a l modality for humans use to comunicate with other people, agents and... more Speech is the most ~t u r a l modality for humans use to comunicate with other people, agents and complex systems. A spoken dialogue system must be robust to noise and able to mimic human conversational behavior, like correcting mi suns, answering simple questions about the task and derstanding most well fbrmed inquiries or commands. The system aims to understand the meaning of the human utterance, and if it does not, then it discards the utterance as being rneant for someone else. The first operational system is Clarissa, a conversational procedure reader and navigator, which will be used i n a System Development Test Objective (SDTO) on the International Space Station (ISS) during Expedition 10. In the present environment one astronaut reads the procedure on a Manual Procedure Viwer ("V) or paper, and has to3top to read or turn pages, shifting focus fkom the task. Clarissa is designed to read and navigate ISS procedures entirely with speech, while the astronaut has his eyes and hands engaged in performing the t a s k The system also provides an MPV like graphical interface so the procedure can be read visually. A demo of the system w i l l be given.
International Conference on Acoustics, Speech, and Signal Processing, Jan 13, 2003
An ongoing study is reported of all sixteen of the American English vowels using subsets of the D... more An ongoing study is reported of all sixteen of the American English vowels using subsets of the DARPA acoustic-phonetic database. Formants are obtained and normalized for each talker's formant range based on one sentence. The resulting formant tracks are smoothed using splines and sampled at nine equally spaced points in time within vowel-centered triphone regions. Triphones with semivowels in them are clustered separately. These formant values are k-means clustered using subsets of the sampled formant values. The additional supervised training is done using other parameters, including duration. The resulting clusters are used as a classifier on the basis of the modified Euclidean distance from the cluster centers. This results in approximately 80% first choice vowel recognition of the outer edges of the vowel quadrilateral. Stressed vowels were found to have spectra which statistically were no more stable than unstressed vowels.<<ETX>>
Journal of the Acoustical Society of America, May 1, 1994
A language identification (LID) system that uses phonemotactic models in addition to phoneme mode... more A language identification (LID) system that uses phonemotactic models in addition to phoneme models to identify languages is described. The proposed LID system is trained and tested using the OGI multilanguage telephone speech database. The continuous density second-order ergodic variable duration hidden Markov phonemic models are trained for each language using a high accuracy phoneme recognition system developed at Bell Laboratories. The phonemotactic models for each language are trained using a text corpora of about ten million words and grapheme to phoneme converters. The language Li of an incoming speech signal x is hypothesized as the one that produced the highest likelihood f(x‖λi)f(λi‖Li) for all the phonemic models λi of a given set with the phonemotactic constraint. Initially, this LID system was trained and evaluated for English/Spanish language identification and the language identification was 83% correct (79% on English and 88% on Spanish). Results for four languages will be presented. The discriminative power of this LID system can be improved by mapping the phoneme lattice onto a syllable or a word sequence using a lexical analyzer and a trigram syllable or word language model The language identification results with and without interfacing the lexical analyzer will be presented.
Speech Communication, Dec 1, 1991
Abstract Vowel formant target frequencies from different talkers depend on the details of the voc... more Abstract Vowel formant target frequencies from different talkers depend on the details of the vocal tract, sex, regional accent, speaking habits and other factors. Good vowel recognition and studies of vowels from different talkers require an accurate method for compensating for speaker differences in these frequencies. The major variance seen in the data is between males and females. However, even within the same sex class, there are large variations in the formant target frequencies for the same vowel in the same phonetic context. Various methods of compensating for speaker variation in formants were studied. Bark scaled formants and subtraction of Bark fundamental frequency from the first formant was tried first. In spite of recent published papers on the efficacy of this technique, it was found inadequate. The transformations were incapable of improving the clusters of the cardinal vowels, for example. A modification of the Gerstman technique, determining the speaker's formant range and then transforming into an “ideal” talker's range, was found to account for most of the variance due to different talkers given a small amount of training data. This technique was applied to vowel in context studies on American English. Formant ranges were studied for 125 talkers of General American English. Plots of formant ranges for males and females showed interesting patterns. The lower limit of the second formant was not very different, while the lower limit of the first formant was lower for males. Both the first and second formant maxima were larger for females. The modified Gerstman transformation was able to superimpose the formant targets for the same vowel in the same context from different talkers into the same region of F 1, F 2 space. There remained some residual variance between male and female, even after the transformation. These trends are shown in a series of plots of vowel target frequency data.
4th International Conference on Spoken Language Processing (ICSLP 1996), Oct 3, 1996
A task independent s p o k en Language Identication (LID) system which uses a Large Vocabulary Au... more A task independent s p o k en Language Identication (LID) system which uses a Large Vocabulary Automatic Speech Recognition (LVASR) module for each language to choose the most likely language spoken is described in detail. The system has been trained on 5 languages: English, German, Japanese, Mandarin Chinese and Spanish. In this paper it is demonstrated that the performance of a LID system which is based on LVASR gives very good performance, when trained and tested on a 5 language subset (English, German, Spanish, Japanese, and Mandarin Chinese) of the Oregon Graduate Institute 11 language data base. The performance advantage is shown for both long (50 second) and short (10 second) test utterances. The ve language results show 88% correct recognition for 50 second utterances without condence measures and 98 % correct with condence measures. The recognition rate is 81 % correct for 10 second utterances without condence measures and 93 % correct with condence measures. The best performance has been obtained for systems trained on phonetically hand labeled speech. speech x(t) Spanish phoneme recognition system Mandarin phoneme recognition system Baye's classifier German phoneme recognition system English phoneme recognition system Japanese phoneme recognition system
The Chinese language is based on characters which are syllabic in nature. Since languages have sy... more The Chinese language is based on characters which are syllabic in nature. Since languages have syllabotactic rules which govern the construction of syllables and their allowed sequences, Chinese character sequence models can be used as a first level approximation of allowed syllable sequences. N-gram character sequence models were trained on 4.3 billion characters. Characters are used as a first level recognition unit with multiple pronunciations per character. For comparison the CU-HTK Mandarin word based system was used to recognize words which were then converted to character sequences. The character only system error rates for one best recognition were slightly worse than word based character recognition. However combining the two systems using log-linear combination gives better results than either system separately. An equally weighted combination gave consistent CER gains of 0.1-0.2% absolute over the word based standard system.
There is general agreement that sentential syllable vowel stres!! (called prominence by some auth... more There is general agreement that sentential syllable vowel stres!! (called prominence by some authors) in American English is marked by pitch risefalls, energy, and duration. None of these cues by themselves is sufficient, instead combinations of these cues are used by talkers to signal stress in continuous speech. After studying the stress marking strategies of 15 talkers of American English, an algorithm was devised which Iabels vowels with three Ievels of stress. The algorithm is based on combinations of pitch rise falls, relative energy and duration. The pitch is determined automatieally in all voiced regions in the sentence. Then the regions are characterised as having rising pitch, falling pitch or steady pitch. Sequences of three regions are examined to find the pitch rise fall patterns which signal stress. The energy in the band 0· 2500 Hz is determined throughout the utterance. All the energy measurements are made relative to the maximum energy in the sentence. If the energy of the vowel is within 11 db of the maximum it is considered energy stressed. The duration is determined from band Iabels in the present implementation. Duration is corrected for prepausal effects. If two out of three cues are present, then the vowel is Iabeiied stressed. If the vowel has the highest energy, Iongest duration, and highest pitch then it is labeled as highly stressed. If the vowel has very low energy relative to the loudest sound in the sentence, then it is labeled unstressed no matter what the other two cues indicate. The algorithm was tested on 125 sentences of American English and found to perform very weil. The pitch stress was the most difficult. Detailed analysis of the results show that approximately 85 % of the syllables are correctly stress labelled.
Journal of the Acoustical Society of America, Dec 1, 1986
We will demonstrate a spoken dialogue interface to a Geologist's Field Assistant that is being de... more We will demonstrate a spoken dialogue interface to a Geologist's Field Assistant that is being developed as part of NASA's Mobile Agents project. The assistant consists of a robot and an agent system which helps an astronaut wearing a planetary space suit while conducting a geological exploration. The primary technical challanges relating to spoken dialogue systems that arise in this project are speech recognition in noise, open-microphone, and recording voice annotations. This system is capable of discriminating between speech intended for the system and for other purposes.
Journal of the Acoustical Society of America, Oct 1, 1984
This paper investigates a weighted finite state transducer (WFST) based syllable decoding and tra... more This paper investigates a weighted finite state transducer (WFST) based syllable decoding and transduction method for keyword search (KWS), and compares it with sub-word search and phone confusion methods in detail. Acoustic context dependent phone models are trained from word forced alignments and then used for syllable decoding and lattice generation. Out-of-vocabulary (OOV) keyword pronunciations are produced using a grapheme-to-syllable (G2S) system and then used to construct a lexical transducer. The lexical transducer is then composed with a keyword-boosted language model (LM) to transduce the syllable lattices to word lattices for final KWS. Word Error Rates (WER) and KWS results are reported for 5 different languages. It is shown that the syllable transduction method gives comparable KWS results to the syllable search and phone confusion methods. Combination of these three methods further improves OOV KWS performance.
Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2003
We present a demonstration of a prototype system aimed at providing support with procedural tasks... more We present a demonstration of a prototype system aimed at providing support with procedural tasks for astronauts on board the International Space Station. Current functionality includes navigation within the procedure, previewing steps, requesting a list of images or a particular image, recording voice notes and spoken alarms, setting parameters such as audio volume. Dialogue capabilities include handling spoken corrections for an entire dialogue move, reestablishing context in response to a user request, responding to user barge-in, and help on demand. The current system has been partially reimplemented for better efficiency and in response to feedback from astronauts and astronaut training personnel. Added features include visual and spoken step previewing, and spoken correction of dialogue moves. The intention is to introduce the system into astronaut training as a prelude to flight on board the International Space Station.
iaf, 2002
New missions of space exploration will require unprecedented levels of autonomy to successfully a... more New missions of space exploration will require unprecedented levels of autonomy to successfully accomplish their objectives. Both inherent complexity and communication distances will preclude levels of human involvement common to current and previous space flight missions. With exponentially increasing capabilities of computer hardware and software, including networks and communication systems, a new balance of work is being developed between humans
Seven English monothong vowels were studied in continuous sentences. The purpose of the study was... more Seven English monothong vowels were studied in continuous sentences. The purpose of the study was to determine what methods are likely to be successful in compensating for coarticulation in all vowel and consonantal contexts. A method by Kuwahara has been examined in detail. The Kuwahara compensation improves the separation of the vowel regions in a space composed of the first and second formant. Important issues are where to measure the formant 'target" frequencies, how to obtain good formant tracks, measuring speaking rate accurately, and how to label vowels accurately.
Speech is the most ~t u r a l modality for humans use to comunicate with other people, agents and... more Speech is the most ~t u r a l modality for humans use to comunicate with other people, agents and complex systems. A spoken dialogue system must be robust to noise and able to mimic human conversational behavior, like correcting mi suns, answering simple questions about the task and derstanding most well fbrmed inquiries or commands. The system aims to understand the meaning of the human utterance, and if it does not, then it discards the utterance as being rneant for someone else. The first operational system is Clarissa, a conversational procedure reader and navigator, which will be used i n a System Development Test Objective (SDTO) on the International Space Station (ISS) during Expedition 10. In the present environment one astronaut reads the procedure on a Manual Procedure Viwer ("V) or paper, and has to3top to read or turn pages, shifting focus fkom the task. Clarissa is designed to read and navigate ISS procedures entirely with speech, while the astronaut has his eyes and hands engaged in performing the t a s k The system also provides an MPV like graphical interface so the procedure can be read visually. A demo of the system w i l l be given.
International Conference on Acoustics, Speech, and Signal Processing, Jan 13, 2003
An ongoing study is reported of all sixteen of the American English vowels using subsets of the D... more An ongoing study is reported of all sixteen of the American English vowels using subsets of the DARPA acoustic-phonetic database. Formants are obtained and normalized for each talker's formant range based on one sentence. The resulting formant tracks are smoothed using splines and sampled at nine equally spaced points in time within vowel-centered triphone regions. Triphones with semivowels in them are clustered separately. These formant values are k-means clustered using subsets of the sampled formant values. The additional supervised training is done using other parameters, including duration. The resulting clusters are used as a classifier on the basis of the modified Euclidean distance from the cluster centers. This results in approximately 80% first choice vowel recognition of the outer edges of the vowel quadrilateral. Stressed vowels were found to have spectra which statistically were no more stable than unstressed vowels.<<ETX>>
Journal of the Acoustical Society of America, May 1, 1994
A language identification (LID) system that uses phonemotactic models in addition to phoneme mode... more A language identification (LID) system that uses phonemotactic models in addition to phoneme models to identify languages is described. The proposed LID system is trained and tested using the OGI multilanguage telephone speech database. The continuous density second-order ergodic variable duration hidden Markov phonemic models are trained for each language using a high accuracy phoneme recognition system developed at Bell Laboratories. The phonemotactic models for each language are trained using a text corpora of about ten million words and grapheme to phoneme converters. The language Li of an incoming speech signal x is hypothesized as the one that produced the highest likelihood f(x‖λi)f(λi‖Li) for all the phonemic models λi of a given set with the phonemotactic constraint. Initially, this LID system was trained and evaluated for English/Spanish language identification and the language identification was 83% correct (79% on English and 88% on Spanish). Results for four languages will be presented. The discriminative power of this LID system can be improved by mapping the phoneme lattice onto a syllable or a word sequence using a lexical analyzer and a trigram syllable or word language model The language identification results with and without interfacing the lexical analyzer will be presented.
Speech Communication, Dec 1, 1991
Abstract Vowel formant target frequencies from different talkers depend on the details of the voc... more Abstract Vowel formant target frequencies from different talkers depend on the details of the vocal tract, sex, regional accent, speaking habits and other factors. Good vowel recognition and studies of vowels from different talkers require an accurate method for compensating for speaker differences in these frequencies. The major variance seen in the data is between males and females. However, even within the same sex class, there are large variations in the formant target frequencies for the same vowel in the same phonetic context. Various methods of compensating for speaker variation in formants were studied. Bark scaled formants and subtraction of Bark fundamental frequency from the first formant was tried first. In spite of recent published papers on the efficacy of this technique, it was found inadequate. The transformations were incapable of improving the clusters of the cardinal vowels, for example. A modification of the Gerstman technique, determining the speaker's formant range and then transforming into an “ideal” talker's range, was found to account for most of the variance due to different talkers given a small amount of training data. This technique was applied to vowel in context studies on American English. Formant ranges were studied for 125 talkers of General American English. Plots of formant ranges for males and females showed interesting patterns. The lower limit of the second formant was not very different, while the lower limit of the first formant was lower for males. Both the first and second formant maxima were larger for females. The modified Gerstman transformation was able to superimpose the formant targets for the same vowel in the same context from different talkers into the same region of F 1, F 2 space. There remained some residual variance between male and female, even after the transformation. These trends are shown in a series of plots of vowel target frequency data.
4th International Conference on Spoken Language Processing (ICSLP 1996), Oct 3, 1996
A task independent s p o k en Language Identication (LID) system which uses a Large Vocabulary Au... more A task independent s p o k en Language Identication (LID) system which uses a Large Vocabulary Automatic Speech Recognition (LVASR) module for each language to choose the most likely language spoken is described in detail. The system has been trained on 5 languages: English, German, Japanese, Mandarin Chinese and Spanish. In this paper it is demonstrated that the performance of a LID system which is based on LVASR gives very good performance, when trained and tested on a 5 language subset (English, German, Spanish, Japanese, and Mandarin Chinese) of the Oregon Graduate Institute 11 language data base. The performance advantage is shown for both long (50 second) and short (10 second) test utterances. The ve language results show 88% correct recognition for 50 second utterances without condence measures and 98 % correct with condence measures. The recognition rate is 81 % correct for 10 second utterances without condence measures and 93 % correct with condence measures. The best performance has been obtained for systems trained on phonetically hand labeled speech. speech x(t) Spanish phoneme recognition system Mandarin phoneme recognition system Baye's classifier German phoneme recognition system English phoneme recognition system Japanese phoneme recognition system
The Chinese language is based on characters which are syllabic in nature. Since languages have sy... more The Chinese language is based on characters which are syllabic in nature. Since languages have syllabotactic rules which govern the construction of syllables and their allowed sequences, Chinese character sequence models can be used as a first level approximation of allowed syllable sequences. N-gram character sequence models were trained on 4.3 billion characters. Characters are used as a first level recognition unit with multiple pronunciations per character. For comparison the CU-HTK Mandarin word based system was used to recognize words which were then converted to character sequences. The character only system error rates for one best recognition were slightly worse than word based character recognition. However combining the two systems using log-linear combination gives better results than either system separately. An equally weighted combination gave consistent CER gains of 0.1-0.2% absolute over the word based standard system.
There is general agreement that sentential syllable vowel stres!! (called prominence by some auth... more There is general agreement that sentential syllable vowel stres!! (called prominence by some authors) in American English is marked by pitch risefalls, energy, and duration. None of these cues by themselves is sufficient, instead combinations of these cues are used by talkers to signal stress in continuous speech. After studying the stress marking strategies of 15 talkers of American English, an algorithm was devised which Iabels vowels with three Ievels of stress. The algorithm is based on combinations of pitch rise falls, relative energy and duration. The pitch is determined automatieally in all voiced regions in the sentence. Then the regions are characterised as having rising pitch, falling pitch or steady pitch. Sequences of three regions are examined to find the pitch rise fall patterns which signal stress. The energy in the band 0· 2500 Hz is determined throughout the utterance. All the energy measurements are made relative to the maximum energy in the sentence. If the energy of the vowel is within 11 db of the maximum it is considered energy stressed. The duration is determined from band Iabels in the present implementation. Duration is corrected for prepausal effects. If two out of three cues are present, then the vowel is Iabeiied stressed. If the vowel has the highest energy, Iongest duration, and highest pitch then it is labeled as highly stressed. If the vowel has very low energy relative to the loudest sound in the sentence, then it is labeled unstressed no matter what the other two cues indicate. The algorithm was tested on 125 sentences of American English and found to perform very weil. The pitch stress was the most difficult. Detailed analysis of the results show that approximately 85 % of the syllables are correctly stress labelled.
Journal of the Acoustical Society of America, Dec 1, 1986
We will demonstrate a spoken dialogue interface to a Geologist's Field Assistant that is being de... more We will demonstrate a spoken dialogue interface to a Geologist's Field Assistant that is being developed as part of NASA's Mobile Agents project. The assistant consists of a robot and an agent system which helps an astronaut wearing a planetary space suit while conducting a geological exploration. The primary technical challanges relating to spoken dialogue systems that arise in this project are speech recognition in noise, open-microphone, and recording voice annotations. This system is capable of discriminating between speech intended for the system and for other purposes.