Stephanie Seneff | Massachusetts Institute of Technology (MIT) (original) (raw)

Papers by Stephanie Seneff

This paper describes our initial efforts at porting the VOY-AGER spoken language system to Japane... more This paper describes our initial efforts at porting the VOY-AGER spoken language system to Japanese. In the process we have reorganized the structure of the system so that language dependent information is separated from the core engine as much as possible. For example, this information is encoded in tabular or rule-based form for the natural language understanding and generation components. The internal system manager, discourse and dialogue component, and database are all maintained in language transparent form. Once the generation component was ported, data were collected from 40 native speakers of Japanese using a wizard collection paradigm. A portion of these data was used to train the natural language and segment-based speech recognition components. The system obtained an overall understanding accuracy of 52~0 on the test data, which is similar to our earlier reported results for English [i].

This paper describes a phase vocoder based technique for voice transformation. This method provid... more This paper describes a phase vocoder based technique for voice transformation. This method provides a flexible way to manipulate various aspects of the input signal, e.g., fundamental frequency of voicing, duration, energy, and formant positions, without explicit £ ¥ ¤ extraction. The modifications to the signal can be specific to any feature dimensions, and can vary dynamically over time. There are many potential applications for this technique. In concatenative speech synthesis, the method can be applied to transform the speech corpus to different voice characteristics, or to smooth any pitch or formant discontinuities between concatenation boundaries. The method can also be used as a tool for language learning. We can modify the prosody of the student's own speech to match that from a native speaker, and use the result as guidance for improvements. The technique can also be used to convert other biological signals, such as killer whale vocalizations, to a signal that is more appropriate for human auditory perception. Our initial experiments show encouraging results for all of these applications.

Prosodic cues (namely, fundamental frequency, energy and duration) provide important information ... more Prosodic cues (namely, fundamental frequency, energy and duration) provide important information for speech. For a tonal language such as Chinese, fundamental frequency (£ ¥ ¤) plays a critical role in characterizing tone as well, which is an essential phonemic feature. In this paper, we describe our work on duration and tone modeling for telephone-quality continuous Mandarin digits, and the application of these models to improve recognition. The duration modeling includes a speaking-rate normalization scheme. A novel £ ¤ extraction algorithm is developed, and parameters based on orthonormal decomposition of the £ ¤ contour are extracted for tone recognition. Context dependency is expressed by "tri-tone" models clustered into broad classes. A 20.0% error rate is achieved for four-tone classification. Over a baseline recognition performance of 5.1% word error rate, we achieve 31.4% error reduction with duration models, 23.5% error reduction with tone models, and 39.2% error reduction with duration and tone models combined.

Computer Speech & Language, 2007

This paper addresses the problem of recognizing a vocabulary of over 50,000 city names in a telep... more This paper addresses the problem of recognizing a vocabulary of over 50,000 city names in a telephone access spoken dialogue system. We adopt a two-stage framework in which only major cities are represented in the first stage lexicon. We rely on an unknown word model encoded as a phone loop to detect OOV city names (referred to as 'rare city' names). We use SpeM, a tool that can extract words and word-initial cohorts from phone graphs from a large fallback lexicon, to provide an N-best list of promising city name hypotheses on the basis of the phone graph corresponding to the OOV. This N-best list is then inserted into the second stage lexicon for a subsequent recognition pass. Experiments were conducted on a set of spontaneous telephone-quality utterances; each containing one rare city name. It appeared that SpeM was able to include nearly 75% of the correct city names in an N-best hypothesis list of 3000 city names. With the names found by SpeM to extend the lexicon of the second stage recognizer, a word accuracy of 77.3% could be obtained. The best one-stage system yielded a word accuracy of 72.6%. The absolute number of correctly recognized rare city names almost doubled, from 62 for the best one-stage system to 102 for the best two-stage system. However, even the best two-stage system recognized only about one-third of the rare city names retrieved by SpeM. The paper discusses ways for improving the overall performance in the context of an application.

ACM Transactions on Speech and Language Processing, Jul 1, 2006

This article describes our research on spoken language translation aimed toward the application o... more This article describes our research on spoken language translation aimed toward the application of computer aids for second language acquisition. The translation framework is incorporated into a multilingual dialogue system in which a student is able to engage in natural spoken interaction with the system in the foreign language, while speaking a query in their native tongue at any time to obtain a spoken translation for language assistance. Thus the quality of the translation must be extremely high, but the domain is restricted. Experiments were conducted in the weather information domain with the scenario of a native English speaker learning Mandarin Chinese. We were able to utilize a large corpus of English weather-domain queries to explore and compare a variety of translation strategies: formal, example-based, and statistical. Translation quality was manually evaluated on a test set of 695 spontaneous utterances. The best speech translation performance (89.9% correct, 6.1% incorrect, and 4.0% rejected), is achieved by a system which combines the formal and example-based methods, using parsability by a domain-specific Chinese grammar as a rejection criterion.

This paper addresses the issue of large-vocabulary recognition in a specific word class. We propo... more This paper addresses the issue of large-vocabulary recognition in a specific word class. We propose a two-pass strategy in which only major cities are explicitly represented in the first stage lexicon. An unknown word model encoded as a phone loop is used to detect OOV city names (referred to as rare city names). After which SpeM, a tool that can extract words and word-initial cohorts from phone graphs on the basis of a large fallback lexicon, provides an N-best list of promising city names on the basis of the phone sequences generated in the first stage. This N-best list is then inserted into the second stage lexicon for a subsequent recognition pass. Experiments were conducted on a set of spontaneous telephone-quality utterances each containing one rare city name. We tested the size of the N-best list and three types of language models (LMs). The experiments showed that SpeM was able to include nearly 85% of the correct city names into an N-best list of 3000 city names when a unigram LM, which also boosted the unigram scores of a city name in a given state, was used.

Learning a foreign language requires much practice outside of the classroom. Computerassisted lan... more Learning a foreign language requires much practice outside of the classroom. Computerassisted language learning systems can help fill this need, and one desirable capability of such systems is the automatic correction of grammatical errors in texts written by non-native speakers. This dissertation concerns the correction of non-native grammatical errors in English text, and the closely related task of generating test items for language learning, using a combination of statistical and linguistic methods. We show that syntactic analysis enables extraction of more salient features. We address issues concerning robustness in feature extraction from non-native texts; and also design a framework for simultaneous correction of multiple error types. Our proposed methods are applied on some of the most common usage errors, including prepositions, verb forms, and articles. The methods are evaluated on sentences with synthetic and real errors, and in both restricted and open domains. A secondary theme of this dissertation is that of user customization. We perform a detailed analysis on a non-native corpus, illustrating the utility of an error model based on the mother tongue. We study the benefits of adjusting the correction models based on the quality of the input text; and also present novel methods to generate high-quality multiple-choice items that are tailored to the interests of the user.

North American Chapter of the Association for Computational Linguistics, 2004

This paper describes our research on both the detection and subsequent resolution of recognition ... more This paper describes our research on both the detection and subsequent resolution of recognition errors in spoken dialogue systems. The paper consists of two major components. The first half concerns the design of the error detection mechanism for resolving city names in our MERCURY flight reservation system, and an investigation of the behavioral patterns of users in subsequent subdialogues involving keypad entry for disambiguation. An important observation is that, upon a request for keypad entry, users are frequently unresponsive to the extent of waiting for a time-out or hanging up the phone. The second half concerns a pilot experiment investigating the feasibility of replacing the solicitation of a keypad entry with that of a "speak-and-spell" entry. A novelty of our work is the introduction of a speech synthesizer to simulate the user, which facilitates development and evaluation of our proposed strategy. We have found that the speak-and-spell strategy is quite effective in simulation mode, but it remains to be tested in real user dialogues.

Language Resources and Evaluation, May 1, 2000

The GALAXY-II architecture, comprised of a centralized hub mediating the interaction among a suit... more The GALAXY-II architecture, comprised of a centralized hub mediating the interaction among a suite of human language technology servers, provides both a useful tool for implementing systems and also a streamlined way of configuring the evaluation of these systems. In this paper, we discuss our ongoing efforts in evaluation of spoken dialogue systems, with particular attention to the way in which the architecture facilitates the development of a variety of evaluation configurations. We furthermore propose two new metrics for automatic evaluation of the discourse and dialogue components of a spoken dialogue system, which we call "user frustration" and "information bit rate." 1 GALAXY-II has been designated as the initial common architecture for the multi-site DARPA Communicator project in the United States.

7th European Conference on Speech Communication and Technology (Eurospeech 2001)

5th International Conference on Spoken Language Processing (ICSLP 1998)

Prosodic cues (namely, fundamental frequency, energy and duration) provide important information ... more Prosodic cues (namely, fundamental frequency, energy and duration) provide important information for speech. For a tonal language such as Chinese, fundamental frequency (F 0) plays a critical role in characterizing tone as well, which is an essential phonemic feature. In this paper, we describe our work on duration and tone modeling for telephone-quality continuous Mandarin digits, and the application of these models to improve recognition. The duration modeling includes a speaking-rate normalization scheme. A novel F 0 extraction algorithm is developed, and parameters based on orthonormal decomposition of the F 0 contour are extracted for tone recognition. Context dependency is expressed by "tri-tone" models clustered into broader classes. A 20.0% error rate is achieved for four-tone classification. Over a baseline recognition performance of 5.1% word error rate, we achieve 31.4% error reduction with duration models, 23.5% error reduction with tone models, and 39.2% error reduction with duration and tone models combined.

We propose an induction algorithm to semi-automate grammar authoring in an interlingua-based mach... more We propose an induction algorithm to semi-automate grammar authoring in an interlingua-based machine translation framework. This algorithm uses a pre-existing one-way translation system from some other language to the target language as prior information to infer a grammar for the target language. We demonstrate the system's effectiveness by automatically inducing a Chinese grammar for a weather domain from its English counterpart, and showing that it can produce high-quality translation from Chinese back to English.

Proceedings of the workshop on Human Language Technology - HLT '94, 1994

In this paper, we describe a reversible letter-to-sound/soundto-letter generation system based on... more In this paper, we describe a reversible letter-to-sound/soundto-letter generation system based on an approach which combines a rule-based formalism with data-driven techniques. We adopt a probabilistic parsing strategy to provide a hierarchical lexical analysis of a word, including information such as morphology, stress, syllabification, phonemics and graphemics. Long-distance constraints are propagated by enforcing local constraints throughout the hierarchy. Our training and testing corpora are derived from the high-frequency portion of the Brown Corpus (10,000 words), augmented with markers indicating stress and word morphology. We evaluated our performance based on an unseen test set. The percentage of nonparsable words for letter-to-sound and sound-to-letter generation were 6% and 5% respectively. Of the remaining words our system achieved a word accuracy of 71.8~0 and a phoneme accuracy of 92.5% for letter-to-sound generation, and a word accuracy of 55.8% and letter accuracy of 89.4% for sound-to-letter generation. We also compared our hierarchical approach with an alternative, single-layer approach to demonstrate how the hierarchy provides a parsimonious description for English orthographic-phonological regularities, while simultaneously attaining competitive generation accuracy.

Proceedings of the 16th conference on Computational linguistics -, 1996

This paper describes our work-in-progress in automatic English-to-Korean text; translation. This ... more This paper describes our work-in-progress in automatic English-to-Korean text; translation. This work is an initial step toward the ultimate goal of text and speech translation for enhanced nmltilingual and multinational operations. For riffs puipose, we have adopted an interlintlua approach with natural language understmlding (TINA) and generation (GENESIS) modules at the core. We tackle the ambiguity problem t)y incorporating syntactic and semantic categories in |;he analysis grammar. Our system is capable of producing accurate translation of comt)lex sentences (38 words) and sentence fragments as well as average le.ngth (12 words) grammatical sentences. Two types of sysl, em ewJuatiou have 1)een carried out: one for grammar coverage and the other for overall performance. ],br system robustness, integration of two subsystems is under way: (i) a rule-based part-of-speech tagger to handle tinknown words/constructions, and (if) a word-forword translator to handle other system failures.

Proceedings of the 35th annual meeting on Association for Computational Linguistics -, 1997

Telegraphic messages with numerous instances of omission pose a new challenge to parsing in that ... more Telegraphic messages with numerous instances of omission pose a new challenge to parsing in that a sentence with omission causes a higher degree of ambi6uity than a sentence without omission. Misparsing reduced by omissions has a far-reaching consequence in machine translation. Namely, a misparse of the input often leads to a translation into the target language which has incoherent meaning in the given context. This is more frequently the case if the structures of the source and target languages are quite different, as in English and Korean. Thus, the question of how we parse telegraphic messages accurately and efficiently becomes a critical issue in machine translation. In this paper we describe a technical solution for the issue, and reSent the performance evaluation of a machine transtion system on telegraphic messages before and after adopting the proposed solution. The solution lies in a grammar design in which lexicalized grammar rules defined in terms of semantic categories and syntactic rules defined in terms of part-of-speech are utilized toether. The proposed grammar achieves a higher parsg coverage without increasing the amount of ambiguity/misparsing when compared with a purely lexicalized semantic grammar, and achieves a lower degree of. ambiguity/misparses without, decreasing the parsmg coverage when compared with a purely syntactic grammar.

Proceedings of the workshop on Speech and Natural Language - HLT '91, 1991

As part of our development of a spoken language system in the ATIS domain, we have begun a smail-... more As part of our development of a spoken language system in the ATIS domain, we have begun a smail-scale effort in collecting spontaneous speech data. Our procedure differs from the one used at 'i~xas Instruments (TI) in many respects, the most important being the reliance on an existing system, rather than a wizard, to participate in data collection. Over the past few months, we have collected over 3,600 spontaneously generated sentences from 100 subjects. This paper documents our data collection process, and makes some comparative anaiyses of our data with those collected at TI. The advantages as well as disadvantages of this method of data collection will be discussed.

International Journal of Mobile Human Computer Interaction, 2011

This paper proposes a paradigm for using speech to interact with computers, one that complements ... more This paper proposes a paradigm for using speech to interact with computers, one that complements and extends traditional spoken dialogue systems: speech for content creation. The literature in automatic speech recognition (ASR), natural language processing (NLP), sentiment detection, and opinion mining is surveyed to argue that the time has come to use mobile devices to create content on-the-fly. Recent work in user modelling and recommender systems is examined to support the claim that using speech in this way can result in a useful interface to uniquely personalizable data. A data collection effort recently undertaken to help build a prototype system for spoken restaurant reviews is discussed. This vision critically depends on mobile technology, for enabling the creation of the content and for providing ancillary data to make its processing more relevant to individual users. This type of system can be of use where only limited speech processing is possible.

Proceedings of the first international conference on Human language technology research - HLT '01, 2001

At MIT Lincoln Laboratory, we have been developing a Koreanto-English machine translation system ... more At MIT Lincoln Laboratory, we have been developing a Koreanto-English machine translation system CCLINC (Common Coalition Language System at Lincoln Laboratory). The CCLINC Korean-to-English translation system consists of two core modules, language understanding and generation modules mediated by a language neutral meaning representation called a semantic frame. The key features of the system include: (i) Robust efficient parsing of Korean (a verb final language with overt case markers, relatively free word order, and frequent omissions of arguments). (ii) High quality translation via word sense disambiguation and accurate word order generation of the target language. (iii) Rapid system development and porting to new domains via knowledge-based automated acquisition of grammars. Having been trained on Korean newspaper articles on "missiles" and "chemical biological warfare," the system produces the translation output sufficient for content understanding of the original document.

Proceedings of the workshop on Speech and Natural Language - HLT '91, 1992

As the DARPA spoken language community moves towards developing useful systems for interactive pr... more As the DARPA spoken language community moves towards developing useful systems for interactive problem solving, we must explore alternative evaluation procedures that measure whether these systems aid people in solving problems within the task domain. In this paper, we describe several experiments exploring new evaluation procedures. To look at end-to-end evaluation, we modified our data collection procedure slightly in order to experiment with several objective task completion measures. We found that the task completion time is well correlated with the number of queries used. We also explored log file evaluation, where evaluators were asked to judge the clarity of the query and the correctness of the response based on examination of the log file. Our results show that seven evaluators were unanimous on more than 80% of the queries, and that at least 6 out of 7 evaluators agreed over 90% of the time. Finally, we applied these new procedures to compare two systems, one system requiring a complete parse and the other using the more flexible robust parsing mechanism. We found that these metrics could distinguish between these systems: there were significant differences in ability to complete the task, number of queries required to complete the task, and score (as computed through a log file evaluation) between the robust and the non-robust modes.

2008 6th International Symposium on Chinese Spoken Language Processing, 2008

In this paper we introduce Mandarin language understanding methods developed for spoken language ... more In this paper we introduce Mandarin language understanding methods developed for spoken language applications. We describe a set of strategies to improve the parsing performance for Mandarin. We also discuss two context resolution techniques adopted to handle Chinese ellipsis in a practical Mandarin spoken dialogue system. Experimental evaluation verifies the effectiveness and efficiency of our proposed parsing enhancements, in terms of both parse coverage and speed. System evaluation with human subjects also verifies the effectiveness of our proposed approaches to speech understanding and context resolution in practical conversational systems.