Juan Lizaola Huerta - Academia.edu (original) (raw)
Papers by Juan Lizaola Huerta
7th International Conference on Spoken Language Processing (ICSLP 2002)
In this paper we describe some recent improvements to the performance of the Aurora 2 noisy digit... more In this paper we describe some recent improvements to the performance of the Aurora 2 noisy digits speech recognition system for the matched training and test condition. The algorithms that we used pertain to discriminant acoustic mod-eling based on the Maximum Mutual Information (MMI) criterion, non-linear speaker/channel adaptation through probability distribution function matching. In addition, we revis-ited our last year's baseline system and improved its performance through crossword context dependent modeling and Gaussian mixture components selection using the Bayesian Information Criterion (BIC). The aggregated result is 93.3% word accuracy for the multi-condition training data scenario.
7th European Conference on Speech Communication and Technology (Eurospeech 2001)
In this paper we describe some experiments on the Aurora 2 noisy digits database. The algorithms ... more In this paper we describe some experiments on the Aurora 2 noisy digits database. The algorithms that we used can be broadly clas- sified into noise robustness techniques based on a linear-channel model of the acoustic environment such as CDCN (1) and its novel variant termed Alignment-based CDCN ( , proposed here), and techniques which do not assume any particular knowledge about the structure of the environment or noise conditions affecting the speech signal such as discriminant feature space transforma- tions and speaker/channel adaptation. We present recognition ex- periments for both the clean training data and the multi-condition training data scenarios.
6th International Conference on Spoken Language Processing (ICSLP 2000)
An FAQ (frequently-asked question) pattern consists of a question and a text document that answer... more An FAQ (frequently-asked question) pattern consists of a question and a text document that answers the question and contains some additional remarks. As a query is similar to the FAQ’s question, the FAQ’s answer gives a possible answer or parts of the answer of the query. On the other hand, an FAQ’s answer may also contain information not concerning with the corresponding FAQ’s question but embed the answer for other questions. For a given query, therefore, the answer can be obtained from both FAQ question and answer. In this paper, we propose a framework for Internet FAQ retrieval by using spoken language query. We aim at two points: (1) extraction of the main intention embedded in a query sentence and (2) semantic comparison between a query sentence and an FAQ pattern. To evaluate the system performance, a collection of 1022 FAQ patterns and a set of 185 query sentences are collected for experiment. In intention extraction, 91.9% of intention segments can be extracted correctly. Compared to the keyword-based approach, an improvement from 78.06% to 95.28% in recall rate for the top 10 candidates is obtained.
Interspeech 2004, 2004
We present a rapid compensation technique aimed at reducing the detrimental effect of environment... more We present a rapid compensation technique aimed at reducing the detrimental effect of environmental noise and channel on server based mobile speech recognition. It solves two key problems for such systems: firstly how to accurately separate non-speech events (or background noise) from noise introduced by network artifacts; secondly how to reduce the latency created by the extra computation required for a codebook-based linear channel compensation technique. We address the first problem by modifying an existing energy based endpoint-detection algorithm to provide segmenttype information to the compensation module. We tackle the latency issue with a codebook based scheme by employing a tree structured vector quantization technique with dynamic thresholds to avoid the computation of all codewords. Our technique is evaluated using a speech-in-car database at 3 different speeds. Our results show that our method leads to a 8.7% reduction in error rate and 35% reduction in computational cost.
Proceedings of the 1st International Workshop on Natural Language Understanding and Cognitive Science, 2004
This paper describes an approach for building conversational applications that dynamically adjust... more This paper describes an approach for building conversational applications that dynamically adjust to the user's level of expertise based on the user's responses. In our method, the Dialog Manager interacts with the application user through a mechanism that adjusts the prompts presented to the user based on a hierarchical model of the domain, the recent interaction history, and the known complexity of the domain itself. The goal is to present a conversational modality for experienced or confident users, and a simpler Directed Dialog experience for the more inexperienced users, and to dynamically identify these levels of expertise from the user's utterances. Our method uses a task hierarchy as a representation of the domain and follows a feedback control system framework to traverse of this tree. We illustrate these mechanisms with a simple sample domain based on a car rental application
Interspeech 2009, 2009
In this paper we describe the RTTS system for enterprise-level real time speech recognition and t... more In this paper we describe the RTTS system for enterprise-level real time speech recognition and translation. RTTS follows a Web Service-based approach which allows the encapsulation of ASR and MT Technology components thus hiding the configuration and ...
2006 IEEE International Conference on Multimedia and Expo, 2006
Systems designed to extract time-critical information from large volumes of unstructured data mus... more Systems designed to extract time-critical information from large volumes of unstructured data must include the ability, both from an architectural and algorithmic point of view, to filter out unimportant data that might otherwise overwhelm the available resources. This paper presents an approach for data filtering to reduce computation in the context of a distributed speech processing architecture designed to detect or identify speakers. Here, filtering means either dropping and ignoring data or passing it on for further processing. The goal of the paper is to show that when the filter is designed to select and pass on a subset of the input data that best preserves the ability to recognize a specific desired speaker, or group of speakers, a large percentage of the data can be ignored while being able to preserve most of the accuracy.
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, 2010
We present a new efficient algorithm for top-N match retrieval of sequential patterns. Our approa... more We present a new efficient algorithm for top-N match retrieval of sequential patterns. Our approach is based on an incremental approximation of the string edit distance using index information and a stack based search. Our approach produces hypotheses with average edit error of about 0.29 edits from the optimal SED result while using only about 5% of the CPU computation.
CHI '06 Extended Abstracts on Human Factors in Computing Systems, 2006
ABSTRACT
2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100)
We describe procedures and experimental results using speech from diverse source languages to bui... more We describe procedures and experimental results using speech from diverse source languages to build an ASR system for a single target language. This work is intended to improve ASR in languages for which large amounts of training data are not available. We have developed both knowledge based and automatic methods to map phonetic units from the source languages to the target language. We employed HMM adaptation techniques and Discriminative Model Combination to combine acoustic models from the individual source languages for recognition of speech in the target language. Experiments are described in which Czech Broadcast News is transcribed using acoustic models trained from small amounts of Czech read speech augmented by English, Spanish, Russian, and Mandarin acoustic models.
This paper describes our approach to developing novel vector based measures of semantic similarit... more This paper describes our approach to developing novel vector based measures of semantic similarity between a pair of sentences or utterances. Measures of this nature are useful not only in evaluating machine translation output, but also in other language understanding and information retrieval applications. We first describe the general family of existing vector based approaches to evaluating semantic similarity and their general properties. We illustrate how this family can be extended by means of discriminatively trained semantic feature weights. Finally, we explore the problem of rephrasing (i.e., addressing the question is sentence X the rephrase of sentence Y?) and present a new measure of the semantic linear equivalence between two sentences by means of a modified LSI approach based on the Generalized Singular Value Decomposition. 1 In this paper, for the sake of conciseness, we use the terms document, utterance, and sentence interchangeably. Typically the nature of the task define the specific type (for example, voice automation systems use utterances and so on).
Machine translation (MT) technology is becoming more and more pervasive, yet the quality of MT ou... more Machine translation (MT) technology is becoming more and more pervasive, yet the quality of MT output is still not ideal. Thus, human corrections are used to edit the output for further studies. However, how to judge the human correction might be tricky when the annotators are not experts. We present a novel way that uses cross-validation to automatically judge the human corrections where each MT output is corrected by more than one annotator. Cross-validation among corrections for the same machine translation, and among corrections from the same annotator are both applied. We get a correlation around 40% in sentence quality for Chinese-English and Spanish-English. We also evaluate the user quality as well. At last, we rank the quality of human corrections from good to bad, which enables us to set a quality threshold to make a trade-off between the scope and the quality of the corrections.
Proceedings of the ACM SIGKDD Workshop on Human Computation, 2009
ABSTRACT In this paper, we describe the design principles used for implementing crowdsourcing wit... more ABSTRACT In this paper, we describe the design principles used for implementing crowdsourcing within the enterprise. This is based on our distinction between two kinds of crowdsourcing: enterprise (inside a firewall) versus the public domain. Whereas public domain crowdsourcing offers ...
Proceedings of the Conference on Empirical Methods in Natural Language Processing - EMNLP '08, 2008
We introduce the relative rank differential statistic which is a non-parametric approach to docum... more We introduce the relative rank differential statistic which is a non-parametric approach to document and dialog analysis based on word frequency rank-statistics. We also present a simple method to establish semantic saliency in dialog, documents, and dialog segments using these word frequency rank statistics. Applications of our technique include the dynamic tracking of topic and semantic evolution in a dialog, topic detection, automatic generation of document tags, and new story or event detection in conversational speech and text. Our approach benefits from the robustness, simplicity and efficiency of non-parametric and rank based approaches and consistently outperformed term-frequency and TF-IDF cosine distance approaches in several experiments conducted.
Text, Speech and Language Technology
Three o'clock in The afTernoon duke univerSiTy chapel durham, norTh carolina Where Do We Go From ... more Three o'clock in The afTernoon duke univerSiTy chapel durham, norTh carolina Where Do We Go From Here? Overcoming Inequality and Building Community Duke University's 2010 Martin Luther King, Jr. Commemoration Committee finds insight into Dr. King's vision for an equitable society in his last published book, Where Do We Go From Here: Chaos or Community? In the book's last chapter, he writes: "The contemporary tendency in our society is to base our distribution on scarcity, which has vanished, and to compress our abundance into the overfed mouths of the middle and upper classes until they gag with superfluity. If democracy is to have breadth of meaning, it is necessary to adjust this inequity. It is not only moral, but it is also intelligent. We are wasting and degrading human life by clinging to archaic thinking. ... The curse of poverty has no justification in our age." Duke university's 2010 service of celebration the reverenD Dr. Martin luther king, Jr.
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011
In this work we present the Subsequence Similarity Language Model (S2-LM) which is a new approach... more In this work we present the Subsequence Similarity Language Model (S2-LM) which is a new approach to language modeling based on string similarity. As a language model, S2-LM generates scores based on the closest matching string given a very large corpus. In this paper we describe the properties and advantages of our approach and describe efficient methods to carry out its computation. We describe an n-best rescoring experiment intended to show that S2-LM can be adjusted to behave as an n-gram SLM model.
Proceedings of the ACM SIGKDD Workshop on Human Computation, 2010
In large scale online multiuser communities, the phenomenon of 'participation inequality,' has be... more In large scale online multiuser communities, the phenomenon of 'participation inequality,' has been described as generally following a more or less 90-9-1 rule [9]. In this paper, we examine crowdsourcing participation levels inside the enterprise (within a company's firewall) and show that it is possible to achieve a more equitable distribution of 33-66-1. Accordingly, we propose a SCOUT ((S)uper Contributor, (C)ontributor, and (OUT)lier)) model for describing user participation based on quantifiable effort-level metrics. In support of this framework, we present an analysis that measures the quantity of contributions correlated with responses to motivation and incentives. In conclusion, SCOUT provides the task-based categories to characterize participation inequality that is evident in online communities, and crucially, also demonstrates the inequality curve (and associated characteristics) in the enterprise domain.
2007 Winter Simulation Conference, 2007
In this paper we depart from a set of simple assumptions regarding the behavior of a pool of cust... more In this paper we depart from a set of simple assumptions regarding the behavior of a pool of customers associated with an enterprise's contact center. We assume that the pool of customers can access the contact center through an array of communication modalities (e.g., email, chat, web, voice). Based on these assumptions we develop a model that describes the volume of demand likely to be observed in such an environment as a function of time. Under the simple initial assumptions, the model we develop corresponds to a mean-reverting process of the type frequently used in energy options pricing. When independence assumptions are relaxed and correlations between user behavior are included, a jump-diffusion component appears in the model. The resulting model constitutes the potential foundation for key simulation-based analyses of the contact center, like capacity modeling and risk analysis.
7th International Conference on Spoken Language Processing (ICSLP 2002)
In this paper we describe some recent improvements to the performance of the Aurora 2 noisy digit... more In this paper we describe some recent improvements to the performance of the Aurora 2 noisy digits speech recognition system for the matched training and test condition. The algorithms that we used pertain to discriminant acoustic mod-eling based on the Maximum Mutual Information (MMI) criterion, non-linear speaker/channel adaptation through probability distribution function matching. In addition, we revis-ited our last year's baseline system and improved its performance through crossword context dependent modeling and Gaussian mixture components selection using the Bayesian Information Criterion (BIC). The aggregated result is 93.3% word accuracy for the multi-condition training data scenario.
7th European Conference on Speech Communication and Technology (Eurospeech 2001)
In this paper we describe some experiments on the Aurora 2 noisy digits database. The algorithms ... more In this paper we describe some experiments on the Aurora 2 noisy digits database. The algorithms that we used can be broadly clas- sified into noise robustness techniques based on a linear-channel model of the acoustic environment such as CDCN (1) and its novel variant termed Alignment-based CDCN ( , proposed here), and techniques which do not assume any particular knowledge about the structure of the environment or noise conditions affecting the speech signal such as discriminant feature space transforma- tions and speaker/channel adaptation. We present recognition ex- periments for both the clean training data and the multi-condition training data scenarios.
6th International Conference on Spoken Language Processing (ICSLP 2000)
An FAQ (frequently-asked question) pattern consists of a question and a text document that answer... more An FAQ (frequently-asked question) pattern consists of a question and a text document that answers the question and contains some additional remarks. As a query is similar to the FAQ’s question, the FAQ’s answer gives a possible answer or parts of the answer of the query. On the other hand, an FAQ’s answer may also contain information not concerning with the corresponding FAQ’s question but embed the answer for other questions. For a given query, therefore, the answer can be obtained from both FAQ question and answer. In this paper, we propose a framework for Internet FAQ retrieval by using spoken language query. We aim at two points: (1) extraction of the main intention embedded in a query sentence and (2) semantic comparison between a query sentence and an FAQ pattern. To evaluate the system performance, a collection of 1022 FAQ patterns and a set of 185 query sentences are collected for experiment. In intention extraction, 91.9% of intention segments can be extracted correctly. Compared to the keyword-based approach, an improvement from 78.06% to 95.28% in recall rate for the top 10 candidates is obtained.
Interspeech 2004, 2004
We present a rapid compensation technique aimed at reducing the detrimental effect of environment... more We present a rapid compensation technique aimed at reducing the detrimental effect of environmental noise and channel on server based mobile speech recognition. It solves two key problems for such systems: firstly how to accurately separate non-speech events (or background noise) from noise introduced by network artifacts; secondly how to reduce the latency created by the extra computation required for a codebook-based linear channel compensation technique. We address the first problem by modifying an existing energy based endpoint-detection algorithm to provide segmenttype information to the compensation module. We tackle the latency issue with a codebook based scheme by employing a tree structured vector quantization technique with dynamic thresholds to avoid the computation of all codewords. Our technique is evaluated using a speech-in-car database at 3 different speeds. Our results show that our method leads to a 8.7% reduction in error rate and 35% reduction in computational cost.
Proceedings of the 1st International Workshop on Natural Language Understanding and Cognitive Science, 2004
This paper describes an approach for building conversational applications that dynamically adjust... more This paper describes an approach for building conversational applications that dynamically adjust to the user's level of expertise based on the user's responses. In our method, the Dialog Manager interacts with the application user through a mechanism that adjusts the prompts presented to the user based on a hierarchical model of the domain, the recent interaction history, and the known complexity of the domain itself. The goal is to present a conversational modality for experienced or confident users, and a simpler Directed Dialog experience for the more inexperienced users, and to dynamically identify these levels of expertise from the user's utterances. Our method uses a task hierarchy as a representation of the domain and follows a feedback control system framework to traverse of this tree. We illustrate these mechanisms with a simple sample domain based on a car rental application
Interspeech 2009, 2009
In this paper we describe the RTTS system for enterprise-level real time speech recognition and t... more In this paper we describe the RTTS system for enterprise-level real time speech recognition and translation. RTTS follows a Web Service-based approach which allows the encapsulation of ASR and MT Technology components thus hiding the configuration and ...
2006 IEEE International Conference on Multimedia and Expo, 2006
Systems designed to extract time-critical information from large volumes of unstructured data mus... more Systems designed to extract time-critical information from large volumes of unstructured data must include the ability, both from an architectural and algorithmic point of view, to filter out unimportant data that might otherwise overwhelm the available resources. This paper presents an approach for data filtering to reduce computation in the context of a distributed speech processing architecture designed to detect or identify speakers. Here, filtering means either dropping and ignoring data or passing it on for further processing. The goal of the paper is to show that when the filter is designed to select and pass on a subset of the input data that best preserves the ability to recognize a specific desired speaker, or group of speakers, a large percentage of the data can be ignored while being able to preserve most of the accuracy.
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, 2010
We present a new efficient algorithm for top-N match retrieval of sequential patterns. Our approa... more We present a new efficient algorithm for top-N match retrieval of sequential patterns. Our approach is based on an incremental approximation of the string edit distance using index information and a stack based search. Our approach produces hypotheses with average edit error of about 0.29 edits from the optimal SED result while using only about 5% of the CPU computation.
CHI '06 Extended Abstracts on Human Factors in Computing Systems, 2006
ABSTRACT
2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100)
We describe procedures and experimental results using speech from diverse source languages to bui... more We describe procedures and experimental results using speech from diverse source languages to build an ASR system for a single target language. This work is intended to improve ASR in languages for which large amounts of training data are not available. We have developed both knowledge based and automatic methods to map phonetic units from the source languages to the target language. We employed HMM adaptation techniques and Discriminative Model Combination to combine acoustic models from the individual source languages for recognition of speech in the target language. Experiments are described in which Czech Broadcast News is transcribed using acoustic models trained from small amounts of Czech read speech augmented by English, Spanish, Russian, and Mandarin acoustic models.
This paper describes our approach to developing novel vector based measures of semantic similarit... more This paper describes our approach to developing novel vector based measures of semantic similarity between a pair of sentences or utterances. Measures of this nature are useful not only in evaluating machine translation output, but also in other language understanding and information retrieval applications. We first describe the general family of existing vector based approaches to evaluating semantic similarity and their general properties. We illustrate how this family can be extended by means of discriminatively trained semantic feature weights. Finally, we explore the problem of rephrasing (i.e., addressing the question is sentence X the rephrase of sentence Y?) and present a new measure of the semantic linear equivalence between two sentences by means of a modified LSI approach based on the Generalized Singular Value Decomposition. 1 In this paper, for the sake of conciseness, we use the terms document, utterance, and sentence interchangeably. Typically the nature of the task define the specific type (for example, voice automation systems use utterances and so on).
Machine translation (MT) technology is becoming more and more pervasive, yet the quality of MT ou... more Machine translation (MT) technology is becoming more and more pervasive, yet the quality of MT output is still not ideal. Thus, human corrections are used to edit the output for further studies. However, how to judge the human correction might be tricky when the annotators are not experts. We present a novel way that uses cross-validation to automatically judge the human corrections where each MT output is corrected by more than one annotator. Cross-validation among corrections for the same machine translation, and among corrections from the same annotator are both applied. We get a correlation around 40% in sentence quality for Chinese-English and Spanish-English. We also evaluate the user quality as well. At last, we rank the quality of human corrections from good to bad, which enables us to set a quality threshold to make a trade-off between the scope and the quality of the corrections.
Proceedings of the ACM SIGKDD Workshop on Human Computation, 2009
ABSTRACT In this paper, we describe the design principles used for implementing crowdsourcing wit... more ABSTRACT In this paper, we describe the design principles used for implementing crowdsourcing within the enterprise. This is based on our distinction between two kinds of crowdsourcing: enterprise (inside a firewall) versus the public domain. Whereas public domain crowdsourcing offers ...
Proceedings of the Conference on Empirical Methods in Natural Language Processing - EMNLP '08, 2008
We introduce the relative rank differential statistic which is a non-parametric approach to docum... more We introduce the relative rank differential statistic which is a non-parametric approach to document and dialog analysis based on word frequency rank-statistics. We also present a simple method to establish semantic saliency in dialog, documents, and dialog segments using these word frequency rank statistics. Applications of our technique include the dynamic tracking of topic and semantic evolution in a dialog, topic detection, automatic generation of document tags, and new story or event detection in conversational speech and text. Our approach benefits from the robustness, simplicity and efficiency of non-parametric and rank based approaches and consistently outperformed term-frequency and TF-IDF cosine distance approaches in several experiments conducted.
Text, Speech and Language Technology
Three o'clock in The afTernoon duke univerSiTy chapel durham, norTh carolina Where Do We Go From ... more Three o'clock in The afTernoon duke univerSiTy chapel durham, norTh carolina Where Do We Go From Here? Overcoming Inequality and Building Community Duke University's 2010 Martin Luther King, Jr. Commemoration Committee finds insight into Dr. King's vision for an equitable society in his last published book, Where Do We Go From Here: Chaos or Community? In the book's last chapter, he writes: "The contemporary tendency in our society is to base our distribution on scarcity, which has vanished, and to compress our abundance into the overfed mouths of the middle and upper classes until they gag with superfluity. If democracy is to have breadth of meaning, it is necessary to adjust this inequity. It is not only moral, but it is also intelligent. We are wasting and degrading human life by clinging to archaic thinking. ... The curse of poverty has no justification in our age." Duke university's 2010 service of celebration the reverenD Dr. Martin luther king, Jr.
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011
In this work we present the Subsequence Similarity Language Model (S2-LM) which is a new approach... more In this work we present the Subsequence Similarity Language Model (S2-LM) which is a new approach to language modeling based on string similarity. As a language model, S2-LM generates scores based on the closest matching string given a very large corpus. In this paper we describe the properties and advantages of our approach and describe efficient methods to carry out its computation. We describe an n-best rescoring experiment intended to show that S2-LM can be adjusted to behave as an n-gram SLM model.
Proceedings of the ACM SIGKDD Workshop on Human Computation, 2010
In large scale online multiuser communities, the phenomenon of 'participation inequality,' has be... more In large scale online multiuser communities, the phenomenon of 'participation inequality,' has been described as generally following a more or less 90-9-1 rule [9]. In this paper, we examine crowdsourcing participation levels inside the enterprise (within a company's firewall) and show that it is possible to achieve a more equitable distribution of 33-66-1. Accordingly, we propose a SCOUT ((S)uper Contributor, (C)ontributor, and (OUT)lier)) model for describing user participation based on quantifiable effort-level metrics. In support of this framework, we present an analysis that measures the quantity of contributions correlated with responses to motivation and incentives. In conclusion, SCOUT provides the task-based categories to characterize participation inequality that is evident in online communities, and crucially, also demonstrates the inequality curve (and associated characteristics) in the enterprise domain.
2007 Winter Simulation Conference, 2007
In this paper we depart from a set of simple assumptions regarding the behavior of a pool of cust... more In this paper we depart from a set of simple assumptions regarding the behavior of a pool of customers associated with an enterprise's contact center. We assume that the pool of customers can access the contact center through an array of communication modalities (e.g., email, chat, web, voice). Based on these assumptions we develop a model that describes the volume of demand likely to be observed in such an environment as a function of time. Under the simple initial assumptions, the model we develop corresponds to a mean-reverting process of the type frequently used in energy options pricing. When independence assumptions are relaxed and correlations between user behavior are included, a jump-diffusion component appears in the model. The resulting model constitutes the potential foundation for key simulation-based analyses of the contact center, like capacity modeling and risk analysis.