Seung-Hoon Na | Chonbuk National University (original) (raw)

Papers by Seung-Hoon Na

Research paper thumbnail of Kle at trec 2008 blog track: Blog post and feed retrieval

Abstract: This paper describes our participation in the TREC 2008 Blog Track. For the opinion tas... more Abstract: This paper describes our participation in the TREC 2008 Blog Track. For the opinion task, we made an opinion retrieval model that consists of preprocessing, topic retrieval, opinion finding, and sentiment classification parts. For topic retrieval, our system is based on the passage-based retrieval model and feedback. For the opinion analysis, we created a PSEUDO OPINIONATED WORD (POW), O, which is representative of all opinion words, and expanded the original query with O. For the blog distillation task, we integrated ...

Research paper thumbnail of Two-Stage Document Length Normalization for Information Retrieval

ACM Transactions on Information Systems, 2015

The standard approach for term frequency normalization is based only on the document length. Howe... more The standard approach for term frequency normalization is based only on the document length. However, it does not distinguish the verbosity from the scope, these being the two main factors determining the document length. Because the verbosity and scope have largely different effects on the increase in term frequency, the standard approach can easily suffer from insufficient or excessive penalization depending on the specific type of long document. To overcome these problems, this paper proposes two-stage normalization by performing verbosity and scope normalization separately, and by employing different penalization functions. In verbosity normalization, each document is pre-normalized by dividing the term frequency by the verbosity of the document. In scope normalization, an existing retrieval model is applied in a straightforward manner to the pre-normalized document, finally leading us to formulate our proposed verbosity normalized (VN) retrieval model. Experimental results carried out on standard TREC collections demonstrate that the VN model leads to marginal but statistically significant improvements over standard retrieval models.

Research paper thumbnail of RankSVR: can preference data help regression?

In some regression applications (e.g., an automatic movie scoring system), a large number of rank... more In some regression applications (e.g., an automatic movie scoring system), a large number of ranking data is available in addition to the original regression data. This paper studies whether and how the ranking data can improve the accuracy of regression task. In particular, this paper first proposes an extension of SVR (Support Vector Regression), RankSVR, which incorporates ranking constraints in the learning of regression function. Second, this paper proposes novel sampling methods for RankSVR, which selectively choose samples of ranking data for training of regression functions in order to maximize the performance of RankSVR. While it is relatively easier to acquire ranking data than regression data, incorporating all the ranking data in the learning of regression doest not always generate the best output. Moreoever, adding too many ranking constraints into the regression problem substantially lengthens the training time. Our proposed sampling methods find the ranking samples that maximize the regression performance. Experimental results on synthetic and real data sets show that, when the ranking data is additionally available, RankSVR significantly performs better than SVR by utilizing ranking constraints in the learning of regression, and also show that our sampling methods improve the RankSVR performance better than the random sampling.

Research paper thumbnail of A 2-poisson model for probabilistic coreference of named entities for improved text retrieval

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval - SIGIR '09, 2009

Text retrieval queries frequently contain named entities. The standard approach of term frequency... more Text retrieval queries frequently contain named entities. The standard approach of term frequency weighting does not work well when estimating the term frequency of a named entity, since anaphoric expressions (like he, she, the movie, etc) are frequently used to refer to named entities in a document, and the use of anaphoric expressions causes the term frequency of named entities to be underestimated. In this paper, we propose a novel 2-Poisson model to estimate the frequency of anaphoric expressions of a named entity, without explicitly resolving the anaphoric expressions. Our key assumption is that the frequency of anaphoric expressions is distributed over named entities in a document according to the probabilities of whether the document is elite for the named entities. This assumption leads us to formulate our proposed Co-referentially Enhanced Entity Frequency (CEEF ). Experimental results on the text collection of TREC Blog Track show that CEEF achieves significant and consistent improvements over state-of-the-art retrieval methods using standard term frequency estimation. In particular, we achieve a 3% increase of MAP over the best performing run of TREC 2008 Blog Track.

Research paper thumbnail of Enriching document representation via translation for improved monolingual information retrieval

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information - SIGIR '11, 2011

Word ambiguity and vocabulary mismatch are critical problems in information retrieval. To deal wi... more Word ambiguity and vocabulary mismatch are critical problems in information retrieval. To deal with these problems, this paper proposes the use of translated words to enrich document representation, going beyond the words in the original source language to represent a document. In our approach, each original document is automatically translated into an auxiliary language, and the resulting translated document serves as a semantically enhanced representation for supplementing the original bag of words. The core of our translation representation is the expected term frequency of a word in a translated document, which is calculated by averaging the term frequencies over all possible translations, rather than focusing on the 1-best translation only. To achieve better efficiency of translation, we do not rely on full-fledged machine translation, but instead use monotonic translation by removing the time-consuming reordering component. Experiments carried out on standard TREC test collections show that our proposed translation representation leads to statistically significant improvements over using only the original language of the document collection.

Research paper thumbnail of DiffPost: Filtering Non-relevant Content Based on Content Difference between Two Consecutive Blog Posts

Lecture Notes in Computer Science, 2009

One of the important issues in blog search engines is to extract the cleaned text from blog post.... more One of the important issues in blog search engines is to extract the cleaned text from blog post. In practice, this extraction process is confronted with many non-relevant contents in the original blog post, such as menu, banner, site description, etc, causing the ranking be less-effective. The problem is that these non-relevant contents are not encoded in a unified way but encoded in many different ways between blog sites. Thus, the commercial vendor of blog sites should consider tuning works such as making human-driven rules for eliminating these non-relevant contents for all blog sites. However, such tuning is a very inefficient process. Rather than this labor-intensive method, this paper first recognizes that many of these non-relevant contents are not changed between several consequent blog posts, and then proposes a simple and effective DiffPost algorithm to eliminate them based on content difference between two consequent blog posts in the same blog site. Experimental result in TREC blog track is remarkable, showing that the retrieval system using DiffPost makes an important performance improvement of about 10% MAP (Mean Average Precision) increase over that without DiffPost. 1

Research paper thumbnail of Improving Opinion Retrieval Based on Query-Specific Sentiment Lexicon

Lecture Notes in Computer Science, 2009

Lexicon-based approaches have been widely used for opinion retrieval due to their simplicity. How... more Lexicon-based approaches have been widely used for opinion retrieval due to their simplicity. However, no previous work has focused on the domain-dependency problem in opinion lexicon construction. This paper proposes simple feedback-style learning for query-specific opinion lexicon using the set of top-retrieved documents in response to a query. The proposed learning starts from the initial domain-independent general lexicon and creates a query-specific lexicon by re-updating the opinion probability of the initial lexicon based on top-retrieved documents. Experimental results on recent TREC test sets show that the query-specific lexicon provides a significant improvement over previous approaches, especially in BLOG-06 topics 1 .

Research paper thumbnail of Exploiting proximity feature in bigram language model for information retrieval

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '08, 2008

Language modeling approaches have been effectively dealing with the dependency among query terms ... more Language modeling approaches have been effectively dealing with the dependency among query terms based on N-gram such as bigram or trigram models. However, bigram language models suffer from adjacency-sparseness problem which means that dependent terms are not always adjacent in documents, but can be far from each other, sometimes with distance of a few sentences in a document. To resolve the adjacency-sparseness problem, this paper proposes a new type of bigram language model by explicitly incorporating the proximity feature between two adjacent terms in a query. Experimental results on three test collections show that the proposed bigram language model significantly improves previous bigram model as well as Tao's approach, the state-of-art method for proximity-based method.

Research paper thumbnail of Disambiguating Author Names Using Automatic Relevance Feedback

Communications in Computer and Information Science, 2011

Research paper thumbnail of A Comparison of Classifiers for Detecting Hedges

Communications in Computer and Information Science, 2011

Research paper thumbnail of Lightweight Natural Language Database Interfaces

Lecture Notes in Computer Science, 2004

Most natural language database interfaces suffer from the translation knowledge portability probl... more Most natural language database interfaces suffer from the translation knowledge portability problem, and are vulnerable to ill-formed questions because of their deep analysis. To alleviate those problems, this paper proposes a lightweight approach to natural language interfaces, where translation knowledge is semi-automatically acquired and user questions are only syntactically analyzed. For the acquisition of translation knowledge, first, a target database is reverse-engineered into a physical database schema on which domain experts annotate linguistic descriptions to produce a pER (physically-derived Entity-Relationship) schema. Next, from the pER schema, initial translation knowledge is automatically extracted. Then, it is extended with synonyms from lexical databases. In the stage of question answering, this semi-automatically constructed translation knowledge is then used to resolve translation ambiguities.

Research paper thumbnail of Influence of WSD on Cross-Language Information Retrieval

Lecture Notes in Computer Science, 2005

ABSTRACT Translation ambiguity is a major problem in dictionary-based cross-language information ... more ABSTRACT Translation ambiguity is a major problem in dictionary-based cross-language information retrieval. This paper proposes a statistical word sense disambiguation (WSD) approach for translation ambiguity resolution. Then, with respect to CLIR effectiveness, the pure effect of a disambiguation module will be explored on the following issues: contribution of disambiguation weight to target term weighting, influences of WSD performance on CLIR retrieval effectiveness. In our investigation, we do not use pre-translation or post-translation methods to exclude any mixing effects on CLIR.

Research paper thumbnail of Effective Query Model Estimation Using Parsimonious Translation Model in Language Modeling Approach

Lecture Notes in Computer Science, 2005

The KL divergence framework, the extended language modeling approach has a critical problem with ... more The KL divergence framework, the extended language modeling approach has a critical problem with estimation of query model, which is the probabilistic model that encodes user’s information need. At initial retrieval, estimation of query model by translation model had been proposed that involves term co-occurrence statistics. However, the translation model has a difficulty to applying, because term co-occurrence statistics must

Research paper thumbnail of Improving Relevance Feedback in Language Modeling Approach: Maximum a Posteriori Probability Criterion and Three-Component Mixture Model

Lecture Notes in Computer Science, 2005

We demonstrate that regularization can improve feedback in a language modeling framework.

Research paper thumbnail of Estimation of Query Model from Parsimonious Translation Model

Lecture Notes in Computer Science, 2005

ABSTRACT The KL divergence framework, the extended language modeling approach, have a critical pr... more ABSTRACT The KL divergence framework, the extended language modeling approach, have a critical problem with estimation of query model, which is the probabilistic model that encodes user’s information need. However, at initial retrieval, it is difficult to expand query model using co-occurrence, because the two-dimensional matrix information such as term co-occurrence must be constructed in offline. Especially in large collection, constructing such large matrix of term co-occurrences prohibitively increases time and space complexity. This paper proposes an effective method to construct co-occurrence statistics by employing parsimonious translation model. Parsimonious translation model is a compact version of translation model, and it contains very small number of parameters that includes non-zero probabilities. Parsimonious translation model enables us to enormously reduce the number of remaining terms in document so that co-occurrence statistics can be calculated in tractable time. In experimentations, the results show that query model derived from parsimonious translation model significantly improves baseline language modeling performance.

Research paper thumbnail of Query Model Estimations for Relevance Feedback in Language Modeling Approach

Lecture Notes in Computer Science, 2005

Recently, researchers have successfully augmented the language modeling approach with a well-foun... more Recently, researchers have successfully augmented the language modeling approach with a well-founded framework in order to incorporate relevance feedback. A critical problem in this framework is to estimate a query language model that encodes detailed knowledge about a user’s information need. This paper explores several methods for query model estimation, motivated by Zhai’s generative model. The generative model is an

Research paper thumbnail of English-Chinese Transliteration Word Pair Extraction from Parallel Corpora

International Journal of Computer Processing of Languages, 2008

... Na, Dong-Il Na and Jong-Hyeok Lee written in Chinese, and TU represents transliteration units... more ... Na, Dong-Il Na and Jong-Hyeok Lee written in Chinese, and TU represents transliteration units. So P(C|E), P(克林顿 |Clinton) can be transformed to P(KeLinDun|Clinton). In this paper we define English TU as unigram, bigram, and trigram; Chinese TU is pinyin initial, pinyin final ...

Research paper thumbnail of Applying Completely-Arbitrary Passage for Pseudo-Relevance Feedback in Language Modeling Approach

Lecture Notes in Computer Science, 2008

Different from the traditional document-level feedback, passage-level feedback restricts the cont... more Different from the traditional document-level feedback, passage-level feedback restricts the context of selecting relevant terms to a passage in a document, rather than to the entire document. It can thus avoid the selection of nonrelevant terms from non-relevant parts in a document. The most recent work of passage-level feedback has been investigated from the viewpoint of the fixedwindow type of passage. However, the fixed-window type of passage has limitation in optimizing the passage-level feedback, since it includes a queryindependent portion. To minimize the query-independence of the passage, this paper proposes a new type of passage, called completely-arbitrary passage. Based on this, we devise a novel two-stage passage feedback -which consists of passage-retrieval and passage-extension as sub-steps, unlike previous singlestage passage feedback relying only on passage retrieval. Experimental results show that the proposed two-stage passage-level feedback much significantly improves the document-level feedback than the single-stage passage feedback that uses the fixed-window type of passage.

Research paper thumbnail of Completely-Arbitrary Passage Retrieval in Language Modeling Approach

Lecture Notes in Computer Science, 2008

Passage retrieval has been expected to be an alternative method to resolve length-normalization p... more Passage retrieval has been expected to be an alternative method to resolve length-normalization problem, since passages have more uniform lengths and topics, than documents. An important issue in the passage retrieval is to determine the type of the passage. Among several different passage types, the arbitrary passage type which dynamically varies according to query has shown the best performance. However,

Research paper thumbnail of Query-Based Inter-document Similarity Using Probabilistic Co-relevance Model

Lecture Notes in Computer Science, 2008

Inter-document similarity is the critical information which determines whether or not the cluster... more Inter-document similarity is the critical information which determines whether or not the cluster-based retrieval improves the baseline. However, a theoretical work on inter-document similarity has not been investigated, even though such work can provide a principle to define a more improved similarity in a well-motivated direction. To support this theory, this paper starts from pursuing an ideal inter-document similarity that optimally satisfies the cluster-hypothesis. We propose a probabilistic principle of inter-document similarities; the optimal similarity of two documents should be proportional to the probability that they are co-relevant to an arbitrary query. Based on this principle, the study of the inter-document similarity is formulated to attack the estimation problem of the co-relevance model of documents. Furthermore, we obtain that the optimal interdocument similarity should be defined using queries as its basic unit, not terms, namely a query-based similarity. We strictly derive a novel query-based similarity from the co-relevance model, without any heuristics. Experimental results show that the new query-based inter-document similarity significantly improves the previously-used term-based similarity in the context of Voorhee's evaluation measure.

Research paper thumbnail of Kle at trec 2008 blog track: Blog post and feed retrieval

Abstract: This paper describes our participation in the TREC 2008 Blog Track. For the opinion tas... more Abstract: This paper describes our participation in the TREC 2008 Blog Track. For the opinion task, we made an opinion retrieval model that consists of preprocessing, topic retrieval, opinion finding, and sentiment classification parts. For topic retrieval, our system is based on the passage-based retrieval model and feedback. For the opinion analysis, we created a PSEUDO OPINIONATED WORD (POW), O, which is representative of all opinion words, and expanded the original query with O. For the blog distillation task, we integrated ...

Research paper thumbnail of Two-Stage Document Length Normalization for Information Retrieval

ACM Transactions on Information Systems, 2015

The standard approach for term frequency normalization is based only on the document length. Howe... more The standard approach for term frequency normalization is based only on the document length. However, it does not distinguish the verbosity from the scope, these being the two main factors determining the document length. Because the verbosity and scope have largely different effects on the increase in term frequency, the standard approach can easily suffer from insufficient or excessive penalization depending on the specific type of long document. To overcome these problems, this paper proposes two-stage normalization by performing verbosity and scope normalization separately, and by employing different penalization functions. In verbosity normalization, each document is pre-normalized by dividing the term frequency by the verbosity of the document. In scope normalization, an existing retrieval model is applied in a straightforward manner to the pre-normalized document, finally leading us to formulate our proposed verbosity normalized (VN) retrieval model. Experimental results carried out on standard TREC collections demonstrate that the VN model leads to marginal but statistically significant improvements over standard retrieval models.

Research paper thumbnail of RankSVR: can preference data help regression?

In some regression applications (e.g., an automatic movie scoring system), a large number of rank... more In some regression applications (e.g., an automatic movie scoring system), a large number of ranking data is available in addition to the original regression data. This paper studies whether and how the ranking data can improve the accuracy of regression task. In particular, this paper first proposes an extension of SVR (Support Vector Regression), RankSVR, which incorporates ranking constraints in the learning of regression function. Second, this paper proposes novel sampling methods for RankSVR, which selectively choose samples of ranking data for training of regression functions in order to maximize the performance of RankSVR. While it is relatively easier to acquire ranking data than regression data, incorporating all the ranking data in the learning of regression doest not always generate the best output. Moreoever, adding too many ranking constraints into the regression problem substantially lengthens the training time. Our proposed sampling methods find the ranking samples that maximize the regression performance. Experimental results on synthetic and real data sets show that, when the ranking data is additionally available, RankSVR significantly performs better than SVR by utilizing ranking constraints in the learning of regression, and also show that our sampling methods improve the RankSVR performance better than the random sampling.

Research paper thumbnail of A 2-poisson model for probabilistic coreference of named entities for improved text retrieval

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval - SIGIR '09, 2009

Text retrieval queries frequently contain named entities. The standard approach of term frequency... more Text retrieval queries frequently contain named entities. The standard approach of term frequency weighting does not work well when estimating the term frequency of a named entity, since anaphoric expressions (like he, she, the movie, etc) are frequently used to refer to named entities in a document, and the use of anaphoric expressions causes the term frequency of named entities to be underestimated. In this paper, we propose a novel 2-Poisson model to estimate the frequency of anaphoric expressions of a named entity, without explicitly resolving the anaphoric expressions. Our key assumption is that the frequency of anaphoric expressions is distributed over named entities in a document according to the probabilities of whether the document is elite for the named entities. This assumption leads us to formulate our proposed Co-referentially Enhanced Entity Frequency (CEEF ). Experimental results on the text collection of TREC Blog Track show that CEEF achieves significant and consistent improvements over state-of-the-art retrieval methods using standard term frequency estimation. In particular, we achieve a 3% increase of MAP over the best performing run of TREC 2008 Blog Track.

Research paper thumbnail of Enriching document representation via translation for improved monolingual information retrieval

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information - SIGIR '11, 2011

Word ambiguity and vocabulary mismatch are critical problems in information retrieval. To deal wi... more Word ambiguity and vocabulary mismatch are critical problems in information retrieval. To deal with these problems, this paper proposes the use of translated words to enrich document representation, going beyond the words in the original source language to represent a document. In our approach, each original document is automatically translated into an auxiliary language, and the resulting translated document serves as a semantically enhanced representation for supplementing the original bag of words. The core of our translation representation is the expected term frequency of a word in a translated document, which is calculated by averaging the term frequencies over all possible translations, rather than focusing on the 1-best translation only. To achieve better efficiency of translation, we do not rely on full-fledged machine translation, but instead use monotonic translation by removing the time-consuming reordering component. Experiments carried out on standard TREC test collections show that our proposed translation representation leads to statistically significant improvements over using only the original language of the document collection.

Research paper thumbnail of DiffPost: Filtering Non-relevant Content Based on Content Difference between Two Consecutive Blog Posts

Lecture Notes in Computer Science, 2009

One of the important issues in blog search engines is to extract the cleaned text from blog post.... more One of the important issues in blog search engines is to extract the cleaned text from blog post. In practice, this extraction process is confronted with many non-relevant contents in the original blog post, such as menu, banner, site description, etc, causing the ranking be less-effective. The problem is that these non-relevant contents are not encoded in a unified way but encoded in many different ways between blog sites. Thus, the commercial vendor of blog sites should consider tuning works such as making human-driven rules for eliminating these non-relevant contents for all blog sites. However, such tuning is a very inefficient process. Rather than this labor-intensive method, this paper first recognizes that many of these non-relevant contents are not changed between several consequent blog posts, and then proposes a simple and effective DiffPost algorithm to eliminate them based on content difference between two consequent blog posts in the same blog site. Experimental result in TREC blog track is remarkable, showing that the retrieval system using DiffPost makes an important performance improvement of about 10% MAP (Mean Average Precision) increase over that without DiffPost. 1

Research paper thumbnail of Improving Opinion Retrieval Based on Query-Specific Sentiment Lexicon

Lecture Notes in Computer Science, 2009

Lexicon-based approaches have been widely used for opinion retrieval due to their simplicity. How... more Lexicon-based approaches have been widely used for opinion retrieval due to their simplicity. However, no previous work has focused on the domain-dependency problem in opinion lexicon construction. This paper proposes simple feedback-style learning for query-specific opinion lexicon using the set of top-retrieved documents in response to a query. The proposed learning starts from the initial domain-independent general lexicon and creates a query-specific lexicon by re-updating the opinion probability of the initial lexicon based on top-retrieved documents. Experimental results on recent TREC test sets show that the query-specific lexicon provides a significant improvement over previous approaches, especially in BLOG-06 topics 1 .

Research paper thumbnail of Exploiting proximity feature in bigram language model for information retrieval

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '08, 2008

Language modeling approaches have been effectively dealing with the dependency among query terms ... more Language modeling approaches have been effectively dealing with the dependency among query terms based on N-gram such as bigram or trigram models. However, bigram language models suffer from adjacency-sparseness problem which means that dependent terms are not always adjacent in documents, but can be far from each other, sometimes with distance of a few sentences in a document. To resolve the adjacency-sparseness problem, this paper proposes a new type of bigram language model by explicitly incorporating the proximity feature between two adjacent terms in a query. Experimental results on three test collections show that the proposed bigram language model significantly improves previous bigram model as well as Tao's approach, the state-of-art method for proximity-based method.

Research paper thumbnail of Disambiguating Author Names Using Automatic Relevance Feedback

Communications in Computer and Information Science, 2011

Research paper thumbnail of A Comparison of Classifiers for Detecting Hedges

Communications in Computer and Information Science, 2011

Research paper thumbnail of Lightweight Natural Language Database Interfaces

Lecture Notes in Computer Science, 2004

Most natural language database interfaces suffer from the translation knowledge portability probl... more Most natural language database interfaces suffer from the translation knowledge portability problem, and are vulnerable to ill-formed questions because of their deep analysis. To alleviate those problems, this paper proposes a lightweight approach to natural language interfaces, where translation knowledge is semi-automatically acquired and user questions are only syntactically analyzed. For the acquisition of translation knowledge, first, a target database is reverse-engineered into a physical database schema on which domain experts annotate linguistic descriptions to produce a pER (physically-derived Entity-Relationship) schema. Next, from the pER schema, initial translation knowledge is automatically extracted. Then, it is extended with synonyms from lexical databases. In the stage of question answering, this semi-automatically constructed translation knowledge is then used to resolve translation ambiguities.

Research paper thumbnail of Influence of WSD on Cross-Language Information Retrieval

Lecture Notes in Computer Science, 2005

ABSTRACT Translation ambiguity is a major problem in dictionary-based cross-language information ... more ABSTRACT Translation ambiguity is a major problem in dictionary-based cross-language information retrieval. This paper proposes a statistical word sense disambiguation (WSD) approach for translation ambiguity resolution. Then, with respect to CLIR effectiveness, the pure effect of a disambiguation module will be explored on the following issues: contribution of disambiguation weight to target term weighting, influences of WSD performance on CLIR retrieval effectiveness. In our investigation, we do not use pre-translation or post-translation methods to exclude any mixing effects on CLIR.

Research paper thumbnail of Effective Query Model Estimation Using Parsimonious Translation Model in Language Modeling Approach

Lecture Notes in Computer Science, 2005

The KL divergence framework, the extended language modeling approach has a critical problem with ... more The KL divergence framework, the extended language modeling approach has a critical problem with estimation of query model, which is the probabilistic model that encodes user’s information need. At initial retrieval, estimation of query model by translation model had been proposed that involves term co-occurrence statistics. However, the translation model has a difficulty to applying, because term co-occurrence statistics must

Research paper thumbnail of Improving Relevance Feedback in Language Modeling Approach: Maximum a Posteriori Probability Criterion and Three-Component Mixture Model

Lecture Notes in Computer Science, 2005

We demonstrate that regularization can improve feedback in a language modeling framework.

Research paper thumbnail of Estimation of Query Model from Parsimonious Translation Model

Lecture Notes in Computer Science, 2005

ABSTRACT The KL divergence framework, the extended language modeling approach, have a critical pr... more ABSTRACT The KL divergence framework, the extended language modeling approach, have a critical problem with estimation of query model, which is the probabilistic model that encodes user’s information need. However, at initial retrieval, it is difficult to expand query model using co-occurrence, because the two-dimensional matrix information such as term co-occurrence must be constructed in offline. Especially in large collection, constructing such large matrix of term co-occurrences prohibitively increases time and space complexity. This paper proposes an effective method to construct co-occurrence statistics by employing parsimonious translation model. Parsimonious translation model is a compact version of translation model, and it contains very small number of parameters that includes non-zero probabilities. Parsimonious translation model enables us to enormously reduce the number of remaining terms in document so that co-occurrence statistics can be calculated in tractable time. In experimentations, the results show that query model derived from parsimonious translation model significantly improves baseline language modeling performance.

Research paper thumbnail of Query Model Estimations for Relevance Feedback in Language Modeling Approach

Lecture Notes in Computer Science, 2005

Recently, researchers have successfully augmented the language modeling approach with a well-foun... more Recently, researchers have successfully augmented the language modeling approach with a well-founded framework in order to incorporate relevance feedback. A critical problem in this framework is to estimate a query language model that encodes detailed knowledge about a user’s information need. This paper explores several methods for query model estimation, motivated by Zhai’s generative model. The generative model is an

Research paper thumbnail of English-Chinese Transliteration Word Pair Extraction from Parallel Corpora

International Journal of Computer Processing of Languages, 2008

... Na, Dong-Il Na and Jong-Hyeok Lee written in Chinese, and TU represents transliteration units... more ... Na, Dong-Il Na and Jong-Hyeok Lee written in Chinese, and TU represents transliteration units. So P(C|E), P(克林顿 |Clinton) can be transformed to P(KeLinDun|Clinton). In this paper we define English TU as unigram, bigram, and trigram; Chinese TU is pinyin initial, pinyin final ...

Research paper thumbnail of Applying Completely-Arbitrary Passage for Pseudo-Relevance Feedback in Language Modeling Approach

Lecture Notes in Computer Science, 2008

Different from the traditional document-level feedback, passage-level feedback restricts the cont... more Different from the traditional document-level feedback, passage-level feedback restricts the context of selecting relevant terms to a passage in a document, rather than to the entire document. It can thus avoid the selection of nonrelevant terms from non-relevant parts in a document. The most recent work of passage-level feedback has been investigated from the viewpoint of the fixedwindow type of passage. However, the fixed-window type of passage has limitation in optimizing the passage-level feedback, since it includes a queryindependent portion. To minimize the query-independence of the passage, this paper proposes a new type of passage, called completely-arbitrary passage. Based on this, we devise a novel two-stage passage feedback -which consists of passage-retrieval and passage-extension as sub-steps, unlike previous singlestage passage feedback relying only on passage retrieval. Experimental results show that the proposed two-stage passage-level feedback much significantly improves the document-level feedback than the single-stage passage feedback that uses the fixed-window type of passage.

Research paper thumbnail of Completely-Arbitrary Passage Retrieval in Language Modeling Approach

Lecture Notes in Computer Science, 2008

Passage retrieval has been expected to be an alternative method to resolve length-normalization p... more Passage retrieval has been expected to be an alternative method to resolve length-normalization problem, since passages have more uniform lengths and topics, than documents. An important issue in the passage retrieval is to determine the type of the passage. Among several different passage types, the arbitrary passage type which dynamically varies according to query has shown the best performance. However,

Research paper thumbnail of Query-Based Inter-document Similarity Using Probabilistic Co-relevance Model

Lecture Notes in Computer Science, 2008

Inter-document similarity is the critical information which determines whether or not the cluster... more Inter-document similarity is the critical information which determines whether or not the cluster-based retrieval improves the baseline. However, a theoretical work on inter-document similarity has not been investigated, even though such work can provide a principle to define a more improved similarity in a well-motivated direction. To support this theory, this paper starts from pursuing an ideal inter-document similarity that optimally satisfies the cluster-hypothesis. We propose a probabilistic principle of inter-document similarities; the optimal similarity of two documents should be proportional to the probability that they are co-relevant to an arbitrary query. Based on this principle, the study of the inter-document similarity is formulated to attack the estimation problem of the co-relevance model of documents. Furthermore, we obtain that the optimal interdocument similarity should be defined using queries as its basic unit, not terms, namely a query-based similarity. We strictly derive a novel query-based similarity from the co-relevance model, without any heuristics. Experimental results show that the new query-based inter-document similarity significantly improves the previously-used term-based similarity in the context of Voorhee's evaluation measure.