Exploiting proximity feature in bigram language model for information retrieval (original) (raw)

Word Pairs in Language Modeling for Information Retrieval

Recherche d'Information Assistee par Ordinateur, 2004

Previous language modeling approaches to information retrieval have focused primarily on single terms. The use of bigram models has been studied, but the restriction on word order and adjacency may not be justified for information retrieval. We propose a new language modeling approach to information retrieval that incorporates lexical anities, or pairs of words that occur near each other, without

Query expansion using term relationships in language models for information retrieval

2005

2005. Query expansion using term relationships in language models for information retrieval. Available from OpenAIR@RGU. [online]. Available from: http://openair.rgu.ac.uk Citation for the publisher's version: BAI, J., SONG, D., BRUZA, P. D., NIE, J. Y. and CAO, G., 2005. Query expansion using term relationships in language models for information retrieval. In: A. CHOWDHURY, N. FUHR, M. RONTHALER, H.

Dependency Structure Applied to Language Modeling for Information Retrieval

ETRI Journal, 2006

In this paper, we propose a new language model, namely, a dependency structure language model, for information retrieval to compensate for the weaknesses of unigram and bigram language models. The dependency structure language model is based on the first-order dependency model and the dependency parse tree generated by a linguistic parser. So, long-distance dependencies can be naturally captured by the dependency structure language model. We carried out extensive experiments to verify the proposed model, where the dependency structure model gives a better performance than recently proposed language models and the Okapi BM25 method, and the dependency structure is more effective than unigram and bigram in language modeling for information retrieval.

Dependency structure language model for information retrieval

ETRI journal, 2006

– In this paper, we propose a new language model, namely, a dependency structure language model, for information retrieval to compensate for the weakness of bigram and trigram language models. The dependency structure language model is based on the Chow Expansion ...

Choosing the Right Bigrams for Information Retrieval

Classification, Clustering, and Data Mining Applications, 2004

After more than 30 years of research in information retrieval, the dominant paradigm remains the "bag-of-words", in which query terms are considered independent of their coocurrences with each other. Although there has been some work on incorporating phrases or other syntactic information into IR, such attempts have given modest and inconsistent improvements, at best. This paper is a first step at investigating more deeply the question of using bigrams for information retrieval. Our results indicate that only certain kinds of bigrams are likely to aid retrieval. We used linear regression methods on data from TREC 6, 7, and 8 to identify which bigrams are able to help retrieval at all. Our characterization was then tested through retrieval experiments using our information retrieval engine, AIRE, which implements many standard ranking functions and retrieval utilities.

Variations on language modeling for information retrieval

2005

Abstract Search engine technology builds on theoretical and empirical research results in the area of information retrieval (IR). This dissertation makes a contribution to the field of language modeling (LM) for IR, which views both queries and documents as instances of a unigram language model and defines the matching function between a query and each document as the probability that the query terms are generated by the document language model. The work described is concerned with three research issues.

Effective use of phrases in language modeling to improve information retrieval

2004

Traditional information retrieval models treat the query as a bag of words, assuming that the occurrence of each query term is independent of the positions and occurrences of others. Several of these traditional models have been extended to incorporate positional information, most often through the inclusion of phrases. This has shown improvements in effectiveness on large, modern test collections. The language modeling approach to information retrieval is attractive because it provides a well-studied theoretical framework that has been successful in other fields. Incorporating positional information into language models is intuitive and has shown significant improvements in several language-modeling applications. However, attempts to integrate positional information into the language-modeling approach to IR have not shown consistent significant improvements. This paper provides a broader exploration of this problem. We apply the backoff technique to incorporate a bigram phrase lang...

Proximity relevance model for query expansion

Query expansion (QE) aims at improving information retrieval effectiveness by enhancing the query formulation. Because users' queries are generally short and because of the language ambiguity, some information needs are difficult to satisfy. Query reformulation and QE methods have been developed to face this issue. Pseudo relevance feedback (PRF) considers the top retrieved documents as relevant and uses their content in order to expand the initial query. Rather than considering feedback documents as a bag of words, it is possible to exploit term proximity information. Although there are some researches in this direction, the majority of them is empirical. The lack of theoretical works in this area motivated us to introduce a novel method integrated into the language model formalism that takes advantage of the remoteness of candidate terms for QE from query terms within feedback documents. In contrast to previous works, our approach captures the proximity directly and in terms of sentences rather than tokens. We show that the method significantly improves the retrieval performance on TREC collections especially for difficult queries.

Text Retrieval by using k-word Proximity Search

1999

When we search from a huge amount of documents, we often specify several keywords and use conjunctive queries to narrow the result of the search. Though the searched documents contain all keywords, positions of the keywords are usually not considered. As the result, the search result contains some meaningless documents. It is therefore eective to rank documents according to proximity of keywords in the documents. This ranking is regarded as a kind of text data mining. In this paper, we propose two algorithms for nding documents in which all given keywords appear in neighboring places. One is based on plane-sweep algorithm and the other is based on divide-and-conquer approach. Both algorithms run in O(n log n) time where n is the number of occurrences of given keywords. We run the plane-sweep algorithm on a large collection of html les and verify its eectiveness.