Chinese information retrieval using Lemur: NTCIR-5 CIR experiments at UNT (original) (raw)
Related papers
Applying Multiple Characteristics and Techniques in the NICT Information Retrieval System at NTCIR-6
2004
Our information retrieval system takes advantage of numerous characteristics of information and uses numerous sophisticated techniques. It uses Robertson's 2-Poisson model and Rocchio's formula, both of which are known to be effective. Characteristics of newspapers such as locational information are used. We present our application of Fujita's method, where longer terms are used in retrieval by the system but de-emphasized relative to the emphasis on the shortest terms. This allows us to use both compound and single-word terms. The statistical test used in expanding queries through an automatic feedback process is described. The method gives us terms that have been statistically shown to be related to the top-ranked documents obtained in the first retrieval. We also use a numerical term, QIDF, which is an IDF term for queries. QIDF decreases the scores for stop words that occur in many queries. It can be very useful for foreign languages for which we cannot determine stop words. We also use web-based unknown word translation for bilingual information retrieval. We participated in two monolingual information retrieval tasks (Korean and Japanese) and five bilingual information retrieval tasks (Chinese-Japanese, English-Japanese, Japanese-Korean, Korean-Japanese, and English-Korean) at NTCIR-6. We obtained good results in all the tasks.
A Hybrid Chinese Information Retrieval Model
Lecture Notes in Computer Science, 2010
A distinctive feature of Chinese test is that a Chinese document is a sequence of Chinese with no space or boundary between Chinese words. This feature makes Chinese information retrieval more difficult since a retrieved document which contains the query term as a sequence of Chinese characters may not be really relevant to the query since the query term (as a sequence Chinese characters) may not be a valid Chinese word in that documents. On the other hand, a document that is actually relevant may not be retrieved because it does not contain the query sequence but contains other relevant words. In this research, we propose a hybrid Chinese information retrieval model by incorporating wordbased techniques with the traditional character-based techniques. The aim of this approach is to investigate the influence of Chinese segmentation on the performance of Chinese information retrieval. Two ranking methods are proposed to rank retrieved documents based on the relevancy to the query calculated by combining character-based ranking and word-based ranking. Our experimental results show that Chinese segmentation can improve the performance of Chinese information retrieval, but the improvement is not significant if it incorporates only Chinese segmentation with the traditional character-based approach.
Abstract: Our information retrieval system takes advantageof numerous characteristics of the information and appliesnumerous sophisticated techniques. Robertson's2-Poisson model and Rocchio's formula, both of whichare known to be effective, have been applied in the system.Characteristics of newspapers such as locationalinformation were applied. We present our applicationof Fujita's method, where longer terms are used in retrievalby the system but de-emphasized relative to theemphasis on the...
Technical issues in building an information retrieval system for chinese
1996
Information retrieval in a foreign language requires modification to text and user interfaces. Stemming, word boundary identification, punctuation and stopword identificdation must all be modified; appropriate input and presentation methods must be provided. But once these interface issues are resolved the retrieval model and enhancement techniques operate equally effectively in all the languages we have worked with.
Applying multiple characteristics and techniques in the NICT information retrieval system in NTCIR-5
Our information retrieval system takes advantage of numerous characteristics of information and uses numerous sophisticated techniques. It uses Robert-son's 2-Poisson model and Rocchio's formula, both of which are known to be effective. Characteristics of newspapers such as locational information are used. We present our application of Fujita's method, where longer terms are used in retrieval by the system but de-emphasized relative to the emphasis on the short-est terms. This allows us to use both compound and single-word terms. The statistical test used in expand-ing queries through an automatic feedback process is described. The method gives us terms that have been statistically shown to be related to the top-ranked doc-uments obtained in the first retrieval. We also use a numerical term, QIDF, which is an IDF term for queries. QIDF decreases the scores for stop words that occur in many queries. It can be very useful for foreign languages for which we cannot determine...
2004
Our information retrieval system takes advantage of numerous characteristics of information and uses numerous sophisticated techniques. It uses Robertson's 2-Poisson model and Rocchio's formula, both of which are known to be effective. Characteristics of newspapers such as locational information are used. We present our application of Fujita's method, where longer terms are used in retrieval by the system but de-emphasized relative to the emphasis on the shortest terms. This allows us to use both compound and single-word terms. The statistical test used in expanding queries through an automatic feedback process is described. The method gives us terms that have been statistically shown to be related to the top-ranked documents obtained in the first retrieval. We also use a numerical term, QIDF, which is an IDF term for queries. QIDF decreases the scores for stop words that occur in many queries. It can be very useful for foreign languages for which we cannot determine stop words. We also use web-based unknown word translation for bilingual information retrieval. We participated in two monolingual information retrieval tasks (Korean and Japanese) and five bilingual information retrieval tasks (Chinese-Japanese, English-Japanese, Japanese-Korean, Korean-Japanese, and English-Korean) at NTCIR-6. We obtained good results in all the tasks.
At the NTCIR-4 workshop, Justsystem Corporation (JSC) and Clairvoyance Corporation (CC) collaborated in the cross-language retrieval task (CLIR). Our goal was to evaluate the performance and robustness of our recently developed commercial-grade CLIR systems for English and Asian languages. The main contribution of this article is the investigation of different strategies, their interactions in both monolingual and bilingual retrieval tasks, and their respective contributions to operational retrieval systems in the context of NTCIR-4. We report results of Japanese and English monolingual retrieval and results of Japanese-to-English bilingual retrieval. In monolingual retrieval analysis, we examine two special properties of the NTCIR experimental design (two levels of relevance and identical queries in multiple languages) and explore how they interact with strategies of our retrieval system, including pseudo-relevance feedback, multi-word term down-weighting, and term weight merging strategies. Our analysis shows that the choice of language (English or Japanese) does not have a significant impact on retrieval performance. Query expansion is slightly more effective with relaxed judgments than with rigid judgments. For better retrieval performance, weights of multi-word terms should be lowered. In the bilingual retrieval analysis, we aim to identify robust strategies that are effective when used alone and when used in combination with other strategies. We examine cross-lingual specific strategies such as translation disambiguation and translation structuring, as well as general strategies such as pseudo-relevance feedback and multi-word term down-weighting. For shorter title topics, pseudo-relevance feedback is a major performance enhancer, but translation structuring affects retrieval performance negatively when used alone or in combination with other strategies. All experimented strategies improve retrieval performance for the longer description topics, with pseudo-relevance feedback and translation structuring as the major contributors.
Evaluation of Information Retrieval Systems: Test Collections and Evaluation Workshops
The crucial role of the evaluation in the development of the information retrieval tools is useful evidence to improve the performance of these tools and the quality of results that they return. However, the classic evaluation approaches have limitations and shortcomings especially regarding to the user consideration, the measure of the adequacy between the query and the returned documents and the consideration of characteristics, specifications and behaviors of the search tool. Therefore, we believe that the exploitation of contextual elements could be a very good way to evaluate the search tools. So, this paper presents a new approach that takes into account the context during the evaluation process at three complementary levels. The experiments gives at the end of this article has shown the applicability of the proposed approach to real research tools.