Technical issues in building an information retrieval system for chinese (original) (raw)
Related papers
Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '96, 1996
In previous studies, Chinese text retrieval has often been dealt with on the character basis. This approach is not suited to deal with complex queries. We suggest that Chirmse text retrieval should work with words inslead of characters. The crucial problem is to segment originally continuous Chinese texts into words. In this paper, wc Ilrsi propose a hybrid segmentation approach which unifies the commonly used approaches. The systcm SMART is then udaptcd to index the segmented Chinese texls. Finally, wc suggest that Chinese text retrieval should move further to include a thesaurus in order to cope with dle rich vocabulary of Chinese.
Chinese information extraction and retrieval
Proceedings of a workshop on held at Vienna, Virginia May 6-8, 1996 -, 1996
This paper provides a summary of the following topics: I. what was learned from porting the INQUERY information retrieval engine and the INFINDER term finder to Chinese 2. experiments at the University of Massachusetts evaluating INQUERY performance on Chinese newswire (Xinhua), 3. what was learned from porting selected components of PLUM to Chinese 4. experiments evaluating the POST part of speech tagger and named entity recognition on Chinese. 5. program issues in technology development.
A Hybrid Chinese Information Retrieval Model
Lecture Notes in Computer Science, 2010
A distinctive feature of Chinese test is that a Chinese document is a sequence of Chinese with no space or boundary between Chinese words. This feature makes Chinese information retrieval more difficult since a retrieved document which contains the query term as a sequence of Chinese characters may not be really relevant to the query since the query term (as a sequence Chinese characters) may not be a valid Chinese word in that documents. On the other hand, a document that is actually relevant may not be retrieved because it does not contain the query sequence but contains other relevant words. In this research, we propose a hybrid Chinese information retrieval model by incorporating wordbased techniques with the traditional character-based techniques. The aim of this approach is to investigate the influence of Chinese segmentation on the performance of Chinese information retrieval. Two ranking methods are proposed to rank retrieved documents based on the relevancy to the query calculated by combining character-based ranking and word-based ranking. Our experimental results show that Chinese segmentation can improve the performance of Chinese information retrieval, but the improvement is not significant if it incorporates only Chinese segmentation with the traditional character-based approach.
Chinese text retrieval without using a dictionary
ACM SIGIR Forum, 1997
It is generafly believed that words, rather than characters, should be the smallest indexing unit for Chinese text retrieval systems, and that it is essential to have a comprehensive Chinese dictionary or lexicon for Chhmse text retrieval systems to do well. Chinese text has no delimiters to mark woni boundaries. As a result, any text retrieval systems that build word-based indexes need to segment text into words. We implemented several statistical and dictionary-hazed word segmentation methods to study the effect on retrieval effectiveness of different segmentation methods using the TREC-S Chinese test collection and topics. The results show that, for all three sets of queries, the simple bigram indexing and the purely statistical word segmentation perform better than the popular dictionary-based maximum matching method with a dictionary of 138,955 entries.
Purpose: The article discusses on a general methodological level different methods that have been used for management of single key word form variation in information retrieval during the history of textual information retrieval. The paper offers the reader an overall practical guide for choosing between different methods to be used for different types of European languages. Methods being compared in the paper include stemming, lemmatization, truncation, syllabification, unsupervised morphological methods, character n-gramming and generation of inflected word forms. Methodology/Approach: Based on the empirical findings and results achieved by other researchers the paper discusses several pros and cons of different keyword variation management methods in a broader context than usually in IR, where only achieved effectiveness results are normally considered. The study proposes a list of five criteria for comparison of the conflation methods in general and offer a heuristics for choosing a suitable method for conflation of a specific language. Findings: Simpler character-based methods could be preferred in IR instead of very sophisticated linguistic methods. It is also suggested that for morphologically simple languages, such as English, any kind of keyword variation management may be futile, as the increase in IR effectiveness achieved may be very low. Morphologically more complex languages can be conflated with the simple methods quite effectively for present IR search engines.
Applying Multiple Characteristics and Techniques in the NICT Information Retrieval System at NTCIR-6
2004
Our information retrieval system takes advantage of numerous characteristics of information and uses numerous sophisticated techniques. It uses Robertson's 2-Poisson model and Rocchio's formula, both of which are known to be effective. Characteristics of newspapers such as locational information are used. We present our application of Fujita's method, where longer terms are used in retrieval by the system but de-emphasized relative to the emphasis on the shortest terms. This allows us to use both compound and single-word terms. The statistical test used in expanding queries through an automatic feedback process is described. The method gives us terms that have been statistically shown to be related to the top-ranked documents obtained in the first retrieval. We also use a numerical term, QIDF, which is an IDF term for queries. QIDF decreases the scores for stop words that occur in many queries. It can be very useful for foreign languages for which we cannot determine stop words. We also use web-based unknown word translation for bilingual information retrieval. We participated in two monolingual information retrieval tasks (Korean and Japanese) and five bilingual information retrieval tasks (Chinese-Japanese, English-Japanese, Japanese-Korean, Korean-Japanese, and English-Korean) at NTCIR-6. We obtained good results in all the tasks.
2004
Our information retrieval system takes advantage of numerous characteristics of information and uses numerous sophisticated techniques. It uses Robertson's 2-Poisson model and Rocchio's formula, both of which are known to be effective. Characteristics of newspapers such as locational information are used. We present our application of Fujita's method, where longer terms are used in retrieval by the system but de-emphasized relative to the emphasis on the shortest terms. This allows us to use both compound and single-word terms. The statistical test used in expanding queries through an automatic feedback process is described. The method gives us terms that have been statistically shown to be related to the top-ranked documents obtained in the first retrieval. We also use a numerical term, QIDF, which is an IDF term for queries. QIDF decreases the scores for stop words that occur in many queries. It can be very useful for foreign languages for which we cannot determine stop words. We also use web-based unknown word translation for bilingual information retrieval. We participated in two monolingual information retrieval tasks (Korean and Japanese) and five bilingual information retrieval tasks (Chinese-Japanese, English-Japanese, Japanese-Korean, Korean-Japanese, and English-Korean) at NTCIR-6. We obtained good results in all the tasks.
Probability-Based Chinese Text Processing and Retrieval
Computational Intelligence, 2000
We discuss the use of probability-based natural language processing for Chinese text retrieval. We focus on comparing different text extraction methods and probabilistic weighting methods. Several document processing methods and probabilistic weighting functions are presented. A number of experiments have been conducted on large standard text collections. We present the experimental results that compare a word-based text processing method with a character-based method. The experimental results also compare a number of term-weighting functions including both single-unit weighting and compound-unit weighting functions.