A Hybrid Chinese Information Retrieval Model (original) (raw)
Related papers
Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '96, 1996
In previous studies, Chinese text retrieval has often been dealt with on the character basis. This approach is not suited to deal with complex queries. We suggest that Chirmse text retrieval should work with words inslead of characters. The crucial problem is to segment originally continuous Chinese texts into words. In this paper, wc Ilrsi propose a hybrid segmentation approach which unifies the commonly used approaches. The systcm SMART is then udaptcd to index the segmented Chinese texls. Finally, wc suggest that Chinese text retrieval should move further to include a thesaurus in order to cope with dle rich vocabulary of Chinese.
Chinese text retrieval without using a dictionary
ACM SIGIR Forum, 1997
It is generafly believed that words, rather than characters, should be the smallest indexing unit for Chinese text retrieval systems, and that it is essential to have a comprehensive Chinese dictionary or lexicon for Chhmse text retrieval systems to do well. Chinese text has no delimiters to mark woni boundaries. As a result, any text retrieval systems that build word-based indexes need to segment text into words. We implemented several statistical and dictionary-hazed word segmentation methods to study the effect on retrieval effectiveness of different segmentation methods using the TREC-S Chinese test collection and topics. The results show that, for all three sets of queries, the simple bigram indexing and the purely statistical word segmentation perform better than the popular dictionary-based maximum matching method with a dictionary of 138,955 entries.
Effects of Term Segmentation on Chinese/English Cross-Language Information Retrieval
1999
The majority of recent Cross-Language Information Retrieval (CLIR) research has focused on European languages. CLIR problems that involve East Asian languages such as Chinese introduce additional challenges, because written Chinese texts lack boundaries between terms. The paper examines three Chinese segmentation techniques in combination with two variants of dictionary-based Chinese to English query translation. The results indicate that failure to segment terms, particularly technical terms and names, can have a cascading effect that reduces retrieval effectiveness. Task-tuned segmentation algorithms and alternative term weighting strategies are suggested as productive directions for future work
Technical issues in building an information retrieval system for chinese
1996
Information retrieval in a foreign language requires modification to text and user interfaces. Stemming, word boundary identification, punctuation and stopword identificdation must all be modified; appropriate input and presentation methods must be provided. But once these interface issues are resolved the retrieval model and enhancement techniques operate equally effectively in all the languages we have worked with.
Using self-supervised word segmentation in Chinese information retrieval
Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '02, 2002
We propose a self-supervised word-segmentation technique for Chinese information retrieval. This method combines the advantages of traditional dictionary based approaches with character based approaches, while overcoming many of their shortcomings. Experiments on TREC data show comparable performance to both the dictionary based and the character based approaches. However, our method is language independent and unsupervised, which provides a promising avenue for constructing accurate multilingual information retrieval systems that are flexible and adaptive.
Chinese information extraction and retrieval
Proceedings of a workshop on held at Vienna, Virginia May 6-8, 1996 -, 1996
This paper provides a summary of the following topics: I. what was learned from porting the INQUERY information retrieval engine and the INFINDER term finder to Chinese 2. experiments at the University of Massachusetts evaluating INQUERY performance on Chinese newswire (Xinhua), 3. what was learned from porting selected components of PLUM to Chinese 4. experiments evaluating the POST part of speech tagger and named entity recognition on Chinese. 5. program issues in technology development.
Chinese information retrieval based on related term group
2005
This paper describes our work at the fifth NTCIR workshop on the subtasks of monolingual information retrieval (IR). Query expansions using automatically acquired related term groups were explored. Unlike traditional query expansion methods, the related term groups extracted from web-based corpuses and the related terms extracted from document set are used in combination to improve the effectiveness of query expansion in our method. Experiments show that our method achieves an average 13.1% improvement compare to the traditional relevance feedback technique.
Probability-Based Chinese Text Processing and Retrieval
Computational Intelligence, 2000
We discuss the use of probability-based natural language processing for Chinese text retrieval. We focus on comparing different text extraction methods and probabilistic weighting methods. Several document processing methods and probabilistic weighting functions are presented. A number of experiments have been conducted on large standard text collections. We present the experimental results that compare a word-based text processing method with a character-based method. The experimental results also compare a number of term-weighting functions including both single-unit weighting and compound-unit weighting functions.