Technical issues in building an information retrieval system for chinese (original) (raw)

On Chinese text retrieval

Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '96, 1996

In previous studies, Chinese text retrieval has often been dealt with on the character basis. This approach is not suited to deal with complex queries. We suggest that Chirmse text retrieval should work with words inslead of characters. The crucial problem is to segment originally continuous Chinese texts into words. In this paper, wc Ilrsi propose a hybrid segmentation approach which unifies the commonly used approaches. The systcm SMART is then udaptcd to index the segmented Chinese texls. Finally, wc suggest that Chinese text retrieval should move further to include a thesaurus in order to cope with dle rich vocabulary of Chinese.

Chinese information extraction and retrieval

Proceedings of a workshop on held at Vienna, Virginia May 6-8, 1996 -, 1996

This paper provides a summary of the following topics: I. what was learned from porting the INQUERY information retrieval engine and the INFINDER term finder to Chinese 2. experiments at the University of Massachusetts evaluating INQUERY performance on Chinese newswire (Xinhua), 3. what was learned from porting selected components of PLUM to Chinese 4. experiments evaluating the POST part of speech tagger and named entity recognition on Chinese. 5. program issues in technology development.

A Hybrid Chinese Information Retrieval Model

Lecture Notes in Computer Science, 2010

A distinctive feature of Chinese test is that a Chinese document is a sequence of Chinese with no space or boundary between Chinese words. This feature makes Chinese information retrieval more difficult since a retrieved document which contains the query term as a sequence of Chinese characters may not be really relevant to the query since the query term (as a sequence Chinese characters) may not be a valid Chinese word in that documents. On the other hand, a document that is actually relevant may not be retrieved because it does not contain the query sequence but contains other relevant words. In this research, we propose a hybrid Chinese information retrieval model by incorporating wordbased techniques with the traditional character-based techniques. The aim of this approach is to investigate the influence of Chinese segmentation on the performance of Chinese information retrieval. Two ranking methods are proposed to rank retrieved documents based on the relevancy to the query calculated by combining character-based ranking and word-based ranking. Our experimental results show that Chinese segmentation can improve the performance of Chinese information retrieval, but the improvement is not significant if it incorporates only Chinese segmentation with the traditional character-based approach.

Chinese text retrieval without using a dictionary

ACM SIGIR Forum, 1997

It is generafly believed that words, rather than characters, should be the smallest indexing unit for Chinese text retrieval systems, and that it is essential to have a comprehensive Chinese dictionary or lexicon for Chhmse text retrieval systems to do well. Chinese text has no delimiters to mark woni boundaries. As a result, any text retrieval systems that build word-based indexes need to segment text into words. We implemented several statistical and dictionary-hazed word segmentation methods to study the effect on retrieval effectiveness of different segmentation methods using the TREC-S Chinese test collection and topics. The results show that, for all three sets of queries, the simple bigram indexing and the purely statistical word segmentation perform better than the popular dictionary-based maximum matching method with a dictionary of 138,955 entries.

Managing word form variation of text retrieval in practice Why language technology is not the only cure for better IR performance?

Purpose: The article discusses on a general methodological level different methods that have been used for management of single key word form variation in information retrieval during the history of textual information retrieval. The paper offers the reader an overall practical guide for choosing between different methods to be used for different types of European languages. Methods being compared in the paper include stemming, lemmatization, truncation, syllabification, unsupervised morphological methods, character n-gramming and generation of inflected word forms. Methodology/Approach: Based on the empirical findings and results achieved by other researchers the paper discusses several pros and cons of different keyword variation management methods in a broader context than usually in IR, where only achieved effectiveness results are normally considered. The study proposes a list of five criteria for comparison of the conflation methods in general and offer a heuristics for choosing a suitable method for conflation of a specific language. Findings: Simpler character-based methods could be preferred in IR instead of very sophisticated linguistic methods. It is also suggested that for morphologically simple languages, such as English, any kind of keyword variation management may be futile, as the increase in IR effectiveness achieved may be very low. Morphologically more complex languages can be conflated with the simple methods quite effectively for present IR search engines.

Applying Multiple Characteristics and Techniques in the NICT Information Retrieval System at NTCIR-6

2004

Our information retrieval system takes advantage of numerous characteristics of information and uses numerous sophisticated techniques. It uses Robertson's 2-Poisson model and Rocchio's formula, both of which are known to be effective. Characteristics of newspapers such as locational information are used. We present our application of Fujita's method, where longer terms are used in retrieval by the system but de-emphasized relative to the emphasis on the shortest terms. This allows us to use both compound and single-word terms. The statistical test used in expanding queries through an automatic feedback process is described. The method gives us terms that have been statistically shown to be related to the top-ranked documents obtained in the first retrieval. We also use a numerical term, QIDF, which is an IDF term for queries. QIDF decreases the scores for stop words that occur in many queries. It can be very useful for foreign languages for which we cannot determine stop words. We also use web-based unknown word translation for bilingual information retrieval. We participated in two monolingual information retrieval tasks (Korean and Japanese) and five bilingual information retrieval tasks (Chinese-Japanese, English-Japanese, Japanese-Korean, Korean-Japanese, and English-Korean) at NTCIR-6. We obtained good results in all the tasks.

Applying multiple characteristics and techniques to obtain high levels of performance in information retrieval

2004

Probability-Based Chinese Text Processing and Retrieval

Computational Intelligence, 2000

We discuss the use of probability-based natural language processing for Chinese text retrieval. We focus on comparing different text extraction methods and probabilistic weighting methods. Several document processing methods and probabilistic weighting functions are presented. A number of experiments have been conducted on large standard text collections. We present the experimental results that compare a word-based text processing method with a character-based method. The experimental results also compare a number of term-weighting functions including both single-unit weighting and compound-unit weighting functions.

Improving Information Retrieval Systems' Efficiency

International Journal of Engineering Research & Technology (IJERT) , 2022

This paper proposes a new stemming algorithm for rooting Arabic words and attempts to solve the polymorphism problem of the word itself by returning it to its root. The proposed algorithm will be based on introducing new rules of patterns that increase the efficiency of word identification. Also, this algorithm will contribute to enhancing the efficiency and speed of information retrieval in search engines. Using these rules, he can determine whether a sequence of suffixes is part of the real word or not and remove it. In this research, a new tool was also developed that allows the user to use any dataset written in Arabic and implement the derivation on it to check the new stem. To ensure the effectiveness of the proposed algorithm by derivation accuracy test was tested by applying the proposed algorithm to various texts, and then it was compared with Khoja's and a previous algorithms, which were applied on the same data. The results of this research indicated a good improvement in the accuracy of stemming.

Design and Evaluation of Approaches for Automatic Chinese Text

2000

In this paper, we propose and evaluate approaches to categorizing Chinese texts, which consist of term extraction, term selection, term clustering and text classification. We propose a scalable approach which uses frequency counts to identify left and right boundaries of possibly significant terms. We used the combination of term selection and term clustering to reduce the dimension of the vector space to a practical level. While the huge number of possible Chinese terms makes most of the machine learning algorithms impractical, results obtained in an experiment on a CAN news collection show that the dimension could be dramatically reduced to 1200 while approximately the same level of classification accuracy was maintained using our approach. We also studied and compared the performance of three well known classifiers, the Rocchio linear classifier, naive Bayes probabilistic classifier and k-nearest neighbors(kNN) classifier, when they were applied to categorize Chinese texts. Overall, kNN achieved the best accuracy, about 78.3%, but required large amounts of computation time and memory when used to classify new texts. Rocchio was very time and memory efficient, and achieved a high level of accuracy, about 75.4%. In practical implementation, Rocchio may be a good choice.

Using self-supervised word segmentation in Chinese information retrieval

Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '02, 2002

We propose a self-supervised word-segmentation technique for Chinese information retrieval. This method combines the advantages of traditional dictionary based approaches with character based approaches, while overcoming many of their shortcomings. Experiments on TREC data show comparable performance to both the dictionary based and the character based approaches. However, our method is language independent and unsupervised, which provides a promising avenue for constructing accurate multilingual information retrieval systems that are flexible and adaptive.

Chinese-Japanese Cross Language Information Retrieval: A Han Character Based Approach

Proceedings of the Acl 2000 Workshop on Word Senses and Multi Linguality Volume 8, 2000

In this paper, we investigate cross language information retrieval (CLIR) for Chinese and Japanese texts utilizing the Han characters-common ideographs used in writing Chinese, Japanese and Korean (CJK) languages. The Unicode encoding scheme, which encodes the superset of Han characters, is used as a common encoding platform to deal with the multilingual collection in a uniform manner. We discuss the importance of Han character semantics in document indexing and retrieval of the ideographic languages. We also analyse the baseline results of the cross language information retrieval using the common Han characters appeared in both Chinese and Japanese texts.

Japanese-Chinese Cross-Language Information Retrieval: An Interlingua Apporach

Int. J. Comput. Linguistics Chin. Lang. Process., 2000

Electronically available multilingual information can be divided into two major categories: (1) alphabetic language information (English-like alphabetic languages) and (2) ideographic language information (Chinese-like ideographic languages). The information available in non-English alphabetic languages as well as in ideographic languages (especially, in Japanese and Chinese) is growing at an incredibly high rate in recent years. Due to the ideographic nature of Japanese and Chinese, complicated with the existence of several encoding standards in use, efficient processing (representation, indexing, retrieval, etc.) of such information became a tedious task. In this paper, we propose a Han Character (Kanji) oriented Interlingua model of indexing and retrieving Japanese and Chinese information. We report the results of mono- and cross- language information retrieval on a Kanji space where documents and queries are represented in terms of Kanji oriented vectors. We also employ a dimens...

Large-Vocabulary Chinese Text/speech Information Retrieval Using Mandarin Speech Queries

The network technology and the Internet are creating a completely new information era. It is believed that in the near future numerous of digital libraries and a great variety of multimedia databases, which consist of heterogeneous types of information including text, audio, image, video and so on, will be available worldwide via the Internet. This paper deals with the problem of Chinese text and Mandarin speech information retrieval with Mandarin speech queries. Instead of using the syllable-based information alone, the word-based information was also successfully incorporated to further improve the retrieving performance. A prototype system with an interface supporting some user-friendly functions was successfully implemented and the initial test results verified the feasibility of our approaches.

Applying multiple characteristics and techniques in the NICT information retrieval system in NTCIR-5

Our information retrieval system takes advantage of numerous characteristics of information and uses numerous sophisticated techniques. It uses Robert-son's 2-Poisson model and Rocchio's formula, both of which are known to be effective. Characteristics of newspapers such as locational information are used. We present our application of Fujita's method, where longer terms are used in retrieval by the system but de-emphasized relative to the emphasis on the short-est terms. This allows us to use both compound and single-word terms. The statistical test used in expand-ing queries through an automatic feedback process is described. The method gives us terms that have been statistically shown to be related to the top-ranked doc-uments obtained in the first retrieval. We also use a numerical term, QIDF, which is an IDF term for queries. QIDF decreases the scores for stop words that occur in many queries. It can be very useful for foreign languages for which we cannot determine...

Word Stemming Algorithms and Retrieval Effectiveness in Malay and Arabic Documents Retrieval Systems

2007

Documents retrieval in Information Retrieval Systems (IRS) is generally about understanding of information in the documents concern. The more the system able to understand the contents of documents the more effective will be the retrieval outcomes. But understanding of the contents is a very complex task. Conventional IRS apply algorithms that can only approximate the meaning of document contents through keywords approach using vector space model.

Effects of Term Segmentation on Chinese/English Cross-Language Information Retrieval

1999

The majority of recent Cross-Language Information Retrieval (CLIR) research has focused on European languages. CLIR problems that involve East Asian languages such as Chinese introduce additional challenges, because written Chinese texts lack boundaries between terms. The paper examines three Chinese segmentation techniques in combination with two variants of dictionary-based Chinese to English query translation. The results indicate that failure to segment terms, particularly technical terms and names, can have a cascading effect that reduces retrieval effectiveness. Task-tuned segmentation algorithms and alternative term weighting strategies are suggested as productive directions for future work