Vietnamese Text Retrieval: Test Collection and First Experimentations (original) (raw)

Construction and Analysis of a Large Vietnamese Text Corpus

2016

This paper presents a new Vietnamese text corpus which contains around 4.05 billion words. It is a collection of Wikipedia texts, newspaper articles and random web texts. The paper describes the process of collecting, cleaning and creating the corpus. Processing Vietnamese texts faced several challenges, for example, different from many Latin languages, Vietnamese language does not use blanks for separating words, hence using common tokenizers such as replacing blanks with word boundary does not work. A short review about different approaches of Vietnamese tokenization is presented together with how the corpus has been processed and created. After that, some statistical analysis on this data is reported including the number of syllable, average word length, sentence length and topic analysis. The corpus is integrated into a framework which allows searching and browsing. Using this web interface, users can find out how many times a particular word appears in the corpus, sample senten...

Full-Text Search for Thai Information Retrieval Systems

The 5th International …, 2000

While there have been a lot of efficient full-text search algorithms developed for English documents, these algorithms can be directly used for other languages, e.g. Chinese, Japanese, Thai and so on. However, due to idiosyncrasies of each individual language, directly applying such algorithms may not be suitable for the language considered. This paper proposes a simplification of Boyer-Moore algorithm, called BMT, in order to reduce computation and makes it appropriate for Thai full-text. To investigate the efficiency, the comparison of BMT with other search algorithms is evaluated. Moreover, we applied syllable-like segmentation, called Thai character clusters (TCCs), to improve searching efficiency in Thai documents by grouping Thai characters into inseparable units. The TCC is based on the spell features of Thai language. Comparing with traditional full-text searching methods, this approach can improve not only searching time and memory consumption but also searching accuracy. The experimental results evidence that searching methods using TCC outperform the traditional methods in full-text search algorithm.

Automatic Searching for English-Vietnamese Documents on the Internet

2012

Bilingual corpora together with machine learning technology can be used to solve problems in natural language processing. In addition, bilingual corpora are useful for mapping linguistic tags of less popular languages, such as Vietnamese, and for studying comparative linguistics. However, Vietnamese corpora still have some shortcomings, especially English-Vietnamese bilingual corpora. This paper focuses on a searching method for bilingual Internet materials to support establishing an English-Vietnamese bilingual corpus. Based on the benefit of natural language processing toolkits, the system concentrates on using them as a solution for the problem of searching any Internet English-Vietnamese bilingual document without the need for any rules. We propose a method for extracting the main content of webpages without the need for frame of website or source of website before processing. Several other natural language processing tools included in our system are English-Vietnamese machine translation, extracting Vietnamese keywords, search engines, and comparing similar documents. Our experiments show several valuable auto-searching results for the US Embassy and Australian Embassy websites.

Character cluster based Thai information retrieval

Proceedings of the fifth …, 2000

Some languages including Thai, Japanese and Chinese do not have explicit word boundary. This causes the problem of word boundary ambiguity that results in decreasing the accuracy of information retrieval. This paper proposes a new technique so-called character clustering to reduce the ambiguity of word boundary in Thai documents and hence improve searching efficiency. To investigate the efficiency, a set of experiments using Thai newspapers is conducted in both non-indexing and indexing searching approaches. The experimental results show our method outperform the traditional methods in both nonindexing and indexing approaches in all measures.

Recognizing and Tagging Vietnamese Words Based on Statistics and Word Order Patterns

In Vietnamese sentences, function words and word order patterns (WOPs) identify the semantic meaning and the grammatical word classes. We study the most popular WOPs and find out the candidates for new Vietnamese words (NVWs) based on the phrase and word segmentation algorithm [7]. The best WOPs, which are used for recognizing and tagging NVWs, are chosen based on the support and confidence concepts. These concepts are also used in examining if a word belongs to a word class. Our experiments were examined over a huge corpus, which contains more than 50 million sentences. Four sets of WOPs are studied for recognizing and tagging nouns, verbs, adjectives and pronouns. There are 6,385 NVWs in our new dictionary including 2,791 new noun-taggings, 1,436 new verb-tagging, 682 new adj-taggings, and 1,476 new pronoun taggings.

A testbed for Indonesian text retrieval

Proceedings of the 9th …, 2004

Indonesia is the fourth most populous country and a close neighbour of Australia. However, despite media and intelligence interest in Indonesia, little work has been done on evaluating Information Retrieval techniques for Indonesian, and no standard testbed exists for such a purpose. An effective testbed should include a collection of documents, realistic queries, and relevance judgements. The TREC and TDT testbeds have provided such an environment for the evaluation of English, Mandarin, and Arabic text retrieval techniques. The NTCIR testbed provides a similar environment for Chinese, Korean, Japanese, and English. This paper describes an Indonesian TREC-like testbed we have constructed and made available for the evaluation of ad hoc retrieval techniques. To illustrate how the test collection is used, we briefly report the effect of stemming for Indonesian text retrieval, showing-similarly to English-that it has little effect on accuracy.

Using search engine to construct a scalable corpus for Vietnamese lexical development for word segmentation

Proceedings of the 7th Workshop on Asian Language Resources - ALR7, 2009

As the web content becomes more accessible to the Vietnamese community across the globe, there is a need to process Vietnamese query texts properly to find relevant information. The recent deployment of a Vietnamese translation tool on a well-known search engine justifies its importance in gaining popularity with the World Wide Web. There are still problems in the translation and retrieval of Vietnamese language as its word recognition is not fully addressed. In this paper we introduce a semi-supervised approach in building a general scalable web corpus for Vietnamese using search engine to facilitate the word segmentation process. Moreover, we also propose a segmentation algorithm which recognizes effectively Out-Of-Vocabulary (OOV) words. The result indicates that our solution is scalable and can be applied for real time translation program and other linguistic applications. This work is here is a continuation of the work of Nguyen D. (2008).

Characteristics and retrieval effectiveness of n-gram string similarity matching on Malay documents

Proceedings of the 10th WSEAS international …, 2011

There have been very few studies of the use of conflation algorithms for indexing and retrieval of Malay documents as compared to English. The two main classes of conflation algorithms are string-similarity algorithms and stemming algorithms. There is only one existing Malay stemming algorithm and this provide a benchmark for the following experiments using n-gram string similarity algorithms, in particular bigram and trigram, using the same Malay queries and documents. Inherent characteristics of n-grams and several variations of experiments performed on the queries and documents are discussed. The variations are: both nonstemmed queries and documents; stemmed queries and nonstemmed documents; and both stemmed queries and documents. Further experiment are then carried out by removing the most frequently occuring n-gram. The dice-coefficient is used as threshold and weight in ranking the retrieved documents. Beside using dice coefficients to rank documents, inverse document frequency (itf) weights are also used. Interpolation technique and standard recall-precision functions are used to calculate recall-precision values. These values are then compared to the available recall-precision values of the only Malay stemming algorithm.