A co-classification approach to learning from multilingual corpora (original) (raw)

Cross-effective cross-lingual document classification

This article addresses the question of how to deal with text categorization when the set of documents to be classified belong to different languages. The figures we provide demonstrate that cross-lingual classification where a classifier is trained using one language and tested against another is possible and feasible provided we translate a small number of words: the most relevant terms for class profiling. The experiments we report, demonstrate that the translation of these most relevant words proves to be a cost-effective approach to cross-lingual classification.

Cost-effective Cross-lingual Document Classification

2004

This article addresses the question of how to deal with text categorization when the set of documents to be classified belong to different languages. The figures we provide demonstrate that cross-lingual classification where a classifier is trained using one language and tested against another is possible and feasible provided we translate a small number of words: the most relevant terms for class profiling. The experiments we report, demonstrate that the translation of these most relevant words proves to be a cost-effective approach to cross-lingual classification.

Cross language text categorization by acquiring multilingual domain models from comparable corpora

Proceedings of the ACL Workshop on Building and Using Parallel Texts - ParaText '05, 2005

In a multilingual scenario, the classical monolingual text categorization problem can be reformulated as a cross language TC task, in which we have to cope with two or more languages (e.g. English and Italian). In this setting, the system is trained using labeled examples in a source language (e.g. English), and it classifies documents in a different target language (e.g. Italian).

Using Information from the Target Language to Improve Crosslingual Text Classification

Lecture Notes in Computer Science, 2010

Crosslingual text classification consists of exploiting labeled documents in a source language to classify documents in a different target language. In addition to the evident translation problem, this task also faces some difficulties caused by the cultural discrepancies manifested in both languages by means of different topic distributions. Such discrepancies make the classifier unreliable for the categorization task. In order to tackle this problem we propose to improve the classification performance by using information embedded in the own target dataset. The central idea of the proposed approach is that similar documents must belong to the same category. Therefore, it classifies the documents by considering not only their own content but also information about the assigned category to other similar documents from the same target dataset. Experimental results using three different languages evidence the appropriateness of the proposed approach.

Cross-lingual text categorization

2003

This article deals with the problem of Cross-Lingual Text Categorization (CLTC), which arises when documents in different languages must be classified according to the same classification tree. We describe practical and cost-effective solutions for automatic Cross-Lingual Text Categorization, both in case a sufficient number of training examples is available for each new language and in the case that for some language no training examples are available. Experimental results of the bi-lingual classification of the ILO corpus (with documents in English and Spanish) are obtained using bi-lingual training, terminology translation and profile-based translation.

Multilingual Text Classification through Combination of Monolingual Classifiers

Abstract. With the globalization trend there is a big amount of documents written in different languages. If these polylingual documents are already organized into existing categories one can deliver a learning model to classify newly arrived polylingual documents. Despite being able to adopt a simple approach by considering the problem as multiple independent monolingual text classification problems, this approach fails to use the opportunity offered by polylingual training documents to improve the effectiveness of the classifier.

Text categorization on a multi-lingual corpus

2008

This paper presents experiments with a hierarchical text categorizer on a multi-lingual (English, French) corpus. The results obtained are very similar for both languages. The results allow us to apply in the near future cross-language text categorization that can be used to support automatic translation to create multi-lingual topic glossary.

Multilingual approaches to text categorisation

… Journal for the …, 2005

In this article we examine three different approaches to categorising documents from multilingual corpora using machine learning algorithms. These approaches satisfy two main conditions: there may be an unlimited number of different languages in the corpus and it is unnecessary to previously identify each document's language. The approaches differ in two main aspects: how documents are pre-processed (using either language-neutral or language-specific techniques) and how many classifiers are employed (either one global or one for each existing language). These approaches were tested on a bilingual corpus provided by a Spanish newspaper that contains articles written in Spanish and Basque. The empirical findings were studied from the point of view of classification accuracy and system performance including execution time and memory usage.

An index-based joint multilingual/cross-lingual text categorization using topic expansion via BabelNet

Turkish Journal of Electrical Engineering &Computer Sciences, 2020

The majority of the state-of-the-art text categorization algorithms are supervised and therefore require prior training. Besides the rigor involved in developing training datasets and the requirement for repetition of training for different texts, working with multilingual texts poses additional unique challenges. One of these challenges is that the developer is required to have many different languages involved. Term expansion such as query expansion has been applied in numerous applications; however, a major drawback of most of these applications is that the actual meaning of terms is not usually taken into consideration. Considering the semantics of terms is necessary because of the polysemous nature of most natural language words. In this paper, as a specific contribution to the document index approach for text categorization, we present a joint multilingual/cross-lingual text categorization algorithm (JointMC) based on semantic term expansion of class topic terms through an optimized knowledge-based word sense disambiguation. The lexical knowledge in BabelNet is used for the word sense disambiguation and expansion of the topics' terms. The categorization algorithm computes the distributed semantic similarity between the expanded class topics and the text documents in the test corpus. We evaluate our categorization algorithm using a multilabel text categorization problem. The multilabel categorization task uses the JRC-Acquis dataset. The JRC-Acquis dataset is based on subject domain classification of the European Commission's EuroVoc microthesaurus. We compare the performance of the classifier with a model of it using the original class topics. Furthermore, we compare the performance of our classifier with two state-of-the-art supervised algorithms (each for multilingual and cross-lingual tasks) using the same dataset. Empirical results obtained on five experimental languages show that categorization with expanded topics shows a very wide performance margin when compared to usage of the original topics. Our algorithm outperforms the existing supervised technique, which used the same dataset. Cross-language categorization surprisingly shows similar performance and is marginally better for some of the languages.