Cross language text categorization by acquiring multilingual domain models from comparable corpora (original) (raw)

Using Information from the Target Language to Improve Crosslingual Text Classification

Lecture Notes in Computer Science, 2010

Crosslingual text classification consists of exploiting labeled documents in a source language to classify documents in a different target language. In addition to the evident translation problem, this task also faces some difficulties caused by the cultural discrepancies manifested in both languages by means of different topic distributions. Such discrepancies make the classifier unreliable for the categorization task. In order to tackle this problem we propose to improve the classification performance by using information embedded in the own target dataset. The central idea of the proposed approach is that similar documents must belong to the same category. Therefore, it classifies the documents by considering not only their own content but also information about the assigned category to other similar documents from the same target dataset. Experimental results using three different languages evidence the appropriateness of the proposed approach.

Cross-lingual text categorization

2003

This article deals with the problem of Cross-Lingual Text Categorization (CLTC), which arises when documents in different languages must be classified according to the same classification tree. We describe practical and cost-effective solutions for automatic Cross-Lingual Text Categorization, both in case a sufficient number of training examples is available for each new language and in the case that for some language no training examples are available. Experimental results of the bi-lingual classification of the ILO corpus (with documents in English and Spanish) are obtained using bi-lingual training, terminology translation and profile-based translation.

Construction of supervised and unsupervised learning systems for multilingual text categorization

Expert Systems with Applications, 2009

Due to the availability of a huge amount of textual data from a variety of sources, users of internationally distributed information regions need effective methods and tools that enable them to discover, retrieve and categorize relevant information, in whatever language and form it may have been stored. This drives a convergence of numerous interests from diverse research communities focusing on the issues related to multilingual text categorization. In this work, we implemented and measured the performance of the leading supervised and unsupervised approaches for multilingual text categorization. We selected support vector machines (SVM) as representative of supervised techniques as well as latent semantic indexing (LSI) and self-organizing maps (SOM) techniques as our selective ones of unsupervised methods for system implementation. The preliminary results show that our platform models including both supervised and unsupervised learning methods have the potentials for multilingual text categorization.

Cross-effective cross-lingual document classification

This article addresses the question of how to deal with text categorization when the set of documents to be classified belong to different languages. The figures we provide demonstrate that cross-lingual classification where a classifier is trained using one language and tested against another is possible and feasible provided we translate a small number of words: the most relevant terms for class profiling. The experiments we report, demonstrate that the translation of these most relevant words proves to be a cost-effective approach to cross-lingual classification.

Text categorization on a multi-lingual corpus

2008

This paper presents experiments with a hierarchical text categorizer on a multi-lingual (English, French) corpus. The results obtained are very similar for both languages. The results allow us to apply in the near future cross-language text categorization that can be used to support automatic translation to create multi-lingual topic glossary.

MULTILINGUAL TEXT CATEGORIZATION

A multi lingual text categorization classifies the document from different languages to a single language format. This approach is dependenton semantic representation of extracted data from different languages and is not restricted for only some domains but also helps in carrying out by internal system manager. In the testing phase WordNet and java web translator were used for translating the content into a unique language and identifying the similarity percentage of matched data using classification approaches i.e, TF-IDF, K-Nearest Neighbour.This helps in categorizing the profiles and attainingsynset relation between test and train documents.As the techniques KNN and TF-IDFwere compared based on monolingual and multilingual similarity measure which provides better results compared to the existing techniques.

Automatic Generation of Language-Independent Features for Cross-Lingual Classification

arXiv (Cornell University), 2018

Many applications require categorization of text documents using predefined categories. The main approach to performing text categorization is learning from labeled examples. For many tasks, it may be difficult to find examples in one language but easy in others. The problem of learning from examples in one or more languages and classifying (categorizing) in another is called cross-lingual learning. In this work, we present a novel approach that solves the general crosslingual text categorization problem. Our method generates, for each training document, a set of language-independent features. Using these features for training yields a languageindependent classifier. At the classification stage, we generate language-independent features for the unlabeled document, and apply the classifier on the new representation. To build the feature generator, we utilize a hierarchical language-independent ontology, where each concept has a set of support documents for each language involved. In the preprocessing stage, we use the support documents to build a set of language-independent feature generators, one for each language. The collection of these generators is used to map any document into the language-independent feature space. Our methodology works on the most general cross-lingual text categorization problems, being able to learn from any mix of languages and classify documents in any other language. We also present a method for exploiting the hierarchical structure of the ontology to create virtual supporting documents for languages that do not have them. We tested our method, using Wikipedia as our ontology, on the most commonly used test collections in cross-lingual text categorization, and found that it outperforms existing methods.

Multilingual approaches to text categorisation

… Journal for the …, 2005

In this article we examine three different approaches to categorising documents from multilingual corpora using machine learning algorithms. These approaches satisfy two main conditions: there may be an unlimited number of different languages in the corpus and it is unnecessary to previously identify each document's language. The approaches differ in two main aspects: how documents are pre-processed (using either language-neutral or language-specific techniques) and how many classifiers are employed (either one global or one for each existing language). These approaches were tested on a bilingual corpus provided by a Spanish newspaper that contains articles written in Spanish and Basque. The empirical findings were studied from the point of view of classification accuracy and system performance including execution time and memory usage.