Pseudo-aligned multilingual corpora (original) (raw)

Massively Multilingual Document Alignment with Cross-lingual Sentence-Mover’s Distance

2020

Document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other. Such aligned data can be used for a variety of NLP tasks from training cross-lingual representations to mining parallel data for machine translation. In this paper we develop an unsupervised scoring function that leverages cross-lingual sentence embeddings to compute the semantic distance between documents in different languages. These semantic distances are then used to guide a document alignment algorithm to properly pair cross-lingual web documents across a variety of low, mid, and high-resource language pairs. Recognizing that our proposed scoring function and other state of the art methods are computationally intractable for long web documents, we utilize a more tractable greedy algorithm that performs comparably. We experimentally demonstrate that our distance metric performs better alignment than current baselines outperforming them by...

Multilingual Document Alignment -- A Study with Chinese and

2002

Natural language processing (NLP) community is increasingly using parallel-and comparablecorpora for cross-linguistic research. The knowledge extracted from such corpora helps us in cross-language information retrieval, topic detection and tracking, machine translation, and many other NLP tasks. Parallel or comparable corpora of Japanese-Chinese language-pair are rare. We investigate an automatic approach to build bilingual corpora from a collection of unaligned bilingual-documents using linguistic and statistical processing. The similarities between two documents across the languages are calculated using mutual information (MI) and residual inverse document frequency (RIDF) of Kanji. We explained the document alignment algorithm in detail and evaluated the effectiveness of the algorithm using a collection of potentially relevant but unaligned bilingual documents.

Multilingual Document Alignment - A Study with Chinese and Japanese

Natural language processing (NLP) community is increasingly using parallel- and comparable- corpora for cross-linguistic research. The knowledge extracted from such corpora helps us in cross-language information retrieval, topic detection and tracking, machine translation, and many other NLP tasks. Parallel or comparable corpora of Japanese- Chinese language-pair are rare. We investigate an automatic approach to build bilingual corpora from a collection of unaligned bilingual-documents using linguistic and statistical processing. The similarities between two documents across the languages are calculated using mutual information (MI) and residual inverse document frequency (RIDF) of Kanji. We explained the document alignment algorithm in detail and evaluated the effectiveness of the algorithm using a collection of potentially relevant but unaligned bilingual documents.

CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs

Cross-lingual document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other. In this paper, we exploit the signals embedded in URLs to label web documents at scale with an average precision of 94.5% across different language pairs. We mine sixty-eight snapshots of the Common Crawl corpus and identify web document pairs that are translations of each other. We release a new web dataset consisting of over 392 million URL pairs from Common Crawl covering documents in 8144 language pairs of which 137 pairs include English. In addition to curating this massive dataset, we introduce baseline methods that leverage crosslingual representations to identify aligned documents based on their textual content. Finally, we demonstrate the value of this parallel documents dataset through a downstream task of mining parallel sentences and measuring the quality of machine translations from models trained on this mined data. Our objective in releasing this dataset is to foster new research in cross-lingual NLP across a variety of low, medium, and high-resource languages.

The ADAPT Bilingual Document Alignment system at WMT16

Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, 2016

Comparable corpora have been shown to be useful in several multilingual natural language processing (NLP) tasks. Many previous papers have focused on how to improve the extraction of parallel data from this kind of corpus on different levels. In this paper, we are interested in improving the quality of bilingual comparable corpora according to increased document alignment score. We describe our participation in the bilingual document alignment shared task of the First Conference on Machine Translation (WMT16). We propose a technique based on sourceto-target sentence-and word-based scores and the fraction of matched source named entities. We performed our experiments on English-to-French document alignments for this bilingual task.

Graph Algorithms for Multiparallel Word Alignment

ArXiv, 2021

With the advent of end-to-end deep learning approaches in machine translation, interest in word alignments initially decreased; however, they have again become a focus of research more recently. Alignments are useful for typological research, transferring formatting like markup to translated texts, and can be used in the decoding of machine translation systems. At the same time, massively multilingual processing is becoming an important NLP scenario, and pretrained language and machine translation models that are truly multilingual are proposed. However, most alignment algorithms rely on bitexts only and do not leverage the fact that many parallel corpora are multiparallel. In this work, we exploit the multiparallelity of corpora by representing an initial set of bilingual alignments as a graph and then predicting additional edges in the graph. We present two graph algorithms for edge prediction: one inspired by recommender systems and one based on network link prediction. Our exper...

Improving Automated Alignment in Multilingual Corpora

1996

We report on methods of improving multilingual text alignments that have been produced in a simple dynamic-programming scheme, by automated detec- tion of possible misalignments. Details of methods involving cognates, specially- identified words, and propositional contents of sentences are given, together with notable features of their performance on parallel corpora in a number of different types of European languages. 1.

A Massive Collection of Cross-Lingual Web-Document Pairs

2019

Cross-lingual document alignment aims to identify pairs of documents in two distinct languages that are of comparable content or translations of each other. Small-scale efforts have been made to collect aligned document level data on a limited set of language-pairs such as English-German or on limited comparable collections such as Wikipedia. In this paper, we mine twelve snapshots of the Common Crawl corpus and identify web document pairs that are translations of each other. We release a new web dataset consisting of 54 million URL pairs from Common Crawl covering documents in 92 languages paired with English. We evaluate the quality of the dataset by measuring the quality of machine translations from models that have been trained on mined parallel sentence pairs from this aligned corpora and introduce a simple yet effective baseline for identifying these aligned documents. The objective of this dataset and paper is to foster new research in cross-lingual NLP across a variety of lo...

Detecting highly confident word translations from comparable corpora without any prior knowledge

2012

In this paper, we extend the work on using latent cross-language topic models for identifying word translations across comparable corpora. We present a novel precisionoriented algorithm that relies on per-topic word distributions obtained by the bilingual LDA (BiLDA) latent topic model. The algorithm aims at harvesting only the most probable word translations across languages in a greedy fashion, without any prior knowledge about the language pair, relying on a symmetrization process and the one-to-one constraint. We report our results for Italian-English and Dutch-English language pairs that outperform the current state-of-the-art results by a significant margin. In addition, we show how to use the algorithm for the construction of high-quality initial seed lexicons of translations.

Pseudo-aligned multilingual corpora (original) (raw)

Related papers