Bilingual dictionaries for all EU languages (original) (raw)

2015

Bilingual dictionaries are key resources in several fields such as translation, language learning or various NLP tasks. However, only major languages have such resources. Automatically built dictionaries by using pivot languages could be a useful resource in these circumstances. Pivot-based bilingual dictionary building is based on merging two bilingual dictionaries which share a common language (e.g. LA-LB, LB-LC) in order to create a dictionary for a new language pair (e.g LA-LC). This process may include wrong translations due to the polisemy of words. We built Basque-Chinese (Mandarin) dictionaries automatically from Basque-English and Chinese-English dictionaries. In order to prune wrong translations we used different methods adequate for less resourced languages. Inverse Consultation and Distributional Similarity methods were chosen because they just depend on easily available resources. Finally, we evaluated manually the quality of the built dictionaries and the adequacy of t...

Automatic generation of bilingual dictionaries using intermediary languages and comparable corpora

2010

This paper outlines a strategy to build new bilingual dictionaries from existing resources. The method is based on two main tasks: first, a new set of bilingual correspondences is generated from two available bilingual dictionaries. Second, the generated correspondences are validated by making use of a bilingual lexicon automatically extracted from non-parallel, and comparable corpora. The quality of the entries of the derived dictionary is very high, similar to that of hand-crafted dictionaries. We report a case study where a new, non noisy, English-Galician dictionary with about 12, 000 correct bilingual correspondences was automatically generated.

Analyzing methods for improving precision of pivot based bilingual dictionaries

2011

An A-C bilingual dictionary can be inferred by merging A-B and B-C dictionaries using B as pivot. However, polysemous pivot words often produce wrong translation candidates. This paper analyzes two methods for pruning wrong candidates: one based on exploiting the structure of the source dictionaries, and the other based on distributional similarity computed from comparable corpora. As both methods depend exclusively on easily available resources, they are well suited to less resourced languages. We studied whether these two techniques complement each other given that they are based on different paradigms. We also researched combining them by looking for the best adequacy depending on various application scenarios.

Building Domain Specific Bilingual Dictionaries

2014

This paper proposes a method to build bilingual dictionaries for specific domains defined by a parallel corpora. The proposed method is based on an original method that is not domain specific. Both the original and the proposed methods are constructed with previously available natural language processing tools. Therefore, this paper contribution resides in the choice and parametrization of the chosen tools. To illustrate the proposed method benefits we conduct an experiment over technical manuals in English and Portuguese. The results of our proposed method were analyzed by human specialists and our results indicates significant increases in precision for unigrams and muli-grams. Numerically, the precision increase is as big as 15% according to our evaluation.

Bilingual Dictionaries: From Theory to Computerization

This paper suggests a computationally-enhanced model of an English –Arabic dictionary based on a systematically empirical linguistic analysis of the source language and target language systems rather than the introspective intuitions of bilingual lexicographers. In this model, computerized text corpora and bilingual semantic concordances play a key role in turning out a reliable bilingual dictionary that does not only serve the purposes of all types of Bilingual Dictionary users but will also be a robust bilingual repertoire in bilingual Natural Language Processing systems such as Machine Translation. Notwithstanding the great advances in the fields of lexical semantics and computational lexicology, bilingual lexicography (BL) is still a far cry from being a scientific discipline per se .Bilingual comparative analysis of the source language and the target language has not yet built itself into the toolkit of the bilingual lexicographer. Computerization as far as bilingual lexicography is concerned is still restricted to such surface-level automation as can be sufficient to transform a book dictionary into a computerized form. This attitude is definitely oblivious to what potentialities artificial intelligence and smart computation can have for updating the linguistic content of bilingual dictionaries beyond what mere CD –Rom churning can. On the other hand, linguistic theories on bilingual lexicography have been governed-somewhat unconsciously-by commercial considerations. Still in the literature on bilingual dictionaries we can read something about the " purpose " of the dictionary and whether it is targeted for production of the TL by SL users or comprehension of an SL by certain TL users, depending on the direction of the SL-TL pair. This view has always governed such critical issues as sense discrimination in both the source language and target language, rendering the need for semantic disambiguation in a bilingual dictionary (BD) subject to the predetermined purpose of the dictionary. This paper tries to expose the shortcomings of this view ,adopting a different theoretical position which sees the unity of purpose as the basis of building the architecture of the bilingual dictionary so that it becomes suited to the needs of all users ,be they average users ,specialized ones ,language learners or translators and be they native speakers of the source language or the target language. At the same time, it would be fair to argue that that eclectic view of the bilingual dictionary can be attributed to the limited space made available in paper dictionaries. However, such an argument, one can contend, is no longer valid once we have adopted full-fledged computerization-with its immense potential for storage

Generation of Bilingual Dictionaries using Structural Properties

Computacion Y Sistemas, 2013

Building bilingual dictionaries from Wikipedia has been extensively studied in the area of computation linguistics. These dictionaries play a crucial role in Natural Language Processing(NLP) applications like Cross-Lingual Information Retrieval, Machine Translation and Named Entity Recognition. To build these dictionaries, most of the existing approaches use information present in Wikipedia titles, info-boxes and categories. Interestingly, not many use the structural properties of a document like sections, subsections, etc. In this work we exploit the structural properties of documents to build a bilingual English-Hindi dictionary. The main intuition behind this approach is that documents in different languages discussing the same topic are likely to have similar structural elements. Though we present our experiments only for Hindi, our approach is language independent and can be easily extended to other languages. The major contribution of our work is that the dictionary contains translation and transliteration of words which include Named Entities to a large extent. We evaluate our dictionary using manually computed precision. We generated a massive list of 72k tokens using our approach with 0.75 precision.

Computational bilingual lexicography: automatic extraction of translation dictionaries

2001

The paper describes a simple but very effective approach to extraction translation equivalents from parallel corpora. We briefly present the multilingual parallel corpus used in our experiments and then describe the pre-processing steps, a baseline iterative method, and the actual algorithm. The evaluation for the two algorithms is presented in some details in terms of precision, recall and processing time. The baseline algorithm was used to extract 6 bilingual lexicons and it was evaluated on four of them. The second algorithm was evaluated only on the Romanian-English noun lexicon. An analysis of the missed or wrong translation equivalents figured out various factors, both intrinsic, due to the method and extrinsic due to the working data (accuracy of the pre-processing, quality of translation, bitext language relatedness). We conclude by discussing the merits and the drawbacks of our method in comparison with other works and comment on further developments.

Flexible statistical construction of bilingual dictionaries

2007

Resumen: La mayoría de los sistemas previos para construir un diccionario bilingüe a partir de un corpus paralelo dependen de un algoritmo iterativo, usando probabilidades de traducción de palabras para alinear palabras en el corpus y sus alineamientos para estimar probabilidades de traducción, repitiendo hasta la convergencia. Si bien este enfoque produce resultados razonables, es computacionalmente lento, limitando el tamaño del corpus que se puede analizar y el del diccionario producido. Nosotros proponemos una aproximación no iterativa para producir un diccionario bilingüe unidireccional que, si bien menos precisa que las aproximaciones iterativas, es mucho más rápida, permitiendo procesar córpora mayores en un tiempo razonable. Asimismo, permite una estimación en tiempo real de la probabilidad de traducción de un par de términos, lo que significa que permite obtener un diccionario de traducción con los n términos más frecuentes, y calcular las probabilidades de traducción de términos infrecuentes cuando se encuentren en documentos reales.

Bilingual Dictionaries Using Comparable and Quasi Comparable Corpora

2016

Cross-lingual term associations are very important for many interlingual applications. Machine Translation system is a sub-field of computational linguistics which aims at translating text from one language to other. All MT systems at the core depend on bilingual dictionaries. The bilingual dictionaries are important resources for such NLP applications as statistical machine translation and cross-language information extraction systems. The bilingual dictionary is one such important resource where entries are word translations. They also can serve to enhance existing dictionaries, for second language teaching and learning. Manually created resources are usually more accurate and do not contain noisy information, in contrast, to automatically learned dictionaries. In this book, we try to address the problem of generating bilingual dictionaries automatically. We propose two different approaches to generating bilingual dictionaries for English-Hindi pair. Both the approaches proposed i...

Building a Basque-Chinese Dictionary by Using English as Pivot

Bilingual dictionaries are key resources in several fields such as translation, language learning or various NLP tasks. However, only major languages have such resources. Automatically built dictionaries by using pivot languages could be a useful resource in these circumstances. Pivot-based bilingual dictionary building is based on merging two bilingual dictionaries which share a common language (e.g. L A-L B , L B-L C) in order to create a dictionary for a new language pair (e.g L A-L C). This process may include wrong translations due to the polisemy of words. We built Basque-Chinese (Mandarin) dictionaries automatically from Basque-English and Chinese-English dictionaries. In order to prune wrong translations we used different methods adequate for less resourced languages. Inverse Consultation and Distributional Similarity methods were chosen because they just depend on easily available resources. Finally, we evaluated manually the quality of the built dictionaries and the adequacy of the methods. Both Inverse Consultation and Distributional Similarity provide good precision of translations but recall is seriously damaged. Distributional similarity prunes rare translations more accurately than other methods.

Bilingual dictionaries for all EU languages (original) (raw)

Related papers