Cross Lingual Word Embedding Refinement by 120001 1 Norm Optimisation (original) (raw)

Training vs. Post-training Cross-lingual Word Embedding Approaches: A Comparative Study

International Journal of Information Science and Management, 2023

This paper provides a comparative analysis of cross-lingual word embedding by studying the impact of different variables on the quality of the embedding models within the distributional semantics framework. Distributional semantics is a method for the semantic representation of words, phrases, sentences, and documents. This method aims at capturing as much information as possible from the contextual information in a vector space. The early study in this domain focused on monolingual word embedding. Further progress used cross-lingual data to capture the contextual semantic information across different languages. The main contribution of this research is to make a comparative study to find out the superior impact of the learning methods, supervised and unsupervised in training and post-training approaches in different embedding algorithms, to capture semantic properties of the words in cross-lingual embedding models to be applicable in tasks that deal with multi-languages, such as question retrieval. To this end, we study the cross-lingual embedding models created by BilBOWA, VecMap, and MUSE embedding algorithms along with the variables that impact the embedding models' quality, namely the size of the training data and the window size of the local context. In our study, we use the unsupervised monolingual Word2Vec embedding model as the baseline and evaluate the quality of embeddings on three data sets: Google analogy, mono-and cross-lingual words similar lists. We further investigated the impact of the embedding models in the question retrieval task.

Leveraging Vector Space Similarity for Learning Cross-Lingual Word Embeddings: A Systematic Review

Digital, 2021

This article presents a systematic literature review on quantifying the proximity between independently trained monolingual word embedding spaces. A search was carried out in the broader context of inducing bilingual lexicons from cross-lingual word embeddings, especially for low-resource languages. The returned articles were then classified. Cross-lingual word embeddings have drawn the attention of researchers in the field of natural language processing (NLP). Although existing methods have yielded satisfactory results for resource-rich languages and languages related to them, some researchers have pointed out that the same is not true for low-resource and distant languages. In this paper, we report the research on methods proposed to provide better representation for low-resource and distant languages in the cross-lingual word embedding space.

Injecting Word Embeddings with Another Language’s Resource : An Application of Bilingual Embeddings

2017

Word embeddings learned from text corpus can be improved by injecting knowledge from external resources, while at the same time also specializing them for similarity or relatedness. These knowledge resources (like WordNet, Paraphrase Database) may not exist for all languages. In this work we introduce a method to inject word embeddings of a language with knowledge resource of another language by leveraging bilingual embeddings. First we improve word embeddings of German, Italian, French and Spanish using resources of English and test them on variety of word similarity tasks. Then we demonstrate the utility of our method by creating improved embeddings for Urdu and Telugu languages using Hindi WordNet, beating the previously established baseline for Urdu.

Learning Cross-Lingual Word Embeddings with Universal Concepts

International Journal on Web Service Computing (IJWSC), 2019

Recent advances in generating monolingual word embeddings based on word co-occurrence for universal languages inspired new efforts to extend the model to support diversified languages. State-of-the-art methods for learning cross-lingual word embeddings rely on the alignment of monolingual word embedding spaces. Our goal is to implement a word co-occurrence across languages with the universal concepts' method. Such concepts are notions that are fundamental to humankind and are thus persistent across languages, e.g., a man or woman, war or peace, etc. Given bilingual lexicons, we built universal concepts as undirected graphs of connected nodes and then replaced the words belonging to the same graph with a unique graph ID. This intuitive design makes use of universal concepts in monolingual corpora which will help generate meaningful word embeddings across languages via the word co-occurrence concept. Standardized benchmarks demonstrate how this underutilized approach competes SOTA...

Cross-Lingual Contextual Word Embeddings Mapping With Multi-Sense Words In Mind

ArXiv, 2019

Recent work in cross-lingual contextual word embedding learning cannot handle multi-sense words well. In this work, we explore the characteristics of contextual word embeddings and show the link between contextual word embeddings and word senses. We propose two improving solutions by considering contextual multi-sense word embeddings as noise (removal) and by generating cluster level average anchor embeddings for contextual multi-sense word embeddings (replacement). Experiments show that our solutions can improve the supervised contextual word embeddings alignment for multi-sense words in a microscopic perspective without hurting the macroscopic performance on the bilingual lexicon induction task. For unsupervised alignment, our methods significantly improve the performance on the bilingual lexicon induction task for more than 10 points.

The Limitations of Cross-language Word Embeddings Evaluation

Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, 2018

The aim of this work is to explore the possible limitations of existing methods of crosslanguage word embeddings evaluation, addressing the lack of correlation between intrinsic and extrinsic cross-language evaluation methods. To prove this hypothesis, we construct English-Russian datasets for extrinsic and intrinsic evaluation tasks and compare performances of 5 different cross-language models on them. The results say that the scores even on different intrinsic benchmarks do not correlate to each other. We can conclude that the use of human references as ground truth for cross-language word embeddings is not proper unless one does not understand how do native speakers process semantics in their cognition.

Isomorphic Cross-lingual Embeddings for Low-Resource Languages

ArXiv, 2022

Cross-Lingual Word Embeddings (CLWEs) are a key component to transfer linguistic information learnt from higher-resource settings into lower-resource ones. Recent research in cross-lingual representation learning has focused on offline mapping approaches due to their simplicity, computational efficacy, and ability to work with minimal parallel resources. However, they crucially depend on the assumption of embedding spaces being approximately isomorphic i.e. sharing similar geometric structure, which does not hold in practice, lead-ing to poorer performance on low-resource and distant language pairs. In this paper, we introduce a framework to learn CLWEs, without assuming isometry, for low-resource pairs via joint exploitation of a related higher-resource language. In our work, we first pre-align the low-resource and related language embedding spaces using offline methods to mitigate the assumption of isometry. Following this, we use joint training methods to develops CLWEs for the relat...

A Comparison of Architectures and Pretraining Methods for Contextualized Multilingual Word Embeddings

Proceedings of the AAAI Conference on Artificial Intelligence

The lack of annotated data in many languages is a well-known challenge within the field of multilingual natural language processing (NLP). Therefore, many recent studies focus on zero-shot transfer learning and joint training across languages to overcome data scarcity for low-resource languages. In this work we (i) perform a comprehensive comparison of state-of-the-art multilingual word and sentence encoders on the tasks of named entity recognition (NER) and part of speech (POS) tagging; and (ii) propose a new method for creating multilingual contextualized word embeddings, compare it to multiple baselines and show that it performs at or above state-of-the-art level in zero-shot transfer settings. Finally, we show that our method allows for better knowledge sharing across languages in a joint training setting.

Multi-Adversarial Learning for Cross-Lingual Word Embeddings

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021

Generative adversarial networks (GANs) have succeeded in inducing cross-lingual word embeddings-maps of matching words across languages-without supervision. Despite these successes, GANs' performance for the difficult case of distant languages is still not satisfactory. These limitations have been explained by GANs' incorrect assumption that source and target embedding spaces are related by a single linear mapping and are approximately isomorphic. We assume instead that, especially across distant languages, the mapping is only piece-wise linear, and propose a multi-adversarial learning method. This novel method induces the seed cross-lingual dictionary through multiple mappings, each induced to fit the mapping for one subspace. Our experiments on unsupervised bilingual lexicon induction and cross-lingual document classification show that this method improves performance over previous single-mapping methods, especially for distant languages.

A Robust Self-Learning Method for Fully Unsupervised Cross-Lingual Mappings of Word Embeddings: Making the Method Robustly Reproducible as Well

ArXiv, 2020

In this paper, we reproduce the experiments of Artetxe et al. (2018b) regarding the robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. We show that the reproduction of their method is indeed feasible with some minor assumptions. We further investigate the robustness of their model by introducing four new languages that are less similar to English than the ones proposed by the original paper. In order to assess the stability of their model, we also conduct a grid search over sensible hyperparameters. We then propose key recommendations that apply to any research project in order to deliver fully reproducible research.