Word Sense Induction Methods: Which One Is Better for Russian (original) (raw)

Word sense induction for Russian : deep study and comparison with dictionaries

The assumption that senses are mutually disjoint and have clear boundaries has been drawn into doubt by several linguists and psychologists. The problem of word sense granularity is widely discussed both in lexicographic and in NLP studies. We aim to study word senses in the wild—in raw corpora— by performing word sense induction (WSI). WSI is the task of automatically inducing the different senses of a given word in the form of an unsupervised learning task with senses represented as clusters of token instances. In this paper, we compared four WSI techniques: Adaptive Skip-gram (AdaGram), Latent Dirichlet Allocation (LDA), clustering of contexts and clustering of synonyms. We quantitatively and qualitatively evaluated them and performed a deep study of the AdaGram method comparing AdaGram clusters for 126 words (nouns, adjectives, and verbs) and their senses in published dictionaries. We found out that AdaGram is quite good at distinguishing homonyms and metaphoric meanings. It ignores disappearing and obsolete senses, but induces new and domain-specific senses which are sometimes absent in dictionaries. However it works better for nouns than for verbs, ignoring the structural differences (e.g. causative meanings or different government patterns). The Adagram database is available online: http://adagram.ll-cl.org/. 1 This research was supported by RSF (project No.16-18-02054: Semantic, statistic and psy-cholinguistic analysis of lexical polysemy as a component of Russian linguistic worldview). The authors would also like to thank students of the Higher School of Economics and Yandex School of Data Analysis for their help in annotating dictionary senses.

RUSSE'2018: A Shared Task on Word Sense Induction for the Russian Language

2018

The paper describes the results of the first shared task on word sense induction (WSI) for the Russian language. While similar shared tasks were conducted in the past for some Romance and Germanic languages, we explore the performance of sense induction and disambiguation methods for a Slavic language that shares many features with other Slavic languages, such as rich morphology and virtually free word order. The participants were asked to group contexts of a given word in accordance with its senses that were not provided beforehand. For instance, given a word "bank" and a set of contexts for this word, e.g. "bank is a financial institution that accepts deposits" and "river bank is a slope beside a body of water", a participant was asked to cluster such contexts in the unknown in advance number of clusters corresponding to, in this case, the "company" and the "area" senses of the word "bank". For the purpose of this evaluat...

Automated Word Sense Frequency Estimation for Russian Nouns

According to Zipf's observation there is a strong correlation between word frequency and polysemy, and yet word sense frequency distribution is a neglected area in computational linguistics. Furthermore, the study of sense frequency has theoretical interest and practical applications for lexicography and word sense disambiguation. Though WordNet and SemCor contain some information about sense frequency in English, it is not enough for either practical or research purposes. For Russian, even this information is lacking. To fill this gap, we develop and test an automated system based on semantic vectors that deals with the problem of sense frequency for Russian nouns. The model is first trained unsupervised on large corpora and then supplied with contexts and collocations from the Active Dictionary of Russian. Dictionary examples are used either for supervised post-­training, or for automatic labeling of clusters that are learnt unsupervised. This allows us to reach a frequency estimation error of 11­15% on different corpora without any additional labeled data. Word sense frequency distributions for 440 nouns are available online.

Towards clustering-based word sense discrimination

This paper describes a series of experiments conducted to group similar words using context features derived from a corpus. The goal is to find an approach that would be suitable for cleaning the fuzzy WordNet synsets obtained by automatic translation of Serbian synsets into Slovene. Similar techniques have been used successfully by a number of researches already and they are attractive particularly because they are knowledge-lean and based on evidence found in simple raw text. A selection of features and settings are tested on sample test sets with an unsupervised machine learning method called hierarchical clustering. In the final part of the paper, the obtained results are analyzed and the optimal set of features is selected, followed by a discussion of the results and some further research plans. Poskus uporabe hierarhičnega razvrščanja v skupine za določanje pomena besed Prispevek opisuje niz eksperimentov, s katerimi smo na podlagi okolice besed, ki smo jo izluščili iz korpusa...

WORD SENSE DISAMBIGUATION FOR RUSSIAN VERBS USING SEMANTIC VECTORS AND DICTIONARY ENTRIES

Word sense disambiguation (WSD) methods are useful for many NLP tasks that require semantic interpretation of input. Furthermore, such methods can help estimate word sense frequencies in different corpora, which is important for lexicographic studies and language learning resources. Although previous research on Russian polysemous verbs disambiguation established some important and interesting results, it was mostly focused on reducing ambiguity or determining the most frequent sense, but not on evaluating WSD accuracy. To the best of our knowledge, there is no comprehensively evaluated method that can perform semi-supervised word sense disambiguation for Russian verbs. In this paper we present a WSD method for verbs that is able to reach an average disambiguation accuracy of 75% using only available linguistic resources: examples and collocations from the Active Dictionary of Russian and large unlabeled corpora. We evaluate the method on contexts sampled from the web-based corpus RuTenTen11 for 10 verbs with 100 contexts for each verb. We compare different variations of the method and analyze its limitations. Method’s implementation and labeled contexts are available online.

Word Sense Frequency Estimation for Russian: Verbs, Adjectives and Different Dictionaries

In this paper we investigate several extensions to our prior work on sense frequency estimation for Russian. Our method is based on semantic vectors and is able to achieve good accuracy for sense frequency estimation trained on dictionary entries from the Active Dictionary of Russian and unannotated corpora. We apply our method to verbs and adjectives to obtain sense frequencies for 329 verbs and 256 adjectives in an academic corpus and a web-based corpus. We compare frequency distributions against dictionary sense ordering and between two corpora and find that the first dictionary sense is not the most frequent for almost half of the words we studied. Evaluation of verbs and adjectives shows that frequency estimation error is lower than 15%. We investigate the effect of sense granularity, evaluating how the accuracy of our method changes when applied to more coarse-grained senses. We also investigate if our method can be applied to other dictionaries with less elaborate sense descriptions, by evaluating its accuracy when training on dictionary entries from two other dictionaries.

Clustering WordNet Word Senses Eneko Agirre and Oier Lopez de Lacalle

Abstract. This paper presents the results of a set of methods to cluster WordNet word senses. The methods rely on different information sources: confusion matrixes from Senseval-2 Word Sense Disambiguation systems, translation similarities, hand-tagged examples of the target word senses and examples obtained automatically from the web for the target word senses. The clustering results have been evaluated using the coarsegrained word senses provided for the lexical sample in Senseval-2.

Word Sense Disambiguation in Monolingual Dictionaries for Building Russian WordNet

2016

Russian Language is currently poorly supported with WordNet-like resources. One of the new efforts for building Russian WordNet involves mining the monolingual dictionaries. While most steps of the building process are straightforward, word sense disambiguation (WSD) is a source of problems. Due to limited word context specific WSD mechanism is required for each kind of relations mined. This paper describes the WSD method used for mining hypernym relations. First part of the paper explains the main reasons for choosing monolingual dictionaries as the primary source of information for Russian language WordNet and states some problems faced during the information extraction. The second part defines algorithm used to extract hyponym-hypernym pair. The third part describes the algorithm used for WSD.

KSU KDD: Word sense induction by clustering in topic space

2010

We describe our language-independent unsupervised word sense induction system. This system only uses topic features to cluster different word senses in their global context topic space. Using unlabeled data, this system trains a latent Dirichlet allocation (LDA) topic model then uses it to infer the topics distribution of the test instances. By clustering these topics distributions in their topic space we cluster them into different senses. Our hypothesis is that closeness in topic space reflects similarity between different word senses. This system participated in SemEval-2 word sense induction and disambiguation task and achieved the second highest V-measure score among all other systems.

Word Sense Induction for Better Lexical Choice

Research in Computing Science

Most words in natural languages are polysemous in nature that is they have multiple possible meanings or senses. The sense in which the word is used determines the translation of the word. We show that incorporating a sense-based translation model into statistical machine translation model consistently improves translation quality across all different test sets of five different language-pairs, according to all eight most commonly used evaluation metrics. This paper is an investigation on how to initiate research in word sense disambiguation and statistical machine translation for under-resourced languages by applying Word Sense Induction.