Word Sense Induction Methods: Which One Is Better for Russian (original) (raw)

The topic of this study is word sense induction (WSI), that is the automatic discovery of the possible senses of a word in text corpora. WSI is a challenging task as there are few examples of WSI being successfully deployed in end-user applications. Our aim is to apply WSI to Russian lexicography as a supporting tool for linguists. For this purpose, we compared the methods previously applied to English: Adaptive Skip-gram (Adagram), Latent Dirichlet Allocation (LDA) as well as several clustering techniques based on word2vec-clustering of contexts, clustering of context words and clustering of synonyms. In this study we quantitatively and a qualitatively evaluated the aforemen-tioned WSI methods for Russian nouns and verbs. For the quantitative evaluation, we measured the similarity of the suggested clustering to the existing dictionary senses with Adjusted Rand Index (ARI) and V-measure scores, using labeled contexts. For the qualitative evaluation, we assessed the interpretability of the derived senses, the number of duplicate senses, the number of mixed senses and derivation of rare senses. The study was performed on 15 nouns using RuWac Internet corpus.