The nature of collocations in the Russian language. The experience of automatic extraction and classification of the material of news texts (original) (raw)

Analysis of Collocations in Russian: Corpus vs Dictionary


The paper discusses the results of an experiment in collocation extraction in a corpus of Russian texts. The data obtained is compared to the data given for set expressions in modern Russian dictionaries in order to analyze from the standpoint of traditional lexicography what kind of phrases can be received by such an approach. The paper also explores the role of statistical measures for extracting collocations in Russian.

Natural Language Processing and Social Network Analysis Studying Special Text Russian Corpora by the Lexico-Syntactic Models


The paper presents the results of automatic term extraction from a special text corpus (a collection of papers on corpus linguistics) by means of statistical methods (association measures) combined with certain syntactic models. The approach undertaken in the paper is based on lexico-syntactic models that can be viewed as models of phrases for the Russian language. The Sketch Engine system represents itself a corpus tool which takes as input a corpus of any language and corresponding grammar patterns. The system gives information about a word’s collocability on concrete dependency models, and generates lists of the most frequent phrases for a given word based on appropriate models. The extracted terms belong to various clusters and represent the lexical structure of the texts in question. The applied method includes statistical analysis that enables estimating paradigmatic and syntagmatic relations between lexemes based on their distribution.

Collocations in Russian Lexicography and Russian Collocations Database


The paper presents the issue of collocability and collocations in Russian and gives a survey of a wide range of dictionaries both printed and online ones that describe collocations. Our project deals with building a database that will include dictionary and statistical collocations. The former can be described in various lexicographic resources whereas the latter can be extracted automatically from corpora. Dictionaries differ among themselves, the information is given in various ways, making it hard for language learners and researchers to acquire data. A number of dictionaries were analyzed and processed to retrieve verified collocations, however the overlap between the lists of collocations extracted from them is still rather small. This fact indicates there is a need to create a unified resource which takes into account collocability and more examples. The proposed resource will also be useful for linguists and for studying Russian as a foreign language. The obtained results can...

Syntagmatic Relations in Russian Corpora and Dictionaries

Pragmantax II. Zum aktuellen Stand der Linguistik und ihren Teildisziplinen. The Present State of Linguistics and its Sub-Disciplines. Frankfurt a.M.: Peter Lang, 2014. S. 333-344.

The paper describes a notion of collocability and collocations, statistical background for collocation extraction and experiments of applying statistical tools in order to extract collocations from Russian texts.

Extracting collocations in Russian: Statistics vs. Dictionary


The notion of collocation is quite ambiguous. A concise survey of different approaches to it (British contextualism, lexicographical approach, approach of the “Meaning-Text” theory) is proposed in the paper. The paper discusses the results of retrieving collocations from a corpus of Russian texts. The data obtained is compared to the data given for set expressions in modern Russian dictionaries. The paper also explores the role of statistical measures for extracting collocations in Russian, and the issue of their applicability to the Russian language.

Analyzing Bulgarian and English Collocations *

* This research has been partially supported by project NI-I-608-96 "Modelling of structural text characteristics by knowledge-based schemata" of the National Science Foundation and project 010037 "Methods for knowledge representation and processing in the modern information technologies" of IIT  BAS.

Representation of Dictionaries in the Russian Collocations Database∗


The study of collocability is an important task and is still highly relevant in linguistics. The present paper discusses the issue of collocability and collocations in a number of Russian dictionaries (the Dictionary of the Russian Language, Dictionary of Set Verb-Noun Phrases, the Dictionary of Russian Idiomatics and the Dictionary of Collocations). These dictionaries were analyzed and used to create a Russian collocations database which includes both verified or dictionary collocations, as well as data from text corpora. At present, the database includes about 18,500 collocations.

Building a Gold Standard for a Russian Collocations Database

Proceedings of the XVIII EURALEX International Congress Lexicography in Global Contexts, 2018

In the last decade, linguists have become increasingly interested in corpus material, which allows for a fresh approach to the phenomena that have already been extensively described in academic works. The dual nature of the co-occurrence phenomenon itself lies, on one hand, in its linguistic component and, on the other, in the probabilistic (combinatorial) characteristics. The former has been described in numerous papers and explicitly defined in dictionaries, while the latter can be identified by a statistical approach. The present paper focuses on the process of building a gold standard that will include data from Russian dictionaries and corpora. The standard is being prepared for a Russian Collocations Database that already includes information on words’ collocability and was extracted from text corpora by statistical measures and linguistic filters. The gold standard will be also used for the evaluation of the extracted collocations and for marking them as “true” collocations with references to the dictionaries.

Linguistic analysis method of ukrainian commercial textual content for data mining


This article deals with the scientific and practical task of automatically detecting significant keywords and rubricating Ukrainian content in Internet systems based on the method of linguistic analysis of text information. The article presents theoretical and experimental substantiation of the method of linguistic analysis of Ukrainian content using Porter's stemming. The method is aiming to automatically detect significant keywords of Ukrainian content on the basis of the proposed formalization of components of analysis grammatical (grapheme), morphological, syntactic, semantic, referential, and structural.

Identification of the usage of collocations in business texts

Automatic Documentation and Mathematical Linguistics, 2017

⎯This paper considers an algorithm for revealing the multiword combinations and derivative prepositions that are characteristic of Russian business and official speech. The construction of such an algorithm can be considered as a necessary stage of studying business and official speech. The described mechanism can be useful in compiling textbooks, training, and controlling services for both foreigners and native speakers of the Russian language. A word list in the form of an Excel file is created based on the core of collocations that have been already included in existing tutorials. The orders of public authorities published in the public domain on the Internet are used as texts over which the search is made.