Corpora with Special Markup for Studying Concept Statistics (original) (raw)
Related papers
CORPORA-BASED EXTENSION OF A SPECIALIZED DICTIONARY
In the process of translating any special-domain text one needs to use those lexical equivalents and linguistic structures which are broadly used within this subject domain, say in the official translations of legal documents. Corpora of parallel texts contain a number of legislative acts of the Russian Federation with parallel standardized translations into English. The sentence, phrase, word equivalents extracted from these parallel texts should be used while translating new legal documents from Russian to English and vice versa. So we have a problem of automated (or semi-automated) matching some fragments of parallel texts on the level of sentences, phrases, collocations, word entries, etc., before recording those in a specialized dictionary. The task is complicated by several issues commonly observed in translation activity: - Different order of words in pairs of sentences; - Presence of the elements generally omitted in the translated sentence; - Presence of the words having several potential equivalents in the translated sentence (polysemic cases); - Presence of phrases, translated as a single word; - Presence of un-translated words due to the incompleteness of bilingual dictionaries. We propose the fragments’ matching process as follows: Lemmatize each word; search in the Dictionary all pairs of single-word or word-combinations equivalents; construct an adjacency matrix for a bipartite graph whose vertices are the translated words of each sentence; find the critical path in the bilingual space taking into account the weight of pairs equivalents and the distance between them (the number of words separating them). A fragment is defined as a substring of the sentence, limited by the points on the critical path. The report contains examples of the processed text of the Housing Code of the RF with the candidates of equivalents, those evaluation and results of the preliminary analysis.
Лингвистика больших корпусов / Linguistics of Big Corpora
Компьютерная лингвистика и вычислительные онтологии: сборник научных статей. Труды XVIII объединенной конференции «Интернет и современное общество» (IMS-2015), Санкт-Петербург, 23 – 25 июня 2015 г. – СПб: Университет ИТМО, 2015. - С. 82-93., 2015
Quantitative evaluation of linguistic data and mathematical methods of their processing are of great interest for linguists. A rich source of statistical information are text corpora. The paper describes experimental studies of the stability of word combinations and methods for their quantitative estimation in synchronic and diachronic. It shows how text corpora and corpus methods and tools such as association measures, word sketches, concordances can be used to expand the entries in existing dictionaries and how set phrases could be evaluated quantitatively. Corpus linguistics understand set phrases as statistically determined unities. This approach is the basic point of different automatic ways to extract idioms and collocations. There are a small numbers of works on set phrases productivity during time periods because of small size of historical corpora. In this research examined set phrases usage was studied diachronically on the base of the big Google books Ngram Viewer Russian corpus counting billions of tokens. The study argues that diachronic productivity is best evaluated with a studying contexts. Used corpus tools enable to do it. Ultimately, it is shown and maintained that corpus linguistics methods and tools allow to create dictionaries of new type which have to include a larger amount of set phrases and collocations than before. However, there is the problem of verification and the reliability of the data obtained on the basis of corpora. To fulfil their potential capabilities corpora must satisfy such requirements as representativeness (volume) and balance (quality). В статье описываются экспериментальные исследования устойчивости словосочетаний и способы их количественной оценки в синхронии и диахронии. Количественная оценка лингвистических данных и математические методы их обработки представляют большой интерес для лингвистов. Богатым источником статистической информации являются корпуса текстов. Однако существует проблема верификации и достоверности данных, получаемых на основе корпусов. Для реализации потенциальных возможностей корпус должен удовлетворять таким требованиям, как репрезентативность (объем) и сбалансированность (качество).
2018
У статті робиться спроба проаналізувати типологічні характеристики корпусів текстів. Здійснено класифікацію корпусів з огляду на те, які текстові дані бралися до уваги при його укладанні, зокрема за ступенем їх спеціалізації, формальною природою та за мовним параметром. Виявлено, що парадигму за параметром «ступінь спеціалізації текстових даних» складають загальномовні та спеціалізовані корпуси. У свою чергу, у групі спеціалізованих корпусів тип текстових даних, які визначають назву корпусу, до якого вони входять та слугують параметром відбору, може визначатися жанровою, стилістичною, часовою, антропоцентричною, професійною, комунікативною, географічною чи соціальною природою мовної різноманітності. Також представлено приклади згаданих типів корпусів текстів. У статті представлено термінологічні еквіваленти назв корпусів за типом мовних даних в українській та англійській мовах. (The article attempts to analyze the typological characteristics of text corpora. The author proposes to c...
Electronic Corpora as a Research Tool-Possibilities and Prospects
2014
Automatic excerption of language material from electronic corpora provides great opportunities for research. It facilitates the excerption, accelerates the finding of examples and refines them. In this report we present the search options in the Bulgarian National Corpus, as it is freely available, morphologically annotated and balanced in terms of genre and theme of the texts, as it contains texts from different periods in the range of 100 years. We limited the search to find examples of the syntactic structure Small clause, since its specificity illustrates well both the advantages and the limitations of the automatic search. After a series of experiments with different search queries for specific types of small clauses, depending on the expression and its syntax, we came to the conclusion that automatic excerption of linguistic material offers a number of advantages and opportunities, but it has its limits, difficulties and prospects.
Paradigm of Corpus Typological Characteristics by the Type of Text Data
Naukovì zapiski Nacìonalʹnogo unìversitetu «Ostrozʹka akademìâ», 2018
У статті робиться спроба проаналізувати типологічні характеристики корпусів текстів. Здійснено класифікацію корпусів з огляду на те, які текстові дані бралися до уваги при його укладанні, зокрема за ступенем їх спеціалізації, формальною природою та за мовним параметром. Виявлено, що парадигму за параметром «ступінь спеціалізації текстових даних» складають загальномовні та спеціалізовані корпуси. У свою чергу, у групі спеціалізованих корпусів тип текстових даних, які визначають назву корпусу, до якого вони входять та слугують параметром відбору, може визначатися жанровою, стилістичною, часовою, антропоцентричною, професійною, комунікативною, географічною чи соціальною природою мовної різноманітності. Також представлено приклади згаданих типів корпусів текстів. У статті представлено термінологічні еквіваленти назв корпусів за типом мовних даних в українській та англійській мовах. Ключові слова: загальномовний корпус, спеціальний корпус, текстові дані, тип корпусу, типологічні характеристики.
Development of students’ collocational competence based on corpora
Perspectives of Science & Education, 2022
Introduction. The development of students’ collocational competence is one of the main goals of foreign language teaching in a linguistic university. Achieving this goal is possible through the use of corpora. However, the implementation of the methods for the development of students’ collocational competence based on corpora requires the selection of teaching content and the development of learning stages that combine classroom and distant, online and offline learning forms and allow students to work independently with the data of linguistic corpora. The purpose of the study is to develop methods for the development of students’ collocational competence based on corpora. Materials and methods. The study involved 2nd year students (N=44) studying “Linguistics” (profile: “Theory and Methods of Teaching Foreign Languages and Cultures”) and “Pedagogical Education” (profile: “English Language”) programmes at Derzhavin Tambov State University (Russian Federation). In order to test the methods effectiveness for the experimental group, a selection of collocations was carried out and a hybrid learning methods was developed, which includes project activities (classroom and distant) and consists of three stages. Participants in the control group studied a practical English course using traditional printed teaching aids. The subject of control was students’ collocational competence. To process the obtained results, Student’s t-test was used. Research results. The study shows that both teaching methods are quite effective for the development of collocational competence of linguistic university students (control group: t = 12.82, p ≤ 0.005; experimental group: t = 15.64, p ≤ 0.005). At the same time, statistical processing of data when comparing the results of the experimental cut between the control and experimental groups proved the effectiveness of the author’s methods of teaching collocations based on corpora (t = 2.54, p ≤ 0.005). Conclusion. The novelty of the study lies in the formulation of methods for the development of students’ collocational competence based on corpora, which consist of three stages and includes classroom and distant, online and offline forms of education. The results obtained can be used in the development of methods for a foreign language teaching in general and lexical competence in particular on the basis of corpora, as well as in the development of methods for foreign language teaching based on other modern ICTs.
Text corpora in the study of the English compound adjectives
CURRENT TRENDS AND FIELDS OF PHILOLOGICAL STUDIES IN THE CHALLENGING REALITY, 2022
International scientific conference 328 особи є неприпустимим. Також така заміна цінностей вплинула й на зміст та структуру твору. Адаптація Олександра Волкова все менше нагадує оригінал, і тепер повністю відповідає зразку канонічної радянської літератури.