Linguistic Corpora: A View from Turkish (original) (raw)

Construction of the Turkish National Corpus (TNC)

Abstract This paper addresses theoretical and practical issues experienced in the construction of Turkish National Corpus (TNC). TNC is designed to be a balanced, large scale (50 million words) and general-purpose corpus for contemporary Turkish. It has benefited from previous practices and efforts for the construction of corpora. In this sense, TNC generally follows the framework of British National Corpus, yet necessary adjustments in corpus design of TNC are made whenever needed.

Corpus linguistics theory and design and application of a Turkish corpus

2011

Özet Derlemdibilim, dilbilimin son yıllarda öne çıkan bir dalıdır. Bu çalı! madaderlemdilbilim kuramı ve derlemdilbilimin temel kavramlarına kısaca de" inilmi! tir. Bu çalı! manın temel amacı, internet üzerinden eri! ilebilen, yetkili kullanıcılarcaveritabanına yeni metinler eklenebilen ve çıkartılabilen, ba" lam içinde anahtar kelimegösterebilen, platform ba" ımsız ve Türkçe karakterleri sorunsuz gösterebilen bir derlemolu! turmaktır. Geli! tirilen örnek derlemin kodları çalı!

Challenges Encountered in Turkish Natural Language Processing Studies

Natural and Engineering Sciences, 2020

Natural language processing is a branch of computer science that combines artificial intelligence with linguistics. It aims to analyze a language element such as writing or speaking with software and convert it into information. Considering that each language has its own grammatical rules and vocabulary diversity, the complexity of the studies in this field is somewhat understandable. For instance, Turkish is a very interesting language in many ways. Examples of this are agglutinative word structure, consonant/vowel harmony, a large number of productive derivational morphemes (practically infinite vocabulary), derivation and syntactic relations, a complex emphasis on vocabulary and phonological rules. In this study, the interesting features of Turkish in terms of natural language processing are mentioned. In addition, summary info about natural language processing techniques, systems and various sources developed for Turkish are given.

The Turkish Lexicography Corpus TLC : An Overview

The 12th International Conference of The Asian Association for Lexicography “Lexicography in the Digital World”, 2018

There is a lack in the field of lexicography in terms of terminology use because there is not a lexicography terminology created and made available for researchers in Turkey although there is an increase in lexicography studies. Therefore, in order to fulfil this need and create a Turkish Lexicography terminology, Eskisehir Osmangazi University Center for Lexicography has decided to start a project. This study aims to create a specialised corpus including Turkish lexicography studies and to set forth Turkish Lexicography terminology by using this corpus. To create a specialised corpus will help the researchers in deciding what terminology they can use in their studies and it can also help to standardize the terminology use. There is not a platform in which the researchers can discuss and see the previous terminology use in the studies. Terminology choice is mainly made intuitively and it usually depends on small academic group discussions. A corpus can ease the terminology use of the researchers in the field. Based on these necessities, a corpus for Turkish lexicography was created and is accessible to the researchers on a website, www.tsd.ogu.edu.tr . The stages of the study were as follows: synchronization and desynchronization of the corpus, determining the corpus content, digitizing the sources, external tagging of the texts, text type tagging, and lemmatizing. There is also a platform on the website via which the researchers can send new terms they have encountered in related studies. Keywords: Lexicography, terminology, term, corpus, TLC

Patterns and frequency: Evidence from the Turkish National Corpus (TNC

Introduction The count of the frequencies of lexical items is a traditional undertaking. Previously, the primary motivation in these studies was mainly practical rather than theoretical in the sense that quantification information is expected to provide better description for individual items as well as for their combinations. Recently however, research on frequencies concluded that statistical regularities and distributional aspects of lexical structures have theoretical significance, bringing new insights into the role of lexis in grammar and in patterning. Advances in corpus software development and corpus analytic tools provided additional empirical evidence for a renewed understanding of lexical structures. Concordance data have identified various patterns in ordinary language use, alongside formulaic expressions and various other forms of fixed expressions. In the general framework of British linguistic tradition (Stubbs, 1993, 2013), work on corpus data argued that lexical structures encode such properties that cannot be captured within the confines of individual word or lexeme. Sinclair (1998) thus proposes the term lexical item to account for recurrent and regular patterns that expand beyond size of a single item. The present study will show data of frequent and recurrent patterns that are extracted from the Turkish National Corpus (TNC) (http://www.tnc.org.tr). The patterns that we will review here cover sequences of (i) lexical items (i.e. the multiword units), (ii) the regular frequent patterns formed by inflectional categories (i.e., the multimorpheme units), and (iii) patterns found among adjacent lexical item (i.e., interlexical units). In sum, the data here represent an initial typology of such structures and their observed frequencies. This paper is organized as follows. The first part will review basics of fixed expressions and corpus-based analysis of frequent patterns. In the second part of the paper, the data of patterns of lexemes and morphemes will be given with their distributional frequencies. The frequencies of these recurrent patterns are indicative for a proper understanding Turkish lexicon in general. (to cite : Aksan, Mustafa & Yeşim Aksan (2018) Patterns and frequency: Evidence from the Turkish National Corpus (TNC). In M. A. Akıncı & K.Yağmur (Eds.) , The Rouen Meeting: Studies on Turkic structures and language contacts. (Turcologica; Vol. 114). Wiesbaden: Harrassowitz Verlag , 107-118. )

A corpus-based analysis of Fakat, Yoksa, Ayrıca

Constraints in Discourse III , 2012

Zeyrek, D., Demirşahin, I., Turan, Ü. D., Çakıcı, R. (2012) A corpus-based analysis of Fakat, Yoksa, Ayrıca. In Anton Benz, Peter Kuehlnlein, Manfred Stede (eds). Constraints in Discourse III Amsterdam, The Netherlands: John Benjamins. This paper presents a corpus-based quantitative investigation of three connectives in Turkish. As in D-LTAG, we take discourse connectives as lexical anchors which select their two arguments in discourse. Connectives can be grouped in two classes as structural connectives and discourse adverbials. While structural connectives select both of their arguments structurally and through adjacency, adverbials retrieve their first argument in the preceding discourse like anaphoric expressions. The three different connectives selected in the study appeared to have structural, anaphoric and both structural and anaphoric properties. We delineate how these connectives behave differently in terms of the adjacency of their arguments, the span of their first arguments and their position within the clause. Our corpus-based quantitative analysis shows that there is a statistically significant difference between the structural connectives and the discourse adverbial in our data in terms of the variables selected.

TS Corpus Project: An online Turkish Dictionary and TS DIY Corpus

TS Corpus is a free and independent project that aims building Turkish corpora, NLP tools and linguistic datasets. Since 2011, 10 corpora, various NLP tools, a large dataset and an online dictionary has been released. This paper focuses on the " online dictionary " and " TS do-it-yourself corpus " released by the project. The dictionary data is based on TDK (Turkish Language Society) Contemporary Dictionary. However, the dictionary published serves many enhanced functions at user interface level. But, the main importance of the study is about the results presented to the users upon their queries; the presentation of collocations and tri-grams of the key word searched for. The collocations are harvested from a large Turkish corpus, +760 million tokens and the tri-grams were generated from Turkish Wikipedia pages. The do-it-yourself corpus (TS DIY Corpus), allows users to build their own corpora, modify or delete the uploaded texts and run queries. Users may run queries in different modes, such as " as is " , " starting/ending with " or including; besides advanced query option allows users to run queries with part-of-speech tags and lemmas. The results are given in KWIC (keyword in context) format. Various text classification options such as pubdate, author, domain, genre etc. could be selected during corpus creation. As the number of available Turkish corpora is limited, TS DIY Corpus is applicant to be a useful, well-known and largely used software for the scholars and researchers who wants to use a Turkish corpus or study over Turkish texts of their own.

Corpus-Based Research on Terminology of Turkish Lexicography(CBRT-TURKLEX)

Lexikos, 2018

In this paper, we introduce an ongoing lexicographic corpus project. The Center for Lexicography, abbreviated as SOZMER, was established under the aegis of Eskisehir Osmangazi University to support lexicographical projects. SOZMER decided to initiate a corpus-based Turkish lexicography project. This project will be the first stage of the endeavour aimed at preparing a specialized dictionary for Turkish lexicography. The primary aim of the project is to prepare an electronic corpus for researchers of Turkish lexicography. The secondary aim of the project is to obtain a word list of Turkish lexicographic terms. This paper presents a description of the process of data collection and the methodology employed for building a specialized corpus. The study contains an outline of the project background, needs, problems, and the phases of corpus building.

Corpus-Based Research on Terminology of Turkish Lexicography

In this paper, we introduce an ongoing lexicographic corpus project. The Center for Lexicography, abbreviated as SÖZMER, was established under the aegis of Eskisehir Osmangazi University to support lexicographical projects. SÖZMER decided to initiate a corpus-based Turkish lexicography project. This project will be the first stage of the endeavour aimed at preparing a specialized dictionary for Turkish lexicography. The primary aim of the project is to prepare an electronic corpus for researchers of Turkish lexicography. The secondary aim of the project is to obtain a word list of Turkish lexicographic terms. This paper presents a description of the process of data collection and the methodology employed for building a specialized corpus. The study contains an outline of the project background, needs, problems, and the phases of corpus building. Opsomming: Korpus-gebaseerde navorsing op terminologie van die Turkse leksikografie (CBRT-TURKLEX). In hierdie artikel word 'n lopende leksikografiese projek bekend gestel. Die Sentrum vir Leksikografie, afgekort tot SÖZMER, is onder die vaandel van die Eskisehir Osmangazi Universiteit tot stand gebring om leksikografiese projekte te ondersteun. SÖZMER het besluit om 'n korpus-gebaseerde Turkse leksikografieprojek te inisieer. Hierdie projek *

The Turkish National Corpus (TNC): Comparing the Architectures of v1 and v2

Turkish National Corpus (TNC) released its first version in 2012 is the first large scale (50 million words), web-based and publicly-available free resource of contemporary Turkish. It is designed to be a well-balanced and representative reference corpus for Turkish. With 48 million words coming from the written part of it, the untagged TNC v1 represents 4438 different data sources over 9 domains and 34 different genres. The morphologically annotated, 50 million words TNC v2 with 5412 different documents compiled from written and spoken Turkish is planned for release in 2016 offers new query options for linguistic analyses. This paper aims to compare architectures of the TNC v1 and v2 on the basis of a set of queries made on both versions. Standard, restricted and wildcard lexical searches are performed. Then, the speed of two versions in retrieving the query results in concordance lines is compared. Finally, it is argued that TNC v2 performs better and faster than that of TNC v1 due to the in-memory inverted index structure. Since building language corpora is a very recent issue for Turkish, the architecture of TNC v2 would serve as a model for similar corpus construction projects. (to cite :