Construction of the Turkish National Corpus (TNC) (original) (raw)

The Turkish National Corpus (TNC): Comparing the Architectures of v1 and v2

Turkish National Corpus (TNC) released its first version in 2012 is the first large scale (50 million words), web-based and publicly-available free resource of contemporary Turkish. It is designed to be a well-balanced and representative reference corpus for Turkish. With 48 million words coming from the written part of it, the untagged TNC v1 represents 4438 different data sources over 9 domains and 34 different genres. The morphologically annotated, 50 million words TNC v2 with 5412 different documents compiled from written and spoken Turkish is planned for release in 2016 offers new query options for linguistic analyses. This paper aims to compare architectures of the TNC v1 and v2 on the basis of a set of queries made on both versions. Standard, restricted and wildcard lexical searches are performed. Then, the speed of two versions in retrieving the query results in concordance lines is compared. Finally, it is argued that TNC v2 performs better and faster than that of TNC v1 due to the in-memory inverted index structure. Since building language corpora is a very recent issue for Turkish, the architecture of TNC v2 would serve as a model for similar corpus construction projects. (to cite :

Corpus linguistics theory and design and application of a Turkish corpus

2011

Özet Derlemdibilim, dilbilimin son yıllarda öne çıkan bir dalıdır. Bu çalı! madaderlemdilbilim kuramı ve derlemdilbilimin temel kavramlarına kısaca de" inilmi! tir. Bu çalı! manın temel amacı, internet üzerinden eri! ilebilen, yetkili kullanıcılarcaveritabanına yeni metinler eklenebilen ve çıkartılabilen, ba" lam içinde anahtar kelimegösterebilen, platform ba" ımsız ve Türkçe karakterleri sorunsuz gösterebilen bir derlemolu! turmaktır. Geli! tirilen örnek derlemin kodları çalı!

Linguistic Corpora: A View from Turkish

Usage-based linguistic studies have gained new insights as corpus-based and corpus-driven analyses have advanced in recent years. Linguists working in different domains have turned to corpora as a major source in their study of language at all levels of representation. Currently, corpus linguistics is evolving into a sophisticated methodology in extracting and analyzing data. Building and using corpora in Turkish linguistics is a recent undertaking, initially motivated by work on natural language processing (NLP) research. The number of available corpora is increasing and linguistic research has come to test hypotheses on attested data, or uncover more lexical and grammatical patterns of use that have gone unnoticed in the absence of corpus data. Advances in NLP research and tools provided for corpus building and annotation further contribute to corpus studies in Turkish linguistics. (to cite : Aksan, Mustafa & Yeşim Aksan (2018). Linguistic corpora: A view from Turkish (pp. 301-327). Kemal Oflazer & Murat Saraçlar (Eds.) Studies in Turkish Language Processing. Springer Verlag . )

The Turkish Lexicography Corpus TLC : An Overview

The 12th International Conference of The Asian Association for Lexicography “Lexicography in the Digital World”, 2018

There is a lack in the field of lexicography in terms of terminology use because there is not a lexicography terminology created and made available for researchers in Turkey although there is an increase in lexicography studies. Therefore, in order to fulfil this need and create a Turkish Lexicography terminology, Eskisehir Osmangazi University Center for Lexicography has decided to start a project. This study aims to create a specialised corpus including Turkish lexicography studies and to set forth Turkish Lexicography terminology by using this corpus. To create a specialised corpus will help the researchers in deciding what terminology they can use in their studies and it can also help to standardize the terminology use. There is not a platform in which the researchers can discuss and see the previous terminology use in the studies. Terminology choice is mainly made intuitively and it usually depends on small academic group discussions. A corpus can ease the terminology use of the researchers in the field. Based on these necessities, a corpus for Turkish lexicography was created and is accessible to the researchers on a website, www.tsd.ogu.edu.tr . The stages of the study were as follows: synchronization and desynchronization of the corpus, determining the corpus content, digitizing the sources, external tagging of the texts, text type tagging, and lemmatizing. There is also a platform on the website via which the researchers can send new terms they have encountered in related studies. Keywords: Lexicography, terminology, term, corpus, TLC

TS Corpus Project: An online Turkish Dictionary and TS DIY Corpus

TS Corpus is a free and independent project that aims building Turkish corpora, NLP tools and linguistic datasets. Since 2011, 10 corpora, various NLP tools, a large dataset and an online dictionary has been released. This paper focuses on the " online dictionary " and " TS do-it-yourself corpus " released by the project. The dictionary data is based on TDK (Turkish Language Society) Contemporary Dictionary. However, the dictionary published serves many enhanced functions at user interface level. But, the main importance of the study is about the results presented to the users upon their queries; the presentation of collocations and tri-grams of the key word searched for. The collocations are harvested from a large Turkish corpus, +760 million tokens and the tri-grams were generated from Turkish Wikipedia pages. The do-it-yourself corpus (TS DIY Corpus), allows users to build their own corpora, modify or delete the uploaded texts and run queries. Users may run queries in different modes, such as " as is " , " starting/ending with " or including; besides advanced query option allows users to run queries with part-of-speech tags and lemmas. The results are given in KWIC (keyword in context) format. Various text classification options such as pubdate, author, domain, genre etc. could be selected during corpus creation. As the number of available Turkish corpora is limited, TS DIY Corpus is applicant to be a useful, well-known and largely used software for the scholars and researchers who wants to use a Turkish corpus or study over Turkish texts of their own.

Sustaining a Corpus for Spoken Turkish Discourse: Accessibility and Corpus Management Issues

… : From Storyboard to …, 2010

This paper addresses the issues of the long-term availability of language resources and the financing of resource maintenance in the context of the web-based corpus management system employed in the Spoken Turkish Corpus (STC), which operates with EXMARaLDA. Section 2 overviews the capacities of the corpus management system with respect to its software infrastructure, online presentation, metadata management, and interoperability. Section 3 describes the plan foreseen in STC for sustaining the resource, and dwells on the ethical issues surrounding the conflicting demands of free resources for non-commercial research and resource maintenance.

Corpus-Based Research on Terminology of Turkish Lexicography(CBRT-TURKLEX)

Lexikos, 2018

In this paper, we introduce an ongoing lexicographic corpus project. The Center for Lexicography, abbreviated as SOZMER, was established under the aegis of Eskisehir Osmangazi University to support lexicographical projects. SOZMER decided to initiate a corpus-based Turkish lexicography project. This project will be the first stage of the endeavour aimed at preparing a specialized dictionary for Turkish lexicography. The primary aim of the project is to prepare an electronic corpus for researchers of Turkish lexicography. The secondary aim of the project is to obtain a word list of Turkish lexicographic terms. This paper presents a description of the process of data collection and the methodology employed for building a specialized corpus. The study contains an outline of the project background, needs, problems, and the phases of corpus building.

Challenges Encountered in Turkish Natural Language Processing Studies

Natural and Engineering Sciences, 2020

Natural language processing is a branch of computer science that combines artificial intelligence with linguistics. It aims to analyze a language element such as writing or speaking with software and convert it into information. Considering that each language has its own grammatical rules and vocabulary diversity, the complexity of the studies in this field is somewhat understandable. For instance, Turkish is a very interesting language in many ways. Examples of this are agglutinative word structure, consonant/vowel harmony, a large number of productive derivational morphemes (practically infinite vocabulary), derivation and syntactic relations, a complex emphasis on vocabulary and phonological rules. In this study, the interesting features of Turkish in terms of natural language processing are mentioned. In addition, summary info about natural language processing techniques, systems and various sources developed for Turkish are given.

Corpus-Based Research on Terminology of Turkish Lexicography

In this paper, we introduce an ongoing lexicographic corpus project. The Center for Lexicography, abbreviated as SÖZMER, was established under the aegis of Eskisehir Osmangazi University to support lexicographical projects. SÖZMER decided to initiate a corpus-based Turkish lexicography project. This project will be the first stage of the endeavour aimed at preparing a specialized dictionary for Turkish lexicography. The primary aim of the project is to prepare an electronic corpus for researchers of Turkish lexicography. The secondary aim of the project is to obtain a word list of Turkish lexicographic terms. This paper presents a description of the process of data collection and the methodology employed for building a specialized corpus. The study contains an outline of the project background, needs, problems, and the phases of corpus building. Opsomming: Korpus-gebaseerde navorsing op terminologie van die Turkse leksikografie (CBRT-TURKLEX). In hierdie artikel word 'n lopende leksikografiese projek bekend gestel. Die Sentrum vir Leksikografie, afgekort tot SÖZMER, is onder die vaandel van die Eskisehir Osmangazi Universiteit tot stand gebring om leksikografiese projekte te ondersteun. SÖZMER het besluit om 'n korpus-gebaseerde Turkse leksikografieprojek te inisieer. Hierdie projek *

Problems of Creation of the All-Turkic National Corpus

The paper presents the results of research on the theoretical and practical issues of the creation of national corpus of the Turkic world. This paper consists of four parts and conclusion. The first section is devoted to a theoretical analysis of a problem, the second section describes a brief history about the ideas of creation of the International machine fund of Turkic languages from the former USSR, the third section considers the realization of concerted reforms of the Turkic countries on creation of the all-Turkic terminological fund and the fourth section analyzes the significance of future projects of the Turkic people on corpus linguistics.

Patterns and frequency: Evidence from the Turkish National Corpus (TNC

Introduction The count of the frequencies of lexical items is a traditional undertaking. Previously, the primary motivation in these studies was mainly practical rather than theoretical in the sense that quantification information is expected to provide better description for individual items as well as for their combinations. Recently however, research on frequencies concluded that statistical regularities and distributional aspects of lexical structures have theoretical significance, bringing new insights into the role of lexis in grammar and in patterning. Advances in corpus software development and corpus analytic tools provided additional empirical evidence for a renewed understanding of lexical structures. Concordance data have identified various patterns in ordinary language use, alongside formulaic expressions and various other forms of fixed expressions. In the general framework of British linguistic tradition (Stubbs, 1993, 2013), work on corpus data argued that lexical structures encode such properties that cannot be captured within the confines of individual word or lexeme. Sinclair (1998) thus proposes the term lexical item to account for recurrent and regular patterns that expand beyond size of a single item. The present study will show data of frequent and recurrent patterns that are extracted from the Turkish National Corpus (TNC) (http://www.tnc.org.tr). The patterns that we will review here cover sequences of (i) lexical items (i.e. the multiword units), (ii) the regular frequent patterns formed by inflectional categories (i.e., the multimorpheme units), and (iii) patterns found among adjacent lexical item (i.e., interlexical units). In sum, the data here represent an initial typology of such structures and their observed frequencies. This paper is organized as follows. The first part will review basics of fixed expressions and corpus-based analysis of frequent patterns. In the second part of the paper, the data of patterns of lexemes and morphemes will be given with their distributional frequencies. The frequencies of these recurrent patterns are indicative for a proper understanding Turkish lexicon in general. (to cite : Aksan, Mustafa & Yeşim Aksan (2018) Patterns and frequency: Evidence from the Turkish National Corpus (TNC). In M. A. Akıncı & K.Yağmur (Eds.) , The Rouen Meeting: Studies on Turkic structures and language contacts. (Turcologica; Vol. 114). Wiesbaden: Harrassowitz Verlag , 107-118. )

A corpus-based analysis of Fakat, Yoksa, Ayrıca

Constraints in Discourse III , 2012

Zeyrek, D., Demirşahin, I., Turan, Ü. D., Çakıcı, R. (2012) A corpus-based analysis of Fakat, Yoksa, Ayrıca. In Anton Benz, Peter Kuehlnlein, Manfred Stede (eds). Constraints in Discourse III Amsterdam, The Netherlands: John Benjamins. This paper presents a corpus-based quantitative investigation of three connectives in Turkish. As in D-LTAG, we take discourse connectives as lexical anchors which select their two arguments in discourse. Connectives can be grouped in two classes as structural connectives and discourse adverbials. While structural connectives select both of their arguments structurally and through adjacency, adverbials retrieve their first argument in the preceding discourse like anaphoric expressions. The three different connectives selected in the study appeared to have structural, anaphoric and both structural and anaphoric properties. We delineate how these connectives behave differently in terms of the adjacency of their arguments, the span of their first arguments and their position within the clause. Our corpus-based quantitative analysis shows that there is a statistically significant difference between the structural connectives and the discourse adverbial in our data in terms of the variables selected.

Turkish Natural Language Processing Initiative: An Overview

1994

This paper presents an overview of a research project aimed at establishing a computational infrastructure for building NLP applications for Turkish. The project speci cation includes design and implementation of re-usable software tools and advanced NLP applications.

Building a Wordnet for Turkish

This paper summarizes the development process of a wordnet for Turkish as part of the Balkanet project. After discussing the basic methodological issues that had to be resolved during the course of the project, the paper presents the basic steps of the construction process in chronological order. Two applications using Turkish wordnet are summarized and links to resources for wordnet builders are provided at the end of the paper.

Features for an internet accessible corpus of spoken Turkish discourse

2009

Resumé: In this paper we survey features of spoken corpora for a number of languages and focus on the design criteria, technological requirements and priorities in annotation for a spoken corpus for Turkish that aims to reflect its discursive and pragmatic features. The paper continues with a description of possible coding schemas for the annotation of the discursive and pragmatic features of Turkish.

A Corpus Based NooJ Module for Turkish

Proceedings of the NooJ 2010 International Conference and Workshop, 2010

This paper presents the design, implementation and testing processes of a corpus-driven Nooj module for morphological tagging of Turkish. It also underlines the morphological challenges specific to Turkish. Modeling and tagging processes involves both inflectional and derivational paradigms of present-day Turkish. Inflection of multi-word units and syntactic disambiguation are beyond the scope of the study.

The Status of English in Turkey

2009

INTERNATIONAL CORPUS OF LEARNER ENGLISH vii 3.2.4. Location of the data 4. Selecting and querying the corpus 4.1. The REQUEST window 4.1.1. Navigating the request window 4.1.2. Selecting the corpus 4.1.3. Keying in a linguistic query a. Single words b. Several consecutive words c. Series of words d. Multiword units (MWUs) or compound lexical entries e. Abbreviations f. Part-of-speech tags and combinations oflexical units and POS-tags g. Morphological filters 4.1.4. Note on the composition of the request screens 4.1.4.1. Alphanumerical fields 4.1.4.2. Numerical fields 4.1.4.3. Alphabetical fields 4.1.4.4. Variables with multiple options 4.1.5. Resetting the selection 4.1.6. Functions available on the main command bar 4.2. The ZOOM-LIST window 4.3. Submitting the query 5. The RESULT windows 5.1. Two types of result windows 5.2. Deselecting profiles 5.3. Key terms: Sub-Corpus, Selected Corpus and Result Selected Corpus 5.4. Functions available on the result window 5.4.1. 'Grid view' and 'form view' of the selected profiles 5.4.2. Sorting the data in the Result window (grid view only) 5.4.3. Viewing a text and merging texts into a corpus 5.