Construction of the Turkish National Corpus (TNC) (original) (raw)

The Turkish National Corpus (TNC): Comparing the Architectures of v1 and v2

Turkish National Corpus (TNC) released its first version in 2012 is the first large scale (50 million words), web-based and publicly-available free resource of contemporary Turkish. It is designed to be a well-balanced and representative reference corpus for Turkish. With 48 million words coming from the written part of it, the untagged TNC v1 represents 4438 different data sources over 9 domains and 34 different genres. The morphologically annotated, 50 million words TNC v2 with 5412 different documents compiled from written and spoken Turkish is planned for release in 2016 offers new query options for linguistic analyses. This paper aims to compare architectures of the TNC v1 and v2 on the basis of a set of queries made on both versions. Standard, restricted and wildcard lexical searches are performed. Then, the speed of two versions in retrieving the query results in concordance lines is compared. Finally, it is argued that TNC v2 performs better and faster than that of TNC v1 due to the in-memory inverted index structure. Since building language corpora is a very recent issue for Turkish, the architecture of TNC v2 would serve as a model for similar corpus construction projects. (to cite :

Corpus linguistics theory and design and application of a Turkish corpus

2011

Özet Derlemdibilim, dilbilimin son yıllarda öne çıkan bir dalıdır. Bu çalı! madaderlemdilbilim kuramı ve derlemdilbilimin temel kavramlarına kısaca de" inilmi! tir. Bu çalı! manın temel amacı, internet üzerinden eri! ilebilen, yetkili kullanıcılarcaveritabanına yeni metinler eklenebilen ve çıkartılabilen, ba" lam içinde anahtar kelimegösterebilen, platform ba" ımsız ve Türkçe karakterleri sorunsuz gösterebilen bir derlemolu! turmaktır. Geli! tirilen örnek derlemin kodları çalı!

Linguistic Corpora: A View from Turkish

Usage-based linguistic studies have gained new insights as corpus-based and corpus-driven analyses have advanced in recent years. Linguists working in different domains have turned to corpora as a major source in their study of language at all levels of representation. Currently, corpus linguistics is evolving into a sophisticated methodology in extracting and analyzing data. Building and using corpora in Turkish linguistics is a recent undertaking, initially motivated by work on natural language processing (NLP) research. The number of available corpora is increasing and linguistic research has come to test hypotheses on attested data, or uncover more lexical and grammatical patterns of use that have gone unnoticed in the absence of corpus data. Advances in NLP research and tools provided for corpus building and annotation further contribute to corpus studies in Turkish linguistics. (to cite : Aksan, Mustafa & Yeşim Aksan (2018). Linguistic corpora: A view from Turkish (pp. 301-327). Kemal Oflazer & Murat Saraçlar (Eds.) Studies in Turkish Language Processing. Springer Verlag . )

The Turkish Lexicography Corpus TLC : An Overview

The 12th International Conference of The Asian Association for Lexicography “Lexicography in the Digital World”, 2018

There is a lack in the field of lexicography in terms of terminology use because there is not a lexicography terminology created and made available for researchers in Turkey although there is an increase in lexicography studies. Therefore, in order to fulfil this need and create a Turkish Lexicography terminology, Eskisehir Osmangazi University Center for Lexicography has decided to start a project. This study aims to create a specialised corpus including Turkish lexicography studies and to set forth Turkish Lexicography terminology by using this corpus. To create a specialised corpus will help the researchers in deciding what terminology they can use in their studies and it can also help to standardize the terminology use. There is not a platform in which the researchers can discuss and see the previous terminology use in the studies. Terminology choice is mainly made intuitively and it usually depends on small academic group discussions. A corpus can ease the terminology use of the researchers in the field. Based on these necessities, a corpus for Turkish lexicography was created and is accessible to the researchers on a website, www.tsd.ogu.edu.tr . The stages of the study were as follows: synchronization and desynchronization of the corpus, determining the corpus content, digitizing the sources, external tagging of the texts, text type tagging, and lemmatizing. There is also a platform on the website via which the researchers can send new terms they have encountered in related studies. Keywords: Lexicography, terminology, term, corpus, TLC

TS Corpus Project: An online Turkish Dictionary and TS DIY Corpus

TS Corpus is a free and independent project that aims building Turkish corpora, NLP tools and linguistic datasets. Since 2011, 10 corpora, various NLP tools, a large dataset and an online dictionary has been released. This paper focuses on the " online dictionary " and " TS do-it-yourself corpus " released by the project. The dictionary data is based on TDK (Turkish Language Society) Contemporary Dictionary. However, the dictionary published serves many enhanced functions at user interface level. But, the main importance of the study is about the results presented to the users upon their queries; the presentation of collocations and tri-grams of the key word searched for. The collocations are harvested from a large Turkish corpus, +760 million tokens and the tri-grams were generated from Turkish Wikipedia pages. The do-it-yourself corpus (TS DIY Corpus), allows users to build their own corpora, modify or delete the uploaded texts and run queries. Users may run queries in different modes, such as " as is " , " starting/ending with " or including; besides advanced query option allows users to run queries with part-of-speech tags and lemmas. The results are given in KWIC (keyword in context) format. Various text classification options such as pubdate, author, domain, genre etc. could be selected during corpus creation. As the number of available Turkish corpora is limited, TS DIY Corpus is applicant to be a useful, well-known and largely used software for the scholars and researchers who wants to use a Turkish corpus or study over Turkish texts of their own.

Sustaining a Corpus for Spoken Turkish Discourse: Accessibility and Corpus Management Issues

… : From Storyboard to …, 2010

This paper addresses the issues of the long-term availability of language resources and the financing of resource maintenance in the context of the web-based corpus management system employed in the Spoken Turkish Corpus (STC), which operates with EXMARaLDA. Section 2 overviews the capacities of the corpus management system with respect to its software infrastructure, online presentation, metadata management, and interoperability. Section 3 describes the plan foreseen in STC for sustaining the resource, and dwells on the ethical issues surrounding the conflicting demands of free resources for non-commercial research and resource maintenance.

Corpus-Based Research on Terminology of Turkish Lexicography(CBRT-TURKLEX)

Lexikos, 2018

In this paper, we introduce an ongoing lexicographic corpus project. The Center for Lexicography, abbreviated as SOZMER, was established under the aegis of Eskisehir Osmangazi University to support lexicographical projects. SOZMER decided to initiate a corpus-based Turkish lexicography project. This project will be the first stage of the endeavour aimed at preparing a specialized dictionary for Turkish lexicography. The primary aim of the project is to prepare an electronic corpus for researchers of Turkish lexicography. The secondary aim of the project is to obtain a word list of Turkish lexicographic terms. This paper presents a description of the process of data collection and the methodology employed for building a specialized corpus. The study contains an outline of the project background, needs, problems, and the phases of corpus building.

Challenges Encountered in Turkish Natural Language Processing Studies

Natural and Engineering Sciences, 2020

Natural language processing is a branch of computer science that combines artificial intelligence with linguistics. It aims to analyze a language element such as writing or speaking with software and convert it into information. Considering that each language has its own grammatical rules and vocabulary diversity, the complexity of the studies in this field is somewhat understandable. For instance, Turkish is a very interesting language in many ways. Examples of this are agglutinative word structure, consonant/vowel harmony, a large number of productive derivational morphemes (practically infinite vocabulary), derivation and syntactic relations, a complex emphasis on vocabulary and phonological rules. In this study, the interesting features of Turkish in terms of natural language processing are mentioned. In addition, summary info about natural language processing techniques, systems and various sources developed for Turkish are given.

Corpus-Based Research on Terminology of Turkish Lexicography

In this paper, we introduce an ongoing lexicographic corpus project. The Center for Lexicography, abbreviated as SÖZMER, was established under the aegis of Eskisehir Osmangazi University to support lexicographical projects. SÖZMER decided to initiate a corpus-based Turkish lexicography project. This project will be the first stage of the endeavour aimed at preparing a specialized dictionary for Turkish lexicography. The primary aim of the project is to prepare an electronic corpus for researchers of Turkish lexicography. The secondary aim of the project is to obtain a word list of Turkish lexicographic terms. This paper presents a description of the process of data collection and the methodology employed for building a specialized corpus. The study contains an outline of the project background, needs, problems, and the phases of corpus building. Opsomming: Korpus-gebaseerde navorsing op terminologie van die Turkse leksikografie (CBRT-TURKLEX). In hierdie artikel word 'n lopende leksikografiese projek bekend gestel. Die Sentrum vir Leksikografie, afgekort tot SÖZMER, is onder die vaandel van die Eskisehir Osmangazi Universiteit tot stand gebring om leksikografiese projekte te ondersteun. SÖZMER het besluit om 'n korpus-gebaseerde Turkse leksikografieprojek te inisieer. Hierdie projek *