ATILF's computerized linguistic resources involved in cooperative projects (original) (raw)

Computerized linguistic resources of the research laboratory ATILF for lexical and textual analysis: Frantext, TLFi, and the software Stella

Language Resources and Evaluation, 2002

This paper presents some of the computerized linguistic resources of the Research Laboratory ATILF (Analyse et Traitement Informatique de la Langue Française) available via the Web, and will serve as a helping document for demonstrations planned within the framework of LREC 2002. The Research Laboratory ATILF is the new UMR (Unité Mixte de Recherche) created in association between the CNRS and the University of Nancy 2 since 2001-January 2nd, and succeeds to the local component of the INaLF situated in Nancy. This considerable amount of resources concerning French language consists in a set of more than 3400 literary works grouped together in Frantext, plus a number of dictionaries, lexis and other databases. These web available resources are operated and run through the potentialities and powerful capacities of a software called Stella, a search engine specially dedicated to textual databases and relying on a new theory of textual objects.

Computerized linguistic resources of the research laboratory ATILF for lexical and textual analysis : Frantext, TLFi, and the software Stella. Las Palmas, Espagne (27 mai - 2 juin 2002)

Language Resources and Evaluation, 2002

This paper presents some of the computerized linguistic resources of the Research Laboratory ATILF (Analyse et Traitement Informatique de la Langue Française) available via the Web, and will serve as a helping document for demonstrations planned within the framework of LREC 2002. The Research Laboratory ATILF is the new UMR (Unité Mixte de Recherche) created in association between the CNRS and the University of Nancy 2 since 2001-January 2nd, and succeeds to the local component of the INaLF situated in Nancy. This considerable amount of resources concerning French language consists in a set of more than 3400 literary works grouped together in Frantext, plus a number of dictionaries, lexis and other databases. These web available resources are operated and run through the potentialities and powerful capacities of a software called Stella, a search engine specially dedicated to textual databases and relying on a new theory of textual objects.

Semantic Annotation in the Project "Open Access Database 'Adjective-Adverb Interfaces' in Romance "

This paper describes the creation, the annotation process and the model of the Open Access Database 'Adjective-Adverb Interfaces in Romance' (AAIF) project, with its approach to the creation of a domain-specific ontology. In order to make research data accessible, interoperable, extensible, and transferable, data is annotated in TEI/XML, formalized and enriched with RDF and its conceptual data model is stored in and published via the GAMS digital repository. This produces semantically-enriched, annotated multilingual research data that allows retrieval across heterogeneous corpora. The annotation model expressed in the ontology is offered for further reuse.

Review of Corpus Linguistics- zfs

ZFS, 2023

Corpus linguistics is a rapidly growing discipline in linguistics, which broadens our understanding of human language by focusing on actual use of language in different contexts. Understanding Corpus Linguistics is an introduction to the goals, methods, and achievements of corpus linguistics, which is written mainly as a textbook for undergraduate and graduate students, while advanced scholars also could benefit from reading it. The authors, who have been engaged in corpus linguistics as their main interest, try to show what corpus linguistics is and how it helps us to better understand language. After introducing corpus linguistics in the first chapter, the other ten chapters deal with different issues in this approach. In the first chapter, the authors discuss the basic idea of corpus linguistics, its relation to other disciplines and its importance in usage-oriented linguistics. The second chapter of the book is devoted to defining basic concepts of corpus linguistics. Corpus is defined as a collection of texts which must be machine-readable, i.e., collated and analyzed by computers. Corpus linguists try to find patterns of variation in language use and their relation to contextual factors. In order to be analyzable, non-written texts are transcribed into written form. The authors describe the distinction between word forms, directly observable items, and lexemes, and the abstraction underlying inflectionally related groups of word forms that share lexical meaning. Tokens are all individual word forms, while the different occurrences of tokens are considered as one type. Accordingly, tokenization, i.e., identifying and marking the tokens' boundaries, is an essential part of any corpus compiling. Finally, textual and contextual properties of texts are defined and the differences between linguistic or text-internal context, language external context and situational context are clarified. Collocation, colligation, collostruction, metadata, etc. are among concepts related to context which are discussed in this chapter. The third chapter is about corpus composition and corpus types. Among the important characteristics of corpus, the authors first discuss corpus size, emphasizing that it directly depends on resources and practical considerations. The corpora compiled during language documentation of lesser known languages are generally Zeitschrift für Sprachwissenschaft 2023; aop Open Access.

Defining formats and corpus-based examples in the General Ndebele Dictionary, Isichazamazwi SesiNdebele : lexiconotes

Lexikos, 2002

In this article the writer evaluates the defining formats that were used in defining headwords in the first monolingual General Ndebele Dictionary, Isichazamazwi SesiNdebele (ISN). The emphasis in the ISN was on the concept of user-friendliness. The article establishes that defining formats in the ISN are a judicious mixture mainly of the defining formats of the Collins Birmingham University International Language Database (COBUILD) and of what has been referred to as traditional formats. The first part of this article is an analysis of the decisions taken by the ISN editors in formulating their defining formats. It assesses the COBUILD defining principle vis-à-vis its application in defining headwords in the ISN and the impact of this principle on the userfriendliness of the dictionary. It further discusses other formats, including the decision to retain traditional defining formats for defining headwords. One of the traditional defining styles agreed upon was that the editors were to give the hypernym in the case of semantic sets, and then to identify the concept being defined by specifying aspects that distinguish it from others of its type. The second part of the article evaluates the importance and use of the corpus in providing both definitions and examples for the ISN. However, it is further argued that since a corpus has to be "representative" in terms of size in order to be appropriately used as basis for such corpus-based dictionaries, the ISN editors whose corpus was relatively small, could not avoid relying on intuitive knowledge in constructing some examples.

Creating Lexical Resources in TEI P5

Journal of the Text Encoding Initiative, 2012

is full professor for terminology studies and translation technologies at the Centre of Translation Studies at the University of Vienna, director of the Institute for Corpus Linguistics and Text Technology of the Austrian Academy of Sciences, member (kM) of the Austrian Academy of Sciences, and holder of the UNESCO Chair for Multilingual, Transcultural Communication in the Digital Age. He also serves as vice-president of the International Institute for Terminology Research and Chair of a technical subcommittee in the International Standards Organization (ISO) focusing on terminology and language resources (ISO/TC 37/SC 2 2001-2009, SC 1 2009-present). His main research interests are language technologies, corpus linguistics, and knowledge engineering, E-Learning technologies and collaborative work systems, distributed Creating Lexical Resources in TEI P5