A TEI Schema for the Representation of Computer-mediated Communication (original) (raw)

DeRiK: A German reference corpus of computer-mediated communication

Literary and Linguistic Computing, 2013

The paper describes an ongoing project that aims at building a reference corpus of German computer-mediated communication (CMC) as a new component of an already existing reference corpus of written contemporary German. The 'Deutsches Referenzkorpus zur internetbasierten Kommunikation' (DeRiK) shall include data from the most prominent CMC genres amongst German Internet users and, thus, close a gap in the coverage of the corpus resources in the project "Digitales Wörterbuch der deutschen Sprache" (DWDS) which are maintained and provided by the Berlin-Brandenburg Academy of Sciences and the Humanities (BBAW). The focus of the paper is on the role of the DeRiK component within the DWDS framework, on sampling issues, and on CMC-specific issues of corpus annotation.

Paper 2 : Expanding the TEI encoding framework to genres of computer-mediated communication : considerations and suggestions

2013

Computer-mediated communication in TEI: What lies ahead

2013

The social web has brought forth various genres of interpersonal communication (computer-mediated communication, henceforth: cmc) such as chats, discussion forums, wiki talk pages, Twitter, comment and discussion threads on weblogs and social network sites. These genres display linguistic and structural peculiarities which differ both from speech and from written text. Projects that want to build and exchange cmc corpora would greatly benefit from a standard that allows the user to annotate these peculiarities in TEI. From the perspective of several corpus projects which aim at building and annotating cmc corpora for several European languages, this panel will discuss how the models provided by the TEI encoding framework may be adapted to the special requirements of cmc genres. The basis of the discussion is a customized TEI schema presented at the TEI conference held in Wurzburg 2011 (Beiswenger et al. 2012)1. The panel papers will elaborate on basic features that a TEI standard fo...

Building and analysing corpora of computer-mediated communication

This article addresses problems encountered during the construction and analysis of a synchronic corpus of computer-mediated discourse. The purpose of creating this corpus was to conduct a study of conversational interaction, particularly in the synchronous online chat medium (i.e. real-time typed conversation). This corpus is not primarily for the examination of the linguistic idiosyncrasies of the online chatting medium; rather it is to be used for corpus-based sociolinguistic inquiry into gendered and sexualised discourses. Therefore the corpus data needed considerable adaptation during compilation and analysis to prevent those idiosyncrasies from acting as noise in the data. Adaptations include responses to spam (in the form of ‘adbots’), cyber-orthography, the ubiquity of names, over-lapping conversations, and challenges of annotation. Difficulties with gaining participant permissions and demographic information also required significant attention. Attempted solutions to these corpus construction and analysis challenges, which are closely bound to the fields of both cyber-research and corpus linguistics, are outlined.

Investigating Computer-Mediated Communication: Corpus-based Approaches to Language in the Digital World

2018

In this paper, we investigate the spelling conventions on the Twitter microblogging platform. In order to gain insight into the universalities and specificities of communication on social media, we perform a comparative analysis of three closely related languages: Slovene, Croatian and Serbian. The data collection and annotation protocols were developed jointly for all three languages, allowing for maximum interoperability and comparability of results. The analysis reveals differences in the amount of deviation from the norm in the three languages, with Slovene twitterese being the most inclined to using non-standard spelling, and Serbian the least. Overall, closed word classes, especially interjections and abbreviations, are found to be more non-standard than the open classes. In terms of types of standard > non-standard transformations, character deletions are more frequent than insertions or replacements, and transformations mostly occur in word-final positions. The discrepancies between languages are largely due to the pronounced tendency of Slovene and Croatian to use spoken-like, regional and dialectal forms characterised by vowel omissions, especially at the end of words. This analysis and the resulting datasets can be used to further study the properties of non-standard Slovene, Croatian and Serbian, as well as to develop language technologies for nonstandard data in these languages.

COMPUTER-MEDIATED DISCOURSE: Issues and Challenges at the Interface of Corpus

2015

The CoMeRe project aims to build a kernel corpus of different Computer-Mediated Communication (CMC) genres with interactions in French as the main language, by assembling interactions stemming from networks such as the Internet or telecommunication, as well as mono and multimodal, synchronous and asynchronous communications. Corpora are assembled using a standard, thanks to the TEI (Text Encoding Initiative) format. This implies extending, through a European endeavor, the TEI model of text, in order to encompass the richest and the more complex CMC genres. This paper presents the Interaction Space model. We explain how this model has been encoded within the TEI corpus header and body, thanks to a new post element applied to textual messages and turns. The model is then instantiated through four corpora we have processed: three corpora where interactions occurred in single-modality environments (text chat, or SMS systems) and a fourth corpus where text chat, email and forum modalities were used simultaneously. The CoMeRe project has two main research perspectives: Discourse Analysis, only alluded to in this paper, and the linguistic study of idiolects occurring in different CMC genres. As NLP algorithms are an indispensable prerequisite for such research, we present our motivations for applying an automatic annotation process to the CoMeRe corpora. The wish to guarantee generic annotations led us not to consider any processing beyond morphosyntactic labelling, while prioritizing the automatic annotation of any degraded elements within the corpora. We then turn to decisions made concerning which annotations to make for which units and describe the processing pipeline for adding these. All CoMeRe corpora, as well as annotations, are verified, thanks to a staged quality control process, designed to allow corpora to move from one project phase to the next. Public release of the CoMeRe corpora is a short-term goal: corpora will be integrated into the forthcoming French National Reference Corpus, and disseminated through the national linguistic infrastructure ORTOLANG. We, therefore, highlight issues and decisions made concerning the acknowledgement of individual researchers' work in both the metadata and corpus reference, as well as appropriate licenses compliant with the OpenData perspective. The conclusion refers to short terms challenges with respect to NLP annotations and new collections of Wikipedia controversial talk pages and Tweets, that will be added to the CoMeRe databank.

Corpus Linguistics for Online Communication: A Guide for Research

2019

Corpus Linguistics for Online Communication provides an instructive and practical guide to conducting research using methods in corpus linguistics in studies of various forms of online communication. Offering practical exercises and drawing on original data taken from online interactions, this book: • introduces the basics of corpus linguistics, including what is involved in designing and building a corpus; • reviews cutting-edge studies of online communication using corpus linguistics, foregrounding different analytical components to facilitate studies in professional discourse, online learning, public understanding of health issues and dating apps; • showcases both freely-available corpora and the innovative tools that students and researchers can access to carry out their own research. Corpus Linguistics for Online Communication supports researchers and students in generating high quality, applied research and is essential reading for those studying and researching in this area.

Building a corpus of Italian Web forums: standard encoding issues and linguistic features

This paper describes the creation of a reference corpus of nearly 1200 Web forum posts in Italian. The corpus was created evaluating and customizing a previous proposal for Xml standard encoding; a revised version of the relevant DTD is now proposed as reference for the structural features of Web forum posts and a set of correspondences, with little loss of information, is given for the TEI P5 encoding system. Preliminary results about syntactic features of the language of the posts are also included to sample the linguistic variability of this textual genre. 1 Overview Web forums are, arguably, the most popular interactive textual genre on the web. Current Eurostat surveys show that up to 50% of the citizens of some states of the European Union have posted at least a Web message in the year preceding the interview: this posting is undoubtedly often a forum posting. However, few studies (such as Light and Rogers 1999) have dealt with the linguistic or textual features of the forums or even with more basic facts such as their diffusion. Moreover, there is widespread variety even in the name of the genre. Describing Web forums, secondary literature calls them message boards, discussion boards, discussion groups, conversations, chatgroups, newsgroups and so on (as for classification issues for Web texts, see also Rehm et al. 2008). This variation also seems to push towards the grouping in the literature of many textual genres we find scarcely related from every point of view: Usenet newsgroups of the 1990s lack many of the features now common in Web forum interfaces, and so on. Typical of this attitude are syntheses such as Crystal (2006, pp. 134-177) where a single chapter devoted to "The language of chatgroups" describes both "asynchronous" and "synchronous groups", with a few lines of specification. In this paper we will instead deal with a very specific textual genre: asynchronous conversations collected in threads and managed by a particular kind of Web site. Many Web forums allow more than a single way of interaction and messages can be posted on them by various means (Web interface, e-mail and so on). We will see in § § 4-5 that this seems to have linguistic consequences more relevant than those connected to, say, the topic of the forum, hinting to the need to make subtler distinction between subgenres. We did not take into account, then, traditional newsgroups or mailing lists if they don't have a web interface allowing users not only to read past conversations but also to write new messages. On the other hand, we did consider as possible sample material every kind of forum where:

A Tentative Typology of Net Mediated Communication Dr

2005

This work is part of a wider project aimed at collecting and publishing a considerable amount of texts written for the Internet– especially NewsGroups – in Italian, German, Spanish, French, and English: about 600,000,000 words per language were collected (some tagged Italian NewsGroups and some raw Spanish NewsGroups are now available at [4]). Such a wide ranging project required a variety of preliminary studies on vocabulary, grammar, and textual varieties of Italian. One of the several case-studies under way originated an abstract model for the description of the textual features peculiar to Computer Mediated Communication (CMC). The present analysis will show the main characteristics of the model which quantifies the parameters of space, time, and accessibility of selected texts and defines indexes of attention for competition, interactionality, and connexity. At the end, the values obtained from analysis are compared with text-message, forum and NewsGroup data.

A Faceted Classification Scheme for Computer-Mediated Discourse

This article describes a classification scheme for computer-mediated discourse that classifies samples in terms of clusters of features, or " facets ". The goal of the scheme is to synthesize and articulate aspects of technical and social context that influence discourse usage in CMC environments. The classification scheme is motivated, presented in detail with support from existing literature, and illustrated through a comparison of two types of weblog (blog) data. In concluding, the advantages and limitations of the scheme are weighed. Introduction It is by now a truism that computer-mediated communication (CMC) – defined here as predominantly text-based human-human interaction mediated by networked computers or mobile telephony – provides an abundance of data on human behavior and language use. Confronted with such abundance, researchers and practitioners have naturally sought to group, label, or otherwise organize CMC into categories that would facilitate its analysis and uses. However, there has been neither systematic discussion of how this should be done nor consensus regarding individual attempts to do so, many of which have been implicit and ad hoc. As a consequence, how to classify CMC remains a significant unaddressed problem of information organization. This article is concerned with the classification of CMC for research purposes, with a focus on online language and language use, hereafter referred to as computer-mediated discourse (CMD; Herring 1996, 2001). Specifically, it proposes an approach to the classification of CMD based on multiple categories or " facets " , a concept borrowed from classification theory in the field of library and information science. In contrast to applications in that field, however, which are primarily concerned with information storage and retrieval, the goal of the CMD scheme is to articulate aspects of context – both technical and social – that potentially influence discourse usage in CMC environments, and thereby to bring them to the conscious attention of the researcher. In this, it is akin in spirit to Hymes' (1974) etic grid, also known as the SPEAKING mnemonic, which is treated here as an early example of faceted classification in a research context. The organization of this article reflects its goal to motivate, articulate, and illustrate a model. The next section identifies the basic problem that gave rise to the need for a CMD classification scheme. Following a review of research on discourse classification, I then present an overview of the proposed faceted classification scheme for CMD and describe its dimensions and categories. This is followed by an illustration in which the scheme is applied to characterize contrasting computer-mediated (weblog) data samples. In concluding, the advantages and limitations of the faceted classification approach to online communication are weighed. The Problem Various attempts have been made by linguists to classify CMD, starting in the 1980s and early 1990s. Accustomed to dealing with two basic modalities of language – speech and writing –

A TEI Schema for the Representation of Computer-mediated Communication (original) (raw)

Related papers