Automatically Extracting Typical Syntactic Differences from Corpora (original) (raw)

Automatically Extracting Significant Syntax Differences from Corpora

Literary and Linguistic Computing (Oxford Journals).

We develop an aggregate measure of syntactic difference for automatically finding common syntactic differences between collections of text. With the use of this measure, it is possible to mine for differences between, for example, the English of learners and natives, or between related dialects. If formulated in advance, hypotheses can also be tested for statistical significance. It enables us to find not only absence or presence, but also under- and overuse of specific constructs. We have applied our measure to the English of Finnish immigrants in Australia to look for traces of Finnish grammar in their English. The outcomes of this detection process were analysed and found to be insightful. A report is included in this article. Besides explaining our method, we also go into the theory behind it, including permutation statistics, and the custom normalizations required for applying these tests to syntactical data. We also explain how to use the software we developed to apply this method to new corpora, and give some suggestions for further research.

A Measure of Aggregate Syntactic Distance

… of the Workshop on linguistic Distances, 2006

We compare vectors containing counts of trigrams of part-of-speech (POS) tags in order to obtain an aggregate measure of syntax difference. Since lexical syntactic categories reflect more abstract syntax as well, we argue that this procedure reflects more than just the basic syntactic categories. We tag the material automatically and analyze the frequency vectors for POS trigrams using a permutation test. A test analysis of a 305,000 word corpus containing the English of Finnish emigrants to Australia is promising in that the procedure proposed works well in distinguishing two different groups (adult vs. child emigrants) and also in highlighting syntactic deviations between the two groups.

Measuring syntactical variation in Germanic texts

Digital Scholarship in the Humanities

We present two new measures of syntactic distance between languages. First, we present the 'movement measure' which measures the average number of words that has moved in sentences of one language compared to the corresponding sentences in another language. Secondly, we introduce the 'indel measure' which measures the average number of words being inserted or deleted in sentences of one language compared to the corresponding sentences in another language. The two measures were compared to the 'trigram measure' which was introduced by 82-90.). We correlated the results of the three measures and found a low correlation between the results of the movement and indel measure, indicating that the two measures represent different kinds of linguistic variation. We found a high correlation between the results of the movement measure and the trigram measure. The results of all of the three measures suggest that English is syntactically a Scandinavian language. Because of our unique database design we were able to detect asymmetric relationships between the languages. All three measures suggest that asymmetric syntactical distances could be part of the explanation why native speakers of Dutch more easily understand German texts than native speakers of German understand Dutch texts (Swarte 2016).

Measuring Syntactic Distances between Dialects: A Web Application for Annotating Dialectal Data

Procedia Computer Science, 2014

Research in dialectal variation allows linguists to understand the fundamental principles underlying language systems and grammatical changes in time and space. Since different dialectal variants do not occur randomly on the territory and geographical patterns of variation are recognizable for an individual syntactic form, we believe that a systematic approach for studying this variations is required. In this paper, we present a Web application for annotating dialectal data, in particular with the aim of measuring the degree of syntactic differences between dialects.

Needles in Haystacks: Semi-Automatic Identification of Regional Grammatical Variation in Standard German

Heidelberg University Publishing eBooks, 2018

This paper lays out a semi-automatic approach to identifying regional variation in the grammar of Standard German. Our approach takes as input manually defined templates of grammatical constructions that are automatically instantiated over a corpus collected from regional newspapers. These instantiations are automatically ranked by a metric that quantifies how specific an instantiation is for a region. Ranked lists of instantiations are compiled that contain instantiations specific to a region and are scanned manually by linguists to identify those that denote grammatical variants of Standard German. This approach enabled us to discover variants that so far have not been documented. With respect to research on variation within standard languages as seen from a more general perspective, we aim to contribute towards research strategies that clearly rely on empiricism rather than on intuition or bias. 1

Corpus linguistics and the automatic analysis of English

In a recent paper advocating a corpus-based and probabilistic approach to grammar development, argue that "the current state of the art is far from being able to produce a robust parser of general English" and advocate "steady and quantifiable," empirically corpus-driven grammar development and testing. Black et al. are addressing a community in which armchair introspection has been and still is the dominant methodology in many quarters, but in some parts of Europe, corpus linguistics never died. For nearly two decades, the Nijmegen group led by Jan Aarts have been undertaking corpus analyses that, although motivated primarily by the desire to study language variation using corpus data, are particularly relevant to the issue of broad-coverage grammar development. In distinction to other groups undertaking corpus-based work (e.g., Garside, Leech, and Sampson 1987), the Nijmegen group has consistently adopted the position that it is possible and desirable to develop a formal, generative grammar that characterizes the syntactic properties of a given corpus and can be used to assign appropriate analyses to each of its sentences.

UvA-DARE (Digital Academic Repository) Applying automatically parsed corpora to the study of language variation Applying automatically parsed corpora to the study of language variation

2020

In this work, we discuss the benefits of using automatically parsed corpora to study language variation. The study of language variation is an area of linguistics in which quantitative methods have been particularly successful. We argue that the large datasets that can be obtained using automatic annotation can help drive further research in this direction, providing sufficient data for the increasingly complex models used to describe variation. We demonstrate this by replicating and extending a previous quantitative variation study that used manually and semi-automatically annotated data. We show that while the study cannot be replicated completely due to limitations of the existing automatic annotation, we can draw at least the same conclusions as the original study. In addition, we demonstrate the flexibility of this method by extending the findings to related linguistic constructions and to another domain of text, using additional data.

Detecting Syntactic Contamination In Emigrants: The English of Finnish Australians

SKY Journal of Linguistics, 2007

The paper discusses an application of a technique to tag a corpus containing the English of Finnish Australians automatically and to analyse the frequency vectors of part-ofspeech (POS) trigrams using a permutation test. Our goal is to detect the linguistic sources of the syntactic variation between two groups, the ‘Adults,’ who had received their school education in Finland, and the ‘Juveniles,’ who were educated in Australia. The idea of the technique is to utilise frequency profiles of trigrams of POS categories as indicators of syntactic distance between the groups and then examine potential effects of language contact and language (‘vernacular’) universals in SLA. The results show that some features we describe as ‘contaminating’ the interlanguage of the Adults can be best attributed to Finnish substratum transfer. However, there are other features in our data that may also be ascribed to more “universal” primitives or universal properties of the language faculty. As we have no evidence of potential contamination at the early stages of the Juveniles’ L2 acquisition, we cannot yet prove or refute our hypothesis about the strength of contact influence as opposed to that of the other factors.

Modeling Global Syntactic Variation in English Using Dialect Classification

Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, 2019

This paper evaluates global-scale dialect identification for 14 national varieties of English as a means for studying syntactic variation. The paper makes three main contributions: (i) introducing data-driven language mapping as a method for selecting the inventory of national varieties to include in the task; (ii) producing a large and dynamic set of syntactic features using grammar induction rather than focusing on a few hand-selected features such as function words; and (iii) comparing models across both web corpora and social media corpora in order to measure the robustness of syntactic variation across registers.