Dagmar Divjak | The University of Birmingham (original) (raw)
Books by Dagmar Divjak
Each venture a new beginning: studies in honour of Laura A. Janda.Bloomington, IN: Slavica Publishers, 2017
The vast majority of linguistic theories are built on a peculiar type of data: acceptability or g... more The vast majority of linguistic theories are built on a peculiar type of data: acceptability or grammaticality ratings. Traditionally these ratings were obtained through introspec-tion by the analyst, an approach that is problematic in many (if not most) respects. Linguists addressed (part of) the issue by starting to elicit ratings from largish numbers of native speakers. Yet, this caused a new problem: due to the unpopularity of ordinal data in disciplines that drive the development of statistical analysis, few techniques are available that handle this type of date with grace. In our contribution, we explain how Generalized Additive Mixed Models can be used to explore ordinal data in all its complexity using the mgcv package in R.
Cognitive Linguistics is an approach to language study based on the assumptions that our linguist... more Cognitive Linguistics is an approach to language study based on the assumptions that our linguistic abilities are firmly rooted in our cognitive abilities, that meaning is essentially conceptualization, and that grammar is shaped by usage. The Handbook of Cognitive Linguistics provides state-of-the-art overviews of the numerous subfields of cognitive linguistics written by leading international experts which will be useful for established researchers and novices alike. It is an interdisciplinary project with contributions from linguists, psycholinguists, psychologists, and computer scientists which will emphasise the most recent developments in the field, in particular, the shift towards more empirically-based research. In this way, it will, we hope, help to shape the field, encouraging methodologically more rigorous research which incorporates insights from all the cognitive sciences.
"Given that we lack sensory-motor experience for abstract concepts, how do we find out what they ... more "Given that we lack sensory-motor experience for abstract concepts, how do we find out what they mean? How far can we get by tracking frequency distributions in input? The volume tackles the question of what language has to offer the language learner in his/her quest for meaning, and explicitly addresses how semantic knowledge may be distributed along the continuum from "grammar" to "lexicon". Focus is on the synonymy of constructions and lexemes, a meaning relation that has been largely ignored in Western linguistics.
Frequency in all its guises plays a major part in this book. Approaching meaning from a usage-based perspective, a radically distributional approach to quantifying meaning is proposed that encompasses both the constructional and lexical level. Statistical data analysis, relatively new in the field of linguistics, yields a cognitively realistic, clustered model that encourages re-evaluating existing accounts of near-synonymy. Theoretical concepts spanning a range of cognitive linguistic frameworks, i.e. Cognitive Grammar, Radical Construction Grammar and Prototype Theory, account for the complexity of the data and lead to a re-appraisal of traditional semantic theory.
Built on a solid empirical foundation, this network account of synonymy at the constructional and lexical level enriches our understanding of established aspects of the cognitive model of language, serving as catalyst for their further development and refinement. The theoretically informed combination of descriptive accuracy and methodological innovation makes the book a worthwhile read for cognitive linguists and psycholinguists alike."
nstructional alternations, lexical contrasts and extensions and multi-word expressions) in a vari... more nstructional alternations, lexical contrasts and extensions and multi-word expressions) in a variety of languages (Dutch, English, Russian and Spanish) and their representation in cognition as mediated by frequency counts in both text and experiment. The state-of-the-art data collection (ranging from questionnaires to eye-tracking) and analysis (from simple chi-squared to random effects regression) techniques allow to draw theoretical conclusions from (mis)matches between different types of empirical data. The sister volume focuses on language learning and processing.
The volume contains a collection of studies on how the analysis of corpus and psycholinguistic da... more The volume contains a collection of studies on how the analysis of corpus and psycholinguistic data reveal how linguistic knowledge is affected by the frequency of linguistic elements/stimuli. The studies explore a wide range of phenomena , from phonological reduction processes and palatalization to morphological productivity, diachronic change, adjective preposition constructions, auxiliary omission, and multi-word units. The languages studied are Spanish and artificial languages, Russian, Dutch, and English. The sister volume focuses on language representation.
"The volume presents an overview of recent cognitive linguistic research on Slavic languages. Sla... more "The volume presents an overview of recent cognitive linguistic research on Slavic languages. Slavic languages, with their rich inflectional morphology in both the nominal and the verbal system, provide an important testing ground for a linguistic theory that seeks to motivate linguistic structure.
Therefore, the volume touches upon a wide range of phenomena: it addresses issues related to the semantics of grammatical case, tense, aspect, voice and word order, it looks into grammaticalization and language change and discusses sound symbolism. At the same time, the analyses presented address a variety of theoretically important issues. Take for example the role of virtual entities in language or that of iconic motivation in grammar, the importance of metaphor for grammaticalization or that of subjectification for motivating synchronic polysemy and diachronic language change, as well as the myriad of patterns available to encode events in a non-canonical way or to convey the speaker's epistemic stance with respect to the communicated content. In addition, the analyses are couched in a variety of cognitive linguistic frameworks, such as cognitive grammar, mental space theory, construction grammar, frame semantics, grammaticalization theory, as well as prototype semantics.
All in all, the analyses presented in this volume enrich the understanding of established aspects of the cognitive model of language and may serve as catalysts for their further development and refinement, making the volume a worthwhile read for Slavic and cognitive linguists alike."
Papers by Dagmar Divjak
Journal of Experimental Psychology, 2017
Milin, P., D. Divjak, and R. H. Baayen The goal of the present study is to understand the role o... more Milin, P., D. Divjak, and R. H. Baayen
The goal of the present study is to understand the role orthographic and semantic information play in the behaviour of skilled readers. Reading latencies from a self-paced sentence reading experiment in which Russian near-synonymous verbs were manipulated appear well-predicted by a combination of bottom-up sub-lexical letter triplets (trigraphs) and top-down semantic generalizations, modelled using the Naive Discrimination Learner. The results reveal a complex interplay of bottom-up and top-down support from orthography and semantics to the target verbs, whereby activations from orthography only are modulated by individual differences. Using performance on a serial reaction time task for a novel operationalization of the mental speed hypothesis, we explain the observed individual differences in reading behaviour in terms of the exploration/exploitation hypothesis from Reinforcement Learning, where initially slower and more variable behaviour leads to better performance overall. Author Note The experiments received ethical approval from the University of Sheffield (UK), School of Languages & Cultures. The PsychoPy scripts were written by Lily FitzGibbon; participants were recruited and scheduled by Daria Satyukova. The Institute for Linguistic Studies of the St Petersburg branch of the Russian Academy of Sciences kindly made facilities available for testing. The financial support of the British Academy, the Prokhorov Foundation and the Alexander von Humboldt Foundation is gratefully acknowledged. We would like to thank Neil Bermel, Michael Ramscar, and Tom Stafford for helpful discussions and comments on initial versions of the manuscript. We are very grateful to Victor Kuperman and two further anonymous reviewers for their comments on first submission of this paper. Correspondence concerning this article should be addressed to Petar Milin,
The goal of the present study is to understand the role orthographic and semantic information pla... more The goal of the present study is to understand the role orthographic and semantic information play in the behaviour of skilled readers. Reading latencies from a self-paced sentence reading experiment in which Russian near-synonymous verbs were manipulated appear well-predicted by a combination of bottom-up sub-lexical letter triplets (trigraphs) and top-down semantic generalizations, modelled using the Naive Discrimination Learner. The results reveal a complex interplay of bottom-up and top-down support from orthography and semantics to the target verbs, whereby activations from orthography only are modulated by individual differences. Using performance on a serial reaction time task for a novel operationalization of the mental speed hypothesis, we explain the observed individual differences in reading behaviour in terms of the exploration/exploitation hypothesis from Reinforcement Learning, where initially slower and more variable behaviour leads to better performance overall. Author Note The experiments received ethical approval from the University of Sheffield (UK), School of Languages & Cultures. The PsychoPy scripts were written by Lily FitzGibbon; participants were recruited and scheduled by Daria Satyukova. The Institute for Linguistic Studies of the St Petersburg branch of the Russian Academy of Sciences kindly made facilities available for testing. The financial support of the British Academy, the Prokhorov Foundation and the Alexander von Humboldt Foundation is gratefully acknowledged. We would like to thank Neil Bermel, Michael Ramscar, and Tom Stafford for helpful discussions and comments on initial versions of the manuscript. We are very grateful to Victor Kuperman and two further anonymous reviewers for their comments on first submission of this paper. Correspondence concerning this article should be addressed to Petar Milin,
Cognitive Linguistics, 2016
Milin, P., D. Divjak, S. Dimitrijevic, and R. H. Baayen Over the past 10 years, Cognitive Lingui... more Milin, P., D. Divjak, S. Dimitrijevic, and R. H. Baayen
Over the past 10 years, Cognitive Linguistics has taken a Quantitative Turn. Yet, concerns have been raised that this preoccupation with quantification and modelling may not bring us any closer to understanding how language works. We show that this objection is unfounded, especially if we rely on modelling techniques based on biologically and psychologically plausible learning algorithms. These make it possible to take a quantitative approach, while generating and testing specific hypotheses that will advance our understanding of how knowledge of language emerges from exposure to usage. Acknowledgments The financial support of the Alexander von Humboldt Foundation (to Harald Baayen and Petar Milin) and the British Academy (to Dagmar Divjak) is gratefully acknowledged. We wish to thank Emmanuel Keuleers for providing generous help in implementing TiMBL, and Svetlana Borojević who provided access to the experimental hardware and software.
In this paper, we focus on corpus-linguistic studies that address theoretical questions and on co... more In this paper, we focus on corpus-linguistic studies that address theoretical questions and on computational linguistic work on corpus annotation, that makes corpora useful for linguistic work. First, we discuss why the corpus linguistic approach was discredited by generative linguists in the second half of the 20th century, how it made a comeback through advances in computing and was adopted by usage-based linguistics at the beginning of the 21st century. Then, we move on to an overview of necessary and common annotation layers and the issues that are encountered when performing automatic annotation, with special emphasis on Slavic languages. Finally, we survey the types of research requiring corpora that Slavic linguists are involved in worldwide , and the resources they have at their disposal.
In this paper, I pursue the distributional hypothesis that the meaning of a word is derived from ... more In this paper, I pursue the distributional hypothesis that the meaning of a word is derived from the linguistic contexts in which it occurs and apply it to verbs of perception. Differently from NLP implementations of the distributional hypothesis, I explicitly limit the range of variables to the grammatical domain and chart the way in which verbs of Vision, Hearing and Touch are used, morphologically and syntactically, in a representative sample of corpus data. Some aspects of experience are so central and pervasive that reference to them has grammaticalized (Divjak 2010; see also Newman 2008; Janda & Lyashevskaya 2011). The aim is, firstly, to determine to which extent a verb's grammatical context alone allows us to classify utterances according to the perception type, and, secondly, to chart the similarities and differences in the verbs' preference for morphological markers and syntactic constructions. If contexts are highly specialized, language structure, as it is witnessed in use, could assist sensory impaired speakers in building up viable representations of concepts, even if sensory experience is lacking. If, in addition, similarities between certain sensory perception verbs are high, sensory impaired speakers could use these similarities to perform analogical mapping across senses and ground concepts relating to the impaired sense in a cognate sensory experience. The findings are relevant for concept acquisition and representation in general and for concept acquisition and representation in sensory impaired populations, such as the blind, in particular.
Since its conception, Cognitive Linguistics as a theory of language has been enjoying ever increa... more Since its conception, Cognitive Linguistics as a theory of language has been enjoying ever increasing success worldwide. With quantitative growth has come qualitative diversification, and within a now heterogeneous field, different – and at times opposing – views on theoretical and methodological matters have emerged. The historical " prototype " of Cognitive Linguistics may be described as predominantly of mentalist persuasion, based on introspection, specialized in analysing language from a synchronic point of view, focused on West-European data (English in particular), and showing limited interest in the social and multimodal aspects of communication. Over the past years, many promising extensions from this prototype have emerged. The contributions selected for the Special Issue take stock of these extensions along the cognitive, social and methodological axes that expand the cognitive linguistic object of inquiry across time, space and modality.
Over the past 10 years, Cognitive Linguistics has taken a Quantitative Turn. Yet, concerns have b... more Over the past 10 years, Cognitive Linguistics has taken a Quantitative Turn. Yet, concerns have been raised that this preoccupation with quantification and modelling may not bring us any closer to understanding how language works. We show that this objection is unfounded, especially if we rely on modelling techniques based on biologically and psychologically plausible learning algorithms. These make it possible to take a quantitative approach, while generating and testing specific hypotheses that will advance our understanding of how knowledge of language emerges from exposure to usage. Acknowledgments The financial support of the Alexander von Humboldt Foundation (to Harald Baayen and Petar Milin) and the British Academy (to Dagmar Divjak) is gratefully acknowledged. We wish to thank Emmanuel Keuleers for providing generous help in implementing TiMBL, and Svetlana Borojević who provided access to the experimental hardware and software.
Usage-based linguistics abounds with studies that use statistical classification models to analys... more Usage-based linguistics abounds with studies that use statistical classification models to analyse either textual corpus data or behavioral experimental data. Yet, before we can draw conclusions from statistical models of empirical data that we can feed back into cognitive linguistic theory, we need to assess whether the text-based models are cognitively plausible and whether the behavior-based models are linguistically accurate. In this paper, we review four case studies that evaluate statistical classification models of richly annotated linguistic data by explicitly comparing the performance of a corpus-based model to the behavior of native speakers. The data come from four different languages (Arabic, English, Estonian, and Russian) and pertain to both lexical as well as syntactic near-synonymy. We show that behavioral evidence is needed in order to fine-tune and improve statistical models built on data from a corpus. We argue that methodological pluralism and triangulation are the keys for a cognitively realistic linguistic theory.
A number of studies report that frequency is a poor predictor of acceptability, in particular at ... more A number of studies report that frequency is a poor predictor of acceptability, in particular at the lower end of the frequency spectrum. Because acceptability judgments provide a substantial part of the empirical foundation of dominant linguistic traditions, understanding how acceptability relates to frequency, one of the most robust predictors of human performance, is crucial. The relation between low frequency and acceptability is investigated using corpus- and behavioral data on the distribution of infinitival and finite that-complements in Polish. Polish verbs exhibit substantial subordination variation and for the majority of verbs taking an infinitival complement, the that-complement occurs with low frequency (<0.66 ipm). These low-frequency that-clauses, in turn, exhibit large differences in how acceptable they are to native speakers. It is argued that acceptability judgments are based on configurations of internally structured exemplars, the acceptability of which cannot reliably be assessed until sufficient evidence about the core component has accumulated.
Linguistic convention allows speakers various options. Evidence is accumulating that the various ... more Linguistic convention allows speakers various options. Evidence is accumulating that the various options are preferred in different contexts yet the criteria governing the selection of the appropriate form are often far from obvious. Most researchers who attempt to discover the factors determining a preference rely on the linguistic analysis and statistical modeling of data extracted from large corpora.
In this paper, we address the question of how to evaluate such models and explicitly compare the performance of a statistical model derived from a corpus with that of native speakers in selecting one of six Russian TRY verbs. Building on earlier work by Divjak (2003, 2004, 2010) and Divjak & Arppe (2013), we trained a polytomous logistic regression model to predict verb choice given the context. We compare the predictions the model makes for 60 unseen sentences to the choices adult native speakers make in those same sentences.1 We then look in more detail at the interplay of the contextual properties and model computationally how individual differences in assessing the importance of contextual properties may impact the linguistic knowledge of native speakers. Finally, we compare the probability the model assigns to encountering each of the 6 verbs in the 60 test sentences to the acceptability ratings the adult native speakers give to those sentences. We discuss the implications of our findings for both usage-based theory and empirical linguistic methodology.
Over the past four decades, two distinct alternatives have emerged to rule-based models of how li... more Over the past four decades, two distinct alternatives have emerged to rule-based models of how linguistic categories are stored and represented as cognitive structures, namely the prototype and exemplar theories. Although these models were initially thought to be mutually exclusive, shifts from one mechanism to the other have been observed in category learning experiments, bringing the models closer together. In this paper we implement a technique akin to varying abstraction modelling, that assumes intermediate abstraction processes to underlie category representations and categorization decisions; we do so using statistical techniques such as regression and clustering that linguists are familiar with. Using this model we simulate, on the basis of actual usage of Russian TRY verbs and Finnish THINK verbs as observed in corpora, how prototypes for near-synonymous verbs could be formed from concrete exemplars at different levels of abstraction using statistical techniques that track frequency distributions in input.
In so doing, we take a closer look at the cognitive linguistic flirtation with multiple categorization theories, suggesting three improvements anchored in the fact that cognitive linguistics is a usage-based theory of language. Firstly, we show that language provides support for considering single prototype and full exemplar models as opposite ends along a continuum of abstraction. Secondly, we present a methodology that simulates how prototypes can be obtained from exemplars at more than one level of abstraction in a systematic and verifiable way. And thirdly, we illustrate our claims on the basis of work on verbs, denoting intangible events that are neither stable in nor independent of time and express relational concepts; this implies that verbs are more susceptible to their meanings being influenced by the concepts they relate.
In this paper we present the results of an empirical study into the cognitive reality of existing... more In this paper we present the results of an empirical study into the cognitive reality of existing classifications of modality using Polish data.
We analyzed random samples of 250 independent observations for the 7 most frequent modal words (móc, można, musieć, należy, powinien, trzeba, wolno), extracted from the Polish national corpus. Observations were annotated for modal type according to a number of classifications, including van der Auwera and Plungian (1998), as well as for morphological, syntactic and semantic properties using the Behavioral Profiling approach (Divjak and Gries 2006). Multiple correspondence analysis and (polytomous) regression models were used to determine how well modal type and usage align. These corpus-based findings were validated experimentally. In a forced choice task, naive native speakers were exposed to definitions and prototypical examples of modal types or functions, then labeled a number of authentic corpus sentences accordingly. In the sorting task, naive native speakers sorted authentic corpus sentences into semantically coherent groups.
We discuss the results of our empirical study as well as the issues involved in building usage-based accounts on traditional linguistic classifications.
Each venture a new beginning: studies in honour of Laura A. Janda.Bloomington, IN: Slavica Publishers, 2017
The vast majority of linguistic theories are built on a peculiar type of data: acceptability or g... more The vast majority of linguistic theories are built on a peculiar type of data: acceptability or grammaticality ratings. Traditionally these ratings were obtained through introspec-tion by the analyst, an approach that is problematic in many (if not most) respects. Linguists addressed (part of) the issue by starting to elicit ratings from largish numbers of native speakers. Yet, this caused a new problem: due to the unpopularity of ordinal data in disciplines that drive the development of statistical analysis, few techniques are available that handle this type of date with grace. In our contribution, we explain how Generalized Additive Mixed Models can be used to explore ordinal data in all its complexity using the mgcv package in R.
Cognitive Linguistics is an approach to language study based on the assumptions that our linguist... more Cognitive Linguistics is an approach to language study based on the assumptions that our linguistic abilities are firmly rooted in our cognitive abilities, that meaning is essentially conceptualization, and that grammar is shaped by usage. The Handbook of Cognitive Linguistics provides state-of-the-art overviews of the numerous subfields of cognitive linguistics written by leading international experts which will be useful for established researchers and novices alike. It is an interdisciplinary project with contributions from linguists, psycholinguists, psychologists, and computer scientists which will emphasise the most recent developments in the field, in particular, the shift towards more empirically-based research. In this way, it will, we hope, help to shape the field, encouraging methodologically more rigorous research which incorporates insights from all the cognitive sciences.
"Given that we lack sensory-motor experience for abstract concepts, how do we find out what they ... more "Given that we lack sensory-motor experience for abstract concepts, how do we find out what they mean? How far can we get by tracking frequency distributions in input? The volume tackles the question of what language has to offer the language learner in his/her quest for meaning, and explicitly addresses how semantic knowledge may be distributed along the continuum from "grammar" to "lexicon". Focus is on the synonymy of constructions and lexemes, a meaning relation that has been largely ignored in Western linguistics.
Frequency in all its guises plays a major part in this book. Approaching meaning from a usage-based perspective, a radically distributional approach to quantifying meaning is proposed that encompasses both the constructional and lexical level. Statistical data analysis, relatively new in the field of linguistics, yields a cognitively realistic, clustered model that encourages re-evaluating existing accounts of near-synonymy. Theoretical concepts spanning a range of cognitive linguistic frameworks, i.e. Cognitive Grammar, Radical Construction Grammar and Prototype Theory, account for the complexity of the data and lead to a re-appraisal of traditional semantic theory.
Built on a solid empirical foundation, this network account of synonymy at the constructional and lexical level enriches our understanding of established aspects of the cognitive model of language, serving as catalyst for their further development and refinement. The theoretically informed combination of descriptive accuracy and methodological innovation makes the book a worthwhile read for cognitive linguists and psycholinguists alike."
nstructional alternations, lexical contrasts and extensions and multi-word expressions) in a vari... more nstructional alternations, lexical contrasts and extensions and multi-word expressions) in a variety of languages (Dutch, English, Russian and Spanish) and their representation in cognition as mediated by frequency counts in both text and experiment. The state-of-the-art data collection (ranging from questionnaires to eye-tracking) and analysis (from simple chi-squared to random effects regression) techniques allow to draw theoretical conclusions from (mis)matches between different types of empirical data. The sister volume focuses on language learning and processing.
The volume contains a collection of studies on how the analysis of corpus and psycholinguistic da... more The volume contains a collection of studies on how the analysis of corpus and psycholinguistic data reveal how linguistic knowledge is affected by the frequency of linguistic elements/stimuli. The studies explore a wide range of phenomena , from phonological reduction processes and palatalization to morphological productivity, diachronic change, adjective preposition constructions, auxiliary omission, and multi-word units. The languages studied are Spanish and artificial languages, Russian, Dutch, and English. The sister volume focuses on language representation.
"The volume presents an overview of recent cognitive linguistic research on Slavic languages. Sla... more "The volume presents an overview of recent cognitive linguistic research on Slavic languages. Slavic languages, with their rich inflectional morphology in both the nominal and the verbal system, provide an important testing ground for a linguistic theory that seeks to motivate linguistic structure.
Therefore, the volume touches upon a wide range of phenomena: it addresses issues related to the semantics of grammatical case, tense, aspect, voice and word order, it looks into grammaticalization and language change and discusses sound symbolism. At the same time, the analyses presented address a variety of theoretically important issues. Take for example the role of virtual entities in language or that of iconic motivation in grammar, the importance of metaphor for grammaticalization or that of subjectification for motivating synchronic polysemy and diachronic language change, as well as the myriad of patterns available to encode events in a non-canonical way or to convey the speaker's epistemic stance with respect to the communicated content. In addition, the analyses are couched in a variety of cognitive linguistic frameworks, such as cognitive grammar, mental space theory, construction grammar, frame semantics, grammaticalization theory, as well as prototype semantics.
All in all, the analyses presented in this volume enrich the understanding of established aspects of the cognitive model of language and may serve as catalysts for their further development and refinement, making the volume a worthwhile read for Slavic and cognitive linguists alike."
Journal of Experimental Psychology, 2017
Milin, P., D. Divjak, and R. H. Baayen The goal of the present study is to understand the role o... more Milin, P., D. Divjak, and R. H. Baayen
The goal of the present study is to understand the role orthographic and semantic information play in the behaviour of skilled readers. Reading latencies from a self-paced sentence reading experiment in which Russian near-synonymous verbs were manipulated appear well-predicted by a combination of bottom-up sub-lexical letter triplets (trigraphs) and top-down semantic generalizations, modelled using the Naive Discrimination Learner. The results reveal a complex interplay of bottom-up and top-down support from orthography and semantics to the target verbs, whereby activations from orthography only are modulated by individual differences. Using performance on a serial reaction time task for a novel operationalization of the mental speed hypothesis, we explain the observed individual differences in reading behaviour in terms of the exploration/exploitation hypothesis from Reinforcement Learning, where initially slower and more variable behaviour leads to better performance overall. Author Note The experiments received ethical approval from the University of Sheffield (UK), School of Languages & Cultures. The PsychoPy scripts were written by Lily FitzGibbon; participants were recruited and scheduled by Daria Satyukova. The Institute for Linguistic Studies of the St Petersburg branch of the Russian Academy of Sciences kindly made facilities available for testing. The financial support of the British Academy, the Prokhorov Foundation and the Alexander von Humboldt Foundation is gratefully acknowledged. We would like to thank Neil Bermel, Michael Ramscar, and Tom Stafford for helpful discussions and comments on initial versions of the manuscript. We are very grateful to Victor Kuperman and two further anonymous reviewers for their comments on first submission of this paper. Correspondence concerning this article should be addressed to Petar Milin,
The goal of the present study is to understand the role orthographic and semantic information pla... more The goal of the present study is to understand the role orthographic and semantic information play in the behaviour of skilled readers. Reading latencies from a self-paced sentence reading experiment in which Russian near-synonymous verbs were manipulated appear well-predicted by a combination of bottom-up sub-lexical letter triplets (trigraphs) and top-down semantic generalizations, modelled using the Naive Discrimination Learner. The results reveal a complex interplay of bottom-up and top-down support from orthography and semantics to the target verbs, whereby activations from orthography only are modulated by individual differences. Using performance on a serial reaction time task for a novel operationalization of the mental speed hypothesis, we explain the observed individual differences in reading behaviour in terms of the exploration/exploitation hypothesis from Reinforcement Learning, where initially slower and more variable behaviour leads to better performance overall. Author Note The experiments received ethical approval from the University of Sheffield (UK), School of Languages & Cultures. The PsychoPy scripts were written by Lily FitzGibbon; participants were recruited and scheduled by Daria Satyukova. The Institute for Linguistic Studies of the St Petersburg branch of the Russian Academy of Sciences kindly made facilities available for testing. The financial support of the British Academy, the Prokhorov Foundation and the Alexander von Humboldt Foundation is gratefully acknowledged. We would like to thank Neil Bermel, Michael Ramscar, and Tom Stafford for helpful discussions and comments on initial versions of the manuscript. We are very grateful to Victor Kuperman and two further anonymous reviewers for their comments on first submission of this paper. Correspondence concerning this article should be addressed to Petar Milin,
Cognitive Linguistics, 2016
Milin, P., D. Divjak, S. Dimitrijevic, and R. H. Baayen Over the past 10 years, Cognitive Lingui... more Milin, P., D. Divjak, S. Dimitrijevic, and R. H. Baayen
Over the past 10 years, Cognitive Linguistics has taken a Quantitative Turn. Yet, concerns have been raised that this preoccupation with quantification and modelling may not bring us any closer to understanding how language works. We show that this objection is unfounded, especially if we rely on modelling techniques based on biologically and psychologically plausible learning algorithms. These make it possible to take a quantitative approach, while generating and testing specific hypotheses that will advance our understanding of how knowledge of language emerges from exposure to usage. Acknowledgments The financial support of the Alexander von Humboldt Foundation (to Harald Baayen and Petar Milin) and the British Academy (to Dagmar Divjak) is gratefully acknowledged. We wish to thank Emmanuel Keuleers for providing generous help in implementing TiMBL, and Svetlana Borojević who provided access to the experimental hardware and software.
In this paper, we focus on corpus-linguistic studies that address theoretical questions and on co... more In this paper, we focus on corpus-linguistic studies that address theoretical questions and on computational linguistic work on corpus annotation, that makes corpora useful for linguistic work. First, we discuss why the corpus linguistic approach was discredited by generative linguists in the second half of the 20th century, how it made a comeback through advances in computing and was adopted by usage-based linguistics at the beginning of the 21st century. Then, we move on to an overview of necessary and common annotation layers and the issues that are encountered when performing automatic annotation, with special emphasis on Slavic languages. Finally, we survey the types of research requiring corpora that Slavic linguists are involved in worldwide , and the resources they have at their disposal.
In this paper, I pursue the distributional hypothesis that the meaning of a word is derived from ... more In this paper, I pursue the distributional hypothesis that the meaning of a word is derived from the linguistic contexts in which it occurs and apply it to verbs of perception. Differently from NLP implementations of the distributional hypothesis, I explicitly limit the range of variables to the grammatical domain and chart the way in which verbs of Vision, Hearing and Touch are used, morphologically and syntactically, in a representative sample of corpus data. Some aspects of experience are so central and pervasive that reference to them has grammaticalized (Divjak 2010; see also Newman 2008; Janda & Lyashevskaya 2011). The aim is, firstly, to determine to which extent a verb's grammatical context alone allows us to classify utterances according to the perception type, and, secondly, to chart the similarities and differences in the verbs' preference for morphological markers and syntactic constructions. If contexts are highly specialized, language structure, as it is witnessed in use, could assist sensory impaired speakers in building up viable representations of concepts, even if sensory experience is lacking. If, in addition, similarities between certain sensory perception verbs are high, sensory impaired speakers could use these similarities to perform analogical mapping across senses and ground concepts relating to the impaired sense in a cognate sensory experience. The findings are relevant for concept acquisition and representation in general and for concept acquisition and representation in sensory impaired populations, such as the blind, in particular.
Since its conception, Cognitive Linguistics as a theory of language has been enjoying ever increa... more Since its conception, Cognitive Linguistics as a theory of language has been enjoying ever increasing success worldwide. With quantitative growth has come qualitative diversification, and within a now heterogeneous field, different – and at times opposing – views on theoretical and methodological matters have emerged. The historical " prototype " of Cognitive Linguistics may be described as predominantly of mentalist persuasion, based on introspection, specialized in analysing language from a synchronic point of view, focused on West-European data (English in particular), and showing limited interest in the social and multimodal aspects of communication. Over the past years, many promising extensions from this prototype have emerged. The contributions selected for the Special Issue take stock of these extensions along the cognitive, social and methodological axes that expand the cognitive linguistic object of inquiry across time, space and modality.
Over the past 10 years, Cognitive Linguistics has taken a Quantitative Turn. Yet, concerns have b... more Over the past 10 years, Cognitive Linguistics has taken a Quantitative Turn. Yet, concerns have been raised that this preoccupation with quantification and modelling may not bring us any closer to understanding how language works. We show that this objection is unfounded, especially if we rely on modelling techniques based on biologically and psychologically plausible learning algorithms. These make it possible to take a quantitative approach, while generating and testing specific hypotheses that will advance our understanding of how knowledge of language emerges from exposure to usage. Acknowledgments The financial support of the Alexander von Humboldt Foundation (to Harald Baayen and Petar Milin) and the British Academy (to Dagmar Divjak) is gratefully acknowledged. We wish to thank Emmanuel Keuleers for providing generous help in implementing TiMBL, and Svetlana Borojević who provided access to the experimental hardware and software.
Usage-based linguistics abounds with studies that use statistical classification models to analys... more Usage-based linguistics abounds with studies that use statistical classification models to analyse either textual corpus data or behavioral experimental data. Yet, before we can draw conclusions from statistical models of empirical data that we can feed back into cognitive linguistic theory, we need to assess whether the text-based models are cognitively plausible and whether the behavior-based models are linguistically accurate. In this paper, we review four case studies that evaluate statistical classification models of richly annotated linguistic data by explicitly comparing the performance of a corpus-based model to the behavior of native speakers. The data come from four different languages (Arabic, English, Estonian, and Russian) and pertain to both lexical as well as syntactic near-synonymy. We show that behavioral evidence is needed in order to fine-tune and improve statistical models built on data from a corpus. We argue that methodological pluralism and triangulation are the keys for a cognitively realistic linguistic theory.
A number of studies report that frequency is a poor predictor of acceptability, in particular at ... more A number of studies report that frequency is a poor predictor of acceptability, in particular at the lower end of the frequency spectrum. Because acceptability judgments provide a substantial part of the empirical foundation of dominant linguistic traditions, understanding how acceptability relates to frequency, one of the most robust predictors of human performance, is crucial. The relation between low frequency and acceptability is investigated using corpus- and behavioral data on the distribution of infinitival and finite that-complements in Polish. Polish verbs exhibit substantial subordination variation and for the majority of verbs taking an infinitival complement, the that-complement occurs with low frequency (<0.66 ipm). These low-frequency that-clauses, in turn, exhibit large differences in how acceptable they are to native speakers. It is argued that acceptability judgments are based on configurations of internally structured exemplars, the acceptability of which cannot reliably be assessed until sufficient evidence about the core component has accumulated.
Linguistic convention allows speakers various options. Evidence is accumulating that the various ... more Linguistic convention allows speakers various options. Evidence is accumulating that the various options are preferred in different contexts yet the criteria governing the selection of the appropriate form are often far from obvious. Most researchers who attempt to discover the factors determining a preference rely on the linguistic analysis and statistical modeling of data extracted from large corpora.
In this paper, we address the question of how to evaluate such models and explicitly compare the performance of a statistical model derived from a corpus with that of native speakers in selecting one of six Russian TRY verbs. Building on earlier work by Divjak (2003, 2004, 2010) and Divjak & Arppe (2013), we trained a polytomous logistic regression model to predict verb choice given the context. We compare the predictions the model makes for 60 unseen sentences to the choices adult native speakers make in those same sentences.1 We then look in more detail at the interplay of the contextual properties and model computationally how individual differences in assessing the importance of contextual properties may impact the linguistic knowledge of native speakers. Finally, we compare the probability the model assigns to encountering each of the 6 verbs in the 60 test sentences to the acceptability ratings the adult native speakers give to those sentences. We discuss the implications of our findings for both usage-based theory and empirical linguistic methodology.
Over the past four decades, two distinct alternatives have emerged to rule-based models of how li... more Over the past four decades, two distinct alternatives have emerged to rule-based models of how linguistic categories are stored and represented as cognitive structures, namely the prototype and exemplar theories. Although these models were initially thought to be mutually exclusive, shifts from one mechanism to the other have been observed in category learning experiments, bringing the models closer together. In this paper we implement a technique akin to varying abstraction modelling, that assumes intermediate abstraction processes to underlie category representations and categorization decisions; we do so using statistical techniques such as regression and clustering that linguists are familiar with. Using this model we simulate, on the basis of actual usage of Russian TRY verbs and Finnish THINK verbs as observed in corpora, how prototypes for near-synonymous verbs could be formed from concrete exemplars at different levels of abstraction using statistical techniques that track frequency distributions in input.
In so doing, we take a closer look at the cognitive linguistic flirtation with multiple categorization theories, suggesting three improvements anchored in the fact that cognitive linguistics is a usage-based theory of language. Firstly, we show that language provides support for considering single prototype and full exemplar models as opposite ends along a continuum of abstraction. Secondly, we present a methodology that simulates how prototypes can be obtained from exemplars at more than one level of abstraction in a systematic and verifiable way. And thirdly, we illustrate our claims on the basis of work on verbs, denoting intangible events that are neither stable in nor independent of time and express relational concepts; this implies that verbs are more susceptible to their meanings being influenced by the concepts they relate.
In this paper we present the results of an empirical study into the cognitive reality of existing... more In this paper we present the results of an empirical study into the cognitive reality of existing classifications of modality using Polish data.
We analyzed random samples of 250 independent observations for the 7 most frequent modal words (móc, można, musieć, należy, powinien, trzeba, wolno), extracted from the Polish national corpus. Observations were annotated for modal type according to a number of classifications, including van der Auwera and Plungian (1998), as well as for morphological, syntactic and semantic properties using the Behavioral Profiling approach (Divjak and Gries 2006). Multiple correspondence analysis and (polytomous) regression models were used to determine how well modal type and usage align. These corpus-based findings were validated experimentally. In a forced choice task, naive native speakers were exposed to definitions and prototypical examples of modal types or functions, then labeled a number of authentic corpus sentences accordingly. In the sorting task, naive native speakers sorted authentic corpus sentences into semantically coherent groups.
We discuss the results of our empirical study as well as the issues involved in building usage-based accounts on traditional linguistic classifications.
Russian Linguistics, 2009
This paper deals with the assignment of aspect in Russian modal constructions with adverbial or a... more This paper deals with the assignment of aspect in Russian modal constructions with adverbial or adjectival predicatives and impersonal verbs that combine with an infinitive. Unlike previous accounts, this paper takes a strictly corpus-based, quantitative approach within which corpus data on the relationship between aspect and modality are modeled using mixed effects logistic regression. Moreover, the results are cognitively motivated.
Cognitive Foundations of Language Structure and Use, 2009
One of the areas which most strongly supported the emergence of cognitive linguistics as a new re... more One of the areas which most strongly supported the emergence of cognitive linguistics as a new research paradigm is that of lexical semantics. Early work, in particular on prepositions, introduced the notions of prototypes, network representations and radial categories into linguistics. These innovations of cognitive-linguistic lexical semantic analysis were later used for analyzing constructional elements. While this work has provided a wealth of insights, the approachin particular the then widely used network representations of word senseswas criticized for a variety of methodological and conceptual shortcomings; in 1998 the main journal in the field saw a lively debate concerning the question of what contribution, if any, such approaches to, for instance, polysemy can make to issues of linguistic representation. It is probably fair to say that, in spite of a growing recognition of such shortcomings, the field of cognitive linguistics is still far from having resolved all of its issues.
Behavioral profiles A corpus-based approach to cognitive semantic analysis Stefan Th. Gries and D... more Behavioral profiles A corpus-based approach to cognitive semantic analysis Stefan Th. Gries and Dagmar Divjak* 1. Introduction In this paper we will look into questions that concern what may be considered two of the central meaning relations in semantics, ie polysemy or the ...
This article proposes a methodology for addressing three long-standing problems of near synonym r... more This article proposes a methodology for addressing three long-standing problems of near synonym research. First, we show how the internal structure of a group of near synonyms can be revealed. Second, we deal with the problem of distinguishing the subclusters and the words in those subclusters from each other. Finally, we illustrate how these results identify the semantic properties that should be mentioned in lexicographic entries. We illustrate our methodology with a case study on nine near synonymous Russian verbs that, in combination with an infinitive, express TRY.
Transactions of the Philological Society, 2008
This article focuses on grammatical constructions that attenuate or eliminate the expression of a... more This article focuses on grammatical constructions that attenuate or eliminate the expression of agency in Russian, using the frameworks of Radical Construction Grammar and Cognitive Grammar. Emphasis is on the organization of these constructions in larger networks of related personal and impersonal constructions, with impersonal constructions as peripheral members of the system. More specifically, we compare the role of the dative case in impersonal constructions containing a finite verb and an infinitive and demonstrate that there are two such constructions, which has implications for the concepts of main verb-hood and agentivity. This type of nuanced analysis takes into account factors such as case semantics and relationships among constructions in assessing how agency is assigned or avoided in Russian impersonal constructions, hence makes it possible to tease apart the differences between two impersonal constructions that appear identical in structure.
Cognitive Linguistics Research, 2007
Degrees of event integration. A binding scale for [VFIN VINF] structures in Russian Dagmar Divjak... more Degrees of event integration. A binding scale for [VFIN VINF] structures in Russian Dagmar Divjak Abstract* In this paper I merge insights from cognitive and functional approaches to complementation to present a comprehensive model, a binding scale, for the 293 verbs that combine with ...
Cognitive Linguistics Research, 2007
Why cognitive linguists should care about the Slavic languages and vice versa Dagmar Divjak, Laur... more Why cognitive linguists should care about the Slavic languages and vice versa Dagmar Divjak, Laura A. Janda and Agata Kochanska 1. The cognitive paradigm and Slavic linguistic research From its early days, cognitive linguistics has attracted the attention of lin-guists ...
Constructional Approaches to Language, 2015
In this paper we will present a corpus-based cognitive-semantic analysis of five verbs that expre... more In this paper we will present a corpus-based cognitive-semantic analysis of five verbs that express 'begin' in English and Russian, i.e. begin, start, načinat'/načat'. načinat'sja/načat'sja and stat'. On the basis of a quantitative analysis of data extracted from the ICE-GB and the Uppsala Corpus we conclude that the prototype for each verb and each set of verbs in each language
In this paper, we assess objections formulated against (quantitative) corpus-linguistic methods i... more In this paper, we assess objections formulated against (quantitative) corpus-linguistic methods in cognitive linguistics. We present claims critical of both corpus linguistics in general and particular corpus-linguistic analyses in particular and discuss a variety of theoretical as well as empirical shortcomings of these claims. In addition, we summarily discuss our recent corpus- based Behavioral Profile approach to cognitive semantics and
Cognitive Foundations of Language Structure and Use, 2014
In this paper we survey the verbs speakers of Dutch use most frequently to encode the horizontal ... more In this paper we survey the verbs speakers of Dutch use most frequently to encode the horizontal movement of a non-liquid Figure in or on a liquid Ground. To our knowledge, there are no previous studies on this subject. Our results are mainly based on non-elicited data from electronically available corpora. Occasionally, data from dictionaries and internet examples have been taken into account. We will argue that Dutch lexicalizes the Manner of motion, i.e. it encodes the source of propulsion in the verb, leaving the interpretation of directionality to optional satellites or to contextual inference.
Nederlandstaligen die Russisch studeren, krijgen in de eerste les over werkwoorden van beweging o... more Nederlandstaligen die Russisch studeren, krijgen in de eerste les over werkwoorden van beweging onvermijdelijk te horen dat het Russisch veel specifieker is in het weergeven van beweging dan het Nederlands. Maar, is dat wel correct? Eist het Russisch werkelijk dat sprekers meer informatie meedelen over de beweging dan het Nederlands? Is het niet eerder zo dat het Russisch en het Nederlands verschillende componenten van beweging coderen, en daarvoor gebruik maken van twee verschillende talige middelen nl. grammatica en lexicon? Om deze hypothese te onderzoeken heb ik 1222 zinnen met plavat´/plytén zwemmen/varen/drijven geëxtraheerd uit Nederlandse en Russische corpora. Na een korte algemene beschrijving van de componenten van beweging die typisch in talige vorm worden omgezet, zal ik meer in detail illustreren hoe het Nederlands en het Russisch te werk gaan. De contrastieve schets van zowel de voornaamste letterlijke gebruiken als figuurlijke uitbreidingen van elk van deze werkwoorden wordt geïllustreerd aan de hand van corpusmateriaal.
Studies in Language Companion Series, 2010
The paper contrasts the verbs plyt'/plavat' in Russian and płynąć/pływać in Polish with their cor... more The paper contrasts the verbs plyt'/plavat' in Russian and płynąć/pływać in Polish with their correspondences in Dutch, English and Swedish against a broader typological background. The three Germanic languages use several verbs for what is covered by a pair of derivationally related verbs in each of the two Slavic languages. The Germanic languages lexicalize the activity/passivity of motion, but vary considerably as to how they carve up the conceptual space. Russian and Polish, on the other hand, use plavat'/plyt' independently of the activity/passivity of motion and focus on the uni-or non-unidirectionality of the motion.
Quantitative Methods in Cognitive Semantics: Corpus-Driven Approaches, 2010
Corpus-based evidence for an idiosyncratic aspect-modality relation in Russian Dagmar Divjak Abst... more Corpus-based evidence for an idiosyncratic aspect-modality relation in Russian Dagmar Divjak Abstract There is an abundance of literature suggesting a relationship between aspect and modality; typically, perfective aspect is related to objective or factive informa-tion, ...
Dirk Geeraerts has played a key role in launching Cognitive Linguistics as a full-fledged theory ... more Dirk Geeraerts has played a key role in launching Cognitive Linguistics as a full-fledged theory of linguistics and in expanding its sphere of influence in Western Europe. Dirk is furthermore one of the first and strongest advocates for the incorporation of empirical methods - and quantitative, corpus-based methods in particular - into cognitive linguistic research. The “Quantitative Turn” (Janda 2013) is in large part due to his relentless insistence on methodological rigour. In this chapter, I want to take a closer look at what is currently methodological “good practice” in the field and draw attention to some of the assumptions that underlie our methodology and thereby shape our findings yet have gone unquestioned. Four challenges are highlighted - data annotation, statistical analysis, model validation and experimental design - and their theoretical foundations and implications discussed.
We report on a self-paced reading experiment that was run to ascertain whether the effect of diff... more We report on a self-paced reading experiment that was run to ascertain whether the effect of differential tense, aspect and mood (henceforth TAM) marking on verbs would affect processing. TAM properties were identified as the strongest predictors for the choice between 6 near synonyms meaning TRY in Russian on the basis of regression models fit to manually annotated corpus data (Divjak 2010, Divjak & Arppe 2013). We will discuss how we used a Generalized Linear Mixed Model to account for the fact that we deviated from the traditional set-up for self-paced reading in two ways: we used an imbalanced design and ran the task with actually attested sentences rather than artificially created ones. These deviations were motivated by the need to accommodate the natural restrictions on TAM combinations and to respect the lack of a strict word order, which are both typical for Russian. We will also describe how we used a Generalized Additive Model to handle the non-linearities that we encountered in the reading times data.
Slavic Languages in Psycholinguistics, 2016
Divjak, D., A. Arppe, and R. H. Baayen We report on a self-paced reading experiment that was run... more Divjak, D., A. Arppe, and R. H. Baayen
We report on a self-paced reading experiment that was run to ascertain whether the effect of differential tense, aspect and mood (henceforth TAM) marking on verbs would affect processing. TAM properties were identified as the strongest predictors for the choice between 6 near synonyms meaning TRY in Russian on the basis of regression models fit to manually annotated corpus data (Divjak 2010, Divjak & Arppe 2013). We will discuss how we used a Generalized Linear Mixed Model to account for the fact that we deviated from the traditional setup for self-paced reading in two ways: we used an imbalanced design and ran the task with actually attested sentences rather than artificially created ones. These deviations were motivated by the need to accommodate the natural restrictions on TAM combinations and to respect the lack of a strict word order, which are both typical for Russian. We will also describe how we used a Generalized Additive Model to handle the non-linearities that we encountered in the reading times data. 1 Author contributions: DD and AA conceived and designed the self-paced reading experiment; DD ran the experiment; DD, AA and HB analyzed the data with comments from Petar Milin; DD wrote the paper using comments and suggestions from AA and HB. The PsychoPy script for self-paced reading was written by Lily FitzGibbon; participants were recruited and scheduled by Daria Satyukova. The experiment received ethical approval from the University of Sheffield, School of Languages & Cultures. The financial support of the Prokhorov Foundation and the logistic support of the Saint Petersburg branch of the Russian Academy of Sciences are gratefully acknowledged.
Each Venture a New Beginning. Studies in Honor of Laura A. Janda, 2017
Baayen, R. H., and D. Divjak The vast majority of linguistic theories are built on a peculiar ty... more Baayen, R. H., and D. Divjak
The vast majority of linguistic theories are built on a peculiar type of data: acceptability or grammaticality ratings. Traditionally these ratings were obtained through introspec-tion by the analyst, an approach that is problematic in many (if not most) respects. Linguists addressed (part of) the issue by starting to elicit ratings from largish numbers of native speakers. Yet, this caused a new problem: due to the unpopularity of ordinal data in disciplines that drive the development of statistical analysis, few techniques are available that handle this type of date with grace. In our contribution, we explain how Generalized Additive Mixed Models can be used to explore ordinal data in all its complexity using the mgcv package in R.