Sabine Stoll - Academia.edu (original) (raw)
Papers by Sabine Stoll
Cognitive Science, 2022
Inflectional affixes expressing the same grammatical category (e.g., subject agreement) tend to a... more Inflectional affixes expressing the same grammatical category (e.g., subject agreement) tend to appear in the same morphological position in the word. We hypothesize that this cross‐linguistic tendency toward category clustering is at least partly the result of a learning bias, which facilitates the transmission of morphology from one generation to the next if each inflectional category has a consistent morphological position. We test this in an online artificial language experiment, teaching adult English speakers a miniature language consisting of noun stems representing shapes and suffixes representing the color and number features of each shape. In one experimental condition, each suffix category has a fixed position, with color in the first position and number in the second position. In a second condition, each specific combination of suffixes has a fixed order, but some combinations have color in the first position, and some have number in the first position. In a third condit...
Oxford Handbooks Online, 2017
In first language acquisition research so far little is known about the affordances involved in c... more In first language acquisition research so far little is known about the affordances involved in children's acquisition of morphologies of different complexities. This chapter discusses the acquisition of Chintang verbal morphology. Chintang is a Sino-Tibetan (Kiranti) polysynthetic language spoken in a small village in Eastern Nepal by approximately 6,000 speakers. The most complex part of Chintang morphology is verbal inflection. A large number of affixes, verb compounding, and freedom in prefix ordering results in over 1,800 verb forms of single stem verbs and more than 4,000 forms if a secondary stem is involved. In this chapter we assess the challenges of learning such a complex system, and we describe in detail what this acquisition process looks like. For this we analyze a large longitudinal acquisition corpus of Chintang.
Human communication is strikingly multi-modal, relying on vocal utterances combined with visual g... more Human communication is strikingly multi-modal, relying on vocal utterances combined with visual gestures, facial expressions and more. Recent efforts to describe multi-modal signal production in our ape relatives have shed important light on the evolutionary trajectory of this core hallmark of human language. However, whilst promising, a systematic quantification of primate signal production which filters out random combinations produced across modalities is currently lacking. Here, through recording the communicative behaviour of wild chimpanzees from the Kibale forest, Uganda we address this issue and generate the first repertoire of non-random combined vocal and visual components. Using collocation analysis, we identify more than 100 vocal-visual combinations which occur more frequently than expected by chance. We also probe how multi-modal production varies in the population, finding no differences between individuals as a function of age, sex or rank. The number of visual compo...
Cognitive Science
Becoming productive with grammatical categories is a gradual process in children's language d... more Becoming productive with grammatical categories is a gradual process in children's language development. Here, we investigated this transition process by focusing on Turkish causatives. Previous research examining spontaneous and elicited production of Turkish causatives with familiar verbs attested the onset and early stages of productivity at ages 2 to 3 (Aksu‐Koç & Slobin, 1985; Nakipoğlu, Uzundag, & Sarıgül, 2021). So far, however, we know very little about children's understanding of causatives with novel verbs. In the present study, we asked: (a) When does the generalization of causative morphology in a novel context emerge? and (b) What role does child‐directed input play in this development? To answer the first question, we conducted comprehension‐judgment experiments with children aged 2;6–6;11 using pseudo‐verbs (Study 1 & 2). Results showed that children preferred the Turkish causative suffix ‐DIr over an unrelated or no suffix to denote caused events earliest at ...
PLOS Biology
Humans communicate with small children in unusual and highly conspicuous ways (child-directed com... more Humans communicate with small children in unusual and highly conspicuous ways (child-directed communication (CDC)), which enhance social bonding and facilitate language acquisition. CDC-like inputs are also reported for some vocally learning animals, suggesting similar functions in facilitating communicative competence. However, adult great apes, our closest living relatives, rarely signal to their infants, implicating communication surrounding the infant as the main input for infant great apes and early humans. Given cross-cultural variation in the amount and structure of CDC, we suggest that child-surrounding communication (CSC) provides essential compensatory input when CDC is less prevalent—a paramount topic for future studies.
This study investigated whether crosslinguistic differences in the expression of causality influe... more This study investigated whether crosslinguistic differences in the expression of causality influence causal conceptualization of observed events in 3- to 4-year-old Swiss-German-learners and Turkish-learners. In Swiss-German, causality is mainly expressed lexically (e.g., schniidä “to cut”). In Turkish, causality is expressed both lexically, and morphologically with a verbal suffix (e.g., yemek “to eat” vs. yeDIRmek “to feed”). Moreover, unlike Swiss-German, Turkish allows argument ellipsis (e.g. “Mary pushed”). We used pseudo-verbs to test if and how well Swiss children inferred a causal meaning from lexical constructions compared to Turkish children tested in three conditions: lexical, morphological, and morphological constructions with object ellipsis. Swiss children and Turkish children in all three conditions reliably inferred causal meanings, and did so to a similar extent. The findings suggest that children as young as age 3 make use of the specific features of how their nati...
Children acquire their first language while interacting with adults in a highly adaptive manner. ... more Children acquire their first language while interacting with adults in a highly adaptive manner. While adaptation occurs at many linguistic levels such as syntax and speech complexity, semantic adaptation remains unclear due to the difficulty of efficient meaning extraction. In this study, we examine the adaptation of semantics with a computational approach based on distributional information. We show that adults, in their speech addressed to children, adapt their distributional semantics to that in the speech children produce. By analyzing semantic representations modeled from the Manchester corpus, a large longitudinal acquisition corpus of English, we find striking similarity of semantic development between child and child-directed speech, with a slight time lag in the latter. These findings provide strong evidence for the semantic adaptation in first language acquisition and suggest the important role of child-directed speech in semantic learning.
We present the ACQDIV corpus database and aggregation pipeline, a tool developed as part of the E... more We present the ACQDIV corpus database and aggregation pipeline, a tool developed as part of the European Research Council (ERC) funded project ACQDIV, which aims to identify the universal cognitive processes that allow children to acquire any language. The corpus database represents 15 corpora from 14 typologically maximally diverse languages. Here we give an overview of the project, database, and our extensible software package for adding more corpora to the current language sample. Lastly, we discuss how we use the corpus database to mine for universal patterns in child language acquisition corpora and we describe avenues for future research.
How can infants detect where words or morphemes start and end in the continuous stream of speech?... more How can infants detect where words or morphemes start and end in the continuous stream of speech? Previous computational studies have investigated this question mainly for English, where morpheme and word boundaries are often isomorphic. Yet in many languages, words are often multimorphemic, such that word and morpheme boundaries do not align. Our study employed corpora of two languages that differ in the complexity of inflectional morphology, Chintang (Sino-Tibetan) and Japanese (in Experiment 1), as well as corpora of artificial languages ranging in morphological complexity, as measured by the ratio and distribution of morphemes per word (in Experiments 2 and 3). We used two baselines and three conceptually diverse word segmentation algorithms, two of which rely purely on sublexical information using distributional cues, and one that builds a lexicon. The algorithms’ performance was evaluated on both word- and morpheme-level representations of the corpora.Segmentation results were...
One of the earliest and most important challenges for language learners is to find out what role ... more One of the earliest and most important challenges for language learners is to find out what role arguments play in their language, i.e. how does their language express ‘who is doing what to whom?’. e key problem is how thematic roles (agents, patients, themes etc.) are linked to syntactic elements (noun phrases) and morphological markers (case and agreement markers). While there is disagreement as to the actual point in time when these linking paerns become established and when children have acquired the corresponding syntactic abstractions (Gertner et al. 2006, Dimar et al. 2008), there is a general consensus that these are all relatively early achievements, at least in languages like English and German. ese findings are surprising given the fact that the expression of arguments is an exceedingly complex phenomenon: there are oen intricate role constellations in experiencer verbs (cf. e.g. I am afraid of this vs. I fear this), and there is substantial cross-linguistic diversit...
The way infants manage to extract meaning from the speech stream when learning their first langua... more The way infants manage to extract meaning from the speech stream when learning their first language is a highly complex adaptive behavior. This behavior chiefly relies on the ability to extract information from speech they hear and combine it with the external environment they encounter. However, little is known about the underlying distribution of information in speech that conditions this ability. Here we examine properties of this distribution that support meaning extraction in three different types of speech: child-directed speech, adult conversation, and, as a control, written language. We find that verb meanings in child-directed speech can already be successfully extracted from simple co-occurrences of neighboring words, whereas meaning extraction in the other types of speech fundamentally requires access to more complex structural relations between neighboring words. These results suggest that child-directed speech is ideally shaped for a learner who has not yet mastered the...
The aim of this 3-year project (2004-2006) is to provide a rich linguistic and ethnographic docum... more The aim of this 3-year project (2004-2006) is to provide a rich linguistic and ethnographic documentation of two highly endangered but almost totally undocumented languages in eastern Nepal, Chintang and Puma. These languages belong to the Kiranti family of Tibeto-Burman. The Kiranti groups are known to have a rich and in many areas still highly active oral tradition, which has only sporadically been documented so far (Gaenszle 2002) and not at all for Chintang and Puma. According to the Central Bureau of Statistics of Nepal (2001), 98 languages are spoken in Nepal, but more realistic estimates go well beyond 100. The majority of languages spoken in Nepal are "tribal" languages belonging to Tibeto-Burman. The Kiranti subgroup has more than twenty, perhaps as many as thirty different languages and many more dialects (van Driem 2001, Ebert 2003). Chintang belongs to Eastern Kiranti (such as Limbu and Yakkha) and is spoken in Chintang VDC (Village Development Committee) of Dhankuta district. Puma, which can perhaps be classified as part of Central Kiranti (along with Bantawa Rai, Camling Rai and others), is spoken in the area in and around the Ruwa Khola, to the south of the Khotang bazar in Khotang district. Both languages are highly endangered and are being supplanted by Bantawa, one of the major Kiranti languages (Rai 1985). In a rapidly increasing number of cases, however, speakers entirely give up their native language or Bantawa and switch to Nepali, the national lingua franca. It is likely that Chintang and Puma will no longer be spoken within one or two generations. The constitution of Nepal guarantees the right of its citizens to receive their primary education in their mother tongues, and some of the indigenous languages (e.g. Tamang, Limbu, Bantawa) have been introduced in schools, while some (e.g. Camling, Gurung) are soon to be introduced in primary education as optional subjects in some regions. It has been possible
A rich literature explores unsupervised segmentation algorithms infants could use to parse their ... more A rich literature explores unsupervised segmentation algorithms infants could use to parse their input, mainly focusing on English, an analytic language where word, morpheme, and syllable boundaries often coincide. Synthetic languages, where words are multi-morphemic, may present unique difficulties for segmentation. Our study tests corpora of two languages selected to differ in the extent of complexity of their morphological structure, Chintang and Japanese. We use three conceptually diverse word segmentation algorithms and we evaluate them on both word- and morpheme-level representations. As predicted, results for the simpler Japanese are better than those for the more complex Chintang. However, the difference is small compared to the effect of the algorithm (with the lexical algorithm outperforming sub-lexical ones) and the level (scores were lower when evaluating on words versus morphemes). There are also important interactions between language, model, and evaluation level, whic...
One of the most pressing questions in cognitive science remains unanswered: what cognitive mechan... more One of the most pressing questions in cognitive science remains unanswered: what cognitive mechanisms enable children to learn any of the world’s 7000 or so languages? Much discovery has been made with regard to specific learning mechanisms in specific languages, however, given the remarkable diversity of language structures (Evans and Levinson, 2009; Bickel, 2014) the burning question remains: what are the underlying processes that make language acquisition possible, despite substantial cross-linguistic variation in phonology, morphology, syntax, etc.? To investigate these questions, a comprehensive cross-linguistic database of longitudinal child language acquisition corpora from maximally diverse languages has been built.
Data concerning the different responses of dwarf mongooses to paired playbacks of natural and art... more Data concerning the different responses of dwarf mongooses to paired playbacks of natural and artificial alarm calls (Aerial & T3.1; Terrestrial & T3.2; T3 & T3 artificial)
Emerging data in a range of non-human animal species have highlighted a latent ability to combine... more Emerging data in a range of non-human animal species have highlighted a latent ability to combine certain pre-existing calls together into larger structures. Currently, however, there exists no objective quantification of call combinations. This is problematic because animal calls can co-occur with one another simply through chance alone. One common approach applied in language sciences to identify recurrent word combinations is collocation analysis. Through comparing the co-occurrence of two words with how each word combines with other words within a corpus, collocation analysis can highlight above chance, two-word combinations. Here, we demonstrate how this approach can also be applied to non-human animal communication systems by implementing it on a pseudo dataset. We argue collocation analysis represents a promising tool for identifying non-random, communicatively relevant call combinations in animals.
Causation is one of the main features of human cognition and language. An important step in langu... more Causation is one of the main features of human cognition and language. An important step in language acquisition is to understand causation and its linguistic expressions. A prerequisite for this is the extraction of causatives from the input. Languages vary in how they express causatives but the main three types are lexical, periphrastic and morphological causatives. While periphrastic and morphological causative constructions can usually be easily traced by detecting periphrastic verbs (e.g. “make” in English) and affixes (e.g. -(s)ase in Japanese), lexical causatives have no explicit marker and are therefore much more difficult to generalize. Essentially, verbs can imply different levels of causality, which might form a continuum rather than exhibiting strict cutting points for lexical causatives. Here we propose a computational method to simulate the extraction and generalization of lexical causatives based on distributional learning. To test whether child-directed speech exhibi...
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019
Embedding a clause inside another ("the girl [who likes cars [that run fast]] has arrived") is a ... more Embedding a clause inside another ("the girl [who likes cars [that run fast]] has arrived") is a fundamental resource that has been argued to be a key driver of linguistic expressiveness. As such, it plays a central role in fundamental debates on what makes human language unique, and how they might have evolved. Empirical evidence on the prevalence and the limits of embeddings has however been based on either laboratory setups or corpus data of relatively limited size. We introduce here a collection of large, dependencyparsed written corpora in 17 languages, that allow us, for the first time, to capture clausal embedding through dependency graphs and assess their distribution. Our results indicate that there is no evidence for hard constraints on embedding depth: the tail of depth distributions is heavy. Moreover, although deeply embedded clauses tend to be shorter, suggesting processing load issues, complex sentences with many embeddings do not display a bias towards less deep embeddings. Taken together, the results suggest that deep embeddings are not disfavored in written language. More generally, our study illustrates how resources and methods from latest-generation big-data NLP can provide new perspectives on fundamental questions in theoretical linguistics.
Cognitive Science, 2022
Inflectional affixes expressing the same grammatical category (e.g., subject agreement) tend to a... more Inflectional affixes expressing the same grammatical category (e.g., subject agreement) tend to appear in the same morphological position in the word. We hypothesize that this cross‐linguistic tendency toward category clustering is at least partly the result of a learning bias, which facilitates the transmission of morphology from one generation to the next if each inflectional category has a consistent morphological position. We test this in an online artificial language experiment, teaching adult English speakers a miniature language consisting of noun stems representing shapes and suffixes representing the color and number features of each shape. In one experimental condition, each suffix category has a fixed position, with color in the first position and number in the second position. In a second condition, each specific combination of suffixes has a fixed order, but some combinations have color in the first position, and some have number in the first position. In a third condit...
Oxford Handbooks Online, 2017
In first language acquisition research so far little is known about the affordances involved in c... more In first language acquisition research so far little is known about the affordances involved in children's acquisition of morphologies of different complexities. This chapter discusses the acquisition of Chintang verbal morphology. Chintang is a Sino-Tibetan (Kiranti) polysynthetic language spoken in a small village in Eastern Nepal by approximately 6,000 speakers. The most complex part of Chintang morphology is verbal inflection. A large number of affixes, verb compounding, and freedom in prefix ordering results in over 1,800 verb forms of single stem verbs and more than 4,000 forms if a secondary stem is involved. In this chapter we assess the challenges of learning such a complex system, and we describe in detail what this acquisition process looks like. For this we analyze a large longitudinal acquisition corpus of Chintang.
Human communication is strikingly multi-modal, relying on vocal utterances combined with visual g... more Human communication is strikingly multi-modal, relying on vocal utterances combined with visual gestures, facial expressions and more. Recent efforts to describe multi-modal signal production in our ape relatives have shed important light on the evolutionary trajectory of this core hallmark of human language. However, whilst promising, a systematic quantification of primate signal production which filters out random combinations produced across modalities is currently lacking. Here, through recording the communicative behaviour of wild chimpanzees from the Kibale forest, Uganda we address this issue and generate the first repertoire of non-random combined vocal and visual components. Using collocation analysis, we identify more than 100 vocal-visual combinations which occur more frequently than expected by chance. We also probe how multi-modal production varies in the population, finding no differences between individuals as a function of age, sex or rank. The number of visual compo...
Cognitive Science
Becoming productive with grammatical categories is a gradual process in children's language d... more Becoming productive with grammatical categories is a gradual process in children's language development. Here, we investigated this transition process by focusing on Turkish causatives. Previous research examining spontaneous and elicited production of Turkish causatives with familiar verbs attested the onset and early stages of productivity at ages 2 to 3 (Aksu‐Koç & Slobin, 1985; Nakipoğlu, Uzundag, & Sarıgül, 2021). So far, however, we know very little about children's understanding of causatives with novel verbs. In the present study, we asked: (a) When does the generalization of causative morphology in a novel context emerge? and (b) What role does child‐directed input play in this development? To answer the first question, we conducted comprehension‐judgment experiments with children aged 2;6–6;11 using pseudo‐verbs (Study 1 & 2). Results showed that children preferred the Turkish causative suffix ‐DIr over an unrelated or no suffix to denote caused events earliest at ...
PLOS Biology
Humans communicate with small children in unusual and highly conspicuous ways (child-directed com... more Humans communicate with small children in unusual and highly conspicuous ways (child-directed communication (CDC)), which enhance social bonding and facilitate language acquisition. CDC-like inputs are also reported for some vocally learning animals, suggesting similar functions in facilitating communicative competence. However, adult great apes, our closest living relatives, rarely signal to their infants, implicating communication surrounding the infant as the main input for infant great apes and early humans. Given cross-cultural variation in the amount and structure of CDC, we suggest that child-surrounding communication (CSC) provides essential compensatory input when CDC is less prevalent—a paramount topic for future studies.
This study investigated whether crosslinguistic differences in the expression of causality influe... more This study investigated whether crosslinguistic differences in the expression of causality influence causal conceptualization of observed events in 3- to 4-year-old Swiss-German-learners and Turkish-learners. In Swiss-German, causality is mainly expressed lexically (e.g., schniidä “to cut”). In Turkish, causality is expressed both lexically, and morphologically with a verbal suffix (e.g., yemek “to eat” vs. yeDIRmek “to feed”). Moreover, unlike Swiss-German, Turkish allows argument ellipsis (e.g. “Mary pushed”). We used pseudo-verbs to test if and how well Swiss children inferred a causal meaning from lexical constructions compared to Turkish children tested in three conditions: lexical, morphological, and morphological constructions with object ellipsis. Swiss children and Turkish children in all three conditions reliably inferred causal meanings, and did so to a similar extent. The findings suggest that children as young as age 3 make use of the specific features of how their nati...
Children acquire their first language while interacting with adults in a highly adaptive manner. ... more Children acquire their first language while interacting with adults in a highly adaptive manner. While adaptation occurs at many linguistic levels such as syntax and speech complexity, semantic adaptation remains unclear due to the difficulty of efficient meaning extraction. In this study, we examine the adaptation of semantics with a computational approach based on distributional information. We show that adults, in their speech addressed to children, adapt their distributional semantics to that in the speech children produce. By analyzing semantic representations modeled from the Manchester corpus, a large longitudinal acquisition corpus of English, we find striking similarity of semantic development between child and child-directed speech, with a slight time lag in the latter. These findings provide strong evidence for the semantic adaptation in first language acquisition and suggest the important role of child-directed speech in semantic learning.
We present the ACQDIV corpus database and aggregation pipeline, a tool developed as part of the E... more We present the ACQDIV corpus database and aggregation pipeline, a tool developed as part of the European Research Council (ERC) funded project ACQDIV, which aims to identify the universal cognitive processes that allow children to acquire any language. The corpus database represents 15 corpora from 14 typologically maximally diverse languages. Here we give an overview of the project, database, and our extensible software package for adding more corpora to the current language sample. Lastly, we discuss how we use the corpus database to mine for universal patterns in child language acquisition corpora and we describe avenues for future research.
How can infants detect where words or morphemes start and end in the continuous stream of speech?... more How can infants detect where words or morphemes start and end in the continuous stream of speech? Previous computational studies have investigated this question mainly for English, where morpheme and word boundaries are often isomorphic. Yet in many languages, words are often multimorphemic, such that word and morpheme boundaries do not align. Our study employed corpora of two languages that differ in the complexity of inflectional morphology, Chintang (Sino-Tibetan) and Japanese (in Experiment 1), as well as corpora of artificial languages ranging in morphological complexity, as measured by the ratio and distribution of morphemes per word (in Experiments 2 and 3). We used two baselines and three conceptually diverse word segmentation algorithms, two of which rely purely on sublexical information using distributional cues, and one that builds a lexicon. The algorithms’ performance was evaluated on both word- and morpheme-level representations of the corpora.Segmentation results were...
One of the earliest and most important challenges for language learners is to find out what role ... more One of the earliest and most important challenges for language learners is to find out what role arguments play in their language, i.e. how does their language express ‘who is doing what to whom?’. e key problem is how thematic roles (agents, patients, themes etc.) are linked to syntactic elements (noun phrases) and morphological markers (case and agreement markers). While there is disagreement as to the actual point in time when these linking paerns become established and when children have acquired the corresponding syntactic abstractions (Gertner et al. 2006, Dimar et al. 2008), there is a general consensus that these are all relatively early achievements, at least in languages like English and German. ese findings are surprising given the fact that the expression of arguments is an exceedingly complex phenomenon: there are oen intricate role constellations in experiencer verbs (cf. e.g. I am afraid of this vs. I fear this), and there is substantial cross-linguistic diversit...
The way infants manage to extract meaning from the speech stream when learning their first langua... more The way infants manage to extract meaning from the speech stream when learning their first language is a highly complex adaptive behavior. This behavior chiefly relies on the ability to extract information from speech they hear and combine it with the external environment they encounter. However, little is known about the underlying distribution of information in speech that conditions this ability. Here we examine properties of this distribution that support meaning extraction in three different types of speech: child-directed speech, adult conversation, and, as a control, written language. We find that verb meanings in child-directed speech can already be successfully extracted from simple co-occurrences of neighboring words, whereas meaning extraction in the other types of speech fundamentally requires access to more complex structural relations between neighboring words. These results suggest that child-directed speech is ideally shaped for a learner who has not yet mastered the...
The aim of this 3-year project (2004-2006) is to provide a rich linguistic and ethnographic docum... more The aim of this 3-year project (2004-2006) is to provide a rich linguistic and ethnographic documentation of two highly endangered but almost totally undocumented languages in eastern Nepal, Chintang and Puma. These languages belong to the Kiranti family of Tibeto-Burman. The Kiranti groups are known to have a rich and in many areas still highly active oral tradition, which has only sporadically been documented so far (Gaenszle 2002) and not at all for Chintang and Puma. According to the Central Bureau of Statistics of Nepal (2001), 98 languages are spoken in Nepal, but more realistic estimates go well beyond 100. The majority of languages spoken in Nepal are "tribal" languages belonging to Tibeto-Burman. The Kiranti subgroup has more than twenty, perhaps as many as thirty different languages and many more dialects (van Driem 2001, Ebert 2003). Chintang belongs to Eastern Kiranti (such as Limbu and Yakkha) and is spoken in Chintang VDC (Village Development Committee) of Dhankuta district. Puma, which can perhaps be classified as part of Central Kiranti (along with Bantawa Rai, Camling Rai and others), is spoken in the area in and around the Ruwa Khola, to the south of the Khotang bazar in Khotang district. Both languages are highly endangered and are being supplanted by Bantawa, one of the major Kiranti languages (Rai 1985). In a rapidly increasing number of cases, however, speakers entirely give up their native language or Bantawa and switch to Nepali, the national lingua franca. It is likely that Chintang and Puma will no longer be spoken within one or two generations. The constitution of Nepal guarantees the right of its citizens to receive their primary education in their mother tongues, and some of the indigenous languages (e.g. Tamang, Limbu, Bantawa) have been introduced in schools, while some (e.g. Camling, Gurung) are soon to be introduced in primary education as optional subjects in some regions. It has been possible
A rich literature explores unsupervised segmentation algorithms infants could use to parse their ... more A rich literature explores unsupervised segmentation algorithms infants could use to parse their input, mainly focusing on English, an analytic language where word, morpheme, and syllable boundaries often coincide. Synthetic languages, where words are multi-morphemic, may present unique difficulties for segmentation. Our study tests corpora of two languages selected to differ in the extent of complexity of their morphological structure, Chintang and Japanese. We use three conceptually diverse word segmentation algorithms and we evaluate them on both word- and morpheme-level representations. As predicted, results for the simpler Japanese are better than those for the more complex Chintang. However, the difference is small compared to the effect of the algorithm (with the lexical algorithm outperforming sub-lexical ones) and the level (scores were lower when evaluating on words versus morphemes). There are also important interactions between language, model, and evaluation level, whic...
One of the most pressing questions in cognitive science remains unanswered: what cognitive mechan... more One of the most pressing questions in cognitive science remains unanswered: what cognitive mechanisms enable children to learn any of the world’s 7000 or so languages? Much discovery has been made with regard to specific learning mechanisms in specific languages, however, given the remarkable diversity of language structures (Evans and Levinson, 2009; Bickel, 2014) the burning question remains: what are the underlying processes that make language acquisition possible, despite substantial cross-linguistic variation in phonology, morphology, syntax, etc.? To investigate these questions, a comprehensive cross-linguistic database of longitudinal child language acquisition corpora from maximally diverse languages has been built.
Data concerning the different responses of dwarf mongooses to paired playbacks of natural and art... more Data concerning the different responses of dwarf mongooses to paired playbacks of natural and artificial alarm calls (Aerial & T3.1; Terrestrial & T3.2; T3 & T3 artificial)
Emerging data in a range of non-human animal species have highlighted a latent ability to combine... more Emerging data in a range of non-human animal species have highlighted a latent ability to combine certain pre-existing calls together into larger structures. Currently, however, there exists no objective quantification of call combinations. This is problematic because animal calls can co-occur with one another simply through chance alone. One common approach applied in language sciences to identify recurrent word combinations is collocation analysis. Through comparing the co-occurrence of two words with how each word combines with other words within a corpus, collocation analysis can highlight above chance, two-word combinations. Here, we demonstrate how this approach can also be applied to non-human animal communication systems by implementing it on a pseudo dataset. We argue collocation analysis represents a promising tool for identifying non-random, communicatively relevant call combinations in animals.
Causation is one of the main features of human cognition and language. An important step in langu... more Causation is one of the main features of human cognition and language. An important step in language acquisition is to understand causation and its linguistic expressions. A prerequisite for this is the extraction of causatives from the input. Languages vary in how they express causatives but the main three types are lexical, periphrastic and morphological causatives. While periphrastic and morphological causative constructions can usually be easily traced by detecting periphrastic verbs (e.g. “make” in English) and affixes (e.g. -(s)ase in Japanese), lexical causatives have no explicit marker and are therefore much more difficult to generalize. Essentially, verbs can imply different levels of causality, which might form a continuum rather than exhibiting strict cutting points for lexical causatives. Here we propose a computational method to simulate the extraction and generalization of lexical causatives based on distributional learning. To test whether child-directed speech exhibi...
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019
Embedding a clause inside another ("the girl [who likes cars [that run fast]] has arrived") is a ... more Embedding a clause inside another ("the girl [who likes cars [that run fast]] has arrived") is a fundamental resource that has been argued to be a key driver of linguistic expressiveness. As such, it plays a central role in fundamental debates on what makes human language unique, and how they might have evolved. Empirical evidence on the prevalence and the limits of embeddings has however been based on either laboratory setups or corpus data of relatively limited size. We introduce here a collection of large, dependencyparsed written corpora in 17 languages, that allow us, for the first time, to capture clausal embedding through dependency graphs and assess their distribution. Our results indicate that there is no evidence for hard constraints on embedding depth: the tail of depth distributions is heavy. Moreover, although deeply embedded clauses tend to be shorter, suggesting processing load issues, complex sentences with many embeddings do not display a bias towards less deep embeddings. Taken together, the results suggest that deep embeddings are not disfavored in written language. More generally, our study illustrates how resources and methods from latest-generation big-data NLP can provide new perspectives on fundamental questions in theoretical linguistics.