Yair Lakretz | Tel Aviv University (original) (raw)
Papers by Yair Lakretz
arXiv (Cornell University), Jun 9, 2022
Language models demonstrate both quantitative improvement and new qualitative capabilities with i... more Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 444 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting. The BIG-bench github code infrastructure and documentation was developed by Guy Gur-Ari,
Sequence processing in humans is thought to rely on two distinct mechanisms: the computation of t... more Sequence processing in humans is thought to rely on two distinct mechanisms: the computation of transition probabilities between adjacent elements and the extraction of larger hierarchical structures. Previous studies indicate that both mechanisms contribute to auditory sequence processing, but whether language processing involves one or the other remains debated. To address this issue, we designed a linguistic version of the local-global auditory test, which contrasts sequential versus hierarchical violations of expectations in sentences, and we searched for violation responses in both human magnetoencephalography and computational models. We found that in models, both mechanisms coexist, whereas humans only show hierarchical structure effects. Our results suggest that human sentence processing is dominated by structure-based computations and robust to sequential effects. They point to major differences between language processing in humans versus neural models and, within humans, ...
arXiv (Cornell University), Jan 6, 2021
One of the fundamental principles of contemporary linguistics states that language processing req... more One of the fundamental principles of contemporary linguistics states that language processing requires the ability to extract recursively nested tree structures. However, it remains unclear whether and how this code could be implemented in neural circuits. Recent advances in Recurrent Neural Networks (RNNs), which achieve near-human performance in some language tasks, provide a compelling model to address such questions. Here, we present a new framework to study recursive processing in RNNs, using subject-verb agreement as a probe into the representations of the neural network. We trained six distinct types of RNNs on a simplified probabilistic context-free grammar designed to independently manipulate the length of a sentence and the depth of its syntactic tree. All RNNs generalized to subject-verb dependencies longer than those seen during training. However, none systematically generalized to deeper tree structures, even those with a structural bias towards learning nested tree (i.e., stack-RNNs). In addition, our analyses revealed primacy and recency effects in the generalization patterns of LSTM-based models, showing that these models tend to perform well on the outer-and innermost parts of a center-embedded tree structure, but poorly on its middle levels. Finally, probing the internal states of the model during the processing of sentences with nested tree structures, we found a complex encoding of grammatical agreement information (e.g. grammatical number), in which all the information for multiple words nouns was carried by a single unit. Taken together, these results indicate how neural networks may extract bounded nested tree structures, without learning a systematic recursive rule.
ArXiv, 2020
Recursive processing in sentence comprehension is considered a hallmark of human linguistic abili... more Recursive processing in sentence comprehension is considered a hallmark of human linguistic abilities. However, its underlying neural mechanisms remain largely unknown. We studied whether a recurrent neural network with Long Short-Term Memory units can mimic a central aspect of human sentence processing, namely the handling of long-distance agreement dependencies. Although the network was solely trained to predict the next word in a large corpus, analysis showed the emergence of a small set of specialized units that successfully handled local and long-distance syntactic agreement for grammatical number. However, simulations showed that this mechanism does not support full recursion and fails with some long-range embedded dependencies. We tested the model's predictions in a behavioral experiment where humans detected violations in number agreement in sentences with systematic variations in the singular/plural status of multiple nouns, with or without embedding. Human and model er...
Proceedings of the 2019 Conference of the North, 2019
Recent work has shown that LSTMs trained on a generic language modeling objective capture syntax-... more Recent work has shown that LSTMs trained on a generic language modeling objective capture syntax-sensitive generalizations such as longdistance number agreement. We have however no mechanistic understanding of how they accomplish this remarkable feat. Some have conjectured it depends on heuristics that do not truly take hierarchical structure into account. We present here a detailed study of the inner mechanics of number tracking in LSTMs at the single neuron level. We discover that longdistance number information is largely managed by two "number units". Importantly, the behaviour of these units is partially controlled by other units independently shown to track syntactic structure. We conclude that LSTMs are, to some extent, implementing genuinely syntactic processing mechanisms, paving the way to a more general understanding of grammatical encoding in LSTMs.
Findings of the Association for Computational Linguistics: ACL 2023
arXiv (Cornell University), Feb 28, 2023
A sentence is more than the sum of its words: its meaning depends on how they combine with one an... more A sentence is more than the sum of its words: its meaning depends on how they combine with one another. The brain mechanisms underlying such semantic composition remain poorly understood. To shed light on the neural vector code underlying semantic composition, we introduce two hypotheses: First, the intrinsic dimensionality of the space of neural representations should increase as a sentence unfolds, paralleling the growing complexity of its semantic representation, and second, this progressive integration should be reflected in ramping and sentence-final signals. To test these predictions, we designed a dataset of closely matched normal and Jabberwocky sentences (composed of meaningless pseudo words) and displayed them to deep language models and to 11 human participants (5 men and 6 women) monitored with simultaneous magneto-encephalography and intracranial electro-encephalography. In both deep language models and electrophysiological data, we found that representational dimension...
Trends in Cognitive Sciences
Recursive processing is considered a hallmark of human linguistic abilities. A recent study evalu... more Recursive processing is considered a hallmark of human linguistic abilities. A recent study evaluated recursive processing in recurrent neural language models (RNN-LMs) and showed that such models perform below chance level on embedded dependencies within nested constructions – a prototypical example of recursion in natural language. Here, we study if state-of-the-art Transformer LMs do any better. We test four different Transformer LMs on two different types of nested constructions, which differ in whether the embedded (inner) dependency is short or long range. We find that Transformers achieve near-perfect performance on short-range embedded dependencies, significantly better than previous results reported for RNN-LMs and humans. However, on long-range embedded dependencies, Transformers’ performance sharply drops below chance level. Remarkably, the addition of only three words to the embedded dependency caused Transformers to fall from near-perfect to below-chance performance. Ta...
ArXiv, 2021
One of the fundamental principles of contemporary linguistics states that language processing req... more One of the fundamental principles of contemporary linguistics states that language processing requires the ability to extract recursively nested tree structures. However, it remains unclear whether and how this code could be implemented in neural circuits. Recent advances in Recurrent Neural Networks (RNNs), which achieve near-human performance in some language tasks, provide a compelling model to address such questions. Here, we present a new framework to study recursive processing in RNNs, using subject-verb agreement as a probe into the representations of the neural network. We trained six distinct types of RNNs on a simplified probabilistic context-free grammar designed to independently manipulate the length of a sentence and the depth of its syntactic tree. All RNNs generalized to subject-verb dependencies longer than those seen during training. However, none systematically generalized to deeper tree structures, even those with a structural bias towards learning nested tree (i....
Ferrigno et al. [2020] introduced an ingenious task to investigate recursion in human and non-hum... more Ferrigno et al. [2020] introduced an ingenious task to investigate recursion in human and non-human primates. American adults, Tsimane adults, and 3-5 year-old children successfully performed the task. Macaque monkeys required additional training, but two out of three eventually showed good generalization and scored above many Tsimane and child participants. Moreover, when tested on sequences composed of new bracket signs, the monkeys still showed good performance. The authors thus concluded that recursive nesting is not unique to humans. Here, we dispute the claim by showing that at least two alternative interpretations remain tenable. We first examine this conclusion in light of recent findings from modern artificial recurrent neural networks (RNNs), regarding how these networks encode sequences. We show that although RNNs, like monkeys, succeed on demanding generalization tasks as in Ferrigno et al., the underlying neural mechanisms are not recursive. Moreover, we show that when ...
Entropy
Sentence comprehension requires inferring, from a sequence of words, the structure of syntactic r... more Sentence comprehension requires inferring, from a sequence of words, the structure of syntactic relationships that bind these words into a semantic representation. Our limited ability to build some specific syntactic structures, such as nested center-embedded clauses (e.g., “The dog that the cat that the mouse bit chased ran away”), suggests a striking capacity limitation of sentence processing, and thus offers a window to understand how the human brain processes sentences. Here, we review the main hypotheses proposed in psycholinguistics to explain such capacity limitation. We then introduce an alternative approach, derived from our recent work on artificial neural networks optimized for language modeling, and predict that capacity limitation derives from the emergence of sparse and feature-specific syntactic units. Unlike psycholinguistic theories, our neural network-based framework provides precise capacity-limit predictions without making any a priori assumptions about the form ...
BMC Neuroscience
Recent studies have demonstrated the capacity of hippocampal sequences associated with theta osci... more Recent studies have demonstrated the capacity of hippocampal sequences associated with theta oscillation, to encode P160 The effect of progressive degradation of connectivity between brain areas on the brain network structure
Reading is a rapid, distributed process that engages multiple components of the ventral visual st... more Reading is a rapid, distributed process that engages multiple components of the ventral visual stream. However, the neural constituents and their interactions that allow us to identify written words are not well understood. Using direct intracranial recordings in a large cohort of humans, we comprehensively isolated the spatiotemporal dynamics of visual word recognition across the entire left ventral occipitotemporal cortex. The mid-fusiform cortex is the first region that is sensitive to word identity and to both sub-lexical and lexical frequencies. Its activation, response latency and amplitude, are highly dependent on the statistics of natural language. Information about lexicality and word frequency propagates posteriorly from this region to traditional visual word form regions and to earlier visual cortex. This unique sensitivity of mid-fusiform cortex to the lexical characteristics of written words points to its central role as an orthographic lexicon, which accesses the long-...
Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '15, 2015
arXiv (Cornell University), Jun 9, 2022
Language models demonstrate both quantitative improvement and new qualitative capabilities with i... more Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 444 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting. The BIG-bench github code infrastructure and documentation was developed by Guy Gur-Ari,
Sequence processing in humans is thought to rely on two distinct mechanisms: the computation of t... more Sequence processing in humans is thought to rely on two distinct mechanisms: the computation of transition probabilities between adjacent elements and the extraction of larger hierarchical structures. Previous studies indicate that both mechanisms contribute to auditory sequence processing, but whether language processing involves one or the other remains debated. To address this issue, we designed a linguistic version of the local-global auditory test, which contrasts sequential versus hierarchical violations of expectations in sentences, and we searched for violation responses in both human magnetoencephalography and computational models. We found that in models, both mechanisms coexist, whereas humans only show hierarchical structure effects. Our results suggest that human sentence processing is dominated by structure-based computations and robust to sequential effects. They point to major differences between language processing in humans versus neural models and, within humans, ...
arXiv (Cornell University), Jan 6, 2021
One of the fundamental principles of contemporary linguistics states that language processing req... more One of the fundamental principles of contemporary linguistics states that language processing requires the ability to extract recursively nested tree structures. However, it remains unclear whether and how this code could be implemented in neural circuits. Recent advances in Recurrent Neural Networks (RNNs), which achieve near-human performance in some language tasks, provide a compelling model to address such questions. Here, we present a new framework to study recursive processing in RNNs, using subject-verb agreement as a probe into the representations of the neural network. We trained six distinct types of RNNs on a simplified probabilistic context-free grammar designed to independently manipulate the length of a sentence and the depth of its syntactic tree. All RNNs generalized to subject-verb dependencies longer than those seen during training. However, none systematically generalized to deeper tree structures, even those with a structural bias towards learning nested tree (i.e., stack-RNNs). In addition, our analyses revealed primacy and recency effects in the generalization patterns of LSTM-based models, showing that these models tend to perform well on the outer-and innermost parts of a center-embedded tree structure, but poorly on its middle levels. Finally, probing the internal states of the model during the processing of sentences with nested tree structures, we found a complex encoding of grammatical agreement information (e.g. grammatical number), in which all the information for multiple words nouns was carried by a single unit. Taken together, these results indicate how neural networks may extract bounded nested tree structures, without learning a systematic recursive rule.
ArXiv, 2020
Recursive processing in sentence comprehension is considered a hallmark of human linguistic abili... more Recursive processing in sentence comprehension is considered a hallmark of human linguistic abilities. However, its underlying neural mechanisms remain largely unknown. We studied whether a recurrent neural network with Long Short-Term Memory units can mimic a central aspect of human sentence processing, namely the handling of long-distance agreement dependencies. Although the network was solely trained to predict the next word in a large corpus, analysis showed the emergence of a small set of specialized units that successfully handled local and long-distance syntactic agreement for grammatical number. However, simulations showed that this mechanism does not support full recursion and fails with some long-range embedded dependencies. We tested the model's predictions in a behavioral experiment where humans detected violations in number agreement in sentences with systematic variations in the singular/plural status of multiple nouns, with or without embedding. Human and model er...
Proceedings of the 2019 Conference of the North, 2019
Recent work has shown that LSTMs trained on a generic language modeling objective capture syntax-... more Recent work has shown that LSTMs trained on a generic language modeling objective capture syntax-sensitive generalizations such as longdistance number agreement. We have however no mechanistic understanding of how they accomplish this remarkable feat. Some have conjectured it depends on heuristics that do not truly take hierarchical structure into account. We present here a detailed study of the inner mechanics of number tracking in LSTMs at the single neuron level. We discover that longdistance number information is largely managed by two "number units". Importantly, the behaviour of these units is partially controlled by other units independently shown to track syntactic structure. We conclude that LSTMs are, to some extent, implementing genuinely syntactic processing mechanisms, paving the way to a more general understanding of grammatical encoding in LSTMs.
Findings of the Association for Computational Linguistics: ACL 2023
arXiv (Cornell University), Feb 28, 2023
A sentence is more than the sum of its words: its meaning depends on how they combine with one an... more A sentence is more than the sum of its words: its meaning depends on how they combine with one another. The brain mechanisms underlying such semantic composition remain poorly understood. To shed light on the neural vector code underlying semantic composition, we introduce two hypotheses: First, the intrinsic dimensionality of the space of neural representations should increase as a sentence unfolds, paralleling the growing complexity of its semantic representation, and second, this progressive integration should be reflected in ramping and sentence-final signals. To test these predictions, we designed a dataset of closely matched normal and Jabberwocky sentences (composed of meaningless pseudo words) and displayed them to deep language models and to 11 human participants (5 men and 6 women) monitored with simultaneous magneto-encephalography and intracranial electro-encephalography. In both deep language models and electrophysiological data, we found that representational dimension...
Trends in Cognitive Sciences
Recursive processing is considered a hallmark of human linguistic abilities. A recent study evalu... more Recursive processing is considered a hallmark of human linguistic abilities. A recent study evaluated recursive processing in recurrent neural language models (RNN-LMs) and showed that such models perform below chance level on embedded dependencies within nested constructions – a prototypical example of recursion in natural language. Here, we study if state-of-the-art Transformer LMs do any better. We test four different Transformer LMs on two different types of nested constructions, which differ in whether the embedded (inner) dependency is short or long range. We find that Transformers achieve near-perfect performance on short-range embedded dependencies, significantly better than previous results reported for RNN-LMs and humans. However, on long-range embedded dependencies, Transformers’ performance sharply drops below chance level. Remarkably, the addition of only three words to the embedded dependency caused Transformers to fall from near-perfect to below-chance performance. Ta...
ArXiv, 2021
One of the fundamental principles of contemporary linguistics states that language processing req... more One of the fundamental principles of contemporary linguistics states that language processing requires the ability to extract recursively nested tree structures. However, it remains unclear whether and how this code could be implemented in neural circuits. Recent advances in Recurrent Neural Networks (RNNs), which achieve near-human performance in some language tasks, provide a compelling model to address such questions. Here, we present a new framework to study recursive processing in RNNs, using subject-verb agreement as a probe into the representations of the neural network. We trained six distinct types of RNNs on a simplified probabilistic context-free grammar designed to independently manipulate the length of a sentence and the depth of its syntactic tree. All RNNs generalized to subject-verb dependencies longer than those seen during training. However, none systematically generalized to deeper tree structures, even those with a structural bias towards learning nested tree (i....
Ferrigno et al. [2020] introduced an ingenious task to investigate recursion in human and non-hum... more Ferrigno et al. [2020] introduced an ingenious task to investigate recursion in human and non-human primates. American adults, Tsimane adults, and 3-5 year-old children successfully performed the task. Macaque monkeys required additional training, but two out of three eventually showed good generalization and scored above many Tsimane and child participants. Moreover, when tested on sequences composed of new bracket signs, the monkeys still showed good performance. The authors thus concluded that recursive nesting is not unique to humans. Here, we dispute the claim by showing that at least two alternative interpretations remain tenable. We first examine this conclusion in light of recent findings from modern artificial recurrent neural networks (RNNs), regarding how these networks encode sequences. We show that although RNNs, like monkeys, succeed on demanding generalization tasks as in Ferrigno et al., the underlying neural mechanisms are not recursive. Moreover, we show that when ...
Entropy
Sentence comprehension requires inferring, from a sequence of words, the structure of syntactic r... more Sentence comprehension requires inferring, from a sequence of words, the structure of syntactic relationships that bind these words into a semantic representation. Our limited ability to build some specific syntactic structures, such as nested center-embedded clauses (e.g., “The dog that the cat that the mouse bit chased ran away”), suggests a striking capacity limitation of sentence processing, and thus offers a window to understand how the human brain processes sentences. Here, we review the main hypotheses proposed in psycholinguistics to explain such capacity limitation. We then introduce an alternative approach, derived from our recent work on artificial neural networks optimized for language modeling, and predict that capacity limitation derives from the emergence of sparse and feature-specific syntactic units. Unlike psycholinguistic theories, our neural network-based framework provides precise capacity-limit predictions without making any a priori assumptions about the form ...
BMC Neuroscience
Recent studies have demonstrated the capacity of hippocampal sequences associated with theta osci... more Recent studies have demonstrated the capacity of hippocampal sequences associated with theta oscillation, to encode P160 The effect of progressive degradation of connectivity between brain areas on the brain network structure
Reading is a rapid, distributed process that engages multiple components of the ventral visual st... more Reading is a rapid, distributed process that engages multiple components of the ventral visual stream. However, the neural constituents and their interactions that allow us to identify written words are not well understood. Using direct intracranial recordings in a large cohort of humans, we comprehensively isolated the spatiotemporal dynamics of visual word recognition across the entire left ventral occipitotemporal cortex. The mid-fusiform cortex is the first region that is sensitive to word identity and to both sub-lexical and lexical frequencies. Its activation, response latency and amplitude, are highly dependent on the statistics of natural language. Information about lexicality and word frequency propagates posteriorly from this region to traditional visual word form regions and to earlier visual cortex. This unique sensitivity of mid-fusiform cortex to the lexical characteristics of written words points to its central role as an orthographic lexicon, which accesses the long-...
Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '15, 2015