Alex Warstadt (original) (raw)

I am a computational linguist and an Assistant Professor at UC San Diego with appointments in Linguistics and the Halıcıoğlu Data Science Institute.

I am the PI of UC San Diego’s Learning, Meaning, and Natural language lab (LeM🍋N Lab). The group focuses on interdisciplinary research in linguistics, computational cognitive modeling, and natural language processing. We use advances in machine learning to understand why human language is the way it is, how children come to acquire it, and how information is conveyed across multiple channels. We use insights from linguistics and cognitive science to advance compute- and data-efficient learning in LMs and to evaluate and interpret how LMs learn and represent grammatical structures and meaning.

Prior to coming to UC San Diego:

I did my Postdoc at ETH Zürich, were I was affiliated with Rycolab (PI: Ryan Cotterell).
I completed my PhD at NYU in Linguistics (Dissertation: “Artificial Neural Networks as Models of Human Language Acquisition”), where I was supervised by Sam Bowman and affiliated with CAP Lab (PI: Tal Linzen).
I received BAs in Linguistics and Music Theory from Brown University (Thesis: “The Syntax of Coordination and Discontinuity in a Combinatory Categorial Grammar”, advisor: Pauline Jacobson).

Book Chapter
What artificial neural networks can tell us about human language acquisition
In Algebraic Structures in Natural Language, 2022
Rapid progress in machine learning for natural language processing has the potential to transform debates about how humans learn language. However, the learning environments and biases of current artificial learners and humans diverge in ways that weaken the impact of the evidence obtained from learning simulations. For example, today’s most effective neural language models are trained on roughly one thousand times the amount of linguistic data available to a typical child. To increase the relevance of learnability results from computational models, we need to train model learners without significant advantages over humans. If an appropriate model successfully acquires some target linguistic knowledge, it can provide a proof of concept that the target is learnable in a hypothesized human learning scenario. Plausible model learners will enable us to carry out experimental manipulations to make causal inferences about variables in the learning environment, and to rigorously test poverty-of-the-stimulus-style claims arguing for innate linguistic knowledge in humans. Comparable experiments will never be possible with human subjects due to practical and ethical considerations. So far, attempts to deprive current models of unfair advantages fail to achieve human-level grammatical knowledge. But before we can justifiably conclude that language learning requires more prior domain-specific knowledge than current models possess, we must first explore other training regimes as ways to make computational learners more efficient at learning from limited linguistic input.
TACL (to appear)
Investigating Critical Period Effects in Language Acquisition through Neural Language Models
Ionut Constantinescu, Tiago Pimentel, Ryan Cotterell, and 1 more author
Transactions of the Association for Computational Linguistics, 2024
Humans appear to have a critical period (CP) for language acquisition: Second language (L2) acquisition becomes harder after early childhood, and ceasing exposure to a first language (L1) after this period (but not before) typically does not lead to substantial loss of L1 proficiency. It is unknown whether these CP effects result from innately determined brain maturation or as a stabilization of neural connections naturally induced by experience. In this study, we use language models (LMs) to test the extent to which these phenomena are peculiar to humans, or shared by a broader class of language learners. We vary the age of exposure by training LMs on language pairs in various experimental conditions, and find that LMs, which lack any direct analog to innate maturational stages, do not show CP effects when the age of exposure of L2 is delayed. Our results contradict the claim that CP effects are an inevitable result of statistical learning, and they are consistent with an innate mechanism for CP effects. We show that we can reverse-engineer the CP by introducing a regularizer partway through training to simulate a maturational decrease in plasticity. All in all, our results suggest that L1 learning on its own may not be enough to induce a CP, and additional engineering is necessary to make language models more cognitively plausible.
ACL
Learning which features matter: RoBERTa acquires a preference for linguistic generalizations (eventually)
Alex Warstadt, Yian Zhang, Xiaocheng Li, and 2 more authors
In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), Nov 2020
One reason pretraining on self-supervised linguistic tasks is effective is that it teaches models features that are helpful for language understanding. However, we want pretrained models to learn not only to represent linguistic features, but also to use those features preferentially during fine-turning. With this goal in mind, we introduce a new English-language diagnostic set called MSGS (the Mixed Signals Generalization Set), which consists of 20 ambiguous binary classification tasks that we use to test whether a pretrained model prefers linguistic or surface generalizations during finetuning. We pretrain RoBERTa from scratch on quantities of data ranging from 1M to 1B words and compare their performance on MSGS to the publicly available RoBERTa_BASE. We find that models can learn to represent linguistic features with little pretraining data, but require far more data to learn to prefer linguistic generalizations over surface ones. Eventually, with about 30B words of pretraining data, RoBERTa_BASE does consistently demonstrate a linguistic bias with some regularity. We conclude that while self-supervised pretraining is an effective way to learn helpful inductive biases, there is likely room to improve the rate at which models learn which features matter.
Sinn und Bedeutung
Non-resolving responses to polar questions: A revision to the QUD theory of relevance
Omar Agha, and Alex Warstadt
In Proceedings of Sinn und Bedeutung, Sep 2020
The influential Question Under Discussion (QUD) theory of discourse (Roberts, 2012) formal- izes Grice’s notion of relevance. In this paper, we identify a class of relevant discourse moves where Roberts’s account undergenerates, and propose a more inclusive definition of relevance. For example, if asked Should we cancel the picnic?, one can reply If it rains without fully resolving the question. However, in Roberts’s theory, all relevant responses to polar questions are predicted to fully resolve the question because a relevant answer must eliminate at least one alternative in the QUD. We propose that a non-resolving response to a polar question is relevant if it eliminates a set of worlds that overlaps with only some alternatives in the QUD. The new account turns out to make good predictions in the domain of polar questions, and beyond.
TACL
BLiMP: The Benchmark of Linguistic Minimal Pairs for English
Alex Warstadt, Alicia Parrish, Haokun Liu, and 4 more authors
Transactions of the Association for Computational Linguistics, Sep 2020
We introduce The Benchmark of Linguistic Minimal Pairs (BLiMP),1 a challenge set for evaluating the linguistic knowledge of language models (LMs) on major grammatical phenomena in English. BLiMP consists of 67 individual datasets, each containing 1,000 minimal pairs—that is, pairs of minimally different sentences that contrast in grammatical acceptability and isolate specific phenomenon in syntax, morphology, or semantics. We generate the data according to linguist-crafted grammar templates, and human aggregate agreement with the labels is 96.4%. We evaluate n-gram, LSTM, and Transformer (GPT-2 and Transformer-XL) LMs by observing whether they assign a higher probability to the acceptable sentence in each minimal pair. We find that state-of-the-art models identify morphological contrasts related to agreement reliably, but they struggle with some subtle semantic and syntactic phenomena, such as negative polarity items and extraction islands.
EMNLP
Quantifying the redundancy between prosody and text
Lukas Wolf, Tiago Pimentel, Evelina Fedorenko, and 4 more authors
In Proceedings of the 2023 conference on empirical methods in natural language processing (EMNLP), Dec 2023
Prosody—the suprasegmental component of speech, including pitch, loudness, and tempo—carries critical aspects of meaning. However, the relationship between the information conveyed by prosody vs. by the words themselves remains poorly understood. We use large language models (LLMs) to estimate how much information is redundant between prosody and the words themselves. Using a large spoken corpus of English audiobooks, we extract prosodic features aligned to individual words and test how well they can be predicted from LLM embeddings, compared to non-contextual word embeddings. We find a high degree of redundancy between the information carried by the words and prosodic information across several prosodic features, including intensity, duration, pauses, and pitch contours. Furthermore, a word’s prosodic information is redundant with both the word itself and the context preceding as well as following it. Still, we observe that prosodic features can not be fully predicted from text, suggesting that prosody carries information above and beyond the words. Along with this paper, we release a general-purpose data processing pipeline for quantifying the relationship between linguistic information and extra-linguistic features.
Sinn und Bedeutung
"Just" don’t ask: Exclusives and potential questions
Alex Warstadt
In Proceedings of Sinn und Bedeutung, Sep 2020
The English exclusive just is not synonymous with other exclusives such as only in sentences like Sometimes, bad things just/only happen. I give a new analysis of just which explains this and other puzzling readings of just observed in earlier work (e.g. Wiegand, 2016; Beltrama, 2018). I argue that just excludes alternatives derived from a potential question, or possible future QUD, in the sense of Onea (2016). This new perspective makes it possible to give the first unified account of these non-canonical exclusive readings of just, and provides evidence that the semantics of lexical items can be sensitive to possible futures of the discourse.