Robert Berwick | Massachusetts Institute of Technology (MIT) (original) (raw)

Papers by Robert Berwick

Biolinguistics

Several theoretical proposals for the evolution of language have sparked a renewed search for com... more Several theoretical proposals for the evolution of language have sparked a renewed search for comparative data on human and non-human animal computational capacities. However, conceptual confusions still hinder the field, leading to experimental evidence that fails to test for comparable human competences. Here we focus on two conceptual and methodological challenges that affect the field generally: 1) properly characterizing the computational features of the faculty of language in the narrow sense; 2) defining and probing for human language-like computations via artificial language learning experiments in non-human animals. Our intent is to be critical in the service of clarity, in what we agree is an important approach to understanding how language evolved.

Artificial Intelligence at MIT, 1990

Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics, 2021

Accurate recovery of predicate-argument structure from a Universal Dependency (UD) parse is centr... more Accurate recovery of predicate-argument structure from a Universal Dependency (UD) parse is central to downstream tasks such as extraction of semantic roles or event representations. This study introduces compchains, a categorization of the hierarchy of predicate dependency relations present within a UD parse. Accuracy of compchain classification serves as a proxy for measuring accurate recovery of predicate-argument structure from sentences with embedding. We analyzed the distribution of compchains in three UD English treebanks, EWT, GUM and LinES, revealing that these treebanks are sparse with respect to sentences with predicate-argument structure that includes predicate-argument embedding. We evaluated the CoNLL 2018 Shared Task UDPipe (v1.2) baseline (dependency parsing) models as compchain classifiers for the EWT, GUMS and LinES UD treebanks. Our results indicate that these three baseline models exhibit poorer performance on sentences with predicate-argument structure with more than one level of embedding; we used compchains to characterize the errors made by these parsers and present examples of erroneous parses produced by the parser that were identified using compchains. We also analyzed the distribution of compchains in 58 non-English UD treebanks and then used compchains to evaluate the CoNLL'18 Shared Task baseline model for each of these treebanks. Our analysis shows that performance with respect to compchain classification is only weakly correlated with the official evaluation metrics (LAS, MLAS and BLEX). We identify gaps in the distribution of compchains in several of the UD treebanks, thus providing a roadmap for how these treebanks may be supplemented. We conclude by discussing how compchains provide a new perspective on the sparsity of training data for UD parsers, as well as the accuracy of the resulting UD parses.

PLOS Biology, 2019

In their Essay on the evolution of human language, Martins and Boeckx seek to refute what they ca... more In their Essay on the evolution of human language, Martins and Boeckx seek to refute what they call the "half-Merge fallacy"-the conclusion that the most elementary computational operation for human language syntax, binary set formation, or "Merge," evolved in a single step. We show that their argument collapses. It is based on a serious misunderstanding of binary set formation as well as formal language theory. Furthermore, their specific evolutionary scenario counterproposal for a "two-step" evolution of Merge does not work. Although we agree with their Essay on several points, including that there must have been many steps in the evolution of human language and the importance of understanding how language and language syntax are implemented in the brain, we disagree that there is any justification, empirical or conceptual, for the decomposition of binary set formation into separate steps.

Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018

While long short-term memory (LSTM) neural net architectures are designed to capture sequence inf... more While long short-term memory (LSTM) neural net architectures are designed to capture sequence information, human language is generally composed of hierarchical structures. This raises the question as to whether LSTMs can learn hierarchical structures. We explore this question with a well-formed bracket prediction task using two types of brackets modeled by an LSTM. Demonstrating that such a system is learnable by an LSTM is the first step in demonstrating that the entire class of CFLs is also learnable. We observe that the model requires exponential memory in terms of the number of characters and embedded depth, where a sub-linear memory should suffice. Still, the model does more than memorize the training input. It learns how to distinguish between relevant and irrelevant information. On the other hand, we also observe that the model does not generalize well. We conclude that LSTMs do not learn the relevant underlying context-free rules, suggesting the good overall performance is attained rather by an efficient way of evaluating nuisance variables. LSTMs are a way to quickly reach good results for many natural language tasks, but to understand and generate natural language one has to investigate other concepts that can make more direct use of natural language's structural nature.

In Phys. Rev. Letters, 73:2, 5 Dec. 94, Mantegna et al. conclude on the basis of Zipf rank freque... more In Phys. Rev. Letters, 73:2, 5 Dec. 94, Mantegna et al. conclude on the basis of Zipf rank frequency data that noncoding DNA sequence regions are more like natural languages than coding regions. We argue on the contrary that an empirical t to Zipf's \law" cannot be used as a criterion for similarity to natural languages. Although DNA is a presumably an \organized system of signs" in Mandelbrot's (1961) sense, an observation of statistical features of the sort presented in the Mantegna et al. paper does not shed light on the similarity between DNA's \grammar" and natural language grammars, just as the observation of exact Zipf-like behavior cannot distinguish between the underlying processes of tossing an M sided die or a nite-state branching process.

In a recent seminal paper, Gibson and Wexler ([1], GW) take important steps to formalizing the no... more In a recent seminal paper, Gibson and Wexler ([1], GW) take important steps to formalizing the notion of language learning in a (nite) space whose grammars are characterized by a nite number of parameters. One of their aims is to characterize the complexity of learning in such spaces. For example, they demonstrate that even in nite spaces, convergence may be a problem since it is possible under some single-step gradient ascent methods to remain at a local maximum. From the standpoint of learning theory, however, GW leave open several questions that can be addressed by a more precise formalization in terms of Markov structures (a possible formalization suggested but left unpursued in a footnote of GW). In this paper we explicitly formalize learning in a nite parameter space as a Markov structure whose states are parameter settings. Several important results that follow directly from this characterization, include (1) A corrected version of GW's central convergence proof; (2) an explicit formula for calculating the transition probabilities between hypotheses and the existence of \problem states" in addition to local maxima; (3) an explicit calculation of the time needed to converge, in terms of number of (positive) examples; (4) the convergence and comparison of several variants of the GW learning procedure, e.g., random walk; (5) batch-and PAC-style learning bounds for the model.

This paper describes work in progress on a computer program that uses syntactic constraints to de... more This paper describes work in progress on a computer program that uses syntactic constraints to derive the meanings of verbs from an analysis of simple English example stories. The central idea is an extension of Winston's (Winston 1975) program that learned the structural descriptions of blocks world scenes. In the new research, English verbs take the place of blocks world objects like ARCH and

This paper describes a LISP program that can learn English syntactic rules. The key idea is that ... more This paper describes a LISP program that can learn English syntactic rules. The key idea is that the learning can be made easy, given the right initial computational structure: syntactic knowledge is separated into a fixed JIlterpreter and a variable set of hig'hly constrained pattern-action grammar rules. Only the grammar rules are learned, via induction from example sentences presented to the program. The interpreter is a Simplified version of Marcus's parser for EnglIsh [1], which parses sentences without backup. The currently Implemented program acquires about 701. of a SImplified core grammar of English. What seems to make the Illductjon easy is that the rule structures and their actions are highly constrained: there are only four actions, and they manipulate only very local parts of the parse tree.

International Joint Conference on Artificial Intelligence, 1983

This paper is a progress report on a scries of three significant extensions to the original parsi... more This paper is a progress report on a scries of three significant extensions to the original parsing design of (Marcus J980).* The extensions are: Ihe range of syntactic phenomena handled has been enlarged, encompassing sentences with Verb Phrase deletion, gapping, and rightward movement, and an additional output representation of anaphor-antcccdcnt relationships has been added (including pronoun and quantifier interpretation). A complete analysis of the parsing design has been carried out, clarifying the parser's relationship to the extended I R(k,t) parsing method as originally defined by (Knuth 1965) and explored by (Szymanski and Williams 1976). The formal model has led directly to the design of a "stripped down" parser that uses standard LR(k) technology and to results about the class of languages that can be handled by Marcus-style parsers (briefly, the class of languages is defined by those that can be handled by a deterministic, two-stack push-down automaton with severe restrictions on the transfer of material between the two sucks, and includes some strictly context-sensitive languages).

Proceedings of the 15th conference on Computational linguistics -, 1994

Introduction. ltindi Given the prominence of the lexicon in most current (3) !inguistic theories ... more Introduction. ltindi Given the prominence of the lexicon in most current (3) !inguistic theories (I.FG, HPSG, GB), lhe inventory of language particular information left in the lexicon teserves special attention. Constructing large computerized lexicons remains a difficult problem, milding a large array of apparently arbitrary information. I'his papers shows that this arbitrariness can bc mn,;trained more than might have been previously hought. In particular, arbitrariness of argument slructure, Bengali xord sense, and paraphrasability will be shown not only (4) o be constrained, hut also to be integrally rchttcd. Our radical) view is that wu'iation of lexical behavior across anguages is exactly like lexical variation within anguages, specifically, the difference lies in the presence ~r absence of certain morphemes. For cxatnple, the fact hat Japanese has richer possibilities it] certain verbal mtterns is derived solely from its morphological (;reek nventory. ~ Put another way, language parameters (5) ;imply are the presence or absence of lexical material in he morphological componet]t. Observed hmguagc ~ariation patterns reflect morphological systematicity. I'he generative machinery for producing argutnent ;tructure positions is fixed across languages. iAnguistie Motivation. A striking example underscoring universality of trgument structure is the familiar Spray/Load 2 tlternation, shown in example (1). Despite the many m'l'acc differences in these data across htnguages, they ,hare several essential properties. 1) a. John loaded tile hay on the wagon. b. John loaded tile wagon with the hay. '(l[)atlese 2) a. taroo-wa teepu-o boo-ni mait,{i. Taro-NOM tape-ACC stick-DAT wrap-PRF 'Tam wrapped the tape around the stick.' ttlroo-wa boo-o teepu-de Illaita. Tan>NOM stick-ACC tape-WITIl wrap-PRl; 'Tam wrapped the stick with tim tape.' See Miyagawa, Uukui, and Tenny (1985) for at discussion of this 'ffect. Also see Mat'titt (1975, pp 441-.455), tor 56 such morphemes+ ;co below for additiotml discussion o1' these alternations and for an alternative analysis. See, e.g., Levin (1993) and sources cited thcre, for example, ackendoff (1990) and Emends (1991). a. shyam lathi-ko kagaz-sc lape..ta Shyam stick-ACC paper-with wral).PRl ? 'Shyam wrapped the stick with paper.'

Language Down the Garden Path, 2013

Proceedings of the workshop on Speech and Natural Language - HLT '91, 1991

This paper describes an implemented program that takes a tagged text corpus and generates a parti... more This paper describes an implemented program that takes a tagged text corpus and generates a partial list of the subcategorization frames in wtfich each verb occurs. The completeness of the output list increases monotonically with the total occurrences of each verb in the training corpus. False positive rates are one to three percent. Five subeategorization frames are currently detected and we foresee no impediment to detecting many more. Ultimately, we expect to provide a large subcategorization dictionary to the NLP community and to train dictionaries for specific corpora.

Proceedings of the 23rd annual meeting on Association for Computational Linguistics -, 1985

Conjunctions are particularly difficult to parse in traditional, phra.se-based gramniars. This pa... more Conjunctions are particularly difficult to parse in traditional, phra.se-based gramniars. This paper shows how a different representation, not b.xsed on tree structures, markedly improves the parsing problem for conjunctions. It modifies the union of phra.se marker model proposed by GoodalI [19811, where conjllnction is considered as tile linearization of a three-dimensional union of a non-tree I),'med phrase marker representation. A PItOLOG grantm~tr for conjunctions using this new approach is given. It is far simpler and more transparent than a recent phr~e-b~qed extraposition parser conjunctions by Dahl and McCord [1984]. Unlike the Dahl and McCor, I or ATN SYSCONJ appr~ach, no special trail machinery i.~ needed for conjunction, beyond that required for analyzing simple sentences. While oi contparable ¢tficiency, the new ~tpproach unifies under a single analysis a host of related constructions: respectively sentences, right node raising, or gapping. Another ,'ulvanrage is that it is also completely reversible (without cuts), and therefore can be used to generate sentences. John and Mary went to tile pictures Ylimplc constituent coordhmtion Tile fox and tile hound lived in tile fox hole and kennel respectively CotJstit,wnt coordination "vith r.he 'resp~ctively' reading John and I like to program in Prolog and Hope Simple constitmvR co~rdinatiou but c,~, have a collective or n.sp,~'tively reading John likes but I hate bananas ~)tl-c,mstitf~ent coordin,~tion Bill designs cars and Jack aeroplanes Gapping with 'resp,~ctively' reading The fox. the honnd and the horse all went to market Multiple c,mjunets *John sang loudly and a carol Violatiofl of coordination of likes *Wire (lid Peter see and tile car? V/o/atio/i of roisrdJ)l=lte str¢/¢'trlz'e constr.~int *1 will catch Peter and John might the car Gapping, hut componcztt ~cnlenccs c.ntain unlike auxiliary verbs ?Tire president left before noon and at 2. Gorbachev

2010 10th International Conference on Intelligent Systems Design and Applications, 2010

Proceedings of the 21st annual meeting on Association for Computational Linguistics -, 1983

A central goal of linguistic theory is to explain why natural languages are the way they are. It ... more A central goal of linguistic theory is to explain why natural languages are the way they are. It has often been supposed that com0utational considerations ought to play a role in this characterization, but rigorous arguments along these lines have been difficult to come by. In this paper we show how a key "axiom" of certain theories of grammar, Subjacency, can be explained by appealing to general restrictions on on-line parsing plus natural constraints on the rule-writing vocabulary of grammars. The explanation avoids the problems with Marcus' [1980] attempt to account for the same constraint. The argument is robust with respect to machine implementauon, and thus avoids the problems that often arise wilen making detailed claims about parsing efficiency. It has the added virtue of unifying in the functional domain of parsing certain grammatically disparate phenomena, as well as making a strong claim about the way in which the grammar is actually embedded into an on-line sentence processor.

Proceedings of the 23rd annual meeting on Association for Computational Linguistics -, 1985

In this paper we apply some recent work of Angluin (1982) to the induction of the English auxilia... more In this paper we apply some recent work of Angluin (1982) to the induction of the English auxiliary verb system. In general, the induction of finite automata is computationally intractable. However, Angluin shows that restricted finite automata, the It-reversible automata, can be learned by el~cient (polynomial time) algorithms. We present an explicit computer model demonstrating that the English auxiliary verb system can in fact be learned as a I-reversible automaton, and hence in a computationally feasible amount of time. The entire system can be acquired by looking at only half the possible auxiliary verb sequences, and the pattern of generalization seems compatible with what is known about human acquisition of auxiliaries. We conclude that certain linguistic subsystems may well be learnable by inductive inference methods of this kind, and suggest an extension to context-free languages.

Proceedings of the 10th international conference on Computational linguistics -, 1984

Room 820. MH" Artificial Intelligence I ~lb Cambridge. MA 02139 AIISTRACI" Natural langt~ages are... more Room 820. MH" Artificial Intelligence I ~lb Cambridge. MA 02139 AIISTRACI" Natural langt~ages are often assumed to be constrained so that they are either easily learnable or parsdble, but few studies have investigated the conrtcction between these two "'functional'" demands, Without a fonnal model of pamtbility or learnability, it is difficult to determine which is morc "dominant" in fixing the properties of natural languages. In this paper we show that if we adopt one precise model of "easy" parsability, namely, that of boumled context parsabilio,, and a precise model of "easy" learnability, namely, that of degree 2 learnabilio" then we can show that certain families of grammars that meet the bounded context parsability ct~ndition will also be degree 2 learnable. Some implications of this result for learning in other subsystems of linguistic knowledge are suggested. 1

Biolinguistics

Artificial Intelligence at MIT, 1990

Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics, 2021

PLOS Biology, 2019

Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 2018

International Joint Conference on Artificial Intelligence, 1983

Proceedings of the 15th conference on Computational linguistics -, 1994

Language Down the Garden Path, 2013

Proceedings of the workshop on Speech and Natural Language - HLT '91, 1991

Proceedings of the 23rd annual meeting on Association for Computational Linguistics -, 1985

2010 10th International Conference on Intelligent Systems Design and Applications, 2010

Proceedings of the 21st annual meeting on Association for Computational Linguistics -, 1983

Proceedings of the 23rd annual meeting on Association for Computational Linguistics -, 1985

Proceedings of the 10th international conference on Computational linguistics -, 1984