CS674: Natural Language Processing (original) (raw)

CS 674: Natural Language Processing

Spring 2002

Instructor: Lillian Lee, 4152 Upson, . Office hours: Wednesdays 2-3:30, or by appointment
Time and location: MW 11:15-12:05, Upson 207

Announcements: See the cornell.class.cs674 newsgroup.
Reaction essay info and READING LIST: click here
Some online resources: click here
Handouts and lecture references:

Lecture index: Intro: Jan 21 FB-CFGs: Jan 23 TAGs: Jan 28, Jan 30, Feb 4 Stats & Zipf's law: Feb 6 HMMs: Feb 11, Feb 13 and 18 Good-Turing: Feb 20 Discourse: Feb 25,Feb 27 WSD: Mar 4, Mar 6, Mar 11 Segmentation: Mar 13 Semantics: Mar 25, Mar 27 Info. theory: Apr 1 Clustering: Apr 3, Apr 8 MT: Apr 10, Apr 15 EM: Apr 17 Summarization: Apr 22 "Ling. vs. Stats": Apr 24 1/21: Introduction Four handouts: course information sheet (includes syllabus), Student information sheet, Reaction essay readings, Introductory paper Introductory paper: ``I'm sorry Dave, I'm afraid I can't do that'': Linguistics, Statistics, and Natural Language Processing circa 2001 Bill Gates quote, Gartner symposium, 1997. (Hereis an alternate link.) About the word "undertoad" (see John Irving, The World According to Garp, 1976.) Mary Klages, 2001. Structuralism and Sausurre Roni Rosenfeld, the Universal Speech Interface Manifesto Alan M. Turing, 1950. Computing machinery and intelligence. Mind LIX(236), pp. 433-460. 1/23: Feature-based context-free grammars Handout: lecture examples See also ch. 4 of Allen or ch. 9 and 11 of Jurafsky and Martin. 1/28: Tree adjoining grammars Handout: lecture examples Christopher Culy, 1985. The complexity of the vocabulary of Bambara. Linguistics and Philosophy, 8:345-351. Aravind K. Joshi, Leon S. Levy, Masako Takahashi, 1975. Tree adjunct grammars. _Journal of Computer and System Sciences_10(1), pp. 136-163. Aravind K. Joshi and Yves Schabes, Tree-adjoining grammars. I don't know of a publication source for this.Aravind K. Joshi,K. Vijay-Shankar, and David Weir, 1991. The convergence of mildly context-sensitive grammar formalisms. In Peter Sells, Stuart Shieber, and Tom Wasow, editors, Foundational Issues in Natural Language Processing, pp. 31--81. MIT Press. Aravind K. Joshi, 1985. Tree adjoining grammars: how much context-sensitivity is required to provide reasonable structural description. In David R. Dowty, Lauri Karttunen, and Arnold M. Zwicky, eds, Natural Language Processing: psychological, computational, and theoretical perspectives, Cambridge. Note -- there may be an error in the parsing-time claim. Geoff Pullum, 1986. Footloose and context-free. Natural Language and Linguistic Theory 4, pp. 283--289. Reprinted in The Great Eskimo Vocabulary Hoax, U. of Chicago Press, 1991. Stuart Shieber, 1985. Evidence against the context-freeness of natural language. Linguistics and Philosophy, 8:333-343. Stuart Shieber and Yves Schabes, 1990. Synchronous Tree-Adjoining Grammars. In Proceedings of the 13th International Conference on Computational Linguistics, volume 3, pp. 1--6.K. Vijay-Shankar and Aravind K. Joshi, 1985. Some computational properties of Tree Adjoining Grammars. Proc. of the 23rd ACL, pp. 82--93.K. Vijay-Shankar andDavid Weir, 1994. The equivalence of four extensions of context-free grammars. Mathematical Systems Theory, 27:511-545. The XTAG home page. TAG+6: The 6th International Workshop on Tree Adjoining Grammars and Related Frameworks . [back to lecture index | back to top] 1/30: TAGs with adjunction constraints Handout: Lecture examples Anne Abeillé and Yves Schabes, 1989. Parsing idioms in lexicalized TAGs. Fourth Conference of the European Chapter of the Association for Computational Linguistics (EACL '89).Tilman Becker, Aravind K. Joshi, and Owen Rambow, 1991. Long distance scrambling and tree adjoining grammars. EACL, pp. 21--26.Yves Schabes and Stuart Shieber, 1994. An alternative conception of Tree-adjoining derivation. Computational Linguistics 20(1), pp. 91--124. K. Vijay-Shankar and Aravind K. Joshi, 1988. Feature structure based tree adjoining grammars. Proceedings of the 12th International Conference on Computational Linguistics (COLING '88). [back to lecture index back to top]2/04: Feature-based TAGs Handout: Lecture examples Steven Abney, 1996. Statistical methods and linguistics. The Balancing Act, Judith Klavans and Philip Resnik, eds, MIT Press. Paper available online as psorpdf Sanguthevar Rajasekaran and Shibu Yooseph, 1995. TAL recognition in O(M(n^2)) time. Proc. of the 33rd ACL, pp. 166--173. (Link is to the journal version, in Journal of Computer and System Sciences, 56(1), pp. 83--89, 1998.)Giorgio Satta, 1994. Tree Adjoining Grammar parsing and Boolean matrix multiplication. Computational Linguistics, 20(2), pp. 173--191. [back to lecture index back to top]2/06: (Some) statistics of language Handout: Lecture handout J.B. Estoup, 1916. Gammes Stenographiques. Institut Stenographique de France.W. Nelson Francis and Henry Kucera, 1982. Frequency Analysis of English Usage. Houghton Mifflin. Also of potential interest: the Brown corpus manual, 1979; and a search interface.Tommi Jaakkolaand David Haussler, 1998. Exploiting generative models in discriminative classifiers. NIPS 11.Mark Johnson, 2001.Joint and conditional estimation of tagging and parsing models. Proc. of ACL, pp. 314--321.John Lafferty, Andrew McCallum, and Fernando Pereira, 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proc. of ICML. Benoit Mandelbrot, 1957. Th�orie math�matique de la d'Estoup-Zipf. Inst. de Statistique de l'Univ�rsit�.George A. Miller, 1957. Some effects of intermittent silence. American J. Psychology 70, pp. 311-313.Andrew Y. Ng andMichael Jordan, 2002. On discriminative vs. generative classifiers: A comparison of logistic regression and Naive Bayes. Proc. of NIPS.Y. Dan Rubenstein and Trevor Hastie, 1997.Discriminative vs Informative Learning. Proc. of KDD. George Zipf, 1949. Human Behavior and the principle of least effort. Addison-Wesley Press. Zipf's Law (webpage by Wentian Li). Many Zipf's Law refs in many fields.See also section 1.4 of the Manning/Schütze text or chapter 4 of Timothy C. Bell, John G. Cleary, and Ian H. Witten, Text Compression, 1990. [back to lecture index back to top]2/11: Hidden Markov Models Handout: Lecture handout Lawrence R. Rabiner, 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77 (2), pp. 257--286. Errataby Ali Rahimi Lawrence R. Rabiner and B. H. Juang, 1986. An introduction to hidden Markov models. IEEE ASSP Magazine, pp. 4-15. See also Ch. 9 of Manning/Schütze, Ch. 2 of Jelinek, or Ch. 7 of Jurafsky/Martin. Ch. 3 of the Durbin/Eddy/Krogh/Mitchison introduces the same material from a computational-biology perspective. 2/13 and 2/18: Hidden Markov Models (cont.) Handout: Lecture handout Bernard Merialdo, 1994. Tagging English text with a probabilistic model. Computational Linguistics 20 (2), pp. 155-172. [back to lecture index back to top]2/20: Good-Turing smoothing Handout: Lecture handout Peter F. Brown and Vincent J. DellaPietra and Peter V. deSouza and Jennifer C. Lai and Robert L. Mercer, 1992. Class-based n-gram models of natural language. Computational Linguistics 18(4), pp. 467--479.Stanley F. Chen and Joshua Goodman, 1996. An empirical study of smoothing techniques for language modeling Proceedings of the 34th Meeting of the Association for Computational Linguistics, pp 310--318. TR version: TR-10-98, Computer Science Group, Harvard University, 1998.Kenneth W. Church and William A. Gale, 1991. A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams. _Computer Speech and Language_5, pp. 19--54.Michael Collins and James Brooks, 1995. Prepositional Attachment through a Backed-off Model. Third Workshop on Very Large Corpora, pp. 27--38. I. J. Good, 1953. The population frequencies of species and the estimation of population parameters. Biometrika 40, pp. 237--264. Pierre Simon Laplace. Essai Philosophique sur les probabilities. There appear to be several editions; Ristad's A natural law of succession dates it to 1775. Arthur Nadas, 1985. On Turing's formula for word probabilities.IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-33 (6), pp. 1414--1416.Geoffrey Sampson's Good-Turingpage.See also section 6.2 of the Manning/Schütze text or chapter 15 of the Jelinek text. [back to lecture index back to top]2/25: Discourse phenomena Handout: Lecture handout Ralph Grishman, 1986. Computational Linguistics: An Introduction. Cambridge. Jerry R. Hobbs, 1978. Resolving Pronoun References. Reprinted in Grosz, Sparck Jones, and Webber, Readings in Natural Language Processing. Jerry R. Hobbs, 1979. Coherence and co-reference. Cognitive Science 3(1), pages 67--82. Ray Jackendoff, 1972. Semantic Interpretation in Generative Grammar. MIT Press. William C. Mann and Sandra A. Thompson, 1986. Relational propositions in discourse. Discourse Processes 9(1), pp. 57--90. Vladimir Nabokov, Lolita. 1955.Candace Sidner, 1979, Towards a Computational Theory of Definite Anaphora Comprehension in English Discourse. PhD Thesis, MIT. Remko Scha and Livia Polyani, 1988. An augmented context free grammar for discourse. Proceedings of COLING. Yorick Wilks, 1975. An intelligent analyzer and understander of English. Communications of the ACM 18(5), 264--274. Reprinted in Grosz et al, Readings in Natural Language Processing See also ch. 14 of Allen. or ch. 18 of Jurafsky/Martin. [back to lecture index back to top]2/27: The Grosz and Sidner discourse theory Handouts: Example discourse, Project guidelines Barbara Grosz andCandace Sidner, 1986. Attention, intentions, and the structure of discourse.Computational Linguistics 12(3), pp. 175--204.See also ch. 14 and 16 of Allen or ch. 19 of Jurafsky/Martin. [back to lecture index back to top]3/4: Word sense disambiguation Handouts: Lecture handout Kathleen G. Dahlgren, 1988. Naive Semantics for Natural Language Understanding. Kluwer. William Gale, Kenneth Church, and David Yarowsky, 1992. Estimating upper and lower bounds on the performance of word-sense disambiguation programs. Proceedings of the ACL, pp. 249--256. Nancy Ide and Jean Veronis. Introduction to the special issue on word sense disambiguation: The state of the art. Computational Linguistics 24(1), pp. 1--40. Abraham Kaplan, 1950. An experimental study of ambiguity and context. Mechanical Translation 2(2): 39-46 (issue appeared in 1955). Jerrold J. Katz and Jerry A. Fodor, 1963, The structure of semantic theory. Language (39), pp. 170--210. Alpha Luk, 1995. Statistical sense disambiguation with relatively small corpora using dictionary definitions.Proceedings of the 33rd ACL. Ray Mooney, 1996. Comparative experiments on disambiguating word senses: An illustration of the role of bias in machine learning. Proceedings of the 1996 Conference on Empirical Methods in Natural Language Processing, pp. 82-91. Erwin Reifler, 1955. The mechanical determination of meaning. In William N. Locke and A. Donald Booth, eds., Machine Translation of Languages. John Wiley and Sons. Also included in the upcoming Readings in Machine Translation, MIT Press. Philip Resnik, 1995. Disambiguating noun groupings with respect to WordNet senses. Proc. of the 3rd Workshop on Very Large Corpora.Hinrich Sch�tze, 1992. Dimensions of meaning. Proceedings of Supercomputing, pp 787-796. (postscript)Yorick Wilks andMark Stevenson, 1998. The Grammar of Sense: Using part-of-speech tags as a first step in semantic disambiguation. Journal of Natural Language Engineering 4(2), pp. 135-144. (See also cmp-lg/9607028)David Yarowsky, 1992. Word-Sense Disambiguation Using Statistical Models of Roget's Categories Trained on Large Corpora. Proc. of COLING, pp. 454--460.See also ch 17.1-17.2 of Jurafsky and Martin [back to lecture index back to top]3/6: Word sense disambiguation methods Handouts: lecture handout Christiane Fellbaum, ed., 1998. WordNet: An Electronic Lexical Database See also ch. 7 of Manning and Schütze. [back to lecture index back to top]3/11: Supervised and bootstrapped WSD methods Handouts: lecture handout, sample project proposal (available by hardcopy only) David Yarowsky, 1994. Decision Lists for Lexical Ambiguity Resolution: Application to Accent Restoration in Spanish and French.Proceedings, ACL-94, pp. 88--95.David Yarowsky, 1995. Unsupervised word sense disambiguation rivaling supervised methods, Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pp 189--196.David Yarowsky, 1999.A comparison of corpus-based techniques for restoring accents in Spanish and French text. Natural Language Processing Using Very Large Corpora, pp. 99--120.SemCor(see also SemCor 1.7a) [back to lecture index back to top]3/13: Mostly-unsupervised Japanese segmentation Rie Kubota Ando and Lillian Lee, 2000. Mostly-Unsupervised Statistical Segmentation of Japanese: Applications to Kanji. NAACL, pp. 241--248. [back to lecture index back to top]3/25: Representing semantics Handout: lecture handout Richard Montague, 1974. Formal Philosophy.Fernando C.N. Pereira and Stuart Shieber, Prolog and Natural-Language Analysis, 1987. See Section 4.1.Mark Steedman, 2000. The Syntactic Process.See also Allen ch. 11.2, Jurafsky/Martin ch 14. 15.5 [back to lecture index back to top]3/27: Quantifier scope ambiguity Handout: lecture handout Jerry Hobbs and Stuart Shieber, 1987. An Algorithm for Generating Quantifier Scopings. Computational Linguistics, 13(1-2), pages 47-63.Fernando C.N. Pereira and Stuart Shieber, Prolog and Natural-Language Analysis, 1987. See Section 4.1. [back to lecture index back to top]4/1: Introduction to information theory Handouts: lecture handout, sample literature survey (hardcopy only) Thomas M. Cover and Joy A. Thomas, 1991. Elements of Information Theory. Wiley. Ming Li and Paul Vitanyi, An Introduction to Kolmogorov Complexity and Its Applications, 1997 (2nd ed).Claude Shannon, 1948. A mathematical theory of communication. Bell System Technical Journal, vol. 27, pp. 379-423 and 623-656. Republished as "The mathematical theory of communication" in Warren Weaver and Claude E. Shannon, eds., The Mathematical Theory of Communication, U. Illinois Press, 1949. See also ch 7 and 8 of the Jelinek text, or Ch 2.2 of the Manning/ Schütze text. Jelinek recommends Abramson's Information Theory and Coding, 1963. [back to lecture index back to top]4/3: Class-based language modeling with hard clusters Handout: Table 2 and 3 of the Brown paper (see link)Peter F. Brown and Vincent J. DellaPietra and Peter V. deSouza and Jennifer C. Lai and Robert L. Mercer, 1992. Class-based n-gram models of natural language. Computational Linguistics 18(4), pages 467--479. Frederick Jelinek, Robert L. Mercer and Salim Roukos, 1992. Principles of Lexical Language Modeling for Speech Recognition. In Sadaoki Furui and M. Mohan Sondhi, eds., Advances in Speech Signal Processing, Mercer Dekker. [back to lecture index back to top]4/3: Class-based language modeling with soft clusters Handout: Lecture handout Slava M. Katz, 1987. Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer.IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-35 (3), pp 400--401.Lillian Lee and Fernando Pereira, 1999. Distributional similarity models: Clustering vs. nearest neighbors. Fernando Pereira,Naftali Tishby, and Lillian Lee, 1993. Proceedings of the 37th ACL, pp 33--40. Distributional Clustering of English Words. Proceedings of the 31st ACL, pp 183--190. See also Chapter 4 of Similarity-Based Approaches to Natural Language Processing (has proofs) [back to lecture index back to top]4/10: Introduction to machine translation Yehoshua Bar-Hillel, 1960. The present status of automatic translation of languages. In Franz Alt, A. Donald Booth, and R. E. Meagher, eds., Advances in Computers. Academic Press.John Hutchins' history of machine translation in a nutshell. Eduard H. Hovy and Kevin Knight, 1996. Machine Translation. Tutorial at ACL '96.Kishore Papineni and Salim Roukos and Todd Ward and Wei-Jing Zhu, 2001. Bleu: a Method for Automatic Evaluation of Machine Translation. IBM research report RC22176 (W0109-022).See also ch. 21 of Jurafsky/Martin [back to lecture index back to top]4/15: Statistical machine translation Handout: lecture handout Adam L. Berger, Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, John R. Gillett, John D. Lafferty, Robert L. Mercer, Harry Printz, and Lubos Ures, 1994. The Candide System for Machine Translation. Proceedings of the 1994 ARPA Workshop on Human Language Technology Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer, 1993.The Mathematics of Statistical Machine Translation". Computational Linguistics 19(2), pp. 263--311.Kevin Knight, 1999. A Statistical MT Tutorial Workbook. (html)Voltaire, 1759. Candide Warren Weaver. Translation. Memorandum. See MT News International discussion, July 1999. [back to lecture index back to top]4/17: The EM algorithm Michael Collins, 1997. The EM Algorithm. Written Preliminary Exam (WPE) paper.Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin, 1977. Maximum Likelihood From Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society Series B, 39(1), pp. 1-38. Geoffrey J. McLachlan and Thriyambakam Krishnan, 1997. The EM Algorithm and Extensions. Wiley.Ted Pedersen'sintroductory talkand paper See also chapter 9 of the Jelinek text, or chapter 11 of the Durbin, Eddy et al text, or the "When does EM work?" panel at EMNLP 2001 [back to lecture index back to top]4/22: Statistical Summarization Handouts: lecture handout, Presentation and Final Report Instructions Kevin Knight and Daniel Marcu, 2000. Statistics-Based Summarization -- Step One: Sentence Compression. Inderjeet Mani and Mark Maybury, eds., 1999. Advances in Automatic Text Summarization.Michael J. Witbrock and Vibhu Mittal, 1999. Ultra-summarization: a statistical approach to generating highly condensed non-extractive summaries Poster abstract, SIGIR 1999. (long version [back to lecture index back to top]4/2: Linguistics and statistics: the case of POS tagging Handouts: lecture handout Eric Brill and Grace Ngai, 1999. Man [and Woman] vs. Machine: A Case Study in Base Noun Phrase Learning Jean-Pierre Chanod and Pasi Tapanainen, 1995. Tagging French -- comparing a statistical and a constraint-based method. Proceedings of the EACL.Kenneth W. Church, 1992. Current practice in part of speech tagging and suggestions for the future. In Simmons, ed., Sbornik praci: In Honor of Henri Kucera.Doug Cutting, Julian Kupiec, Jan Pedersen, and Penelope Sibun, 1992. A Practical Part-of-Speech Tagger. Proceedings of the Third Conference on Applied Natural Language Processing, pp. 133-140.Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19, pp. 313-330. Christer Samuelsson and Atro Voutilainen, 1997. Comparing a linguistic and a stochastic tagger, Proceedings of the 35th ACL/8th EACL. Also cmp-lg/97060 Pasi Tapanainen and Atro Voutilainen, 1994. Tagging accurately - Don't guess if you know, Proceedings of the Fourth ACL Conference on Applied Natural Language Processing. Also cmp-lg/9408009 [back to lecture index back to top]

Other online resources:

Other CS course websites

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. .