Statistical Natural Language Processing: Models

and Methods (CS775) (original) (raw)

CS 775: Seminar in Natural Language Understanding, Spring 2001

"Statistical Natural Language Processing: Models and Methods"

Wednesdays, 1:25-2:15, Hollister 362

Natural language processing (NLP) has been considered one of the "holy grails" for artificial intelligence ever since Turing proposed his famed "imitation game" (the Turing Test). Statistical approaches have revolutionized the way NLP is done. Furthermore, some of these approaches can be employed in other applications, such as computational biology and data mining. This course will explore important classes of probabilistic models of language and survey some of the common general techniques.

Syllabus

I. Models
- Zipf's law: introduction to the statistics of language
- Hidden Markov Models and related formalisms: the Viterbi and Baum-Welch algorithms, integration with semantic hierarchies, language modeling, smoothing, and Maximum Entropy Markov Models
- The noisy channel model (basics of information theory): entropy, the Kullback-Leibler divergence, and mutual information
- Parametric clustering
II. Methods
- The Expectation-Maximization (EM) algorithm
- The Maximum Entropy (ME) principle
- The singular value decomposition (SVD) and latent semantic indexing (LSI)

Reference Texts

Eugene Charniak, Statistical Language Learning, 1993
Thomas M. Cover and Joy A. Thomas, Elements of Information Theory, 1991
Richard Durbin, Sean Eddy, Anders Krogh, and Graeme Mitchison,Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, 1998
Frederick Jelinek, Statistical Methods for Speech Recognition, 1997.
Christopher D. Manning and Hinrich Schuetze, Foundations of Statistical Natural Language Processing, 1999.

Notes and References from Lecture

Some references may be available only internally to Cornell.
Lecture list: Jan 24 | Jan 31 | Feb 7 | Feb 14 | Feb 21 | Feb 28 | Mar 7 | Mar 14 | Mar 28 | April 11 | April 18 | April 25

Jan 24: Models and (vs.) methods What is NLP - AI-completeness - Statistical NLP - generative models - benefits and tradeoffs.
References:
- Steven Abney, 1996. Statistical Methods and Linguistics. The Balancing Act, Judith Klavans and Philip Resnik, eds, MIT Press. Paper available online as ps
- Bill Gates, 1997. Remarksat the Gartner Symposium: "Now we're betting the company on these natural interface technologies". (Hereis an alternate link.)
- Alan M. Turing, 1950. Computing machinery and intelligence. Mind LIX(236), pp. 433-460.
Jan 31: Zipf's law and the model moral Statistics of words - Zipf-like behavior - why log-log - least effort - Miller's monkeys.
References:
- Timothy C. Bell, John G. Cleary, and Ian H. Witten, 1990.Text Compression, Prentice-Hall.
- J.B. Estoup, 1916. Gammes Stenographiques. Institut Stenographique de France.
- Wentian Li, 1992. Random texts exhibit Zipf's-law-like word frequency distribution. IEEE Transactions on Information Theory, 38(6) 1842-1845. (Click here for alternate link.)
- Zipf's Law (webpage by Wentian Li). Many Zipf's Law refs in many fields.
- Benoit Mandelbrot, 1957. Th�orie math�matique de la d'Estoup-Zipf. Inst. de Statistique de l'Univ�rsit�.
- The Manning/Schuetze text, section 1.4.3
- George A. Miller, 1957. Some effects of intermittent silence. American Journal of Psychology, 70:311-314.
- George Kinglsey Zipf, 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley.
Feb 7: Hidden Markov Models (I) Markov chains - decoupling the output from the mechanism - HMMs - calculating the probability of an observed sequence the hard way - the forward-backward algorithm to the rescue.
References:
- L. R. Rabiner and B. H. Juang, 1986. An Introduction to Hidden Markov Models. IEEE ASSP Magazine 3(1), pp 4-16.
- Lawrence R. Rabiner, 1989. A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE 77(2), pp. 257-286.
  * Errata by Ali Rahimi
Feb 14: Hidden Markov Models (II) Finding likely state sequences - finding likely paths with the Viterbi algorithm - revealing the hidden states - Baum-Welch when nobody in the lab knows the language.
Further references:
- Doug Cutting, Julian Kupiec, Jan Pedersen, and Penelope Sibun, 1992. A Practical Part-of-Speech Tagger. Proceedings of the Third Conference on Applied Natural Language Processing, pp. 133-140. Available as ps.
- Stephen J. DeRose, 1988. Grammatical category disambiguation by statistical optimization. Computational Linguistics 14(2), pp. 31-39.
- David Elworthy, 1994. Does Baum-Welch Re-estimation help taggers? Proceedings of the 4th Conference on Applied Natural Language Processing.
- Julian Kupiec, 1992. Robust Part-of-Speech Tagging Using a Hidden Markov Model. Computer Speech and Language 6, pp. 225-242.
- Bernard Merialdo, 1994. Tagging English Text with a Probabilistic Model. Computational Linguistics 20(2), pp. 155--171.
- Ralph Weischedel, Marie Meteer, Richard Schwartz, Lance A. Ramshaw, and Jeff Palmucci, 1993. Coping with ambiguity and unknown words through probabilistic models. Computational Linguistics 19(2), pp. 359--382.
Feb 21: Abney and Light: a WordNet-based HMM Selectional preferences - word sense disambiguation via HMM's - WordNet-induced HMM topology - fun with hand-simulation - alterations to Baum-Welch - problems with Baum-Welch, WordNet, or the approach?
References:
- Steven Abney and Marc Light, 1999. Hiding a semantic hierarchy in a Markov model. Proceedings of the ACL'99 Workshop on Unsupervised Learning in Natural Language Processing , pp. 1-8. (Longer version)
- Massimiliano Ciaramita and Mark Johnson, 2000. Explaining away ambiguity: Learning verb selectional preference with Bayesian networks.Proceedings of the 18th International Conference on Computational Linguistics, Volume 1.
- Stephen Clark and David Weir, An iterative approach to estimating frequencies over a semantic hierarchy, Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 258-265.
- Hang Li and Naoki Abe, 1998. Generalizing Case Frames Using a Thesaurus and the MDL Principle. Computational Linguistics, Vol. 24(2).
- Philip Resnik, 1997. Selectional preference and sense disambiguation. ACL SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What, and How?
- WordNet. See also WordNet: An Electronic Lexical Database, Christiane Fellbaum, ed., 1998.
Feb 28: Smoothing, the Good old-fashioned Turing way The zero-frequency problem - the return of Zipf's law - the Good-Turing estimate - counting bunnies in Bletchley Park and cracking the Enigma - a Bayesian derivation - singleton rates and novel-item rates - Jelinek and Mercer's empirical derivation - leave-one-out - leave-high-counts-alone.
References:
- Stanley F. Chen and Joshua Goodman, 1996. An empirical study of smoothing techniques for language modeling Proceedings of the 34th Meeting of the Association for Computational Linguistics, pp 310--318. TR version: TR-10-98, Computer Science Group, Harvard University, 1998.
- Kenneth W. Churchand William A. Gale, 1991. A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language 5, 19-54.
- Irving J. Good, 1953. The population frequencies of species and the estimation of population parameters. Biometrika 40(3-4), pp. 237-264.
- Jelinek text: see last chapter.
- Slava M. Katz, 1987. Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer.IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-35 (3), pp 400-401.
- Reinhard Kneser and Hermann Ney, 1995. Improved backing-off for m-gram language modeling. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 181-184.
- Arthur Nadas, 1985. On Turing's formula for word probabilities.IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-33(6), 1414-1416.
- Geoffrey Sampson's Good-Turing Frequency Estimation page
Mar 7: EM and Baum-Welch Using hidden structure variables - maximizing expected log-joint-likelihood increases likelihood - deriving Baum-Welch as a special case.
References:
- Michael Collins, 1997. The EM Algorithm. Written Preliminary Exam (WPE) paper.
- Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin, 1977. Maximum Likelihood From Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society Series B, 39(1), pp. 1-38.
- Geoffrey J. McLachlan and Thriyambakam Krishnan, 1997. The EM Algorithm and Extensions. Wiley.
- See also chapter 9 of the Jelinek text, or chapter 11 of the Durbin, Eddy et al text.
Mar 14: Statistical Generation Presented by Regina Barzilay. Note use of generative models for ranking proposed utterances.
References:
- Aeggeliki Dimitromanolaki's bibliographyon machine learning/statistical approaches to generation
- FERGUS
- FUF/SURGE
- Nitrogen demo
- Penman
Mar 28: Introduction to Information Theory Relating information to communication, randomness, and surprise - semantics vs. engineering? - definitions and axioms for entropy - cross entropy when an unknown source is involved - relative entropy/Kullback Leibler divergence to measure distributional similarity - mutual information as a special case - can the mutual information be negative? It depends on whom you ask
- Norman Abramson, 1963. Information Theory and Coding
- Claude Shannon, 1948. A mathematical theory of communication. Bell System Technical Journal, vol. 27, pp. 379-423 and 623-656. Republished as "The mathematical theory of communication" in Warren Weaver and Claude E. Shannon, eds., The Mathematical Theory of Communication, U. Illinois Press, 1949.
- See also the Cover and Thomas text or chapters 7 and 8 of the Jelinek text.
April 11: Maximum Entropy Models The missing key, the principle of insufficient reason, and the probability the sun will rise tomorrow - constraints on the empirical and estimated expectations - deriving the maximum-entropy distribution (the return of Lagrange) - Candide appears that the Bank of Boston has almost completed its review of Shawmut (or, improving statistical machine translation)
- Adam Berger'sMaxEnt reading list
- Adam Berger, Stephen Della Pietra, and Vincent Della Pietra, 1996. A maximum entropy approach to natural language processing . Computational Linguistics 22(1).
- P. Brown, J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek,J. Lafferty, R. Mercer, and P. Roosin, 1990. A statistical approach to machine translation. Computational Linguistics 16, 79--85.
- P. Brown, S. Della Pietra, V. Della Pietra, and R. Mercer, 1991. The mathematics of statistical machine translation: parameter estimation". Computational Linguistics, 19(2), 263--311.
- Albert Einstein. "Everything should be made as simple as possible, but not simpler".
- John Maynard Keynes, 1921. A Treatise on Probability
- Pierre Simon Laplace, 1775. Essai Philosophique sur les probabilities.
- See also chapters 13-14 of the Jelinek text.
April 18: Maximum Entropy Models: iterative scaling (references as above) deriving the update equations - sanity checks - comparisons to EM
April 25: Latent Semantic Indexing and Iterative Residual Rescaling Subspace projection, or, why hopefully no one will make a model of car named the Chomsky - a geometric view of the singular value decomposition - an analysis not based on a generative model (for once) - compensating for non-uniformity via the IRR algorithm
- Rie Kubota Ando, 2000. Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement. SIGIR '00.
- Rie Kubota Ando and Lillian Lee, 2001. Semantic space: On the effectiveness of iterative residual rescaling. SIGIR 2001, pp. 154--162. (alternate formats)
- Scott Deerwester, Susan Dumais,George Furnas, Thomas Landauerand Richard Harshman, 1990. Indexing by Latent Semantic Analysis.Journal of the American Society for Information Science 41(6), pp. 391-407. (Alternate link here)
- Christos H. Papadimitriou, Prabhakar Raghavan, Hisao Tamaki and Santosh Vempala, 2000. Latent Semantic Indexing: A Probabilistic Analysis. Journal of Computer and System Sciences 61(2), pp. 217--235.
- Telcordia (formerly BellCore) LSI page
- Colorado LSA homepage
- Tennessee LSI web page

Administrative Information

2 credits, S/U only. Grade based on attendance/participation.
Prereqs: elementary probability, mathematical maturity. COMS674 not required.
URL: http://www.cs.cornell.edu/courses/cs775/2001sp/

CS775, Spring '01
Lillian Lee