CS/INFO 630, Spring 2006 (original) (raw)

CS/INFO 630, Spring 2006: Human Language Technology

(or, IR, NLP, and special guests; or, Representing and accessing digital information)

Note: After the Spring 2006 semester, this course was renumbered; the current version is CS674/INFO 630.

Prof. Lillian Lee (follow link for contact information and office hours)
TR 2:55-4:10, Bard 140

Official description (from the course catalog):

Information retrieval has evolved from the problem of locating books in a library to a multitude of tasks ubiquitous in business, science, and personal life. Modern information systems automatically compose newspapers, extract facts from the web, and analyze usage patterns. This course covers the necessary techniques for representing, organizing, and accessing digital information that is in textual or semistructured form. Topics combine information retrieval, natural language processing, and machine learning, with links to work in databases and data mining.

Administrative info:


  1. 1/24/06: "Sense and Sensibility: Automatically Analyzing Subject and Sentiment in Human-Authored Texts" (preface)
    • Handouts: course description and policies [pdf]

  1. 1/26/06: Basics of information retrieval; the vector-space model
    • Lecture guides: cover sheet,Haque/Shaparenko, Rabkin/Krafft
    • Handouts: Examples for lecture-guide preparation [pdf] and an example (for a different class) of the scribe-notes portion [ps]
    • Lecture references:
      * Amit Singhal. Modern information retrieval: A brief overview. IEEE Data Engineering Bulletin 24(4), pp. 35–43, 2001. [ps]
      * TREC conference proceedings. [website]. See especially the overviews.
    • Additional references:
      * See the course resources page for other overviews.
      * David Durbin. The most influential paper Gerard Salton never wrote. Library Trends, Spring 2004. [html]An interesting sidelight on the origins of the vector space model, the assumptions underlying the VSM, and phantom citations.
      * Karen Spärck Jones. IDF term weighting and IR research lessons. Journal of Documentation 60(5):521–523 (2004). [pdf]Reviews the history of the introduction of the three "consensus" term-weighting quantities. Notes that Spärck Jones' data consisted of presence/absence wordlists, thus explaining the non-incorporation of term-frequency information.

  1. 1/31/06: An example of empirical IR research: length normalization
    • Lecture guides: cover sheet for Danis/Rogan (statement of problem corrected 2/21/06), Danis/Rogan scribe notes, Danis/Rogan problems, Leshed/Shami (cover sheet for Leshed/Shami superseded by updated cover sheet for Danis/Rogan)
    • Lecture references:
      * Amit Singhal, Chris Buckley, and Mandar Mitra. Pivoted document length normalization. SIGIR 1996. [ps, pdf]
      * Justin Zobel. How reliable are large-scale information retrieval experiments? SIGIR 1998 [pdf, pdf2]
    • Additional references:
      * Mark Sanderson and Hideo Joho. Forming test collections with no system pooling. SIGIR 2004. [pdf]
      * Ellen Voorhees and Donna Harman. Overview of the Eighth Text REtrieval Conference (TREC-8) [good ps, poor pdf]. Section 3 discusses the effects of pooling.

  1. 2/2/06: Completion of pivoted document-length normalization
    • Lecture guides: cover sheet, Dejgosha/Hu scribe notes, Hamatake/Park
    • Additional references:
      * Abdur Chowdury, M. Catherine McCabe, David Grossman, Ophir Frieder. Document normalization revisited. SIGIR poster (2002). [pdf,pdf2]
      * Jaap Kamps, Maarten de Rijke, and Boerkur Sigurbjoernsson. Length normalization in XML retrieval. SIGIR (2004) [abstract,pdf]

  1. 2/7/06: Introduction to probabilistic retrieval (classic case)
    • Lecture guides: cover sheet, Au/Gerner/Kot, Babinski/Lin
    • Lecture references:
      * Karen Spärck Jones, Steve Walker, and Stephen Robertson. A probabilistic model of information retrieval: Development and comparative experiments. Information Processing and Management, pp. 779–808, 809–840 (2000). [html of preface, pdf of Part 1, pdf of Part 2]
      * William S. Cooper. Some inconsistencies and misidentified modeling assumptions in probabilistic information retrieval. ACM Transactions on Information Systems (TOIS), pp. 100–111, 1995. [pdf]
    • Additional references:
      * Fabio Crestani, Mounia Lalmas, C. J. Van Rijsbergen, and Iain Campbell. "Is this document relevant? ...probably": A survey of probabilistic models in information retrieval. ACM Computing Surveys 30(4) (1998). [pdf,pdf2]
      * M. E. Maron. Probabilistic approaches to the document retrieval problem. SIGIR, pp. 98–107 (1982). [pdf] Discusses models for interpreting the probability of relevance, and also gives a personal historical perspective on information science (as Maron construes it).
      * Stephen Robertson. The probability ranking principle in IR. Journal of Documentation 33 (1977). Reprinted in Spärck Jones and Peter Willett, eds., Readings in Information Retrieval (1997).Appendix gives an example where ranking by probability of relevance is suboptimal.
      * Stephen Robertston and Karen Spärck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science 27, 129-46 (1976). [pdf]

  1. 2/9/06: Completion of a derivation of (a version of) the RSJ model; the case of binary attributes; an isolated-Poisson model for term frequencies
    • Lecture guides: cover sheet, Connolly/Mujeeb,Haque/Shaparenko
    • Lecture references:
      * W. Bruce Croft and D. Harper. Using probabilistic models of information retrieval without relevance information. Journal of Documentation 35: 285–295 (1979). Reprinted in Spärck Jones and Peter Willett, eds., Readings in Information Retrieval (1997).
    • Additional references, including samples from the continuing examination of explanations for/derivations of IDF weighting.
      * Kishore Papineni. Why inverse document frequency? NAACL (2001). [pdf]
      * Stephen Robertson. Understanding inverse document frequency: On theoretical arguments for IDF. Journal of Documentation 60(5):503–520 (2004). [pdf]
      * Arjen P. de Vries and Thomas Roelleke. Relevance Information: A Loss of Entropy but a Gain for IDF? SIGIR (2005). [pdf,pdf2]

  1. 2/14/06: The two-Poisson model and approximations (complete classic probabilistic retrieval)
    • Lecture guides: cover sheet (3/9/06: minor update to note that a corrected version of the lecture handout has been posted), Krafft/Rabkin, Leshed/Shami
    • Handout: pdf (3/7/06: corrected lecture number and noted typo in Singhal's Okapi equation)
    • Lecture references:
      * Abraham Bookstein and D. R. Swanson. Probabilistic models for automatic indexing. Journal of the American Society for Information Science, 25:312–318 (1974).
      * Stephen P. Harter. A probabilistic approach to automatic keyword indexing, part I: On the distribution of specialty words in a technical literature. Journal of the American Society for Information Science 26(4). [pdf.)
      * Stephen E. Robertson and Steve Walker. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. SIGIR, pp. 232–241 (1994) [pdf, pdf2]. Reprinted in Spärck Jones and Peter Willett, eds., Readings in Information Retrieval (1997).
    • Additional references:
      * Eugene L. Margulis. N-Poisson document modelling. SIGIR, pp. 177–189 (1992). [pdf]
      * S. E. Robertson and S. Walker and S. Jones and M. M. Hancock-Beaulieu and M. Gatford. Okapi at TREC-3 (1994) [pdf,ps.gz]Introduces BM25.
      * Origin of the name "Okapi". [html]

  1. 2/16/06: Introduction to the language modeling approach
    • Lecture guide: cover sheet, Danis/Rogan scribe notes, Danis/Rogan problems
    • Lecture references:
      * Jay M. Ponte and W. Bruce Croft. A language modeling approach to information retrieval. SIGIR (1998). [pdf]
      * John Lafferty and Chengxiang Zhai. Probabilistic relevance models based on document and query generation. In Croft and Lafferty, eds., Language Modeling and Information Retrieval (2003). [pdf,ps]
      * ChengXiang Zhai and John Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. SIGIR (2001). [pdf,ps]
    • Additional references:
      * Language modeling and information retrieval. The Lemur project. [html]
      * Xiaoyong Liu and W. Bruce Croft. Statistical language modeling for information retrieval. Annual Review of Information Science and Technology 39 (2003). [pdf]

  1. 2/21/06: More on the LM approach
    • Lecture guides: cover sheet for Babinski/Lin, Babinski/Lin(3/10/06: lecturer-induced typo in exponent for HMM LM incorporated into guide and hence notice removed from cover sheet), Hamatake/Park scribe notes (3/16: a few typos corrected), Hamatake/Park problems
    • Additional references:
      * David MacKay and Linda Peto. A hierarchical Dirichlet language model. Natural Language Engineering 1(3):289–307 (1995). [pdf]
      * ChengXiang Zhai. Notes on the KL-divergence retrieval formula and Dirichlet prior smoothing (2003). [pdf]Our discussion of the effects of using corpus-dependent Dirichlet smoothing is modeled on the derivation given in this paper, although the initial scoring function is a bit different.

  1. 2/23/06: Completion of alternate LM formulation; introduction to relevance feedback

  1. 2/28/06: Relevance feedback methods

  1. 3/2/06: Relevance feedback: further explorations and evaluation issues

  1. 3/7/06: Completion of relevance feedback; implicit feedback sources

  1. 3/9/06: An LM-based approach to implicit feedback

  1. 3/14/06: Clickthrough data as implicit feedback: human validation

3/16/06: midterm exam (see also mid(-)term notes (3/28/06) and the Lecture 15 Krafft/Rabkin guide (includes brief solution sketches for some midterm questions))

  1. 3/28/06: Clickthrough data as relative implicit feedback

  1. 3/30/06: Matrix-theoretic characterizations of corpus structure (introduction to the singular value decomposition (SVD))

  1. 4/4/06: The SVD

  1. 4/6/06: "Few-factor" representations (LSI, pLSI, and others)

  1. 4/11/06: Modeling syntactic structure: context-free grammars

  1. 4/13/06: Feature-based context-free grammars; introduction to tree-adjoining grammars

  1. 4/18/06: TAGs, continued

  1. 4/20/06: TAGs for idiom analysis; adjunction constraints

  1. 4/25/06: Algorithms for grammar parsing and learning

  1. 4/27/06: The EM algorithm

  1. 5/2/06: Conclusion of EM; introduction to maximum-entropy models.

  1. 5/4/06: Conclusion of maximum entropy and of the class