Algorithms in the Real World: Indexing (original) (raw)
CS294-3: Algorithms in the Real World (Guy Blelloch, Fall 97)
Topic 9: Indexing and Searching
[ Topics | Scribe Notes | Readings | Text Books | Links ]
Topic Outline
- Introduction
- Types of queries (logical, proximity, keyword sets, wildcards, edit-distance)
- General techniques (case folding, stemming, stop words)
- Inverted indices
- Compressing posting lists
- Lexicon access
- Sorted list, front coding, B-trees, tries, perfect hashing
- Wildcards using n-grams or rotated lexicon
- Merging posting lists for queries
- Signature files
- Generating term and document signatures
- Boolean logic on signatures
- Efficient queries on signatures
- Vector space models
- Selecting weights (frequency and "information content")
- Similarity measures (dot-product, cosine)
- Efficient queries using augmented inverted indices
- Used with relevance feedback
- Clustering
- Latent semantic indexing
- The Singular value decomposition (SVD) and properties
- Indexing by taking first k columns/rows of the SVD
- Applications and performance
Scribe Notes
- Indexing 1 (draft) (Helen Wang)
- Indexing 2 (draft) (Ben Horowitz)
- Evolutionary Trees and Indexing 3 (draft) (Amar Chaudhary)
Readings
- Ian H. Witten, Alistair Moffat, and Timothy C. Bell.Managing Gigabytes: Compressing and Indexing Documents and Images. Van Nostrand Reinhold, 1994.
- Chapter 3: Indexing
Text Books
- Christos Faloutsos._Searching Multimedia Data Bases by Content._Kluwer Academic, 1996.
- Frakes and Baeza-Yates (ed.).Information Retrieval: Data Structures and Algorithms. Prentice Hall, 1992
- Frants V. J., Shapiro J., Voiskunskii V. G. _Automated Information Retrieval Theory and Methods_Academic Press, Aug 1997.
- Lesk M._Practical Digital Libraries Books, Bytes & Bucks_Morgan Kaufman Publishers, 1997.
- Salton._Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer._Addison-Wesley, 1989.
- Sparck Jones K. and Willett P. (editors)._Readings in Information Retrieval._Morgan Kaufman Publishers, 1997.
- Ian H. Witten, Alistair Moffat, and Timothy C. Bell.Managing Gigabytes: Compressing and Indexing Documents and Images. Van Nostrand Reinhold, 1994.
Further Readings and Links
- Digital Libraries
- The digital libraries initiative. Includes projects at Carnegie Mellon U., Stanford U., UC Berkeley, UC Santa Barbara, U. Illinois, U. Michigan.
- Digital Library Related Information and Resources
- Systems for Indexing/Searching your own site
- A long list of indexing/search engines that can be used to search your own site.
- Web Site search engines. A nice overview of the various search engines available for indexing your own site
- Glimpse
- Information on Web search engines
- Lycos: Design choices in an Internet search service
- Other relevant info
- The Text Retrieval Conference (TREC). "The goal of the conference series is to encourage research in information retrieval from large text applications by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results." Sponsored by NIST and DARPA.
- Latent Semantic Indexing (LSI)
- J. Kleinberg. Authoritative sources in a hyperlinked environment. This is the paper that David Wagner discussed in class.
Back to the Algorithms in the Real World home page.