The Structure of Information Networks (CS 685, Jon Kleinberg) (original) (raw)
Phillips 203.
Overview
Information networks such as the World Wide Web are characterized by the interplay between heterogeneous content and a complex underlying link structure. This course covers recent research on algorithms for analyzing such networks, and models that capture their basic properties. Topics include methods for link analysis, centralized and decentralized search algorithms, probabilistic models for networks, and connections with work in the areas of social networks and citation analysis.
The course pre-requisites include background in algorithms and graphs at the level of CS 482, as well as some familiarity with probability and linear algebra.
The work for the course will consist of a mixture of reaction papers and a few problem sets, concluding with a project. The coursework is discussed in more detail here.
Course Outline
(1) Overview and Background.
Several things laid the groundwork for the material in this course, but two stand out in particular: the increasing availability of network data across technological, social, and biological domains; and the rise of the Web as a central object of study in computer science. We begin by surveying this background material, with a particular emphasis on the World Wide Web.
- V. Bush.As We May Think.Atlantic Monthly, July 1945.
- World Wide Web Consortium.A Little History of the World Wide Web, 1945-1995.
- A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, J. Wiener.Graph structure in the web.9th International World Wide Web Conference, May 2000.
(2) What Can Link Structure Tell Us About Content?
Link structure can be a powerful source of information about the underlying content in the network. In the context of the Web, we can try to identify high-quality information resources from the way in which other pages link to them; this idea has reflections in the analysis of citation data to find influential journals, and in the analysis of social networks to find important members of a community.
- The Hub/Authority and PageRank algorithms
- J. Kleinberg.Authoritative sources in a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998. Extended version in Journal of the ACM 46(1999). Also appears as IBM Research Report RJ 10076, May 1997.
- S. Brin and L. Page.The Anatomy of a Large-Scale Hypertextual Web Search Engine.Proc. 7th International World Wide Web Conference, 1998.
- J. Kleinberg.Authoritative sources in a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998. Extended version in Journal of the ACM 46(1999). Also appears as IBM Research Report RJ 10076, May 1997.
- Some Methodological Issues in Web Link Analysis Algorithms
- Krishna Bharat, Andrei Broder, Monika Henzinger, Puneet Kumar, and Suresh Venkatasubramanian.The Connectivity Server: fast access to linkage information on the Web.Proc. 7th International World Wide Web Conference, 1998.
- Brian Amento, Loren Terveen, and Will Hill.Does "Authority" Mean Quality? Predicting Expert Quality Ratings of Web Documents.23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2000.
- B. D. Davison. Recognizing Nepotistic Links on the Web.AAAI Workshop on Artificial Intelligence for Web Search, 2000.
- Arvind Arasu, Jasmine Novak, Andrew Tomkins, John TomlinPageRank Computation and the Structure of the Web: Experiments and Algorithms.11th International World Wide Web Conference, 2002.
- B. D. Davison. Recognizing Nepotistic Links on the Web.AAAI Workshop on Artificial Intelligence for Web Search, 2000.
- Brian Amento, Loren Terveen, and Will Hill.Does "Authority" Mean Quality? Predicting Expert Quality Ratings of Web Documents.23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2000.
- Krishna Bharat, Andrei Broder, Monika Henzinger, Puneet Kumar, and Suresh Venkatasubramanian.The Connectivity Server: fast access to linkage information on the Web.Proc. 7th International World Wide Web Conference, 1998.
- Connections with Social Networks and Citation Analysis.
- G. Pinski, F. Narin. Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics. Information Processing and Management, 12(1976), pp. 297--312.
- L. Katz. A new status index derived from sociometric analysis. Psychometrika 18(1953).
- C.H. Hubbell. An input-output approach to clique identification. Sociometry 28(1965).
- P. Bonacich. Power and Centrality: A family of measures. American Journal of Sociology 92(1987).
- C.H. Hubbell. An input-output approach to clique identification. Sociometry 28(1965).
- L. Katz. A new status index derived from sociometric analysis. Psychometrika 18(1953).
- G. Pinski, F. Narin. Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics. Information Processing and Management, 12(1976), pp. 297--312.
- Probabilistic Models for Link Analysis
- David Cohn and Huan Chang Probabilistically Identifying Authoritative Documents.17th International Conference on Machine Learning, 2000.
- R. Lempel, S. Moran.The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect.9th International World Wide Web Conference, May 2000.
- A. Borodin, J. S. Rosenthal, G. O. Roberts, P. Tsaparas,Finding Authorities and Hubs From Link Structures on the World Wide Web.10th International World Wide Web Conference, May 2001.
- R. Lempel, S. Moran.The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect.9th International World Wide Web Conference, May 2000.
- David Cohn and Huan Chang Probabilistically Identifying Authoritative Documents.17th International Conference on Machine Learning, 2000.
- Relationships between Text and Hyperlinks
- B. D. Davison.Topical Locality in the Web.23rd Annual International Conference on Research and Development in Information Retrieval, 2000.
- F. Menczer.Links tell us about lexical and semantic Web content.arXiv cs.IR/0108004.
- M.E. Frisse. Searching for information in a hypertext medical handbook. Communications of the ACM 31(7).
- J. Boyan and D. Freitag and T. Joachims. A Machine Learning Architecture for Optimizing Web Search Engines.AAAI Workshop on Internet Based Information Systems, 1996.
- Massimo Marchiori.The Quest for Correct Information on the Web: Hyper Search Engines.Proc. 7th International World Wide Web Conference, 1998.
- J. Boyan and D. Freitag and T. Joachims. A Machine Learning Architecture for Optimizing Web Search Engines.AAAI Workshop on Internet Based Information Systems, 1996.
- M.E. Frisse. Searching for information in a hypertext medical handbook. Communications of the ACM 31(7).
- F. Menczer.Links tell us about lexical and semantic Web content.arXiv cs.IR/0108004.
- B. D. Davison.Topical Locality in the Web.23rd Annual International Conference on Research and Development in Information Retrieval, 2000.
- Incorporating Textual Content into Link Analysis
- Krishna Bharat and Monika R. Henzinger.Improved algorithms for topic distillation in a hyperlinked environment.21st International Conference on Research and Development in Information Retrieval (SIGIR 1998).
- S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins.Mining the link structure of the World Wide Web.IEEE Computer, August 1999.
- D. Cohn, T. Hofmann,The Missing Link -- A Probabilistic Model of Document Content and Hypertext Connectivity.Advances in Neural Information Processing Systems (NIPS) 13, 2000.
- Soumen Chakrabarti, Mukul M. Joshi and Vivek B. Tawde. Enhanced topic distillation using text, markup tags, and hyperlinks.24th Annual International Conference on Research and Development in Information Retrieval, 2001.
- Davood Rafiei, Alberto Mendelzon. What is this Page Known for? Computing Web Page Reputations. Proc. WWW9 Conference, Amsterdam, May 2000
- Pedro Domingos, Matt Richardson.The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank.Advances in Neural Information Processing Systems 14, 2002.
- Taher H. Haveliwala.Topic-Sensitive PageRank.11th International World Wide Web Conference, 2002.
- Pedro Domingos, Matt Richardson.The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank.Advances in Neural Information Processing Systems 14, 2002.
- Davood Rafiei, Alberto Mendelzon. What is this Page Known for? Computing Web Page Reputations. Proc. WWW9 Conference, Amsterdam, May 2000
- Soumen Chakrabarti, Mukul M. Joshi and Vivek B. Tawde. Enhanced topic distillation using text, markup tags, and hyperlinks.24th Annual International Conference on Research and Development in Information Retrieval, 2001.
- D. Cohn, T. Hofmann,The Missing Link -- A Probabilistic Model of Document Content and Hypertext Connectivity.Advances in Neural Information Processing Systems (NIPS) 13, 2000.
- S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins.Mining the link structure of the World Wide Web.IEEE Computer, August 1999.
- Krishna Bharat and Monika R. Henzinger.Improved algorithms for topic distillation in a hyperlinked environment.21st International Conference on Research and Development in Information Retrieval (SIGIR 1998).
- Further Applications of Link Analysis
- Jeffrey Dean and Monika R. Henzinger. Finding Related Web Pages in the World Wide Web.8th International World Wide Web, 1999.
- R. Lempel, A. Soffer.PicASHOW: Pictorial Authority Search by Hyperlinks on the Web.10th International World Wide Web Conference, May 2001.
- Jeffrey Dean and Monika R. Henzinger. Finding Related Web Pages in the World Wide Web.8th International World Wide Web, 1999.
(3) Inferring Communities from Link Topology
We began the discussion of link analysis with the goal of searching for and ranking high-quality content. A lot of the methods to do this, however, uncover richer structure in the network --- densely connected _communties_focused on a particular topic. The problem of uncovering such community structure differs in some interesting ways from the heavily studied problems of clustering and graph partitioning (though there are clear relationships): rather than trying to decompose the full network into a few pieces, the goal is to identify small, densely connected regions that -- even put together -- may not account for much of the whole graph.
- Eigenvector Analysis
- Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. and Harshman, R. A.Indexing by latent semantic analysis.Journal of the Society for Information Science, 41(6), 391-407 (1990).
- Christos Papadimitriou, Prabhakar Raghavan Hisao Tamaki, Santosh Vempala.Latent Semantic Indexing: A Probabilistic Analysis.17th ACM Symposium on the Principles of Database Systems, 1998.
- D. Gibson, J. Kleinberg, P. Raghavan. Inferring Web communities from link topology.Proc. 9th ACM Conference on Hypertext and Hypermedia, 1998.
- P. Drineas, Ravi Kannan, Alan Frieze, Santosh Vempala and V. Vinay"Clustering in large graphs and matrices."Proc. of the 10th ACM-SIAM Symposium on Discrete Algorithms, Baltimore, 1999.
- Yossi Azar, Amos Fiat, Anna Karlin, Frank McSherry and Jared Saia. Spectral Analysis of Data.33rd ACM Symposium on Theory of Computing, 2001.
- Dimitris Achlioptas, Amos Fiat, Anna Karlin, Frank McSherry,Web Search via Hub Synthesis.42nd IEEE Symposium on Foundations of Computer Science, 2001, p.611-618.
- A. Y. Ng, A. X. Zheng, and M. I. Jordan.Link analysis, eigenvectors, and stability.International Joint Conference on Artificial Intelligence (IJCAI), 2001.
- A. Y. Ng, A. X. Zheng, and M. I. Jordan.Stable algorithms for link analysis.24th International Conference on Research and Development in Information Retrieval (SIGIR 2001).
- D. Donoho.High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality.Notes to accompany lecture at AMS Conference on Mathematical Challenges of the 21st Century, August 2000.
- A. Y. Ng, A. X. Zheng, and M. I. Jordan.Stable algorithms for link analysis.24th International Conference on Research and Development in Information Retrieval (SIGIR 2001).
- A. Y. Ng, A. X. Zheng, and M. I. Jordan.Link analysis, eigenvectors, and stability.International Joint Conference on Artificial Intelligence (IJCAI), 2001.
- Dimitris Achlioptas, Amos Fiat, Anna Karlin, Frank McSherry,Web Search via Hub Synthesis.42nd IEEE Symposium on Foundations of Computer Science, 2001, p.611-618.
- Yossi Azar, Amos Fiat, Anna Karlin, Frank McSherry and Jared Saia. Spectral Analysis of Data.33rd ACM Symposium on Theory of Computing, 2001.
- P. Drineas, Ravi Kannan, Alan Frieze, Santosh Vempala and V. Vinay"Clustering in large graphs and matrices."Proc. of the 10th ACM-SIAM Symposium on Discrete Algorithms, Baltimore, 1999.
- D. Gibson, J. Kleinberg, P. Raghavan. Inferring Web communities from link topology.Proc. 9th ACM Conference on Hypertext and Hypermedia, 1998.
- Christos Papadimitriou, Prabhakar Raghavan Hisao Tamaki, Santosh Vempala.Latent Semantic Indexing: A Probabilistic Analysis.17th ACM Symposium on the Principles of Database Systems, 1998.
- Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. and Harshman, R. A.Indexing by latent semantic analysis.Journal of the Society for Information Science, 41(6), 391-407 (1990).
- Combinatorial Methods
- R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins.Trawling the web for emerging cyber-communities.8th International World Wide Web Conference, May 1999.
- R. Agrawal, R. Srikant.Fast Algorithms for Mining Association Rules.20th Int'l Conference on Very Large Databases (VLDB), 1994.
- S. Dill, R. Kumar, K. McCurley, S. Rajagopalan, D. Sivakumar, A. Tomkins. Self-similarity in the Web.27th International Conference on Very Large Data Bases, 2001.
- Gary Flake, Steve Lawrence, C. Lee Giles, Frans Coetzee.Self-Organization and Identification of Web Communities.IEEE Computer, 35:3, March 2002.
- Gary Flake, K. Tsioutsiouliklis, R.E. Tarjan.Graph Clustering Techniques based on Minimum Cut Trees.Technical Report 2002-06, NEC, Princeton, NJ, 2002.
- M. Girvan and M. E. J. Newman. Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA 99, 8271-8276 (2002).
- M. Granovetter. The strength of weak ties.American Journal of Sociology, 78(6):1360-1380, 1973.
- M. Girvan and M. E. J. Newman. Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA 99, 8271-8276 (2002).
- Gary Flake, K. Tsioutsiouliklis, R.E. Tarjan.Graph Clustering Techniques based on Minimum Cut Trees.Technical Report 2002-06, NEC, Princeton, NJ, 2002.
- Gary Flake, Steve Lawrence, C. Lee Giles, Frans Coetzee.Self-Organization and Identification of Web Communities.IEEE Computer, 35:3, March 2002.
- S. Dill, R. Kumar, K. McCurley, S. Rajagopalan, D. Sivakumar, A. Tomkins. Self-similarity in the Web.27th International Conference on Very Large Data Bases, 2001.
- R. Agrawal, R. Srikant.Fast Algorithms for Mining Association Rules.20th Int'l Conference on Very Large Databases (VLDB), 1994.
- R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins.Trawling the web for emerging cyber-communities.8th International World Wide Web Conference, May 1999.
(4) Rank Aggregation and Meta-Search
We have now seen a number of different techniques for ranking Web pages according to various measures of quality. Are there principled methods for combining them, to get an even better `meta-ranking'? We begin this discussion with a celebrated result of Arrow from mathematical economics, suggesting some of the trade-offs inherent in such an approach.
- K. Arrow. Social Choice and Individual Values. Wiley, 1951.
- H.P. Young. Condorcet's Theory of Voting. American Political Science Review 82(1988), pp. 1231-1244.
- Cynthia Dwork, Ravi Kumar, Moni Naor, D. Sivakumar.Rank Aggregation Methods for the Web.10th International World Wide Web Conference, May 2001.
- Erik Selberg, Oren Etzioni.The MetaCrawler Architecture for Resource Aggregation on the Web.IEEE Expert, 1997.
- Weiyi Meng, Clement Yu, King-Lup Liu.Building Efficient and Effective Metasearch Engines.ACM Computing Surveys 34(2002).
- William W. Cohen, Robert E. Schapire, and Yoram Singer. Learning to order things. Journal of Artificial Intelligence Research, 10: 243--270, 1999.
- T. Joachims. Optimizing Search Engines Using Clickthrough Data.Eighth International Conference on Knowledge Discovery and Data Mining, KDD-2002.
(5) Power-Law Distributions
If we were to generate a random graph on n nodes by including each possible edge independently with some probability p, then the fraction of nodes with d neighbors would decrease exponentially in d. But for many large networks -- including the Web, the Internet, collaboration networks, and semantic networks -- it quickly became clear that the fraction of nodes with d neighbors decreases only polynomially in d; to put it differently, the distribution of degrees obeys a power law. What processes are capable of generating such power laws, and why should they be ubiquitous in large networks? The investigation of these questions suggests that power laws are just one reflection of the local and global processes driving the evolution of these networks.
- A.-L. Barabasi, Reka Albert, and Hawoong Jeong.Mean-field theory for scale-free random networks.Physica A 272 173-187 (1999).
- Bernardo A. Huberman, Lada A. Adamic.Growth dynamics of the World-Wide Web.Nature, 399 (1999) 130.
- Michalis Faloutsos, Petros Faloutsos and Christos Faloutsos. On Power-Law Relationships of the Internet Topology.ACM SIGCOMM 1999.
- R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Stochastic models for the Web graph.41th IEEE Symp. on Foundations of Computer Science, 2000, pp. 57-65.
- W. Aiello, F. Chung, L. Lu.Random evolution of massive graphs.Handbook of Massive Data Sets, (Eds. James Abello et al.), Kluwer, 2002, pages 97-122.
- M. Steyvers, J. B. Tenenbaum.The large-scale structure of semantic networks: statistical analyses and a model of semantic growth.2001.
- M. Mitzenmacher. A Brief History of Generative Models for Power Law and Lognormal Distributions.Allerton Conference 2001.
- R. Albert and A.-L.Barabasi.Statistical mechanics of complex networks.Reviews of Modern Physics 74, 47 (2002).
- Qian Chen, Hyunseok Chang. Ramesh Govindan, Sugih Jamin, Scott J. Shenker, Walter Willinger.The Origin of Power Laws in Internet Topologies Revisited.Proc. of IEEE Infocom 2002.
- J. Carlson and J. Doyle.Highly Optimized Tolerance: A Mechanism for Power Laws in Designed Systems.Physical Review E 60:2(1999).
- A. Fabrikant, E. Koutsoupias, C. Papadimitriou. Heuristically Optimized Trade-offs: A New Paradigm for Power Laws in the Internet.29th International Colloquium on Automata, Languages, and Programming (ICALP), 2002.
- M. Molloy and B. Reed. A Critical Point for Random Graphs with a Given Degree Sequence.Random Structures and Algorithms 6(1995) 161-180.
- M. E. J. Newman, S. H. Strogatz and D. J. Watts, Random graphs with arbitrary degree distributions and their applications.Phys. Rev. E 64, 026118 (2001).
(6) Decentralized Search and the Small-World Phenomenon
We now shift our focus to problems with a decentralized flavor: rather than assuming we have access to a central index of the network, we consider the issue of exploring a network `from the inside', without global knowledge. This could be for purposes of search or approximate measurement; it could also be for the purpose of constructing a complete index, in order to make centralized algorithms possible. We begin the discussion of this issue with one of the oldest results on decentralized search -- Stanley Milgram's `small-world' experiments from the 1960's, whose participants forwarded letters through chains of acquaintances to a designated target, and whose striking outcome established the `six degrees of separation' principle. What is a general framework for thinking about this type of result, and what are the general properties that make a network `searchable'?
- S. Milgram. The small world problem. Psychology Today 1(1967).
- J. Travers and S. Milgram. An experimental study of the small world problem. Sociometry 32(1969).
- P. Killworth and H. Bernard, Reverse small world experiment. Social Networks 1(1978).
- Watts, D. J. and S. H. Strogatz. Collective dynamics of 'small-world' networks.Nature 393:440-42(1998).
- J. Kleinberg. The small-world phenomenon: An algorithmic perspective.Proc. 32nd ACM Symposium on Theory of Computing, 2000.
- J. Kleinberg. Small-World Phenomena and the Dynamics of Information.Advances in Neural Information Processing Systems (NIPS) 14, 2001.
- D. J. Watts, P. S. Dodds, M. E. J. NewmanIdentity and Search in Social Networks.Science, 296, 1302-1305, 2002.
- F. Menczer. Growing and Navigating the Small World Web by Local Content.Proc. Natl. Acad. Sci. USA 99(22): 14014-14019, 2002
- J. Kleinfeld.Could it be a Big World After All? The `Six Degrees of Separation' Myth.Society, April 2002.
(7) Decentralized Search in Peer-to-Peer Networks
Recently, decentralized search has been applied to the problem of sharing files in a peer-to-peer network without a global index. Each host in the system holds a subset of the content, and requests must be routed to the appropriate host in a decentralized fashion. As in the case of the small-world problem, the goal is a network that is easily searchable; but how should a distributed protocol shape the network topology so as to attain this goal?
- Unstructured Approaches
- T. Hong. Performance.Peer-to-Peer: Harnessing the Power of Disruptive Technologies. (A. Oram, editor), O'Reilly and Associates, 2001.
- E. Cohen, S. Shenker.Replication Strategies in Unstructured Peer-to-Peer Networks. SIGCOMM 2002.
- Lada A. Adamic, Rajan M. Lukose, Amit R. Puniyani, Bernardo A. Huberman.Search in Power-Law Networks.Phys. Rev. E, 64 46135 (2001).
- A. Goel, H. Zhang, and R. Govindan. Using the Small-World Model to Improve Freenet Performance.IEEE Infocom, 2002.
- Lada A. Adamic, Rajan M. Lukose, Amit R. Puniyani, Bernardo A. Huberman.Search in Power-Law Networks.Phys. Rev. E, 64 46135 (2001).
- E. Cohen, S. Shenker.Replication Strategies in Unstructured Peer-to-Peer Networks. SIGCOMM 2002.
- T. Hong. Performance.Peer-to-Peer: Harnessing the Power of Disruptive Technologies. (A. Oram, editor), O'Reilly and Associates, 2001.
- Structured Approaches
- C. Greg Plaxton, Rajmohan Rajaraman, Andrea W. Richa.Accessing Nearby Copies of Replicated Objects in a Distributed Environment.ACM Symposium on Parallel Algorithms and Architectures, SPAA 1997.
- S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker.A Scalable Content-Addressable Network.ACM SIGCOMM, 2001
- A. Rowstron, P. Druschel.Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems.18th IFIP/ACM International Conference on Distributed Systems Platforms (Middleware 2001).
- I. Stoica, R. Morris, D. Karger, F. Kaashoek, H. Balakrishnan.Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.ACM SIGCOMM, 2001.
- B. Y. Zhao, J. D. Kubiatowicz, A. D. Joseph,Tapestry: An Infrastructure for Fault-Tolerant Wide-Area Location and Routing.UC Berkeley Computer Science Division, Report No. UCB/CSD 01/1141, April 2001.
- Sylvia Ratnasamy, Scott Shenker and Ion Stoica.Routing Algorithms for DHTs: Some Open Questions.1st International Workshop on Peer-to-Peer Systems (IPTPS), 2002.
- Dalia Malkhi, Moni Naor, David Ratajczak. Viceroy: A Scalable and Dynamic Emulation of the Butterfly.ACM Symposium on Principles of Distributed Computing, 2002.
- Sylvia Ratnasamy, Scott Shenker and Ion Stoica.Routing Algorithms for DHTs: Some Open Questions.1st International Workshop on Peer-to-Peer Systems (IPTPS), 2002.
- B. Y. Zhao, J. D. Kubiatowicz, A. D. Joseph,Tapestry: An Infrastructure for Fault-Tolerant Wide-Area Location and Routing.UC Berkeley Computer Science Division, Report No. UCB/CSD 01/1141, April 2001.
- I. Stoica, R. Morris, D. Karger, F. Kaashoek, H. Balakrishnan.Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.ACM SIGCOMM, 2001.
- A. Rowstron, P. Druschel.Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems.18th IFIP/ACM International Conference on Distributed Systems Platforms (Middleware 2001).
- S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker.A Scalable Content-Addressable Network.ACM SIGCOMM, 2001
- C. Greg Plaxton, Rajmohan Rajaraman, Andrea W. Richa.Accessing Nearby Copies of Replicated Objects in a Distributed Environment.ACM Symposium on Parallel Algorithms and Architectures, SPAA 1997.
(8) Exploring the Web by Crawling, Focused Crawling, and Approximate Sampling
In the Web, crawlers are the bridge between the decentralized and centralized worlds -- they gather the content so it can be centrally indexed. We consider some of the issues in scaling a crawler up to the scope of the full Web, as well as the problem of_focused crawling_, in which one simply wants to crawl a particular subset of the Web -- for example, to gather pages only on a particular topic. Finally, we consider methods based on random sampling and random walks for determining aggregate properties of the network and its indices.
- Crawling Methodology
- Allan Heydon, Marc Najork.High-Performance Web Crawling.SRC Research Report 173, September 2001.
- Junghoo Cho and Hector Garcia-Molina.Parallel Crawlers.11th World Wide Web conference, May 2002.
- Junghoo Cho, Hector Garcia-Molina Synchronizing a database to Improve Freshness.ACM International Conference on Management of Data (SIGMOD), May 2000.
- Junghoo Cho and Hector Garcia-Molina.Parallel Crawlers.11th World Wide Web conference, May 2002.
- Allan Heydon, Marc Najork.High-Performance Web Crawling.SRC Research Report 173, September 2001.
- Focused Crawling
- Junghoo Cho, Hector Garcia-Molina, Lawrence Page.Efficient Crawling Through URL Ordering.7th World Wide Web Conference (WWW7), Brisbane, Australia, April 1998.
- S. Chakrabarti, M. van den Berg, and B. Dom.Focused crawling: A new approach to topic-specific Web resource discovery.8th International World Wide Web Conference, May 1999.
- M. Diligenti, F.M. Coetzee, S. Lawrence, C.L. Giles, M. GoriFocused Crawling Using Context Graphs.26th International Conference on Very Large Databases, VLDB 2000.
- D. Bergmark.Collection Synthesis.Joint Conference on Digital Libraries 2002.
- M. Diligenti, F.M. Coetzee, S. Lawrence, C.L. Giles, M. GoriFocused Crawling Using Context Graphs.26th International Conference on Very Large Databases, VLDB 2000.
- S. Chakrabarti, M. van den Berg, and B. Dom.Focused crawling: A new approach to topic-specific Web resource discovery.8th International World Wide Web Conference, May 1999.
- Junghoo Cho, Hector Garcia-Molina, Lawrence Page.Efficient Crawling Through URL Ordering.7th World Wide Web Conference (WWW7), Brisbane, Australia, April 1998.
- Approximate Sampling
- K. Bharat and A. Broder.A technique for measuring the relative size and overlap of public Web search engines.Proc. 7th International World Wide Web Conference, 1998.
- Steve Lawrence and C. Lee Giles. Accessibility and Distribution of Information on the Web. Nature 400(6740): 107-109, July 8, 1999.
- M. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork.Measuring Index Quality using Random Walks on the Web.8th International World Wide Web Conference, May 1999.
- M. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork.On Near-Uniform URL Sampling .9th International World Wide Web Conference, May 2000.
- Ziv Bar-Yossef, Alexander Berg, Steve Chien, Jittat Fakcharoenphol, and Dror Weitz.Approximating Aggregate Queries about Web Pages via Random Walks.26th International Conference on Very Large Databases (VLDB), 2000, pages 535-544.
- Soumen Chakrabarti, Mukul Joshi, Kunal Punera, and David M. Pennock. The structure of broad topics on the Web. 11th World Wide Web conference, May 2002.
- Ziv Bar-Yossef, Alexander Berg, Steve Chien, Jittat Fakcharoenphol, and Dror Weitz.Approximating Aggregate Queries about Web Pages via Random Walks.26th International Conference on Very Large Databases (VLDB), 2000, pages 535-544.
- M. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork.On Near-Uniform URL Sampling .9th International World Wide Web Conference, May 2000.
- M. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork.Measuring Index Quality using Random Walks on the Web.8th International World Wide Web Conference, May 1999.
- Steve Lawrence and C. Lee Giles. Accessibility and Distribution of Information on the Web. Nature 400(6740): 107-109, July 8, 1999.
- K. Bharat and A. Broder.A technique for measuring the relative size and overlap of public Web search engines.Proc. 7th International World Wide Web Conference, 1998.
(9) Link-Based Classification
Link analysis plays a role in a problem that is related to the ones we have been considering: the task of classifying Web pages into topic categories. Automated text analysis can give us an estimate of the topic of each page; but we also suspect that pages have some tendency to be similar to neighboring pages in the link structure. How should we combine these two sources of evidence? A number of probabilistic frameworks are useful for this task, including the formalism of Markov random fields, which -- for quite different applications -- has been extensively studied in computer vision.
- J. Besag. Spatial interaction and the statistical analysis of lattice systems. J. Royal Statistical Society B, 36(1974).
- Soumen Chakrabarti, Byron E. Dom, and Piotr Indyk. Enhanced hypertext categorization using hyperlinks.Proceedings of the ACM International Conference on Management of Data, SIGMOD 1998, pages 307-318.
- O. Veksler.Efficient Graph-Based Energy Minimization Methods in Computer Vision.Ph.D. Thesis, Cornell University, 1999.
- J. Kleinberg, E. Tardos. Approximation Algorithms for Classification Problems with Pairwise Relationships: Metric Labeling and Markov Random Fields.Proc. 40th IEEE Symposium on Foundations of Computer Science, 1999.
- A. Broder, R. Krauthgamer, and M. Mitzenmacher. Improved Classification via Connectivity Information.ACM-SIAM Symposium on Discrete Algorithms, 2000.
- Avrim Blum, Shuchi Chawla. Learning from Labeled and Unlabeled Data using Graph Mincuts.International Conference on Machine Learning (ICML), 2001.
- T. Joachims, N. Cristianini, and J. Shawe-Taylor. Composite Kernels for Hypertext Categorisation.International Conference on Machine Learning (ICML), 2001.
- B. Taskar, P. Abbeel and D. Koller. Discriminative Probabilistic Models for Relational Data.Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI 2002).
(10) Diffusion of Information Through Networks
We can think of a network as a large circulatory system, through which information continuously flows. This diffusion of information can happen rapidly or slowly; it can be disastrous -- as in a panic or cascading failure -- or beneficial -- as in the spread of an innovation. Work in several areas has proposed models for such processes, and investigated when a network is more or less susceptible to their spread.
- M. Granovetter. Threshold models of collective behavior.American Journal of Sociology 83(6):1420-1443, 1978.
- T. Schelling. Micromotives and Macrobehavior. Norton, 1978.
- S. Morris.Contagion.Review of Economic Studies 67 (2000), 57-78.
- H. Peyton Young.The Diffusion of Innovations in Social Networks.Santa Fe Institute Working Paper 02-04-018.
- E. Berger.Dynamic Monopolies of Constant Size.Journal of Combinatorial Theory Series B 83(2001), 191-200.
- Chalee Asavathiratham.The Influence Model: A Tractable Representation for the Dynamics of Networked Markov Chains.Ph.D. Thesis, MIT 2000.
- Pedro Domingos, Matt Richardson. Mining Knowledge-Sharing Sites for Viral Marketing.Eighth International Conference on Knowledge Discovery and Data Mining, KDD-2002.
(11) Epidemic Algorithms in Networks
The type of diffusion or cascade process we discussed in the previous part can also be used as a design principle for network protocols. This leads to the idea of epidemic algorithms, also called gossip-based algorithms, in which information is propagated through a collection of distributed computing hosts, typically using some form of randomization. The result is a means of information dissemination that is often more simpler and more robust than regimented, centralized schemes.
- Alan J. Demers, Daniel H. Greene, Carl Hauser, Wes Irish, John Larson, Scott Shenker, Howard E. Sturgis, Daniel C. Swinehart, Douglas B. Terry. Epidemic Algorithms for Replicated Database Maintenance. Operating Systems Review 22(1): 8-32 (1988)
- R. van Renesse.Scalable and secure resource location.Hawaii International Conference on System Sciences, 2000.
- R. van Renesse, K. Birman, W. Vogels. Astrolabe: A Robust and Scalable Technology For Distributed System Monitoring, Management, and Data Mining.to appear in ACM Transactions on Computer Systems, 2003.
- Mor Harchol-Balter, Tom Leighton, Daniel Lewin. Resource Discovery in Distributed Networks. ACM Symposium on Principles of Distributed Computing (PODC), 1999, pp. 229-238.
- Shay Kutten, David PelegDeterministic Distributed Resource Discovery.ACM Symposium on Principles of Distributed Computing (PODC), 2000.
- R. Karp, C. Schindelhauer, S. Shenker, B. Vocking.Randomized Rumor Spreading.41st IEEE Symposium on Foundations of Computer Science, 2000.
- D. Kempe, J. Kleinberg, A. Demers. Spatial gossip and resource location protocols.Proc. 33rd ACM Symposium on Theory of Computing, 2001.
Network Datasets
There are a number of interesting network datasets available on the Web; they form a valuable resource for trying out algorithms and models across a range of settings.
- Internet topology: The network structure of the Internet can be studied at several levels of resolution. Here is a dataset at the autonomous system (AS) level.
- Web subgraphs: These were constructed by expanding a 200-page response set to a search engine query, as in the hub/authority algorithm. This data was collected some time back, so a number of the links will be broken.
- Collaboration networks: Datasets of collaboration among scientists, movie actors, and business people have been used as proxies for social networks. Here are two examples.
- Protein interaction networks: These are constructed from the collection of proteins in a cell, with edges joining pairs that have been found to interact. Naturally, such networks are incomplete, since some interactions have not yet been discovered.
- Semantic networks: Free association datasets for words have been collected by cognitive scientists; these are constructed by compiling the free responses of test subjects when presented with cue words. (For example, a test subject presented with the cue word `ice' might react with the word `cold,' `cream,' or `water.')
Other Courses with Overlapping Content
- Algorithmic Aspects of Computer Networks (Boston Univ.): Byers.
- Internet Algorithmics (Brown): Goodrich.
- Algorithms for Indexing and Search (Carnegie Mellon): Blelloch, Lafferty, Miller.
- Networks and Complexity in Social Systems (Columbia): Watts.
- Scaling in Networks (Columbia): Lazar.
- Algorithms at the End of the Wire (Harvard): Mitzenmacher.
- Hypertext retrieval and mining (IIT Bombay): Chakrabarti.
- Complex Human Networks Reading Group (MIT): Pentland, Clarkson, Choudhury.
- Computational Issues in Ecommerce (Penn State): Cox, Giles, Pennock, Zha.
- Advanced Algorithms in Data Mining (Penn State): Zha.
- Web Protocols, Principles, and Applications (Polytechnic): Suel.
- Information Retrieval, Discovery, and Delivery (Princeton): LaPaugh.
- Data Mining (Stanford): Ullman.
- Information Retrieval and Distributed Databases (Stanford): Raghavan.
- Seminar in Data Mining and Search (Tel Aviv): Fiat.
- Recommender Systems (Virginia Tech): Ramakrishnan.
- Advanced Topics in Data Mining (UC Irvine): Smyth.
- Networks and Complexity (UC Irvine): White.
- Advanced Topics in Information Management Systems (UCLA): Cho.
- Algorithmic Problems Inspired by the Web (U. Chicago): Simon.
- Large Scale Networked Systems (U. Chicago): Foster.
- Advanced algorithms in data mining (U. Helsinki): Mannila.
- Information Access and Management on the Internet (UIUC): Chang.
- Graph Mining and Link Analysis Reading Group (U. Maryland): Getoor, Lu.
- Scaling, Power Laws, and Small World Phenomena in Networks (U. Mass.): Towsley.
- Peer-to-Peer and Application-Level Networking (U. Mass.): Kurose, Levine, Towsley.
- Practicum in Data Mining (U. Texas): Ghosh.
- Machine Learning for Text Analysis (U. Wisconsin): Craven.