Improving learning in networked data by combining explicit and mined links (original) (raw)
This paper is about using multiple types of information for classification of networked data in a semi-supervised setting: given a fully described network (nodes and edges) with known labels for some of the nodes, predict the labels of the remaining nodes. One method recently developed for doing such inference is a guilt-byassociation model. This method has been independently developed in two different settings-relational learning and semi-supervised learning. In relational learning, the setting assumes that the networked data has explicit links such as hyperlinks between webpages or citations between research papers. The semi-supervised setting assumes a corpus of non-relational data and creates links based on similarity measures between the instances. Both use only the known labels in the network to predict the remaining labels but use very different information sources. The thesis of this paper is that if we combine these two types of links, the resulting network will carry more information than either type of link by itself. We test this thesis on six benchmark data sets, using a within-network learning algorithm, where we show that we gain significant improvements in predictive performance by combining the links. We describe a principled way of combining multiple types of edges with different edge-weights and semantics using an objective graph measure called node-based assortativity. We investigate the use of this measure to combine text-mined links with explicit links and show that using our approach significantly improves performance of our classifier over naively combining these two types of links. Motivation Recent years have seen a lot of attention on classification with networked data in various domains and settings (e.g., (Cortes, Pregibon, & Volinsky 2001; Blum et al. 2004; Macskassy & Provost forthcoming; Wang & Zhang 2006)). Networked data is data, generally of the same type such as web-pages or text documents, that are connected via various explicit relations such as one paper citing another, hyperlinks between web-pages, or people calling each other. This paper concerns itself mainly with the problem of withinnetwork classification: given a partially labeled network (some nodes have been labeled), label the rest of the nodes in the network. There have been two separate thrusts of work in this area; one assumes that the data is already in the form of a network such as a web-site, a citation graph, or a calling graph (e.g., (Taskar, Segal, & Koller 2001; Cortes, Pregibon, & Volinsky 2001; Macskassy & Provost 2003)). The second area of work has not been cast as a network learning problem, but rather in the area of semi-supervised learning in a