Using ghost edges for classification in sparsely labeled networks (original) (raw)

Semi-supervised classification and betweenness computation on large, sparse, directed graphs

Pattern Recognition, 2011

This work addresses graph-based semi-supervised classification and betweenness computation in large, sparse, networks (several millions of nodes). The objective of semi-supervised classification is to assign a label to unlabeled nodes using the whole topology of the graph and the labeling at our disposal. Two approaches are developed to avoid explicit computation of pairwise proximity between the nodes of the graph, which would be impractical for graphs containing millions of nodes. The first approach directly computes, for each class, the sum of the similarities between the nodes to classify and the labeled nodes of the class as suggested initially in . Along this approach, two algorithms exploiting different state-of-the-art kernels on a graph are developed. The same strategy can also be used in order to compute a betweenness measure. The second approach works on a trellis structure built from biased random walks on the graph, extending an idea introduced in [3]. These random walks allow to define a biased bounded betweenness for the nodes of interest, defined separately for each class. All the proposed algorithms have a linear computing time in the number of edges, and hence are applicable to large sparse networks. They are empirically validated on medium-size standard data sets and are shown to be competitive with state-of-the-art techniques. Finally, we processed a novel data set, which is made available for benchmarking, for multi-class classification in a large network: the U.S. patents citation network containing 3M nodes (of six different classes) and 38M edges. The three proposed algorithms achieve competitive results (around 85% classification rate) on this large network -they classify the unlabeled nodes within a few minutes on a standard workstation.

Correcting evaluation bias of relational classifiers with network cross validation

Knowledge and Information Systems

Recently, a number of modeling techniques have been developed for data mining and machine learning in relational and network domains where the instances are not independent and identically distributed (i.i.d.). These methods specifically exploit the statistical dependencies among instances in order to improve classification accuracy. However, there has been little focus on how these same dependencies affect our ability to draw accurate conclusions about the performance of the models. More specifically, the complex link structure and attribute dependencies in relational data violate the assumptions of many conventional statistical tests and make it difficult to use these tests to assess the models in an unbiased manner. In this work, we examine the task of within-network classification and the question of whether two algorithms will learn models that will result in significantly different levels of performance. We show that the commonly used form of evaluation (paired t-test on overlapping network samples) can result in an unacceptable level of Type I error. Furthermore, we show that Type I error increases as (1) the correlation among instances increases and (2) the size of the evaluation set increases (i.e., the proportion of labeled nodes in the network decreases). We propose a method for network cross-validation that combined with paired t-tests produces more acceptable levels of Type I error while still providing reasonable levels of statistical power (i.e., 1−Type II error).

Improving within-network classification with local attributes

2007

This paper is about using multiple types of information for classification of networked data in the transductive setting: given a network with some nodes labeled, predict the labels of the remaining nodes. One method recently developed for doing such inference is a guilt-by-association model. This method has been independently developed in two different settings. One setting assumes that the networked data has explicit links such as hyperlinks between web-pages or citations between research papers. The second setting assumes a corpus of non-relational data and creates links based on similarity measures between the instances. Both use only the known labels in the network to predict the remaining labels but use very different information sources. The thesis of of this paper is that if we were to combine the two types of links, the resulting network would carry more information than either type of link by itself. This thesis is tested on six benchmark data sets where we show that this is indeed correct. We further do a sensitivity study on how many links should be created, showing that the combined network gets most of its immediate gain using only a few extra links.

Simple models and classification in networked data

2004

When entities are linked by explicit relations, classification methods that take advantage of the network can perform substantially better than methods that ignore the network. This paper argues that studies of relational classification in networked data should include simple network-only methods as baselines for comparison, in addition to the non-relational baselines that generally are used. In particular, comparing more complex algorithms with algorithms that only consider the network (and not the features of the entities) allows one to factor out the contribution of the network structure itself to the predictive power of the model. We examine several simple methods for network-only classification on previously used relational data sets, and show that they can perform remarkably well. The results demonstrate that the inclusion of network-only classifiers can shed new light on studies of relational learners.

Improving learning in networked data by combining explicit and mined links

2007

This paper is about using multiple types of information for classification of networked data in a semi-supervised setting: given a fully described network (nodes and edges) with known labels for some of the nodes, predict the labels of the remaining nodes. One method recently developed for doing such inference is a guilt-byassociation model. This method has been independently developed in two different settings-relational learning and semi-supervised learning. In relational learning, the setting assumes that the networked data has explicit links such as hyperlinks between webpages or citations between research papers. The semi-supervised setting assumes a corpus of non-relational data and creates links based on similarity measures between the instances. Both use only the known labels in the network to predict the remaining labels but use very different information sources. The thesis of this paper is that if we combine these two types of links, the resulting network will carry more information than either type of link by itself. We test this thesis on six benchmark data sets, using a within-network learning algorithm, where we show that we gain significant improvements in predictive performance by combining the links. We describe a principled way of combining multiple types of edges with different edge-weights and semantics using an objective graph measure called node-based assortativity. We investigate the use of this measure to combine text-mined links with explicit links and show that using our approach significantly improves performance of our classifier over naively combining these two types of links. Motivation Recent years have seen a lot of attention on classification with networked data in various domains and settings (e.g., (Cortes, Pregibon, & Volinsky 2001; Blum et al. 2004; Macskassy & Provost forthcoming; Wang & Zhang 2006)). Networked data is data, generally of the same type such as web-pages or text documents, that are connected via various explicit relations such as one paper citing another, hyperlinks between web-pages, or people calling each other. This paper concerns itself mainly with the problem of withinnetwork classification: given a partially labeled network (some nodes have been labeled), label the rest of the nodes in the network. There have been two separate thrusts of work in this area; one assumes that the data is already in the form of a network such as a web-site, a citation graph, or a calling graph (e.g., (Taskar, Segal, & Koller 2001; Cortes, Pregibon, & Volinsky 2001; Macskassy & Provost 2003)). The second area of work has not been cast as a network learning problem, but rather in the area of semi-supervised learning in a

Link prediction in graph construction for supervised and semi-supervised learning

2015 International Joint Conference on Neural Networks (IJCNN), 2015

Many real-world domains are relational in nature that consist of a set of objects related to each other in complex ways. However, there are also flat data sets and if we want to apply graph-based algorithms, it is necessary to construct a graph from this data. This paper aims to: i) increase the exploration of graphbased algorithms and ii) proposes new techniques for graph construction from flat data. Our proposal focuses on constructing graphs using link prediction measures for predicting the existence of links between entities from an initial graph. Starting from a basic graph structure as minimum spanning tree, we apply a link prediction measure to add new edges in the graph. The link prediction measures considered here are based on structural similarity of the graph that improves the graph connectivity. We evaluate our proposal for graph construction in supervised and semi-supervised classification and we confirm the graphs achieve better accuracy.

Classificaiton in Networked Data: A Toolkit and a Univariate Case Study

2005

This paper presents NetKit, a modular toolkit for classification in networked data, and a case-study of its application to a collection of networked data sets used in prior machine learning research. Networked data are relational data where entities are interconnected, and this paper considers the common case where to-be-estimated entities are linked to entities for which the target is known. NetKit is based on a three-component framework, comprising a local classifier, a relational classifier, and a collective inference procedure. Various existing relational learning algorithms can be instantiated with appropriate choices for these three components and new relational learning algorithms can be composed by new combinations of components. The case study demonstrates how our toolkit facilitates comparison of different learning methods (which so far has been lacking in machine learning research). It also shows how the modular framework allows analysis of subcomponents, to assess which, whether, and when particular components contribute to superior performance. The case study focuses on the simple, but important, special case of univariate network classification, where the only information available is the structure of class linkage in the network (i.e., only links and class labels are available). To our knowledge, no work previously has evaluated systematically the power of class-linkage alone for classification in machine learning benchmark data sets. Among other things, the results demonstrate clearly that simple network-classification models perform well enough that they should be used as baseline classifiers for studies of relational learning for networked data.

Evaluating Statistical Tests for Within-Network Classifiers of Relational Data

2009

Recently a number of modeling techniques have been developed for data mining and machine learning in relational and network domains where the instances are not independent and identically distributed (i.i.d.). These methods specifically exploit the statistical dependencies among instances in order to improve classification accuracy. However, there has been little focus on how these same dependencies affect our ability to draw accurate conclusions about the performance of the models. More specifically, the complex link structure and attribute dependencies in network data violate the assumptions of many conventional statistical tests and make it difficult to use these tests to assess the models in an unbiased manner. In this work, we examine the task of within-network classification and the question of whether two algorithms will learn models which will result in significantly different levels of performance. We show that the commonly-used form of evaluation (paired t-test on overlapping network samples) can result in an unacceptable level of Type I error. Furthermore we show that Type I error increases as (1) the correlation among instances increases and (2) the size of the evaluation set increases (i.e., the proportion of labeled nodes in the network decreases). We propose a method for network cross-validation that combined with paired t-tests produces more acceptable levels of Type I error while still providing reasonable levels of statistical power (i.e., Type II error).

Graph construction based on labeled instances for semi-supervised learning

Semi-Supervised Learning (SSL) techniques have become very relevant since they require a small set of labeled data. In this context, graph-based algorithms have gained promi- nence in the area due to their capacity to exploiting, besides information about data points, the relationships among them. Moreover, data represented in graphs allow the use of collective inference (vertices can affect each other), propagation of labels (autocorrelation among neighbors) and use of neighborhood characteristics of a vertex. An important step in graph-based SSL methods is the conversion of tabular data into a weighted graph. The graph construction has a key role in the quality of the classification in graph-based methods. This paper explores a method for graph construction that uses available labeled data. We provide extensive experiments showing the proposed method has many advantages: good classification accuracy, quadratic time complexity, no sensitivity to the parameter k>10, sparse graphformationwithaveragedegreearound 2 andhubformation from the labeled points, which facilitates the propagation of labels.

Network Embedding With Completely-Imbalanced Labels

IEEE Transactions on Knowledge and Data Engineering, 2020

Network embedding, aiming to project a network into a low-dimensional space, is increasingly becoming a focus of network research. Semi-supervised network embedding takes advantage of labeled data, and has shown promising performance. However, existing semi-supervised methods would get unappealing results in the completely-imbalanced label setting where some classes have no labeled nodes at all. To alleviate this, we propose two novel semi-supervised network embedding methods. The first one is a shallow method named RSDNE. Specifically, to benefit from the completely-imbalanced labels, RSDNE guarantees both intra-class similarity and inter-class dissimilarity in an approximate way. The other method is RECT which is a new class of graph neural networks. Different from RSDNE, to benefit from the completely-imbalanced labels, RECT explores the class-semantic knowledge. This enables RECT to handle networks with node features and multi-label setting. Experimental results on several real-world datasets demonstrate the superiority of the proposed methods.