Validating Collective Classification Using Cohorts (original) (raw)

Collective classification in network data

2008

Abstract Many real-world applications produce networked data such as the world-wide web (hypertext documents connected via hyperlinks), social networks (for example, people connected by friendship links), communication networks (computers connected via communication links) and biological networks (for example, protein interaction networks). A recent focus in machine learning research has been to extend traditional machine learning classification techniques to classify nodes in such networks.

Cautious inference in collective classification

2007

Collective classification can significantly improve accuracy by exploiting relationships among instances. Although several collective inference procedures have been reported, they have not been thoroughly evaluated for their commonalities and differences. We introduce novel generalizations of three existing algorithms that allow such algorithmic and empirical comparisons. Our generalizations permit us to examine how cautiously or aggressively each algorithm exploits intermediate relational data, which can be noisy. We conjecture that cautious approaches that identify and preferentially exploit the more reliable intermediate data should outperform aggressive approaches. We explain why caution is useful and introduce three parameters to control the degree of caution. An empirical evaluation of collective classification algorithms, using two base classifiers on three data sets, supports our conjecture.

Collective Classification with Content and Link Noise

itu.edu.tr

Collective classification algorithms [Mackassky et.al. 2007, Sen et.al. 2008] could be used for better classification of networked data when unlabeled test node features and links are available. In this study, we provide detailed results on the performance of collective classification algorithms when content or link noise is present. First of all, we show that collective classification algorithms are more robust to content noise than content only classification. We also evaluate the performance of collective classification when additive link noise is present. We show that, especially when content and/or link noise is present, feature and/or node selection is essential for better collective classification.

Collective prediction with latent graphs

Proceedings of the 20th ACM international …, 2011

Collective classification in relational data has become an important and active research topic in the last decade. It exploits the dependencies of instances in a network to improve predictions. Related applications include hyperlinked document classification, social network analysis and collaboration network analysis. Most of the traditional collective classification models mainly study the scenario that there exists a large amount of labeled examples (labeled nodes). However, in many real-world applications, labeled data are extremely difficult to obtain. For example, in network intrusion detection, there may be only a limited number of identified intrusions whereas there are a huge set of unlabeled nodes. In this situation, most of the data have no connection to labeled nodes; hence, no supervision knowledge can be obtained from the local connections. In this paper, we propose to explore various latent linkages among the nodes and judiciously integrate the linkages to generate a latent graph. This is achieved by finding a graph that maximizes the linkages among the training data with the same label, and maximizes the separation among the data with different labels. The objective is further cast into an optimization problem and is solved with quadratic programming. Finally, we apply label propagation on the latent graph to make prediction. Experiments show that the proposed model LNP (Latent Network Propagation) can improve the learning accuracy significantly. For instance, when there are only 10% of labeled examples, the accuracies of all the comparison models are less than 63%, while that of the proposed model is 74%.

An ensemble model for collective classification that reduces learning and inference variance

2012

Ensemble learning can improve classification of relational data. Previous attempts to do so include methods that have focused primarily on reducing learning or inference variance, but not both at the same time. We present an ensemble model that reduces error due to variance in both learning and collective inference. Our model uniquely combines two strategies tailored specifically for relational data and relational models to achieve a larger reduction in variance than using either method alone, which results in significant accuracy gains. In addition, we present the first theoretical analysis for ensembles of collective classifiers in relational domains, to show the reasons for the superior performance of our proposed method. We also use synthetic and real world data to demonstrate the improvement empirically.

Exploiting network structure for active inference in collective classification

2007

Active inference seeks to maximize classification performance while minimizing the amount of data that must be labeled ex ante. This task is particularly relevant in the context of relational data, where statistical dependencies among instances can be exploited to improve classification accuracy. We show that efficient methods for indexing network structure can be exploited to select high-value nodes for labeling. This approach substantially outperforms random selection and selection based on simple measures of ...

Why collective inference improves relational classification

Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '04, 2004

Procedures for collective inference make simultaneous statistical judgments about the same variables for a set of related data instances. For example, collective inference could be used to simultaneously classify a set of hyperlinked documents or infer the legitimacy of a set of related financial transactions. Several recent studies indicate that collective inference can significantly reduce classification error when compared with traditional inference techniques. We investigate the underlying mechanisms for this error reduction by reviewing past work on collective inference and characterizing different types of statistical models used for making inference in relational data. We show important differences among these models, and we characterize the necessary and sufficient conditions for reduced classification error based on experiments with real and simulated data.

Validation of Network Classifiers

This paper develops PAC (probably approximately correct) error bounds for network classifiers in the transductive setting, where the network node inputs and links are all known, the training nodes class labels are known, and the goal is to classify a working set of nodes that have unknown class labels. The bounds are valid for any model of network generation. They require working nodes to be selected independently, but not uniformly at random. For example, they allow different regions of the network to have different densities of unlabeled nodes.

Collective Classification Using Heterogeneous Classifiers

Lecture Notes in Computer Science, 2011

Collective classification algorithms have been used to improve classification performance when network training data with content, link and label information and test data with content and link information are available. Collective classification algorithms use a base classifier which is trained on training content and link data. The base classifier inputs usually consist of the content vector concatenated with an aggregation vector of neighborhood class information. In this paper, instead of using a single base classifier, we propose using different types of base classifiers for content and link. We then combine the content and link classifier outputs using different classifier combination methods. Our experiments show that using heterogeneous classifiers for link and content classification and combining their outputs gives accuracies as good as collective classification. Our method can also be extended to collective classification scenarios with multiple types of content and link.

Active inference for collective classification

2010

Abstract Labeling nodes in a network is an important problem that has seen a growing interest. A number of methods that exploit both local and relational information have been developed for this task. Acquiring the labels for a few nodes at inference time can greatly improve the accuracy, however the question of figuring out which node labels to acquire is challenging. Previous approaches have been based on simple structural properties.