Network data classification using graph partition (original) (raw)

Classification of web documents using a graph model

2003

In this paper we describe work relating to classification of web documents using a graph-based model instead of the traditional vector-based model for document representation. We compare the classification accuracy of the vector model approach using the k-Nearest Neighbor (k-NN) algorithm to a novel approach which allows the use of graphs for document representation in the k-NN algorithm. The proposed method is evaluated on three different web document collections using the leave-one-out approach for measuring classification accuracy. The results show that the graph-based k-NN approach can outperform traditional vector-based k-NN methods in terms of both accuracy and execution time.

Collective classification in network data

2008

Abstract Many real-world applications produce networked data such as the world-wide web (hypertext documents connected via hyperlinks), social networks (for example, people connected by friendship links), communication networks (computers connected via communication links) and biological networks (for example, protein interaction networks). A recent focus in machine learning research has been to extend traditional machine learning classification techniques to classify nodes in such networks.

Graph Classification using Machine Learning Algorithms

Graph Classification using Machine Learning Algorithms by Monica Golahalli Seenappa In the Graph classification problem, given is a family of graphs and a group of different categories, and we aim to classify all the graphs (of the family) into the given categories. Earlier approaches, such as graph kernels and graph embedding techniques have focused on extracting certain features by processing the entire graph. However, real world graphs are complex and noisy and these traditional approaches are computationally intensive. With the introduction of the deep learning framework, there have been numerous attempts to create more efficient classification approaches. For this project, we will be focusing on modifying an existing kernel graph convolutional neural network approach. Moreover, subgraphs (patches) are extracted from the graph using a community detection algorithm. These patches are provided as input to a graph kernel and max pooling is applied. We will be experimenting with different community detection algorithms and graph kernels and compare their efficiency and performance. For the experiments, we use eight publicly available real world datasets, ranging from biological to social networks. Additionally, for these datasets we provide results using a baseline algorithm and a spectral decomposition of Laplacian graph for comparison purposes.

Model-Based Classification of Web Documents Represented by Graphs

Proc. of WebKDD, 2006

Most web content classification methods are based on the vectorspace model of information retrieval. One of the important advantages of this representation model is that it can be used by both instance-based and model-based classifiers for categorization. However, this popular method of document representation does not capture important structural information, such as the order and proximity of word occurrence or the location of a word within the document. It also makes no use of the mark-up information that can be easily extracted from the web document HTML tags.

Approaches to Network Classification

Cornell University - arXiv, 2002

We introduce a novel approach to description of networks/graphs. It is based on an analogue physical model which is dynamically evolved. This evolution depends on the connectivity matrix and readily brings out many qualitative features of the graph.

Identification of clusters in the Web graph based on link topology

Seventh International Database Engineering and Applications Symposium, 2003. Proceedings., 2003

The web graph has recently been used to model the link structure of the Web. The studies of such graphs can yield valuable insights into web algorithms for crawling, searching and discovery of web communities. This paper proposes a new approach to clustering the Web graph. The proposed algorithm identifies a small subset of the graph as "core" members of clusters, and then incrementally constructs the clusters by a selection criterion. Two qualitative criteria are proposed to measure the quality of graph clustering. We have implemented our algorithm and tested a set of arbitrary graphs with good results. Applications of our approach include graph drawing and web visualization.

A Clustering Algorithm for Classification of Network Traffic using Semi Supervised Data

In traditional text classification, a classifier is built using supervised training documents of every class. This paper studies a different problem. Given a set P of documents of a particular class (called positive class) and a set U of unsupervised documents that contains documents from class P and also other types of documents (called negative class), we want to build a classifier to classify the documents in U into documents from P and documents not from P. The key feature of this problem is that there is no supervised negative document, which makes traditional text classification techniques inappropriate. So In this paper, we propose an effective technique to solve the problem. It combines the Rocchio method and the K-means technique for classifier network data. Experimental results show that the new method outperforms existing methods significantly.

Classificaiton in Networked Data: A Toolkit and a Univariate Case Study

2005

This paper presents NetKit, a modular toolkit for classification in networked data, and a case-study of its application to a collection of networked data sets used in prior machine learning research. Networked data are relational data where entities are interconnected, and this paper considers the common case where to-be-estimated entities are linked to entities for which the target is known. NetKit is based on a three-component framework, comprising a local classifier, a relational classifier, and a collective inference procedure. Various existing relational learning algorithms can be instantiated with appropriate choices for these three components and new relational learning algorithms can be composed by new combinations of components. The case study demonstrates how our toolkit facilitates comparison of different learning methods (which so far has been lacking in machine learning research). It also shows how the modular framework allows analysis of subcomponents, to assess which, whether, and when particular components contribute to superior performance. The case study focuses on the simple, but important, special case of univariate network classification, where the only information available is the structure of class linkage in the network (i.e., only links and class labels are available). To our knowledge, no work previously has evaluated systematically the power of class-linkage alone for classification in machine learning benchmark data sets. Among other things, the results demonstrate clearly that simple network-classification models perform well enough that they should be used as baseline classifiers for studies of relational learning for networked data.

Log Classification using K-Means Clustering for Identify Internet User Behaviors

International Journal of Computer Applications, 2016

The Internet has become a necessity in today's society; any information is accessible on the internet via web browser. However, these activities could have an impact on users, one of which changes in behavior. This study focuses on the activities of Internet users based on the log data network at an educational institution. The data used in this study resulted from one-week observation from one of the universities in Yogyakarta. Data log network activity is one type of big data, so it is needed to use of data mining with K-Means algorithm as a solution to determine the behavior of Internet users. The K-Means algorithm used for clustering based on the number of visitors. Cluster number of visitors divided into three, namely low with 1479 amount of data, medium with 126 amount of data, and high with 33 amount of data. Categorization also performed by the access time and is based on website content that exists in the data. It is to compare the results by the K-Means clustering algorithm. The results of the educational institution show that each of these clusters produces websites that are frequented by the sequence: website search, social media, news, and information. This study also revealed that the cyber-profiling had been done strongly influenced by environmental factors and daily

Partitioning-based clustering for Web document categorization

Decision Support Systems, 1999

Clustering techniques have been used by many intelligent software agents in order to retrieve, lter, and categorize documents available on the World Wide Web. Clustering is also useful in extracting salient features of related web documents to automatically formulate queries and search for other similar documents on the Web. Traditional clustering algorithms either use a priori knowledge of document structures to de ne a distance or similarity among these documents, or use probabilistic techniques such as Bayesian classi cation. Many of these traditional algorithms, however, falter when the dimensionality of the feature space becomes high relative to the size of the document space. In this paper, we introduce two new clustering algorithms that can e ectively cluster documents, even in the presence of a very high dimensional feature space. These clustering techniques, which are based on generalizations of graph partitioning, do not require pre-speci ed ad hoc distance functions, and are capable of automatically discovering document similarities or associations. We conduct several experiments on real Web data using various feature selection heuristics, and compare our clustering schemes to standard distance-based techniques, such as hierarchical agglomeration clustering, and Bayesian classi cation methods, such as AutoClass.

Network data classification using graph partition (original) (raw)

Related papers