Graft: An Efficient Graphlet Counting Method for Large Graph Analysis (original) (raw)
Related papers
RAGE – A rapid graphlet enumerator for large networks
Computer Networks, 2012
Counting network graphlets (and motifs) was shown to have an important role in studying a wide range of complex networks. However, when the network size is large, as in the case of the Internet topology and WWW graphs, counting the number of graphlets becomes prohibitive for graphlets of size 4 and above. Devising efficient graphlet counting algorithms thus becomes an important goal.
E-CLoG: Counting edge-centric local graphlets
2017 IEEE International Conference on Big Data (Big Data), 2017
In recent years, graphlet counting has emerged as an important task in topological graph analysis. However, the existing works on graphlet counting obtain the graphlet counts for the entire network as a whole. These works capture the key graphical patterns that prevail in a given network but they fail to meet the demand of the majority of real-life graph related prediction tasks such as link prediction, edge/node classification, etc., which require to build features for an edge (or a vertex) of a network. To meet the demand for such applications, efficient algorithms are needed for counting local graphlets within the context of an edge (or a vertex). In this work, we propose an efficient method, titled E-CLOG, for counting all 3, 4 and 5 size local graphlets with the context of a given edge for its all different edge orbits. We also provide a shared-memory, multi-core implementation of E-CLOG, which makes it even more scalable for very large real-world networks. In particular, We obtain strong scaling on a variety of graphs (14x-20x on 36 cores). We provide extensive experimental results to demonstrate the efficiency and effectiveness of the proposed method. For instance, we show that E-CLOG is faster than existing work by multiple order of magnitudes; for the Wordnet graph E-CLOG counts all 3,4 and 5-size local graphlets in 1.5 hours using a single thread and in only a few minutes using the parallel implementation, whereas the baseline method does not finish in more than 4 days. We also show that local graphlet counts around an edge are much better features for link prediction than well-known topological features; our experiments show that the former enjoys between 10% to 45% of improvement in the AUC value for predicting future links in three real-life social and collaboration networks.
GUISE: Uniform Sampling of Graphlets for Large Graph Analysis
2012
Graphlet frequency distribution (GFD) has recently become popular for characterizing large networks. However, the computation of GFD for a network requires the exact count of embedded graphlets in that network, which is a computationally expensive task. As a result, it is practically infeasible to compute the GFD for even a moderately large network. In this paper, we propose GUISE, which uses a Markov Chain Monte Carlo (MCMC) sampling method for constructing the approximate GFD of a large network. Our experiments on networks with millions of nodes show that GUISE obtains the GFD within few minutes, whereas the exhaustive counting based approach takes several days. Cit 92-94(V=4,340,E=12,917) Cit 92-96(V=9,186,E=53,183 ) Cit 92-98(V=14,572,E=125,346) Cit 92-00(V=8,000,E=20,523) Cit 92-03(V=27,770,E = 352, 8 07)
Efficient Batch Dynamic Graphlet Counting
arXiv (Cornell University), 2023
Graphlet counting is an important problem as it has numerous applications in several fields, including social network analysis, biological network analysis, transaction network analysis, etc. Most of the practical networks are dynamic. A graphlet is a subgraph with a fixed number of vertices and can be induced or non-induced. There are several works for counting graphlets in a static network where graph topology never changes. Surprisingly, there have been no scalable and practical algorithms for maintaining all fixed-sized graphlets in a dynamic network where the graph topology changes over time. We are the first to propose an efficient algorithm for maintaining graphlets in a fully dynamic network. Our algorithm is efficient because (1) we consider only the region of changes in the graph for updating the graphlet count, and (2) we use an efficient algorithm for counting graphlets in the region of change. We show by experimental evaluation that our technique is more than 10x faster than the baseline approach.
Network Global Testing by Counting Graphlets
2018
Consider a large social network with possibly severe degree heterogeneity and mixed-memberships. We are interested in testing whether the network has only one community or there are more than one communities. The problem is known to be non-trivial, partially due to the presence of severe degree heterogeneity. We construct a class of test statistics using the numbers of short paths and short cycles, and the key to our approach is a general framework for canceling the effects of degree heterogeneity. The tests compare favorably with existing methods. We support our methods with careful analysis and numerical study with simulated data and a real data example.
Exploiting graphlet decomposition to explain the structure of complex networks: the GHuST framework
Scientific Reports
The characterization of topology is crucial in understanding network evolution and behavior. This paper presents an innovative approach, the GHuST framework to describe complex-network topology from graphlet decomposition. This new framework exploits the local information provided by graphlets to give a global explanation of network topology. The GHuST framework is comprised of 12 metrics that analyze how 2- and 3-node graphlets shape the structure of networks. The main strengths of the GHuST framework are enhanced topological description, size independence, and computational simplicity. It allows for straight comparison among different networks disregarding their size. It also reduces the complexity of graphlet counting, since it does not use 4- and 5-node graphlets. The application of the novel framework to a large set of networks shows that it can classify networks of distinct nature based on their topological properties. To ease network classification and enhance the graphical r...
Graphlet characteristics in directed networks
Graphlet analysis is part of network theory that does not depend on the choice of the network null model and can provide comprehensive description of the local network structure. Here, we propose a novel method for graphlet-based analysis of directed networks by computing first the signature vector for every vertex in the network and then the graphlet correlation matrix of the network. This analysis has been applied to brain effective connectivity networks by considering both direction and sign (inhibitory or excitatory) of the underlying directed (effective) connectivity. In particular, the signature vectors for brain regions and the graphlet correlation matrices of the brain effective network are computed for 40 healthy subjects and common dependencies are revealed. We found that the signature vectors (node, wedge, and triangle degrees) are dominant for the excitatory effective brain networks. Moreover, by considering only those correlations (or anti correlations) in the correlation matrix that are significant (>0.7 or <−0.7) and are presented in more than 60% of the subjects, we found that excitatory effective brain networks show stronger causal (measured with Granger causality) patterns (G-causes and G-effects) than inhibitory effective brain networks. The complexity of systems is frequently the result of non-trivial local connectivity and interaction of its constituents parts. A number of network structural characteristics have recently been the subject of particularly intense research, including degree distributions 1 , community structure 2,3 , and various measures of vertex cen-trality 4,5 , to mention only a few. Vertices may have attributes associated with them; for example, properties of proteins in protein-protein interaction networks 6 , users' social network profiles 7 , or authors' publication histories in co-authorship networks 8. Two approaches that focus on the local connectivity of subgraphs within a network are Motifs and Graphlets. Motifs are defined as sub-graphs that repeat frequently in the networks i.e they repeat at frequency higher than in the random graphs 9,10 , and they depend on the choice of the network's null model. In contrast, graphlets are induced sub-graphs of a network that appear at any frequency and hence are independent of a null model. They have been introduced recently 11 and they have found numerous applications as building blocks of network analysis in various disciplines ranging from social science 12,13 to biology 14,15. In social science, graphlet analysis (known as sub-graph census) is widely adopted in sociometric studies 12. Much of the work in this vein focused on analyzing triadic tendencies as important structural features of social networks (e.g., transi-tivity or triadic closure) as well as analyzing triadic configurations as the basis for various social network theories (e.g., social balance, strength of weak ties, stability of ties, or trust 16). In biology graphlets were used to infer protein structure 17 , to compare biological networks 14,15 , and to characterize the relationship between disease and structure of networks 18. Many of the real-world networks are directed, but until now no method has been proposed based on graphlets that can provide information about local structure of directed networks. Here, we offer a graphlet-based approach for analysis of the local structure of a directed network. In the method proposed in this manuscript, we compute for each vertex, a vector of structural features, called signature vector, based on the number of graphlets associated with the vertex, and for the network its graphlet correlation matrix, measuring graphlet dependencies which reveal unknown organizational principles of the network. We applied the technique to brain effective networks of 40 healthy subjects, and we found that many of the subjects share similar patterns in their network's local structure. In brain networks a node is associated with different types of elements, depending on the level of interest in the brain, and an edge represents the connection or interaction between two elements 19. If the brain is studied on
Efficient Enumeration of Four Node Graphlets at Trillion-Scale
2020
Graphlet enumeration is known to be a challenging task in graph analysis. This is because the cost is exponential in the order of the graphlet. Triangle is a graphlet of order three that has received special attention because it is relatively small but non-trivial, and can still be enumerated quite fast even for massive graphs of millions of nodes and edges. In this paper, we propose an efficient algorithm for enumerating four node graphlets, such as 4-cycles, 4-cliques, diamonds, etc by leveraging the most efficient algorithm for triangle enumeration. We show that despite the belief that any such enumeration algorithm cannot terminate in reasonable time, our method can handle large graphs containing trillions of such graphlets, using a single commodity machine, within a reasonable amount of time.
Efficient estimation of graphlet frequency distributions in protein–protein interaction networks
Bioinformatics, 2006
Motivation: Algorithmic and modeling advances in the area of protein–protein interaction (PPI) network analysis could contribute to the understanding of biological processes. Local structure of networks can be measured by the frequency distribution of graphlets, small connected non-isomorphic induced subgraphs. This measure of local structure has been used to show that high-confidence PPI networks have local structure of geometric random graphs. Finding graphlets exhaustively in a large network is computationally intensive. More complete PPI networks, as well as PPI networks of higher organisms, will thus require efficient heuristic approaches. Results: We propose two efficient and scalable heuristics for finding graphlets in high-confidence PPI networks. We show that both PPI and their model geometric random networks, have defined boundaries that are sparser than the ‘inner parts’ of the networks. In addition, these networks exhibit ‘uniformity’ of local structure inside the networ...