Network Sampling: Methods and Applications (original) (raw)

Biological network comparison using graphlet degree distribution

Bioinformatics, 2007

Motivation: Analogous to biological sequence comparison, comparing cellular networks is an important problem that could provide insight into biological understanding and therapeutics. For technical reasons, comparing large networks is computationally infeasible, and thus heuristics, such as the degree distribution, clustering coefficient, diameter, and relative graphlet frequency distribution have been sought. It is easy to demonstrate that two networks are different by simply showing a short list of properties in which they differ. It is much harder to show that two networks are similar, as it requires demonstrating their similarity in all of their exponentially many properties. Clearly, it is computationally prohibitive to analyze all network properties, but the larger the number of constraints we impose in determining network similarity, the more likely it is that the networks will truly be similar.Results: We introduce a new systematic measure of a network's local structure ...

Graphlet characteristics in directed networks

Graphlet analysis is part of network theory that does not depend on the choice of the network null model and can provide comprehensive description of the local network structure. Here, we propose a novel method for graphlet-based analysis of directed networks by computing first the signature vector for every vertex in the network and then the graphlet correlation matrix of the network. This analysis has been applied to brain effective connectivity networks by considering both direction and sign (inhibitory or excitatory) of the underlying directed (effective) connectivity. In particular, the signature vectors for brain regions and the graphlet correlation matrices of the brain effective network are computed for 40 healthy subjects and common dependencies are revealed. We found that the signature vectors (node, wedge, and triangle degrees) are dominant for the excitatory effective brain networks. Moreover, by considering only those correlations (or anti correlations) in the correlation matrix that are significant (>0.7 or <−0.7) and are presented in more than 60% of the subjects, we found that excitatory effective brain networks show stronger causal (measured with Granger causality) patterns (G-causes and G-effects) than inhibitory effective brain networks. The complexity of systems is frequently the result of non-trivial local connectivity and interaction of its constituents parts. A number of network structural characteristics have recently been the subject of particularly intense research, including degree distributions 1 , community structure 2,3 , and various measures of vertex cen-trality 4,5 , to mention only a few. Vertices may have attributes associated with them; for example, properties of proteins in protein-protein interaction networks 6 , users' social network profiles 7 , or authors' publication histories in co-authorship networks 8. Two approaches that focus on the local connectivity of subgraphs within a network are Motifs and Graphlets. Motifs are defined as sub-graphs that repeat frequently in the networks i.e they repeat at frequency higher than in the random graphs 9,10 , and they depend on the choice of the network's null model. In contrast, graphlets are induced sub-graphs of a network that appear at any frequency and hence are independent of a null model. They have been introduced recently 11 and they have found numerous applications as building blocks of network analysis in various disciplines ranging from social science 12,13 to biology 14,15. In social science, graphlet analysis (known as sub-graph census) is widely adopted in sociometric studies 12. Much of the work in this vein focused on analyzing triadic tendencies as important structural features of social networks (e.g., transi-tivity or triadic closure) as well as analyzing triadic configurations as the basis for various social network theories (e.g., social balance, strength of weak ties, stability of ties, or trust 16). In biology graphlets were used to infer protein structure 17 , to compare biological networks 14,15 , and to characterize the relationship between disease and structure of networks 18. Many of the real-world networks are directed, but until now no method has been proposed based on graphlets that can provide information about local structure of directed networks. Here, we offer a graphlet-based approach for analysis of the local structure of a directed network. In the method proposed in this manuscript, we compute for each vertex, a vector of structural features, called signature vector, based on the number of graphlets associated with the vertex, and for the network its graphlet correlation matrix, measuring graphlet dependencies which reveal unknown organizational principles of the network. We applied the technique to brain effective networks of 40 healthy subjects, and we found that many of the subjects share similar patterns in their network's local structure. In brain networks a node is associated with different types of elements, depending on the level of interest in the brain, and an edge represents the connection or interaction between two elements 19. If the brain is studied on

Efficient estimation of graphlet frequency distributions in protein–protein interaction networks

Bioinformatics, 2006

Motivation: Algorithmic and modeling advances in the area of protein–protein interaction (PPI) network analysis could contribute to the understanding of biological processes. Local structure of networks can be measured by the frequency distribution of graphlets, small connected non-isomorphic induced subgraphs. This measure of local structure has been used to show that high-confidence PPI networks have local structure of geometric random graphs. Finding graphlets exhaustively in a large network is computationally intensive. More complete PPI networks, as well as PPI networks of higher organisms, will thus require efficient heuristic approaches. Results: We propose two efficient and scalable heuristics for finding graphlets in high-confidence PPI networks. We show that both PPI and their model geometric random networks, have defined boundaries that are sparser than the ‘inner parts’ of the networks. In addition, these networks exhibit ‘uniformity’ of local structure inside the networ...

Comparison of tissue/disease specific integrated networks using directed graphlet signatures

Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics - BCB '16, 2016

Background: Analysis of integrated genome-scale networks is a challenging problem due to heterogeneity of high-throughput data. There are several topological measures, such as graphlet counts, for characterization of biological networks. Results: In this paper, we present methods for counting small sub-graph patterns in integrated genome-scale networks which are modeled as labeled multidigraphs. We have obtained physical, regulatory, and metabolic interactions between H. sapiens proteins from the Pathway Commons database. The integrated network is filtered for tissue/disease specific proteins by using a large-scale human transcriptional profiling study, resulting in several tissue and disease specific sub-networks. We have applied and extended the idea of graphlet counting in undirected protein-protein interaction (PPI) networks to directed multi-labeled networks and represented each network as a vector of graphlet counts. Graphlet counts are assessed for statistical significance by comparison against a set of randomized networks. We present our results on analysis of differential graphlets between different conditions and on the utility of graphlet count vectors for clustering multiple condition specific networks. Conclusions: Our results show that there are numerous statistically significant graphlets in integrated biological networks and the graphlet signature vector can be used as an effective representation of a multi-labeled network for clustering and systems level analysis of tissue/disease specific networks.

Exploiting graphlet decomposition to explain the structure of complex networks: the GHuST framework

Scientific Reports

The characterization of topology is crucial in understanding network evolution and behavior. This paper presents an innovative approach, the GHuST framework to describe complex-network topology from graphlet decomposition. This new framework exploits the local information provided by graphlets to give a global explanation of network topology. The GHuST framework is comprised of 12 metrics that analyze how 2- and 3-node graphlets shape the structure of networks. The main strengths of the GHuST framework are enhanced topological description, size independence, and computational simplicity. It allows for straight comparison among different networks disregarding their size. It also reduces the complexity of graphlet counting, since it does not use 4- and 5-node graphlets. The application of the novel framework to a large set of networks shows that it can classify networks of distinct nature based on their topological properties. To ease network classification and enhance the graphical r...

Uncovering Biological Network Function via Graphlet Degree Signatures

Cancer Informatics, 2008

MotivationProteins are essential macromolecules of life and thus understanding their function is of great importance. The number of functionally unclassified proteins is large even for simple and well studied organisms such as baker's yeast. Methods for determining protein function have shifted their focus from targeting specific proteins based solely on sequence homology to analyses of the entire proteome based on protein-protein interaction (PPI) networks. Since proteins interact to perform a certain function, analyzing structural properties of PPI networks may provide useful clues about the biological function of individual proteins, protein complexes they participate in, and even larger subcellular machines.ResultsWe design a sensitive graph theoretic method for comparing local structures of node neighborhoods that demonstrates that in PPI networks, biological function of a node and its local network structure are closely related. The method summarizes a protein's local ...

GUISE: Uniform Sampling of Graphlets for Large Graph Analysis

2012

Graphlet frequency distribution (GFD) has recently become popular for characterizing large networks. However, the computation of GFD for a network requires the exact count of embedded graphlets in that network, which is a computationally expensive task. As a result, it is practically infeasible to compute the GFD for even a moderately large network. In this paper, we propose GUISE, which uses a Markov Chain Monte Carlo (MCMC) sampling method for constructing the approximate GFD of a large network. Our experiments on networks with millions of nodes show that GUISE obtains the GFD within few minutes, whereas the exhaustive counting based approach takes several days. Cit 92-94(V=4,340,E=12,917) Cit 92-96(V=9,186,E=53,183 ) Cit 92-98(V=14,572,E=125,346) Cit 92-00(V=8,000,E=20,523) Cit 92-03(V=27,770,E = 352, 8 07)

Topological Inquisition into the PPI Networks Associated with Human Diseases Through Graphlet Frequency Distribution

Lecture Notes in Computer Science, 2017

In this article, we have proposed a new framework to compare topological structure of protein-protein interaction (PPI) networks constructed from disease associated proteins. Here, similarity of local topological structure between networks is discovered through the analysis of frequent sub-pattern occurred in them using a novel similarity measure based on graphlet frequency distribution. Graphlets are small connected non-isomorphic induced subgraphs in a network which provides detailed topological statistics of it. We have analyzed pairwise similarity of 22 disease associated PPI networks and compared topological and biological characteristics. It has been observed that the PPI networks associated with disease classes 'metabolic' and 'neurological' have the highest similarity scores. Higher similarity has also been observed for networks of disease classes 'bone' and 'skeletal'; 'endocrine' and 'multiple'; and 'gastrointestinal and respiratory'. Topological analysis of the networks also reveals that degree and betweenness centrality of proteins is strongly correlated for the network pairs with high similarity scores. We have also performed gene ontology and pathway based analysis of the proteins involved in the disease associated networks.

Graft: An Efficient Graphlet Counting Method for Large Graph Analysis

IEEE Transactions on Knowledge and Data Engineering, 2014

Majority of the existing works on network analysis study properties that are related to the global topology of a network. Examples of such properties include diameter, power-law exponent, and spectra of graph Laplacian. Such works enhance our understanding of real-life networks, or enable us to generate synthetic graphs with real-life graph properties. However, many of the existing problems on networks require the study of local topological structures of a network, which did not get the deserved attention in the existing works. In this work, we use graphlet frequency distribution (GFD) as an analysis tool for understanding the variance of local topological structure in a network; we also show that it can help in comparing, and characterizing real-life networks. The main bottleneck to obtain GFD is the excessive computation cost for obtaining the frequency of each of the graphlets in a large network. To overcome this, we propose a simple, yet powerful algorithm, called GRAFT, that obtains the approximate graphlet frequency for all graphlets that have up-to five vertices. Comparing to an exact counting algorithm, our algorithm achieves a speedup factor between 10 and 100 for a negligible counting error, which is, on average, less than 5 percent.