Introduction of Empirical Topology in Construction of Relationship Networks of Informative Objects (original) (raw)

A Link-Analysis Extension of Correspondence Analysis for Mining Relational Databases

IEEE Transactions on Knowledge and …, 2010

This work introduces a link-analysis procedure for discovering relationships in a relational database or a graph, generalizing both simple and multiple correspondence analysis. It is based on a random-walk model through the database defining a Markov chain having as many states as elements in the database. Suppose we are interested in analyzing the relationships between some elements (or records) contained in two different tables of the relational database. To this end, in a first step, a reduced, much smaller, Markov chain containing only the elements of interest and preserving the main characteristics of the initial chain, is extracted by stochastic complementation . This reduced chain is then analyzed by projecting jointly the elements of interest in the diffusion-map subspace [42] and visualizing the results. This two-step procedure reduces to simple correspondence analysis when only two tables are defined and to multiple correspondence analysis when the database takes the form of a simple starschema. On the other hand, a kernel version of the diffusion-map distance, generalizing the basic diffusion-map distance to directed graphs, is also introduced and the links with spectral clustering are discussed. Several datasets are analyzed by using the proposed methodology, showing the usefulness of the technique for extracting relationships in relational databases or graphs.

Machine-assisted discovery of relationships in astronomy

Monthly Notices of the Royal Astronomical Society, 2013

High-volume feature-rich data sets are becoming the bread-and-butter of 21st century astronomy but present significant challenges to scientific discovery. In particular, identifying scientifically significant relationships between sets of parameters is non-trivial. Similar problems in biological and geosciences have led to the development of systems which can explore large parameter spaces and identify potentially interesting sets of associations. In this paper, we describe the application of automated discovery systems of relationships to astronomical data sets, focusing on an evolutionary programming technique and an information-theory technique. We demonstrate their use with classical astronomical relationships-the Hertzsprung-Russell diagram and the Fundamental Plane of elliptical galaxies. We also show how they work with the issue of binary classification which is relevant to the next generation of large synoptic sky surveys, such as the Large Synoptic Survey Telescope (LSST). We find that comparable results to more familiar techniques, such as decision trees, are achievable. Finally, we consider the reality of the relationships discovered and how this can be used for feature selection and extraction.

Relational topological clustering

2010

This paper introduces a new topological clustering formalism, dedicated to categorical data arising in the form of a binary matrix or a sum of binary matrices. The proposed approach is based on the principle of the Kohonen's model (conservation of topological order) and uses the Relational Analysis formalism by optimizing a cost function defined as a Condorcet criterion. We propose an hybrid algorithm, which deals linearly with large datasets, provides a natural clusters identification and allows a visualization of the clustering result on a two dimensional grid while preserving the a priori topological order of the data. The proposed approach called RTC was validated on several datasets and the experimental results showed very promising performances.

Local and global mappings of topology representing networks

Information Sciences; vol. 179, num. 21, pp. 3791-3803, 2009

As data analysis tasks often have to deal with complex data structures, the nonlinear dimensionality reduction methods play an important role in exploratory data analysis. In the literature a number of nonlinear dimensionality reduction techniques have been proposed (e.g. Sammon mapping, Locally Linear Embedding). These techniques attempt to preserve either the local or the global geometry of the original data, and they perform metric or non-metric dimensionality reduction. Nevertheless, it is difficult to apply most of them to large data sets. There is a need for new algorithms that are able to combine vector quantisation and mapping methods in order to visualise the data structure in a low-dimensional vector space. In this paper we define a new class of algorithms to quantify and disclose the data structure, that are based on the topology representing networks and apply different mapping methods to the low-dimensional visualisation. Not only existing methods are combined for that purpose but also a novel group of mapping methods (Topology Representing Network Map) are introduced as a part of this class. Topology Representing Network Maps utilise the main benefits of the topology representing networks and of the multidimensional scaling methods to disclose the real structure of the data set under study. To determine the main properties of the topology representing network based mapping methods, a detailed analysis of classical benchmark examples (Wine and Optical Recognition of Handwritten Digits data set) is presented.

Mining Graph Topological Patterns: Finding Covariations among Vertex Descriptors

IEEE Transactions on Knowledge and Data Engineering, 2013

In this article, we propose to mine the graph topology of a large attributed graph by finding regularities among vertex descriptors. Such descriptors are of two types: (1) the vertex attributes that correspond to the information conveyed by the vertices themselves and (2) some topological properties, used to describe the connectivity of each vertex in the graph. Such topological properties and attributes are mostly of numerical or ordinal types and their similarity can be captured by quantifying their co-variation, that is, if their largest or smallest values are supported mostly by the same set of vertices. A topological pattern is thus defined as a set of vertex attributes and topological properties that strongly co-vary over the vertices of the graph. Such pattern mining task relies on frequent pattern mining and graph topology analysis to reveal the links that exist between the relation encoded by the graph and the vertex attributes. For instance, a topological pattern in a co-authorship graph, where vertices represent authors, edges encode coauthorship, and vertex attributes reveal the number of publications in several journals, could be "the higher the number of publications in IEEE TKDE, the higher the closeness centrality of the vertex within the graph". Hence, such pattern discloses the fact that the number of times an author publishes at IEEE TKDE is positively correlated to the fact she has co-authored papers with other central authors, inducing a rather short distance to other graph vertices. We propose several interestingness measures of topological patterns that are different w.r.t. the pairs of vertices considered while evaluating up and down co-variations between properties and attributes: (1) considering all the pairs of vertices enables to find patterns that are true all over the graph; (2) taking into account only the vertex pairs that are in a specific order w.r.t. a selected attribute reveals the topological patterns that emerge with respect to this attribute; (3) examining the vertex pairs that are connected in the graph makes it possible to identify patterns that are structurally correlated to the relationship encoded by the graph. An efficient algorithm that combines searching and pruning strategies in the identification of the most relevant topological patterns is presented. Besides a classical empirical study, we report case studies on four real-life networks showing that our approach provides valuable knowledge in a feasible time.

Extracting Labeled Topological Patterns from Samples of Networks

PLoS ONE, 2013

An advanced graph theoretical approach is introduced that enables a higher level of functional interpretation of samples of directed networks with identical fixed pairwise different vertex labels that are drawn from a particular population. Compared to the analysis of single networks, their investigation promises to yield more detailed information about the represented system. Often patterns of directed edges in sample element networks are too intractable for a direct evaluation and interpretation. The new approach addresses the problem of simplifying topological information and characterizes such a sample of networks by finding its locatable characteristic topological patterns. These patterns, essentially sample-specific network motifs with vertex labeling, might represent the essence of the intricate topological information contained in all sample element networks and provides as well a means of differentiating network samples. Central to the accurateness of this approach is the null model and its properties, which is needed to assign significance to topological patterns. As a proof of principle the proposed approach has been applied to the analysis of networks that represent brain connectivity before and during painful stimulation in patients with major depression and in healthy subjects. The accomplished reduction of topological information enables a cautious functional interpretation of the altered neuronal processing of pain in both groups.

A systematic survey of point set distance measures for link discovery

Semantic Web, 2018

Large amounts of geo-spatial information have been made available with the growth of the Web of Data. While discovering links between resources on the Web of Data has been shown to be a demanding task, discovering links between geo-spatial resources proves to be even more challenging. This is partly due to the resources being described by the means of vector geometry. Especially, discrepancies in granularity and error measurements across data sets render the selection of appropriate distance measures for geo-spatial resources difficult. In this paper, we survey existing literature for point-set measures that can be used to measure the similarity of vector geometries. We then present and evaluate the ten measures that we derived from literature. We evaluate these measures with respect to their time-efficiency and their robustness against discrepancies in measurement and in granularity. To this end, we use samples of real data sets of different granularity as input for our evaluation framework. The results obtained on three different data sets suggest that most distance approaches can be led to scale. Moreover, while some distance measures are significantly slower than other measures, distance measure based on means, surjections and sums of minimal distances are robust against the different types of discrepancies.

Mining Topological Relationship Patterns from Spatiotemporal Databases

International Journal of Data Mining & Knowledge Management Process, 2012

Mining topological relationship patterns involve three aspects. First one is the discovery of geometric relationships like disjoint, cover, intersection and overlap between every pair of spatiotemporal objects. Second one is tracking the change of such relationships with time from spatiotemporal databases. Third one is mining the topological relationship patterns. Spatiotemporal databases deal with changes to spatial objects with time. The applications in this domain process spatial, temporal and attribute data elements to find the evolution of spatial objects and changes in their topological relationships with time. These advanced database applications require storing, management and processing of complex spatiotemporal data. In this paper we discuss a model-view-controller based architecture of the system, the design of spatiotemporal database and methodology for mining spatiotemporal topological relationship patterns. Prototype implementation of the system is carried out on top of open source object relational spatial database management system called postgresql and postgis. The algorithms are experimented on historical cadastral datasets that are created using OpenJump. The resulting topological relationship patterns are presented.

On the relationships between topological measures in real-world networks

Networks and Heterogeneous Media, 2008

Over the past several years, a number of measures have been introduced to characterize the topology of complex networks. We perform a statistical analysis of real data sets, representing the topology of different realworld networks. First, we show that some measures are either fully related to other topological measures or that they are significantly limited in the range of their possible values. Second, we observe that subsets of measures are highly correlated, indicating redundancy among them. Our study thus suggests that the set of commonly used measures is too extensive to concisely characterize the topology of complex networks. It also provides an important basis for classification and unification of a definite set of measures that would serve in future topological studies of complex networks.