Context-Aware Visual Exploration of Molecular Datab (original) (raw)
Related papers
Fast training of self organizing maps for the visual exploration of molecular compounds
2007
Abstract Visual exploration of scientific data in life science area is a growing research field due to the large amount of available data. The Kohonen's self organizing map (SOM) is a widely used tool for visualization of multidimensional data. In this paper we present a fast learning algorithm for SOMs that uses a simulated annealing method to adapt the learning parameters. The algorithm has been adopted in a data analysis framework for the generation of similarity maps.
Similarity maps - a visualization strategy for molecular fingerprints and machine-learning methods
Fingerprint similarity is a common method for comparing chemical structures. Similarity is an appealing approach because, with many fingerprint types, it provides intuitive results: a chemist looking at two molecules can understand why they have been determined to be similar. This transparency is partially lost with the fuzzier similarity methods that are often used for scaffold hopping and tends to vanish completely when molecular fingerprints are used as inputs to machine-learning (ML) models. Here we present similarity maps, a straightforward and general strategy to visualize the atomic contributions to the similarity between two molecules or the predicted probability of a ML model. We show the application of similarity maps to a set of dopamine D3 receptor ligands using atom-pair and circular fingerprints as well as two popular ML methods: random forests and naïve Bayes. An open-source implementation of the method is provided.
Multidimensional scaling and visualization of large molecular similarity tables
Journal of Computational Chemistry, 2001
techniques that attempt to embed a set of patterns described by means of a dissimilarity matrix into a low-dimensional display plane in a way that preserves their original pairwise interrelationships as closely as possible. Unfortunately, current MDS algorithms are notoriously slow, and their use is limited to small data sets. In this article, we present a family of algorithms that combine nonlinear mapping techniques with neural networks, and make possible the scaling of very large data sets that are intractable with conventional methodologies. The method employs a nonlinear mapping algorithm to project a small random sample, and then "learns" the underlying transform using one or more multilayer perceptrons. The distinct advantage of this approach is that it captures the nonlinear mapping relationship in an explicit function, and allows the scaling of additional patterns as they become available, without the need to reconstruct the entire map. A novel encoding scheme is described, allowing this methodology to be used with a wide variety of input data representations and similarity functions. The potential of the algorithm is illustrated in the analysis of two combinatorial libraries and an ensemble of molecular conformations. The method is particularly useful for extracting low-dimensional Cartesian coordinate vectors from large binary spaces, such as those encountered in the analysis of large chemical data sets.
QSAR & Combinatorial Science, 2003
The ChemSpaceShuttle toolbox provides a graphical interface allowing for ligand-based design of focused compound libraries by means of linear and non-linear projection techniques and clustering algorithms. The software implements a non-linear encoder network and non-linear partial least squares for the projection of high-dimensional descriptor vectors into a three dimensional space for visualisation. Compound clustering by a self-organising map (SOM) is incorporated. Visualization can facilitate the selection of compounds with desired properties from large compound libraries. Two sample applications are presented: similarity-based compound selection for focused library design, and classification of drugs and nondrugs. A version of ChemSpaceShuttle is freely available at
Assisted descriptor selection based on visual comparative data analysis
Computer Graphics …, 2011
Exploration and selection of data descriptors representing objects using a set of features are important components in many data analysis tasks. Usually, for a given dataset, an optimal data description does not exist, as the suitable data representation is strongly use case dependent. Many solutions for selecting a suitable data description have been proposed. In most instances, they require data labels and often are black box approaches. Non-expert users have difficulties to comprehend the coherency of input, parameters, and output of these algorithms. Alternative approaches, interactive systems for visual feature selection, overburden the user with an overwhelming set of options and data views. Therefore, it is essential to offer the users a guidance in this analytical process. In this paper, we present a novel system for data description selection, which facilitates the user's access to the data analysis process. As finding of suitable data description consists of several steps, we support the user with guidance. Our system combines automatic data analysis with interactive visualizations. By this, the system provides a recommendation for suitable data descriptor selections. It supports the comparison of data descriptors with differing dimensionality for unlabeled data. We propose specialized scores and interactive views for descriptor comparison. The visualization techniques are scatterplot-based and grid-based. For the latter case, we apply Self-Organizing Maps as adaptive grids which are well suited for large multi-dimensional data sets. As an example, we demonstrate the usability of our system on a real-world biochemical application.
Data Visualization with Simultaneous Feature Selection
2006 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology, 2006
Data visualization algorithms and feature selection techniques are both widely used in bioinformatics but as distinct analytical approaches. Until now there has been no method of deciding feature saliency while training a data visualization model. We derive a generative topographic mapping (GTM) based data visualization approach which estimates feature saliency simultaneously with the training of the visualization model. The approach not only provides a better projection by modeling irrelevant features with a separate noise model but also gives feature saliency values which help the user assess the significance of each feature. We compare the quality of the projection obtained using the new approach with the projections from traditional GTM and self-organizing maps (SOM) algorithms. The results obtained on a synthetic and a real-life chemoinformatics dataset demonstrate that the proposed approach successfully identifies feature significance and provides coherent (compact) projections.
High-resolution Self-Organizing Maps for advanced visualization and dimension reduction
Neural networks : the official journal of the International Neural Network Society, 2018
Kohonen's Self Organizing feature Map (SOM) provides an effective way to project high dimensional input features onto a low dimensional display space while preserving the topological relationships among the input features. Recent advances in algorithms that take advantages of modern computing hardware introduced the concept of high resolution SOMs (HRSOMs). This paper investigates the capabilities and applicability of the HRSOM as a visualization tool for cluster analysis and its suitabilities to serve as a pre-processor in ensemble learning models. The evaluation is conducted on a number of established benchmarks and real-world learning problems, namely, the policeman benchmark, two web spam detection problems, a network intrusion detection problem, and a malware detection problem. It is found that the visualization resulted from an HRSOM provides new insights concerning these learning problems. It is furthermore shown empirically that broad benefits from the use of HRSOMs in b...
An image-based approach to visual feature space analysis
2008
Methods for management and analysis of non-standard data often rely on the so-called feature vector approach. The technique describes complex data instances by vectors of characteristic numeric values which allow to index the data and to calculate similarity scores between the data elements. Thereby, feature vectors often are a key ingredient to intelligent data analysis algorithms including instances of clustering, classification, and similarity search algorithms. However, identification of appropriate feature vectors for a given database of a given data type is a challenging task. Determining good feature vector extractors usually involves benchmarks relying on supervised information, which makes it an expensive and data dependent process. In this paper, we address the feature selection problem by a novel approach based on analysis of certain feature space images. We develop two image-based analysis techniques for the automatic discrimination power analysis of feature spaces. We evaluate the techniques on a comprehensive feature selection benchmark, demonstrating the effectiveness of our analysis and its potential toward automatically addressing the feature selection problem.
Supervised Visualization for Data Exploration
2020
Dimensionality reduction is often used as an initial step in data exploration, either as preprocessing for classification or regression or for visualization. Most dimensionality reduction techniques to date are unsupervised; they do not take class labels into account (e.g., PCA, MDS, t-SNE, Isomap). Such methods require large amounts of data and are often sensitive to noise that may obfuscate important patterns in the data. Various attempts at supervised dimensionality reduction methods that take into account auxiliary annotations (e.g., class labels) have been successfully implemented with goals of increased classification accuracy or improved data visualization. Many of these supervised techniques incorporate labels in the loss function in the form of similarity or dissimilarity matrices, thereby creating over-emphasized separation between class clusters, which does not realistically represent the local and global relationships in the data. In addition, these approaches are often ...
Visual analysis of image collections
The Visual Computer International Journal of Computer Graphics, 2009
Multidimensional Visualization techniques are invaluable tools for analysis of structured and unstructured data with variable dimensionality. This paper introduces PEx-Image-Projection Explorer for Images-a tool aimed at supporting analysis of image collections. The tool supports a methodology that employs interactive visualizations to aid user-driven feature detection and classification tasks, thus offering improved analysis and exploration capabilities. The visual mappings employ similarity-based multidimensional projections and point placement to layout the data on a plane for visual exploration. In addition to its application to image databases, we also illustrate how the proposed approach can be successfully employed in simultaneous analysis of different data types, such as text and images, offering a common visual representation for data expressed in different modalities. Keywords Visual data mining • Image analysis • Biomedical imaging and visualization 1 Introduction Image analysis and image processing applications typically compute feature vectors from images, so that they can Electronic supplementary material The online version of this article (