Associative database of protein sequences (original) (raw)

A functional hierarchical organization of the protein sequence space

BMC Bioinformatics, 2004

Background: It is a major challenge of computational biology to provide a comprehensive functional classification of all known proteins. Most existing methods seek recurrent patterns in known proteins based on manually-validated alignments of known protein families. Such methods can achieve high sensitivity, but are limited by the necessary manual labor. This makes our current view of the protein world incomplete and biased. This paper concerns ProtoNet, a automatic unsupervised global clustering system that generates a hierarchical tree of over 1,000,000 proteins, based solely on sequence similarity. Results: In this paper we show that ProtoNet correctly captures functional and structural aspects of the protein world. Furthermore, a novel feature is an automatic procedure that reduces the tree to 12% its original size. This procedure utilizes only parameters intrinsic to the clustering process. Despite the substantial reduction in size, the system's predictive power concerning biological functions is hardly affected. We then carry out an automatic comparison with existing functional protein annotations. Consequently, 78% of the clusters in the compressed tree (5,300 clusters) get assigned a biological function with a high confidence. The clustering and compression processes are unsupervised, and robust. Conclusions: We present an automatically generated unbiased method that provides a hierarchical classification of all currently known proteins.

MotifCluster: an interactive online tool for clustering and visualizing sequences using shared motifs

Genome Biology, 2008

MotifCluster finds related motifs in a set of sequences, and clusters the sequences into families using the motifs they contain. MotifCluster, at http://bmf.colorado.edu/motifcluster, lets users test whether proteins are related, cluster sequences by shared conserved motifs, and visualize motifs mapped onto trees, sequences and three-dimensional structures. We demonstrate MotifCluster's accuracy using gold-standard protein superfamilies; using recommended settings, families were assigned to the correct superfamilies with 0.17% false positive and no false negative assignments.

Biological Sequence Databases

Biological data available today surpasses information content in several fields. It is critical to logically organize and disseminate these contents to end users. In this chapter, we learn about biological databases that serve as the gateway for researchers.

Forest, a browser for huge dna sequences

We present a new tool, FOREST, aiming at representing the content of a large nucleic acid sequence (e.g. >100KB ) in a suitable form for the biologist. More p r ecisely, FOR-EST builds all subsequences repeated i n a s e quence or a set of sequences. It allows not only to look for the location of the various occurrences of a given subsequence but points also to interesting subsequences with respect to a given criterion. This tool is based o n two key ideas. The rst idea c onsists to build a sux-tree r epresentation of a sequence and to associate to each node of this tree a set of synthesized attributes, computed on the set of subsequences under this node. This allows the biologist to "browse" in the sequence with a constant abstract view of what he may expect to nd in the section of the tree h e is currently investigating. The second idea c onsists to summarize the distribution of the information with boolean vectors associated to the sequence. These vectors may be e asily displayed in form of a linear map of events, as it is done in genetic mapping. Both representations allow various ecient operations on the sequence. They provide a powerful ltering capacity of the data, while reducing the set of elementary ltering operations to a minimum of conceptual operations. This allows the biologist to easily investigate the most prominent features of the lexical structure of its sequences.

SRS browser: a visual interface to the sequence retrieval system

2006

abstract This paper presents a novel approach to the visual exploration and navigation of complex association networks of biological data sets, eg, published papers, gene or protein information. The generic approach was implemented in the SRS Browser as an alternative visual interface to the highly used Sequence Retrieval System (SRS)[1]. SRS supports keyword-based search of about 400 biomedical databases.

Unsupervised genome-wide recognition of local relationship patterns

BMC Genomics, 2013

Background: Phenomena such as incomplete lineage sorting, horizontal gene transfer, gene duplication and subsequent sub-and neo-functionalisation can result in distinct local phylogenetic relationships that are discordant with species phylogeny. In order to assess the possible biological roles for these subdivisions, they must first be identified and characterised, preferably on a large scale and in an automated fashion. Results: We developed Saguaro, a combination of a Hidden Markov Model (HMM) and a Self Organising Map (SOM), to characterise local phylogenetic relationships among aligned sequences using cacti, matrices of pair-wise distance measures. While the HMM determines the genomic boundaries from aligned sequences, the SOM hypothesises new cacti in an unsupervised and iterative fashion based on the regions that were modelled least well by existing cacti. After testing the software on simulated data, we demonstrate the utility of Saguaro by testing two different data sets: (i) 181 Dengue virus strains, and (ii) 5 primate genomes. Saguaro identifies regions under lineage-specific constraint for the first set, and genomic segments that we attribute to incomplete lineage sorting in the second dataset. Intriguingly for the primate data, Saguaro also classified an additional~3% of the genome as most incompatible with the expected species phylogeny. A substantial fraction of these regions was found to overlap genes associated with both the innate and adaptive immune systems. Conclusions: Saguaro detects distinct cacti describing local phylogenetic relationships without requiring any a priori hypotheses. We have successfully demonstrated Saguaro's utility with two contrasting data sets, one containing many members with short sequences (Dengue viral strains: n = 181, genome size = 10,700 nt), and the other with few members but complex genomes (related primate species: n = 5, genome size = 3 Gb), suggesting that the software is applicable to a wide variety of experimental populations. Saguaro is written in C++, runs on the Linux operating system, and can be downloaded from

1 Data and text mining A Novel Representation of Genomic Sequences for taxonomic clustering and visualization by means of Self-Organizing Maps

2016

Motivation: Self-Organizing Maps (SOMs) are readily-available bioinformatics methods for clustering and visualizing high-dimensional data, provided that such biological information is previ-ously transformed to fixed-size, metric-based vectors. To increase the usefulness of SOM-based approaches for the analysis of ge-nomic sequence data, novel representation methods are required that automatically and bijectively transform aligned nucleotide se-quences into numeric vectors, dealing with both nucleotide ambigui-ty and gaps derived from sequence alignment. Results: Six different codification variants based on Euclidean space, just like SOM pro-cessing, have been tested using two SOM models: the classical Kohonen’s SOM and Growing Cell Structures (GCS). They have been applied to two different sets of sequences: 32 sequences of small subunit ribosomal RNA from organisms belonging to the three