Applying Fuzzy Technologies to Equivalence Learning in Protein Classification (original) (raw)
Related papers
Equivalence Learning in Protein Classification
Lecture Notes in Computer Science, 2007
We present a method, called equivalence learning, which applies a two-class classification approach to object-pairs defined within a multi-class scenario. The underlying idea is that instead of classifying objects into their respective classes, we classify object pairs either as equivalent (belonging to the same class) or non-equivalent (belonging to different classes). The method is based on a vectorisation of the similarity between the objects and the application of a machine learning algorithm (SVM, ANN, LogReg, Random Forests) to learn the differences between equivalent and non-equivalent object pairs, and define a unique kernel function that can be obtained via equivalence learning. Using a small dataset of archaeal, bacterial and eukaryotic 3-phosphoglycerate-kinase sequences we found that the classification performance of equivalence learning slightly exceeds those of several simple machine learning algorithms at the price of a minimal increase in time and space requirements.
Fuzzy Classification of Genome Sequences Prior to Assembly Based on Similarity Measures
NAFIPS 2007 - 2007 Annual Meeting of the North American Fuzzy Information Processing Society, 2007
Nucleotide sequencing of genomic data is an important step towards building understanding of gene expression. Current limitations in sequencing limit the number of base pairs that can be processed to only several hundred at a time. Consequently, these sequenced substrings need to be assembled into the overall genome. However, the existence of insertions, deletions and substitutions can complicate the assembly of subsequences and confuse existing methods. What has been needed is an approach that deals with ambiguity in trying to match and assemble a genome from its sequenced subsequences. This research develops fuzzy similarity measures between subsequences that are then incorporated into an assembler based on fuzzy logic and fuzzy similarity measures. The research addresses the problem of extensive computation required by clustering data into meaningful groups. Preliminary evaluation of this approach in conjunction with K-Means clustering suggests that this approach is at least as good as standard approaches and in some cases better.
Alignment-free similarity analysis for protein sequences based on fuzzy integral
Scientific Reports, 2019
sequence comparison is an essential part of modern molecular biology research. In this study, we estimated the parameters of Markov chain by considering the frequencies of occurrence of the all possible amino acid pairs from each alignment-free protein sequence. these estimated Markov chain parameters were used to calculate similarity between two protein sequences based on a fuzzy integral algorithm. For validation, our result was compared with both alignment-based (ClustalW) and alignment-free methods on six benchmark datasets. the results indicate that our developed algorithm has a better clustering performance for protein sequence comparison. With the advent of the advanced sequencing techniques, researchers are generating a large number of protein sequences. This brings in a new challenge 1,2 for phylogenetic and comparative study of these protein sequences. Phylogenetic study and comparative analysis between taxa are an essential part of molecular biology and bioinformatics. These studies, traditionally depended on multiple or pairwise sequence alignments which are the well established classical approach and regarded as a standard method for sequence analysis. However, producing reliable multiple sequence alignments become extremely difficult when more dissimilar protein sequences are considered. The traditional alignment-based methods 3-5 are much empirical to select and create a sequence alignment score matrix, and variation of which may affect the alignment results. Various alignment-free tools 6-13 have been developed over the past two decades to overcome the alignment complexity for phylogenetic analysis. An alignment-free approach consist of two steps for comparing protein sequences. At the first step, the protein sequences are converted into a fixed-length feature vectors. Feature extraction is a series of process for extracting the required information from the query sequences, which is critical for the accuracy of an alignment-free method. At the second step, these extracted feature vectors are used as an input data in vectors similarity comparison algorithm to perform downstream analysis like phylogenetic analysis. Methods based on graphical representation, distance frequency matrix, numerical characterization, K-string dictionary etc., have been introduced to overcome the complication of the sequence alignment. Graphical representation 14,15 of protein sequences provides a simple way of viewing, sorting and comparing various sequences. It also provides mathematical descriptor which help in identifying differences among similar protein sequences quantitatively. Distance frequency of amino acid pairs suggest a new numerical characterization of protein sequence, which converts protein sequence into a distance frequency matrix 16. Numerical characterization directly extracted from protein sequence would capture the essence of the amino acid composition and their distribution on the protein sequence in a quantitative aspect. In this approach, each sequence is mapped into a vector or matrix based on the numerical characterization extracted from the protein sequence. Subsequently, a similarity score is calculated by following distance measure tools, such as, Euclidean distance, Cosine distance, Manhattan distance, etc., among their corresponding vectors or matrices. K-string dictionary 17 approach permit users to use a much lower dimensional frequency or probability vector to represent a protein sequence. It also significantly reduces the space requirement for their implementation. Furthermore, after getting the lower dimensional frequency vectors, Singular Value Decomposition (SVD) is used to get a better protein vector representation which helps user to obtain a precise phylogenetic tree. However, these above mentioned methods are lagging behind in terms of accuracy. Thus, more discriminatory features are still needed to be developed. In addition to the accuracy, these method have another drawback and that is, computational complexity. Motivated by the aforementioned work, in this study, we proposed to use fuzzy integral algorithm 18,19 for analysis of protein sequence based on Markov chain 20. Fuzzy integral
Generating fuzzy rules for protein classification
2008
ABSTRACT. This paper considers the generation of some interpretable fuzzy rules for assigning an amino acid sequence into the appropriate protein superfamily. Since the main objective of this classifier is the interpretability of rules, we have used the distribution of amino acids in the sequences of proteins as features. These features are the occurrence probabilities of six exchange groups in the sequences. To generate the fuzzy rules, we have used some modified versions of a common approach. The generated rules are simple and understandable, especially for biologists. To evaluate our fuzzy classifiers, we have used four protein superfamilies from UniProt database. Experimental results show the comprehensibility of generated fuzzy rules with comparable classification accuracy. 1.
2003
Abstract—Traditionally, two protein sequences are classified into the same class if their feature patterns have high homology. These feature patterns were originally extracted by sequence alignment algorithms, which measure similarity between an unseen protein sequence and identified protein sequences. Neural network approaches, while reasonably accurate at classification, give no information about the relationship between the unseen case and the classified items that is useful to biologist.
1 Hybrid DNA Sequence Similarity Scheme for Training Support Vector Machines *
2014
Similarity between two DNA sequences is based on alignment. There are different approaches of alignments; each has its own specialty of bearing different information on DNA sequence. This paper presents a study on similarity kernels based on different similarity schemes and proposes a hybrid one. Similarity Kernel is required in order to represent the distance or similarity between two DNA sequences. The different schemes of alignments and the cost of computing them, make it further more difficult to decide what scheme to use. In this study we combine different similarity schemes; each scheme is deduced based on alignment. We demonstrate that combining different similarity scheme does in fact generalize well in machine learning. The scoring scheme also turned to have impact on generalization. 1.
Genome data classification based on fuzzy matching
CSI Transactions on ICT, 2012
Genomic data mining and knowledge extraction is an important problem in bioinformatics. Some research work has been done on unknown genome identification and is based on exact pattern matching of n-grams. In most of the real world biological problems exact matching may not give desired results and the problem in using n-grams is exponential explosion. In this paper we propose a method for genome data classification based on approximate matching. The algorithm works by selecting random samples from the genome database. Tolerance is allowed by generating candidates of varied length to query from these sample sequences. The Levenshtein distance is then checked for each candidate and whether they are k-fuzzily equal. The total number of fuzzy matches for each sequence is then calculated. This is then classified using the data mining techniques namely, naive Bayes, support vector machine, back propagation and also by nearest neighbor. Experiment results are provided for different tolerance levels and they show that accuracy increases as tolerance does. We also show the effect of sampling size on the classification accuracy and it was observed that classification accuracy increases with sampling size. Genome data of two species namely Yeast and E. coli are used to verify proposed method.
BMC bioinformatics, 2005
The sequencing of the human genome has enabled us to access a comprehensive list of genes (both experimental and predicted) for further analysis. While a majority of the approximately 30,000 known and predicted human coding genes are characterized and have been assigned at least one function, there remains a fair number of genes (about 12,000) for which no annotation has been made. The recent sequencing of other genomes has provided us with a huge amount of auxiliary sequence data which could help in the characterization of the human genes. Clustering these sequences into families is one of the first steps to perform comparative studies across several genomes. Here we report a novel clustering algorithm (CLUGEN) that has been used to cluster sequences of experimentally verified and predicted proteins from all sequenced genomes using a novel distance metric which is a neural network score between a pair of protein sequences. This distance metric is based on the pairwise sequence simi...
Lecture Notes in Computer Science, 2009
We evaluated methods of protein classification that use kernels built from BLAST output parameters. Protein sequences were represented as vectors of parameters (e.g. similarity scores) determined with respect to a reference set, and used in Support Vector Machines (SVM) as well as in simple nearest neighbor (1NN) classification. We found, using ROC analysis, that aggregate representations that use aggregate similarities with respect to a few object classes, were as accurate as the full vectorial representations, and that a jury of 6 1NN-based aggregate classifiers performed as well as the best SVM classifiers, while they required much less computational time.
Complementary classification approaches for protein sequences
"Protein Engineering, Design and Selection", 1996
We have studied five methods of protein classification and have applied them to the 768 groups of related proteins in the PROSITE catalog. Four of these methods are based on searching a database of blocks, and the other uses the frequently occurring motifs found in the protein families combined with a fingerprint technique. Our experimental results show that the block-based methods perform well when taking into account the probability of amino acids occurring in a block. Furthermore, the five methods give information that is complementary to each other. Thus, using the five methods together, one can obtain high confidence classifications (if the results agree) or suggest alternative hypotheses (if the results disagree). We also list those proteins whose current families documented in the PROSITE catalog differ from those suggested by our results. There are remarkably few of them, which is a testimony to the quality of PROSITE.