Experimental study of a similarity measure for two dimensional sequences (original) (raw)

2011

A variety of different metrics has been introduced to measure the similarity of two given sequences. These widely used metrics are ranging from spell correctors and categorizers to new sequence mining applications. Different metrics consider different aspects of sequences, but the essence of any sequence is extracted from the ordering of its elements. In this paper, we propose a novel sequence similarity measure that is based on all ordered pairs of one sequence and where a Hasse diagram is built in the other sequence. In contrast with existing approaches, the idea behind the proposed sequence similarity metric is to extract all ordering features to capture sequence properties. We designed a clustering problem to evaluate our sequence similarity metric. Experimental results showed the superiority of our proposed sequence similarity metric in maximizing the purity of clustering compared to metrics such as d2, Smith-Waterman, Levenshtein, and Needleman-Wunsch. The limitation of those methods originates from some neglected sequence features, which are considered in our proposed sequence similarity metric.

A general measure of similarity for categorical sequences

Knowledge and Information Systems, 2010

Measuring the similarity between categorical sequences is a fundamental process in many data mining applications. A key issue is extracting and making use of significant features hidden behind the chronological and structural dependencies found in these sequences. Almost all existing algorithms designed to perform this task are based on the matching of patterns in chronological order, but such sequences often have similar structural features in chronologically different order.

2008 Eighth IEEE International Conference on Data Mining, 2008

Measuring the similarity between categorical sequences is a fundamental process in many data mining applications. A key issue is to extract and make use of significant features hidden behind the chronological and structural dependencies found in these sequences. Almost all existing algorithms designed to perform this task are based on the matching of patterns in chronological order, but such sequences often have similar structural features in chronologically different positions.

International Journal of Data Warehousing and Mining, 2010

In many data mining applications, both classification and clustering algorithms require a distance/similarity measure. The central problem in similarity based clustering/classification comprising sequential data is deciding an appropriate similarity metric. The existing metrics like Euclidean, Jaccard, Cosine, and so forth do not exploit the sequential nature of data explicitly. In this paper, the authors propose a similarity preserving function called Sequence and Set Similarity Measure (S3M) that captures both the order of occurrence of items in sequences and the constituent items of sequences. The authors demonstrate the usefulness of the proposed measure for classification and clustering tasks. Experiments were conducted on benchmark datasets, that is, DARPA’98 and msnbc, for classification task in intrusion detection and clustering task in web mining domains. Results show the usefulness of the proposed measure.

A Novel Method of Sequence Similarity Evaluation in N-dimensional Sequence Space

Current Bioinformatics, 2012

The aim of this work is to establish a universal method of searching for similarities between sequences in an ndimensional sequence space. The presented idea extends out of the original Dot-Matrix and semihomology methods with a possibility of making analyses in an n-dimensional sequence space and indicates the method of similarity evaluation. The main novelty of the implemented dotPicker program is to allow for searches of similarities in an n-dimensional sequence space. Sets of identity fragments, which represent given protein families, have been obtained using this program. The idea of evaluation of the obtained identity fragments is proposed and its utilization is presented. Moreover, the potential of the dotPicker program is shown especially when analyzing and identifying previously unknown similarities in protein families.

2011

Bueno, me vais a disculpar, pero se me hace tarde y me esperan en la imprenta. Me habré dejado algo en el tintero, no me lo tengáis a mal. A todos, gracias y nos vemos pronto. ¡Ha sido un placer! Darío Consistency is the last refuge of the unimaginative

Journal of Computational Methods in Sciences and Engineering, 2017

Using only numerical differences of descriptors of entries in databases and the subsequent generation of iterated difference sequences, a simple similarity characterization and efficient approximate similarity measures can be obtained for the database entries, and for some families of functional relations among databank entries, information on the dominant power dependence of such functions interrelating the database entries can be obtained, without the involvement of any fitting algorithm. After the dominant power dependence has been determined by this simple approach, and if more precise functional relations are needed, this knowledge allows one to choose better trial functions fulfilling this constraint, instead of a "trial and error" approach more commonly used for the determination of unknown functions describing the relations among the database entries with respect to the given property.

Australasian Conference on Knowledge Discovery and Data Mining, 2008

In data mining, computing the similarity of objects is an essential task, for example to identify regularities or to build homogeneous clusters of objects. In the case of sequential data seen in various fields of application (e.g. series of customers purchases, Internet navigation) this problem (i.e. comparing the similarity of sequences) is very important. There are already some similarity mea-

International Journal of Data Science

Deoxyribo nucleic acid (DNA) has enormous capacity to carry very important information in the form of character strings. Sequence analysis is the process of applying a wide range of methods to DNA sequences for understanding the structure, feature or evolution of these nucleotides strings. The analysis uses mathematical methods to convert these character strings to numerical values, and these numerical values are used to find similarity between the sequences. DNA sequences only contain four nucleotides: A, C, G and T, but in order to find information from these sequences, sequence comparison becomes essential. In this paper, various methods to analyse DNA sequences including usage of entropy, divergence, LZ complexity and the role of hybridisation are explored. A hybrid model based on the composition vector and distance methods is proposed to find dissimilarity between sequences and this hybrid model is tested on sequences of species downloaded from National Center for Biotechnology Information (NCBI).

LCSk: A refined similarity measure

Theoretical Computer Science, 2016

In this paper we define a new similarity measure: LCSk, aiming at finding the maximal number of k length substrings matching in both input strings while preserving their order of appearance, for which the traditional LCS is a special case, where k = 1. We examine this generalization in both theory and practice. We first describe its basic solution and give an experimental evidence in real data for its ability to differentiate between sequences that are considered similar according to the LCS measure. We then examine extensions of the LCSk definition to LCS in at least k-length substrings (LCS ≥ k) and 2-dimensional LCSk and also define complementary EDk and ED ≥ k distances.

Experimental study of a similarity measure for two dimensional sequences (original) (raw)

Related papers