A unifying semantic distance model for determining the similarity of attribute values (original) (raw)

A Unifying Semantic Distance Measure for Determining the Similarity of Attribute Values

2007

The relative difference between two data values is of interest in a number of application domains including temporal and spatial applications, schema versioning, data warehousing (particularly data preparation), internet searching, validation and error correction, and data mining. Moreover, consistency across systems in determining such distances and the robustness of such calculations is essential in some domains and useful in many. Despite this, there is no generally adopted approach to determining such ...

Computing the semantic distance between terms: An Ontology-based approach

The semantic measure determines how they relate two terms or concepts. The challenge of calculating the similarity between terms has become a research area important and has many application in several fields such as artificial intelligence. The development of efficient measures for the computation of semantic similarity is fundamental for computational semantics. Semantic distance is a measure that identifies the strength of relationship between two concepts in an ontology. This paper presents the development of novel method (called NaoBig) that expresses the semantic distance between concepts of a knowledge base based on ontologies through a numerical factor. The semantic distance between concepts is shown graphically by a directed graph. Also, BigData RDF is used as search engines and indexing triplets.

Efficient Distance Computation Using SQL Queries and UDFs

2008 IEEE International Conference on Data Mining Workshops, 2008

Distance computation is one of the most computationally intensive operations employed by many data mining algorithms. Performing such matrix computations within a DBMS creates many optimization challenges. We propose techniques to efficiently compute Euclidean distance using SQL queries and User-Defined Functions (UDFs). We concentrate on efficient Euclidean distance computation for the well-known K-means clustering algorithm. We present SQL query optimizations and a scalar UDF to compute Euclidean distance. We experimentally evaluate performance and scalability of our proposed SQL queries and UDF with large data sets on a modern DBMS. We benchmark distance computation on two important data mining techniques: clustering and classification. In general, UDFs are faster than SQL queries because they are executed in main memory. Data set size is the main factor impacting performance, followed by data set dimensionality.

SSDDM: Distance Metric for Graph-based Semi-structured Data

Abstract: Data mining of semi-structured data (data with no exact schema) is an emerging field of interest. Data mining algorithms such as clustering, classification need to have a metric distance for the defined data records. The most natural representation for semi-structured data is the graph-representation with labeled nodes and edges. Graph-edit distance measure can be used for comparing the similarity of two record of data, but semi-structured data can contain several attached attributes, values.

SimDB: a similarity-aware database system

Proceedings of the 2010 …, 2010

The identification and processing of similarities in the data play a key role in multiple application scenarios. Several types of similarity-aware operations have been studied in the literature. However, in most of the previous work, similarity-aware operations are studied in isolation from other regular or similarityaware operations. Furthermore, most of the previous research in the area considers a standalone implementation, i.e., without any integration with a database system. In this demonstration we present SimDB, a similarity-aware database management system. SimDB supports multiple similarity-aware operations as first-class database operators. We describe the architectural changes to implement the similarity-aware operators. In particular, we present the way conventional operators' implementation machinery is used to support similarity-ware operators. We also show how these operators interact with other similarity-aware and regular operators. In particular, we show the effectiveness of multiple equivalence rules that can be used to extend cost-based query optimization to the case of similarity-ware operations.

Web-based Dynamic Similarity Distance Tool

Journal of Telecommunication, Electronic and Computer Engineering, 2018

Similarity or distance measures is a well-known method and commonly used for calculating the distance between two samples of a dataset. Basically, the distance between the dataset samples is an important theory in multivariate analysis research. This paper proposes a tool that provides seven common distance methods that can be used in various research area. This tool is a web-based application which can be accessed through the internet browser. The objective of this tool is to introduce a web-based similarity distance application for many analysis and research purposes. Besides, a ranking method based on the Mean Average Precision is also implemented in this tool in order to increase the classification accuracies. This tool can process features that contain numerical values from any type of dataset.

Uniform techniques for deriving similarities of objects and subschemes in heterogeneous databases

IEEE Transactions on Knowledge and Data Engineering, 2003

The availability of automatic tools for inferring semantics of database schemes is useful to solve several database design problems such as, that of obtaining Cooperative Information Systems or Data Warehouses from large sets of data sources. In this context, a main problem is to single out similarities or dissimilarities among scheme objects (interscheme properties) . This paper presents graph-based techniques for a uniform derivation of interscheme properties including synonymies, homonymies, type conflicts, and subscheme similarities. These techniques are characterized by a common core: the computation of maximum weight matchings on some bipartite weighted graphs derived using a suitable metrics to measure semantic closeness of objects. The techniques have been implemented in a system prototype. Several experiments conducted with it, and (in part) accounted for in the paper, confirmed the effectiveness of our approach.

Similarity queries: their conceptual evaluation, transformations, and processing

The VLDB Journal, 2012

Many application scenarios can significantly benefit from the identification and processing of similarities in the data. Even though some work has been done to extend the semantics of some operators, for example join and selection, to be aware of data similarities, there has not been much study on the role and implementation of similarity-aware operations as first-class database operators. Furthermore, very little work has addressed the problem of evaluating and optimizing queries that combine several similarity operations. The focus of this paper is the study of similarity queries that contain one or multiple first-class similarity database operators such as Similarity Selection, Similarity Join, and Similarity Group-by. Particularly, we analyze the implementation techniques of several similarity operators, introduce a consistent and comprehensive conceptual evaluation model for similarity queries, and present a rich set of Electronic supplementary material The online version of this article (