Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language (original) (raw)

An Enriched Information-Theoretic Definition of Semantic Similarity in a Taxonomy

IEEE Access, 2021

This paper addresses the notion of semantic similarity between concepts organized according to a taxonomy, based on the well-known information content approach. This approach has been widely experimented in the literature over the years and, in general, outperforms other proposals which do not originate from it. However, it shows some limitations related to the notion of generic sense of a concept. In this paper we illustrate the problem arising by using the traditional approach, and a novel information-theoretic definition of semantic similarity in a taxonomy is proposed which also takes into account the intended sense of a concept in a given context. This proposal has been applied to some among the most representative stateof-the-art similarity measures based on the information content approach, and the experiment shows that it achieves very high correlation values with human judgment.

An information theoretic approach to improve semantic similarity assessments across multiple ontologies

Semantic similarity has become, in recent years, the backbone of numerous knowledgebased applications dealing with textual data. From the different methods and paradigms proposed to assess semantic similarity, ontology-based measures and, more specifically, those based on quantifying the Information Content (IC) of concepts are the most widespread solutions due to their high accuracy. However, these measures were designed to exploit a single ontology. They thus cannot be leveraged in many contexts in which multiple knowledge bases are considered. In this paper, we propose a new approach to achieve accurate IC-based similarity assessments for concept pairs spread throughout several ontologies. Based on Information Theory, our method defines a strategy to accurately measure the degree of commonality between concepts belonging to different ontologies-this is the cornerstone for estimating their semantic similarity. Our approach therefore enables classic IC-based measures to be directly applied in a multiple ontology setting. An empirical evaluation, based on well-established benchmarks and ontologies related to the biomedical domain, illustrates the accuracy of our approach, and demonstrates that similarity estimations provided by our approach are significantly more correlated with human ratings of similarity than those obtained via related works. unambiguously retrieved from ontologies and similarities can be assessed from structured knowledge that has been explicitly formalised by human experts.

Ontology-driven web-based semantic similarity

Journal of Intelligent Information Systems, 2010

Estimation of the degree of semantic similarity/distance between concepts is a very common problem in research areas such as natural language processing, knowledge acquisition, information retrieval or data mining. In the past, many similarity measures have been proposed, exploiting explicit knowledge-such as the structure of a taxonomy-or implicit knowledge-such as information distribution. In the former case, taxonomies and/or ontologies are used to introduce additional semantics; in the latter case, frequencies of term appearances in a corpus are considered. Classical measures based on those premises suffer from some problems: in the first case, their excessive dependency of the taxonomical/ontological structure; in the second case, the lack of semantics of a pure statistical analysis of occurrences and/or the ambiguity of estimating concept statistical distribution from term appearances. Measures based on Information Content (IC) of taxonomical concepts combine both approaches. However, they heavily depend on a properly pre-tagged and disambiguated corpus according to the ontological entities in order to compute accurate concept appearance probabilities. This limits the applicability of those measures to other ontologies -like specific domain ontologies-and massive corpus -like the Web-. In this paper, several of the presented issues are analyzed. Modifications of classical similarity measures are also proposed. They are based on J Intell Inf Syst a contextualized and scalable version of IC computation in the Web by exploiting taxonomical knowledge. The goal is to avoid the measures' dependency on the corpus pre-processing to achieve reliable results and minimize language ambiguity. Our proposals are able to outperform classical approaches when using the Web for estimating concept probabilities.

Measuring semantic similarity in the taxonomy of WordNet

Proceedings of the Twenty-eighth …, 2005

This paper presents a new model to measure semantic similarity in the taxonomy of WordNet, using edgecounting techniques. We weigh up our model against a benchmark set by human similarity judgment, and achieve a much improved result compared with other methods: the correlation with average human judgment on a standard 28 word pair dataset is 0.921, which is better than anything reported in the literature and also significantly better than average individual human judgments. As this set has been effectively used for algorithm selection and tuning, we also cross-validate an independent 37 word pair test set (0.876) and present results for the full 65 word pair superset (0.897).

A New Similarity measure for taxonomy based on edge counting

International journal of Web & Semantic Technology, 2012

This paper introduces a new similarity measure based on edge counting in a taxonomy like WorldNet or Ontology. Measurement of similarity between text segments or concepts is very useful for many applications like information retrieval, ontology matching, text mining, and question answering and so on. Several measures have been developed for measuring similarity between two concepts: out of these we see that the measure given by Wu and Palmer [1] is simple, and gives good performance. Our measure is based on their measure but strengthens it. Wu and Palmer [1] measure has a disadvantage that it does not consider how far the concepts are semantically. In our measure we include the shortest path between the concepts and the depth of whole taxonomy together with the distances used in Wu and Palmer [1]. Also the measure has following disadvantage i.e. in some situations, the similarity of two elements of an IS-A ontology contained in the neighbourhood exceeds the similarity value of two elements contained in the same hierarchy. Our measure introduces a penalization factor for this case based upon shortest length between the concepts and depth of whole taxonomy.

Exploiting Taxonomical Knowledge to Compute Semantic Similarity: An Evaluation in the Biomedical Domain

Trends in Applied Intelligent Systems, 2010

Determining the semantic similarity between concept pairs is an important task in many language related problems. In the biomedical field, several approaches to assess the semantic similarity between concepts by exploiting the knowledge provided by a domain ontology have been proposed. In this paper, some of those approaches are studied, exploiting the taxonomical structure of a biomedical ontology (SNOMED-CT). Then, a new measure is presented based on computing the amount of overlapping and non-overlapping taxonomical knowledge between concept pairs. The performance of our proposal is compared against related ones using a set of standard benchmarks of manually ranked terms. The correlation between the results obtained by the computerized approaches and the manual ranking shows that our proposal clearly outperforms previous works.

Description and Evaluation of Semantic Similarity Measures Approaches

International Journal of Computer Applications, 2013

In recent years, semantic similarity measure has a great interest in Semantic Web and Natural Language Processing (NLP). Several similarity measures have been developed, being given the existence of a structured knowledge representation offered by ontologies and corpus which enable semantic interpretation of terms. Semantic similarity measures compute the similarity between concepts/terms included in knowledge sources in order to perform estimations. This paper discusses the existing semantic similarity methods based on structure, information content and feature approaches. Additionally, we present a critical evaluation of several categories of semantic similarity approaches based on two standard benchmarks. The aim of this paper is to give an efficient evaluation of all these measures which help researcher and practitioners to select the measure that best fit for their requirements.

An intrinsic information content-based semantic similarity measure considering the disjoint common subsumers of concepts of an ontology

Journal of the Association for Information Science and Technology, 2018

Finding similarity between concepts based on semantics has become a new trend in many applications (e.g., biomedical informatics, natural language processing). Measuring the Semantic Similarity (SS) with higher accuracy is a challenging task. In this context, the Information Content (IC)-based SS measure has gained popularity over the others. The notion of IC evolves from the science of information theory. Information theory has very high potential to characterize the semantics of concepts. Designing an IC-based SS framework comprises (i) an IC calculator, and (ii) an SS calculator. In this article, we propose a generic intrinsic IC-based SS calculator. We also introduce here a new structural aspect of an ontology called DCS (Disjoint Common Subsumers) that plays a significant role in deciding the similarity between two concepts. We evaluated our proposed similarity calculator with the existing intrinsic IC-based similarity calculators, as well as corpora-dependent similarity calculators using several benchmark data sets. The experimental results show that the proposed similarity calculator produces a high correlation with human evaluation over the existing state-of-the-art ICbased similarity calculators.

Algorithmic detection of semantic similarity

Proceedings of the …, 2005

Automatic extraction of semantic information from text and links in Web pages is key to improving the quality of search results. However, the assessment of automatic semantic measures is limited by the coverage of user studies, which do not scale with the size, heterogeneity, and growth of the Web. Here we propose to leverage human-generated metadata -namely topical directories -to measure semantic relationships among massive numbers of pairs of Web pages or topics. The Open Directory Project classifies millions of URLs in a topical ontology, providing a rich source from which semantic relationships between Web pages can be derived. While semantic similarity measures based on taxonomies (trees) are well studied, the design of well-founded similarity measures for objects stored in the nodes of arbitrary ontologies (graphs) is an open problem. This paper defines an information-theoretic measure of semantic similarity that exploits both the hierarchical and non-hierarchical structure of an ontology. An experimental study shows that this measure improves significantly on the traditional taxonomy-based approach. This novel measure allows us to address the general question of how text and link analyses can be combined to derive measures of relevance that are in good agreement with semantic similarity. Surprisingly, the traditional use of text similarity turns out to be ineffective for relevance ranking.