A New Model to Compute the Information Content of Concepts from Taxonomic Knowledge (original) (raw)
Related papers
Semantic similarity has become, in recent years, the backbone of numerous knowledgebased applications dealing with textual data. From the different methods and paradigms proposed to assess semantic similarity, ontology-based measures and, more specifically, those based on quantifying the Information Content (IC) of concepts are the most widespread solutions due to their high accuracy. However, these measures were designed to exploit a single ontology. They thus cannot be leveraged in many contexts in which multiple knowledge bases are considered. In this paper, we propose a new approach to achieve accurate IC-based similarity assessments for concept pairs spread throughout several ontologies. Based on Information Theory, our method defines a strategy to accurately measure the degree of commonality between concepts belonging to different ontologies-this is the cornerstone for estimating their semantic similarity. Our approach therefore enables classic IC-based measures to be directly applied in a multiple ontology setting. An empirical evaluation, based on well-established benchmarks and ontologies related to the biomedical domain, illustrates the accuracy of our approach, and demonstrates that similarity estimations provided by our approach are significantly more correlated with human ratings of similarity than those obtained via related works. unambiguously retrieved from ontologies and similarities can be assessed from structured knowledge that has been explicitly formalised by human experts.
Ontology-based information content computation
Knowledge-Based Systems, 2011
The information content (IC) of a concept provides an estimation of its degree of generality/concreteness, a dimension which enables a better understanding of concept's semantics. As a result, IC has been successfully applied to the automatic assessment of the semantic similarity between concepts. In the past, IC has been estimated as the probability of appearance of concepts in corpora. However, the applicability and scalability of this method are hampered due to corpora dependency and data sparseness. More recently, some authors proposed IC-based measures using taxonomical features extracted from an ontology for a particular concept, obtaining promising results. In this paper, we analyse these ontology-based approaches for IC computation and propose several improvements aimed to better capture the semantic evidence modelled in the ontology for the particular concept. Our approach has been evaluated and compared with related works (both corpora and ontology-based ones) when applied to the task of semantic similarity estimation. Results obtained for a widely used benchmark show that our method enables similarity estimations which are better correlated with human judgements than related works.
Trends in Applied Intelligent Systems, 2010
Determining the semantic similarity between concept pairs is an important task in many language related problems. In the biomedical field, several approaches to assess the semantic similarity between concepts by exploiting the knowledge provided by a domain ontology have been proposed. In this paper, some of those approaches are studied, exploiting the taxonomical structure of a biomedical ontology (SNOMED-CT). Then, a new measure is presented based on computing the amount of overlapping and non-overlapping taxonomical knowledge between concept pairs. The performance of our proposal is compared against related ones using a set of standard benchmarks of manually ranked terms. The correlation between the results obtained by the computerized approaches and the manual ranking shows that our proposal clearly outperforms previous works.
An Enriched Information-Theoretic Definition of Semantic Similarity in a Taxonomy
IEEE Access, 2021
This paper addresses the notion of semantic similarity between concepts organized according to a taxonomy, based on the well-known information content approach. This approach has been widely experimented in the literature over the years and, in general, outperforms other proposals which do not originate from it. However, it shows some limitations related to the notion of generic sense of a concept. In this paper we illustrate the problem arising by using the traditional approach, and a novel information-theoretic definition of semantic similarity in a taxonomy is proposed which also takes into account the intended sense of a concept in a given context. This proposal has been applied to some among the most representative stateof-the-art similarity measures based on the information content approach, and the experiment shows that it achieves very high correlation values with human judgment.
Journal of the Association for Information Science and Technology, 2018
Finding similarity between concepts based on semantics has become a new trend in many applications (e.g., biomedical informatics, natural language processing). Measuring the Semantic Similarity (SS) with higher accuracy is a challenging task. In this context, the Information Content (IC)-based SS measure has gained popularity over the others. The notion of IC evolves from the science of information theory. Information theory has very high potential to characterize the semantics of concepts. Designing an IC-based SS framework comprises (i) an IC calculator, and (ii) an SS calculator. In this article, we propose a generic intrinsic IC-based SS calculator. We also introduce here a new structural aspect of an ontology called DCS (Disjoint Common Subsumers) that plays a significant role in deciding the similarity between two concepts. We evaluated our proposed similarity calculator with the existing intrinsic IC-based similarity calculators, as well as corpora-dependent similarity calculators using several benchmark data sets. The experimental results show that the proposed similarity calculator produces a high correlation with human evaluation over the existing state-of-the-art ICbased similarity calculators.
Arxiv preprint arXiv:1105.5444, 2011
This article presents a measure of semantic similarity in an is-a taxonomy based on the notion of shared information content. Experimental evaluation against a benchmark set of human similarity judgments demonstrates that the measure performs better than the traditional edge-counting approach. The article presents algorithms that take advantage of taxonomic similarity in resolving syntactic and semantic ambiguity, along with experimental results demonstrating their e ectiveness.
Ontology-driven web-based semantic similarity
Journal of Intelligent Information Systems, 2010
Estimation of the degree of semantic similarity/distance between concepts is a very common problem in research areas such as natural language processing, knowledge acquisition, information retrieval or data mining. In the past, many similarity measures have been proposed, exploiting explicit knowledge-such as the structure of a taxonomy-or implicit knowledge-such as information distribution. In the former case, taxonomies and/or ontologies are used to introduce additional semantics; in the latter case, frequencies of term appearances in a corpus are considered. Classical measures based on those premises suffer from some problems: in the first case, their excessive dependency of the taxonomical/ontological structure; in the second case, the lack of semantics of a pure statistical analysis of occurrences and/or the ambiguity of estimating concept statistical distribution from term appearances. Measures based on Information Content (IC) of taxonomical concepts combine both approaches. However, they heavily depend on a properly pre-tagged and disambiguated corpus according to the ontological entities in order to compute accurate concept appearance probabilities. This limits the applicability of those measures to other ontologies -like specific domain ontologies-and massive corpus -like the Web-. In this paper, several of the presented issues are analyzed. Modifications of classical similarity measures are also proposed. They are based on J Intell Inf Syst a contextualized and scalable version of IC computation in the Web by exploiting taxonomical knowledge. The goal is to avoid the measures' dependency on the corpus pre-processing to achieve reliable results and minimize language ambiguity. Our proposals are able to outperform classical approaches when using the Web for estimating concept probabilities.
A Combination-based Semantic Similarity Measure using Multiple Information Sources
2006 IEEE International Conference on Information Reuse & Integration, 2006
The semantic similarity techniques are interested in determining how much two concepts, or terms, are similar according to a given ontology. This paper proposes a method for measuring semantic similarity/distance between terms. The measure combines strengths and complements weaknesses of existing measures that use ontology as primary source. The proposed measure uses a new feature of common specificity (CSpec) besides the path length feature. The CSpec feature is derived from (1)_information content of concepts, and (2) information content of the ontology given a corpus. We evaluated the proposed measure with benchmark test set of term pairs scored for similarity by human experts. The experimental results demonstrated that our similarity measure is effective and outperforms the existing measures. The proposed semantic similarity measure gives the best correlation (0.874) with human scores in the benchmark test set compared to the existing measures.
Unifying ontological similarity measures: A theoretical and empirical investigation
International Journal of Approximate Reasoning, 2013
This paper theoretically and empirically investigates ontological similarity. Tversky's parameterized ratio model of similarity [3] is shown as a unifying basis for many of the well-known ontological similarity measures. A new family of ontological similarity measures is proposed that allows parameterizing the characteristic set used to represent an ontological concept. The three subontologies of the prominent Gene Ontology (GO) are used in an empirical investigation of several ontological similarity measures. Another study using well known semantic similarity within two different anatomy ontologies, the NCIT anatomy and the mouse anatomy, is also presented for comparison to several of the GO results. A discussion of the correlation among the measures is presented as well as a comparison of the effects of two different methods of determining a concept's information content, corpusbased and ontology-based.
2011
Semantic similarity estimation is an important component of analysing natural language resources like clinical records. Proper understanding of concept semantics allows for improved use and integration of heterogeneous clinical sources as well as higher information retrieval accuracy. Semantic similarity has been the focus of much research, which has led to the definition of heterogeneous measures using different theoretical principles and knowledge resources in a variety of contexts and application domains.