George Tsatsaronis | Technische Universität Dresden (original) (raw)
Papers by George Tsatsaronis
Abstract Understanding the structure of complex networks and uncovering the properties of their c... more Abstract Understanding the structure of complex networks and uncovering the properties of their constituents has been for many decades at the center of study of several fundamental sciences, such as discrete mathematics and graph theory. Especially during the previous decade, we have witnessed an explosion in complex network data, with two cornerstone paradigms being the biological networks and the social networks.
Abstract It is common knowledge that plagiarism in academia goes as back in time as research itse... more Abstract It is common knowledge that plagiarism in academia goes as back in time as research itself. However, in the last two decades this phenomenon of academic deception has turned into an academic plague. Undoubtedly, the rapid expansion of the Web and the vast amount of publicly available information and documents facilitate the unethical malpractice of computer-aided plagiarism, which in turn has inflated the problem.
Background The complexity and scale of the knowledge in the biomedical domain has motivated rese... more Background
The complexity and scale of the knowledge in the biomedical domain has motivated research work towards mining heterogeneous data from both structured and unstructured knowledge bases. Towards this direction, it is necessary to combine facts in order to formulate hypotheses or draw conclusions about the domain concepts. This work addresses this problem by using indirect knowledge connecting two concepts in a knowledge graph to discover hidden relations between them. The graph represents concepts as vertices and relations as edges, stemming from structured (ontologies) and unstructured (textual) data. In this graph, path patterns, i.e. sequences of relations, are mined using distant supervision that potentially characterize a biomedical relation.
Results
It is possible to identify characteristic path patterns of biomedical relations from this representation using machine learning. For experimental evaluation two frequent biomedical relations, namely “has target”, and “may treat”, are chosen. Results suggest that relation discovery using indirect knowledge is possible, with an AUC that can reach up to 0.8, a result which is a great improvement compared to the random classification, and which shows that good predictions can be prioritized by following the suggested approach.
Conclusions
Analysis of the results indicates that the models can successfully learn expressive path patterns for the examined relations. Furthermore, this work demonstrates that the constructed graph allows for the easy integration of heterogeneous information and discovery of indirect connections between biomedical concepts.
Ontologies such as the SNOMED Clinical Terms (SNOMED CT), and the Medical Subject Headings (MeSH)... more Ontologies such as the SNOMED Clinical Terms (SNOMED CT), and the Medical Subject Headings (MeSH) play a major role in life sciences. Modeling formally the concepts and the roles in this domain is a crucial process to allow for the integration of biomedical knowledge across applications. In this direction we propose a novel methodology to learn formal definitions for biomedical concepts from unstructured text. We evaluate experimentally the suggested methodology in learning formal definitions of SNOMED CT concepts, using their text definitions from MeSH. The evaluation is focused on the learning of three roles which are among the most populated roles in SNOMED CT: Associated Morphology, Finding Site and Causative Agent. Results show that our methodology may provide an Accuracy of up to 75%. For the representation of the instances three main approaches are suggested, namely, Bag of Words, word n-grams and character n-grams.
Methods, 2014
The amount of biomedical literature has been increasing rapidly during the last decade. Text mini... more The amount of biomedical literature has been increasing rapidly during the last decade. Text mining techniques can harness this large-scale data, shed light onto complex drug mechanisms, and extract relation information that can support computational polypharmacology. In this work, we introduce a fully corpus-based and unsupervised method which utilizes the MEDLINE indexed titles and abstracts to infer drug gene associations and assist drug repositioning. The method measures the Pointwise Mutual Information (PMI) between biomedical terms derived from the Gene Ontology and the Medical Subject Headings. Based on the PMI scores, drug and gene profiles are generated and candidate drug gene associations are inferred when computing the relatedness of their profiles. Results show that an Area Under the Curve (AUC) of up to 0.88 can be achieved. The method can successfully identify direct drug gene associations with high precision and prioritize them. Validation shows that the statistically derived profiles from literature perform as good as manually curated profiles. In addition, we examine the potential application of our approach towards drug repositioning. For all FDA approved drugs repositioned over the last 5years, we generate profiles from publications before 2009 and show that new indications rank high in the profiles. In summary, literature mined profiles can accurately predict drug gene associations and provide insights onto potential repositioning cases.
The wealth of the publicly available data repositories related to chemical compounds and substanc... more The wealth of the publicly available data repositories related to chemical compounds and substances allows current research methodologies to integrate pieces of information across different resources. Typical compound-to-compound relatedness measures are based on structural commonalities between the compounds, sequential information for their targets or/and toxicological liabilities. In this paper, we take a step further towards compound-to-compound relatedness and integrate a significantly larger number of compound characteristics, including chemical properties, indications, related sequences, pathways, genes, toxicity, and other pharmacological information. The main novelty of the suggested methodology is the systematic data integration and the combination of several similarity measures, including a string kernel, Jaccard and Tanimoto. We evaluate the performance of the proposed relatedness measure through a manually curated benchmark dataset, also introduced in this work. Our results suggest that the proposed method generates meaningful associations among the tested compounds; examples of these associations are presented and discussed analytically.
Ontologies play a major role in life sciences, enabling a number of applications, from new data i... more Ontologies play a major role in life sciences, enabling a number of applications, from new data integration to knowledge verification. SNOMED CT is a large medical ontology that is formally defined so that it ensures global consistency and support of complex reasoning tasks. Most biomedical ontologies and taxonomies on the other hand define concepts only textually, without the use of logic. Here, we investigate how to automatically generate formal concept definitions from textual ones. We develop a method that uses machine learning in combination with several types of lexical and semantic features and outputs formal definitions that follow the structure of SNOMED CT concept definitions.
The increasing volume of biomedical literature. e.g., PubMed indexed articles constitutes a huge ... more The increasing volume of biomedical literature. e.g., PubMed indexed articles constitutes a huge data source for applying text mining and predicting trends and new biomedical terminology. In this work we explore, for the first time to the best of our knowledge, the application of temporal language models in the PubMed indexed literature, in order to identify trends in terms. The suggested methodology comprises three steps: (i) training of temporal language models using a parametric window of
The complexity and scale of the knowledge in the biomedical domain has motivated research work to... more The complexity and scale of the knowledge in the biomedical domain has motivated research work towards mining heterogeneous data from structured and unstructured knowledge bases. Towards this direction, it is necessary to combine facts in order to formulate hypotheses or draw conclusions about the domain concepts. In this work we attempt to address this problem by using indirect knowledge connecting two concepts in a graph to identify hidden relations between them. The graph represents concepts as vertices and relations as edges, stemming from structured (ontologies) and unstructured (text) data. In this graph we attempt to mine path patterns which potentially characterize a biomedical relation. For our experimental evaluation we focus on two frequent relations, namely "has target", and "may treat". Our results suggest that relation discovery using indirect knowledge is possible, with an AUC that can reach up to 0.8. Finally, analysis of the results indicates that the models can successfully learn expressive path patterns for the examined relations.
Methods, 2014
The amount of biomedical literature has been increasing rapidly during the last decade. Text mini... more The amount of biomedical literature has been increasing rapidly during the last decade. Text mining techniques can harness this large-scale data, shed light onto complex drug mechanisms, and extract relation information that can support computational polypharmacology. In this work, we introduce a fully corpus-based and unsupervised method which utilizes the MEDLINE indexed titles and abstracts to infer drug gene associations and assist drug repositioning. The method measures the Pointwise Mutual Information (PMI) between biomedical terms derived from the Gene Ontology and the Medical Subject Headings. Based on the PMI scores, drug and gene profiles are generated and candidate drug gene associations are inferred when computing the relatedness of their profiles. Results show that an Area Under the Curve (AUC) of up to 0.88 can be achieved. The method can successfully identify direct drug gene associations with high precision and prioritize them. Validation shows that the statistically derived profiles from literature perform as good as manually curated profiles. In addition, we examine the potential application of our approach towards drug repositioning. For all FDA approved drugs repositioned over the last 5years, we generate profiles from publications before 2009 and show that new indications rank high in the profiles. In summary, literature mined profiles can accurately predict drug gene associations and provide insights onto potential repositioning cases.
Ontologies such as the SNOMED Clinical Terms (SNOMED CT), and the Medical Subject Headings (MeSH)... more Ontologies such as the SNOMED Clinical Terms (SNOMED CT), and the Medical Subject Headings (MeSH) play a major role in life sciences. Modeling formally the concepts and the roles in this domain is a crucial process to allow for the integration of biomedical knowledge across applications. In this direction we propose a novel methodology to learn formal definitions for biomedical concepts from unstructured text. We evaluate experimentally the suggested methodology in learning formal definitions of SNOMED CT concepts, using their text definitions from MeSH. The evaluation is focused on the learning of three roles which are among the most populated roles in SNOMED CT: Associated Morphology, Finding Site and Causative Agent. Results show that our methodology may provide an Accuracy of up to 75%. For the representation of the instances three main approaches are suggested, namely, Bag of Words, word n-grams and character n-grams.
Data intensive applications produce complex information that is posing requirements for novel Dat... more Data intensive applications produce complex information that is posing requirements for novel Database Management Systems (DBMSs). Such information is characterized by its huge volume of data and by its diversity and complexity, since the data processing methods such as pattern recognition, data mining and knowledge extraction result in knowledge artifacts like clusters, association rules, decision trees and others. These artifacts that we call patterns need to be stored and retrieved efficiently. In order to accomplish this we have to express them within a formalism and a language.
... Mourtzoukos1, Joseph Roumier3, Nikolaos Matskanis3, Michael Schroeder2, Philippe Massonet3, D... more ... Mourtzoukos1, Joseph Roumier3, Nikolaos Matskanis3, Michael Schroeder2, Philippe Massonet3, Dimitrios Koutsouris1, Theodora ... O., Ontology Alignment Design Patterns 2010, http://scharffe.fr/pub/ontology ... 4. Harper, M. (2012) The Truly Staggering Cost Of Inventing New ...
In proceedings of the 6th International Workshop on Personalized Access, Profile Management, and Context Awareness in Databases in conjunction with VLDB 2012 (VLDB/PersDB 2012), Aug 1, 2012
The rapidly increasing volume of published clinical and nonclinical data at a variety of sources ... more The rapidly increasing volume of published clinical and nonclinical data at a variety of sources and the resulting great effort required for researchers to access them and mine information of interest lead to clinical trials that are based on only a limited set of knowledge in the domain they cover. This restricted view of the clinical trials' context is quite often the reason behind unsuccessful trials and/or successful ones which, however, underestimate drugs' unwanted effects and thus their results are of low external validity in the much more complicated environment of clinical healthcare. In this paper, we present a context-aware approach, which has been developed in the PONTE project, for effectively guiding medical researchers during clinical trial protocol design and allowing for more efficient and effective access to scientific literature. The suggested approach incorporates intelligent services and advanced text mining mechanisms for scientific literature querying and mining during protocol design, taking into account the study context (i.e. active substance, target and disease) and the domain context in literature.
Document search on PubMed, the pre-eminent database for biomedical literature, relies on the anno... more Document search on PubMed, the pre-eminent database for biomedical literature, relies on the annotation of its documents with relevant terms from the Medical Subject Headings ontology (MeSH) for improving recall through query expansion. Patent documents are another important information source, though they are considerably less accessible. One option to expand patent search beyond pure keywords is the inclusion of classification information: Since every patent is assigned at least one class code, it should be possible for these assignments to be automatically used in a similar way as the MeSH annotations in PubMed. In order to develop a system for this task, it is necessary to have a good understanding of the properties of both classification systems. This report describes our comparative analysis of MeSH and the main patent classification system, the International Patent Classification (IPC). We investigate the hierarchical structures as well as the properties of the terms/classes respectively, and we compare the assignment of IPC codes to patents with the annotation of PubMed documents with MeSH terms. Our analysis shows a strong structural similarity of the hierarchies, but significant differences of terms and annotations. The low number of IPC class assignments and the lack of occurrences of class labels in patent texts imply that current patent search is severely limited. To overcome these limits, we evaluate a method for the automated assignment of additional classes to patent documents, and we propose a system for guided patent search based on the use of class co-occurrence information and external resources.
Biomedical Data Journal, 2015
The rapidly growing wealth of published scientific work, produced by researchers and scholars, ha... more The rapidly growing wealth of published scientific work, produced by researchers and scholars, has resulted in a pressing need for more effective processes towards reviewing scientific articles and research data, organizing data journals, as well as for improved tools and techniques for bibliographic analysis and management of scientometrics. The ongoing EU research project OpenScienceLink aims to address these needs, as well as offer a wide range of opportunities for better collaboration between researchers, by introducing a web-based Platform which offers efficient and intelligent applications and services for exploiting open access scientific information in the biomedical domain. The Platform is empowered by the semantic and social networking capabilities of three leading edge background infrastructures, which have been adapted and integrated for the scope of the project. In this paper, we present the five pilot services that are provided by the OpenScienceLink project. All five services are integrated into the web-based OpenScienceLink platform that is publicly accessible at
The complexity and scale of the knowledge in the biomedical domain has motivated research work to... more The complexity and scale of the knowledge in the biomedical domain has motivated research work towards mining heterogeneous data from structured and unstructured knowledge bases. Towards this direction, it is necessary to combine facts in order to formulate hypotheses or draw conclusions about the domain concepts. In this work we attempt to address this problem by using indirect knowledge connecting two concepts in a graph to identify hidden relations between them. The graph represents concepts as vertices and relations as edges, stemming from structured (ontologies) and unstructured (text) data. In this graph we attempt to mine path patterns which potentially characterize a biomedical relation. For our experimental evaluation we focus on two frequent relations, namely "has target", and "may treat". Our results suggest that relation discovery using indirect knowledge is possible, with an AUC that can reach up to 0.8. Finally, analysis of the results indicates tha...
Abstract Understanding the structure of complex networks and uncovering the properties of their c... more Abstract Understanding the structure of complex networks and uncovering the properties of their constituents has been for many decades at the center of study of several fundamental sciences, such as discrete mathematics and graph theory. Especially during the previous decade, we have witnessed an explosion in complex network data, with two cornerstone paradigms being the biological networks and the social networks.
Abstract It is common knowledge that plagiarism in academia goes as back in time as research itse... more Abstract It is common knowledge that plagiarism in academia goes as back in time as research itself. However, in the last two decades this phenomenon of academic deception has turned into an academic plague. Undoubtedly, the rapid expansion of the Web and the vast amount of publicly available information and documents facilitate the unethical malpractice of computer-aided plagiarism, which in turn has inflated the problem.
Background The complexity and scale of the knowledge in the biomedical domain has motivated rese... more Background
The complexity and scale of the knowledge in the biomedical domain has motivated research work towards mining heterogeneous data from both structured and unstructured knowledge bases. Towards this direction, it is necessary to combine facts in order to formulate hypotheses or draw conclusions about the domain concepts. This work addresses this problem by using indirect knowledge connecting two concepts in a knowledge graph to discover hidden relations between them. The graph represents concepts as vertices and relations as edges, stemming from structured (ontologies) and unstructured (textual) data. In this graph, path patterns, i.e. sequences of relations, are mined using distant supervision that potentially characterize a biomedical relation.
Results
It is possible to identify characteristic path patterns of biomedical relations from this representation using machine learning. For experimental evaluation two frequent biomedical relations, namely “has target”, and “may treat”, are chosen. Results suggest that relation discovery using indirect knowledge is possible, with an AUC that can reach up to 0.8, a result which is a great improvement compared to the random classification, and which shows that good predictions can be prioritized by following the suggested approach.
Conclusions
Analysis of the results indicates that the models can successfully learn expressive path patterns for the examined relations. Furthermore, this work demonstrates that the constructed graph allows for the easy integration of heterogeneous information and discovery of indirect connections between biomedical concepts.
Ontologies such as the SNOMED Clinical Terms (SNOMED CT), and the Medical Subject Headings (MeSH)... more Ontologies such as the SNOMED Clinical Terms (SNOMED CT), and the Medical Subject Headings (MeSH) play a major role in life sciences. Modeling formally the concepts and the roles in this domain is a crucial process to allow for the integration of biomedical knowledge across applications. In this direction we propose a novel methodology to learn formal definitions for biomedical concepts from unstructured text. We evaluate experimentally the suggested methodology in learning formal definitions of SNOMED CT concepts, using their text definitions from MeSH. The evaluation is focused on the learning of three roles which are among the most populated roles in SNOMED CT: Associated Morphology, Finding Site and Causative Agent. Results show that our methodology may provide an Accuracy of up to 75%. For the representation of the instances three main approaches are suggested, namely, Bag of Words, word n-grams and character n-grams.
Methods, 2014
The amount of biomedical literature has been increasing rapidly during the last decade. Text mini... more The amount of biomedical literature has been increasing rapidly during the last decade. Text mining techniques can harness this large-scale data, shed light onto complex drug mechanisms, and extract relation information that can support computational polypharmacology. In this work, we introduce a fully corpus-based and unsupervised method which utilizes the MEDLINE indexed titles and abstracts to infer drug gene associations and assist drug repositioning. The method measures the Pointwise Mutual Information (PMI) between biomedical terms derived from the Gene Ontology and the Medical Subject Headings. Based on the PMI scores, drug and gene profiles are generated and candidate drug gene associations are inferred when computing the relatedness of their profiles. Results show that an Area Under the Curve (AUC) of up to 0.88 can be achieved. The method can successfully identify direct drug gene associations with high precision and prioritize them. Validation shows that the statistically derived profiles from literature perform as good as manually curated profiles. In addition, we examine the potential application of our approach towards drug repositioning. For all FDA approved drugs repositioned over the last 5years, we generate profiles from publications before 2009 and show that new indications rank high in the profiles. In summary, literature mined profiles can accurately predict drug gene associations and provide insights onto potential repositioning cases.
The wealth of the publicly available data repositories related to chemical compounds and substanc... more The wealth of the publicly available data repositories related to chemical compounds and substances allows current research methodologies to integrate pieces of information across different resources. Typical compound-to-compound relatedness measures are based on structural commonalities between the compounds, sequential information for their targets or/and toxicological liabilities. In this paper, we take a step further towards compound-to-compound relatedness and integrate a significantly larger number of compound characteristics, including chemical properties, indications, related sequences, pathways, genes, toxicity, and other pharmacological information. The main novelty of the suggested methodology is the systematic data integration and the combination of several similarity measures, including a string kernel, Jaccard and Tanimoto. We evaluate the performance of the proposed relatedness measure through a manually curated benchmark dataset, also introduced in this work. Our results suggest that the proposed method generates meaningful associations among the tested compounds; examples of these associations are presented and discussed analytically.
Ontologies play a major role in life sciences, enabling a number of applications, from new data i... more Ontologies play a major role in life sciences, enabling a number of applications, from new data integration to knowledge verification. SNOMED CT is a large medical ontology that is formally defined so that it ensures global consistency and support of complex reasoning tasks. Most biomedical ontologies and taxonomies on the other hand define concepts only textually, without the use of logic. Here, we investigate how to automatically generate formal concept definitions from textual ones. We develop a method that uses machine learning in combination with several types of lexical and semantic features and outputs formal definitions that follow the structure of SNOMED CT concept definitions.
The increasing volume of biomedical literature. e.g., PubMed indexed articles constitutes a huge ... more The increasing volume of biomedical literature. e.g., PubMed indexed articles constitutes a huge data source for applying text mining and predicting trends and new biomedical terminology. In this work we explore, for the first time to the best of our knowledge, the application of temporal language models in the PubMed indexed literature, in order to identify trends in terms. The suggested methodology comprises three steps: (i) training of temporal language models using a parametric window of
The complexity and scale of the knowledge in the biomedical domain has motivated research work to... more The complexity and scale of the knowledge in the biomedical domain has motivated research work towards mining heterogeneous data from structured and unstructured knowledge bases. Towards this direction, it is necessary to combine facts in order to formulate hypotheses or draw conclusions about the domain concepts. In this work we attempt to address this problem by using indirect knowledge connecting two concepts in a graph to identify hidden relations between them. The graph represents concepts as vertices and relations as edges, stemming from structured (ontologies) and unstructured (text) data. In this graph we attempt to mine path patterns which potentially characterize a biomedical relation. For our experimental evaluation we focus on two frequent relations, namely "has target", and "may treat". Our results suggest that relation discovery using indirect knowledge is possible, with an AUC that can reach up to 0.8. Finally, analysis of the results indicates that the models can successfully learn expressive path patterns for the examined relations.
Methods, 2014
The amount of biomedical literature has been increasing rapidly during the last decade. Text mini... more The amount of biomedical literature has been increasing rapidly during the last decade. Text mining techniques can harness this large-scale data, shed light onto complex drug mechanisms, and extract relation information that can support computational polypharmacology. In this work, we introduce a fully corpus-based and unsupervised method which utilizes the MEDLINE indexed titles and abstracts to infer drug gene associations and assist drug repositioning. The method measures the Pointwise Mutual Information (PMI) between biomedical terms derived from the Gene Ontology and the Medical Subject Headings. Based on the PMI scores, drug and gene profiles are generated and candidate drug gene associations are inferred when computing the relatedness of their profiles. Results show that an Area Under the Curve (AUC) of up to 0.88 can be achieved. The method can successfully identify direct drug gene associations with high precision and prioritize them. Validation shows that the statistically derived profiles from literature perform as good as manually curated profiles. In addition, we examine the potential application of our approach towards drug repositioning. For all FDA approved drugs repositioned over the last 5years, we generate profiles from publications before 2009 and show that new indications rank high in the profiles. In summary, literature mined profiles can accurately predict drug gene associations and provide insights onto potential repositioning cases.
Ontologies such as the SNOMED Clinical Terms (SNOMED CT), and the Medical Subject Headings (MeSH)... more Ontologies such as the SNOMED Clinical Terms (SNOMED CT), and the Medical Subject Headings (MeSH) play a major role in life sciences. Modeling formally the concepts and the roles in this domain is a crucial process to allow for the integration of biomedical knowledge across applications. In this direction we propose a novel methodology to learn formal definitions for biomedical concepts from unstructured text. We evaluate experimentally the suggested methodology in learning formal definitions of SNOMED CT concepts, using their text definitions from MeSH. The evaluation is focused on the learning of three roles which are among the most populated roles in SNOMED CT: Associated Morphology, Finding Site and Causative Agent. Results show that our methodology may provide an Accuracy of up to 75%. For the representation of the instances three main approaches are suggested, namely, Bag of Words, word n-grams and character n-grams.
Data intensive applications produce complex information that is posing requirements for novel Dat... more Data intensive applications produce complex information that is posing requirements for novel Database Management Systems (DBMSs). Such information is characterized by its huge volume of data and by its diversity and complexity, since the data processing methods such as pattern recognition, data mining and knowledge extraction result in knowledge artifacts like clusters, association rules, decision trees and others. These artifacts that we call patterns need to be stored and retrieved efficiently. In order to accomplish this we have to express them within a formalism and a language.
... Mourtzoukos1, Joseph Roumier3, Nikolaos Matskanis3, Michael Schroeder2, Philippe Massonet3, D... more ... Mourtzoukos1, Joseph Roumier3, Nikolaos Matskanis3, Michael Schroeder2, Philippe Massonet3, Dimitrios Koutsouris1, Theodora ... O., Ontology Alignment Design Patterns 2010, http://scharffe.fr/pub/ontology ... 4. Harper, M. (2012) The Truly Staggering Cost Of Inventing New ...
In proceedings of the 6th International Workshop on Personalized Access, Profile Management, and Context Awareness in Databases in conjunction with VLDB 2012 (VLDB/PersDB 2012), Aug 1, 2012
The rapidly increasing volume of published clinical and nonclinical data at a variety of sources ... more The rapidly increasing volume of published clinical and nonclinical data at a variety of sources and the resulting great effort required for researchers to access them and mine information of interest lead to clinical trials that are based on only a limited set of knowledge in the domain they cover. This restricted view of the clinical trials' context is quite often the reason behind unsuccessful trials and/or successful ones which, however, underestimate drugs' unwanted effects and thus their results are of low external validity in the much more complicated environment of clinical healthcare. In this paper, we present a context-aware approach, which has been developed in the PONTE project, for effectively guiding medical researchers during clinical trial protocol design and allowing for more efficient and effective access to scientific literature. The suggested approach incorporates intelligent services and advanced text mining mechanisms for scientific literature querying and mining during protocol design, taking into account the study context (i.e. active substance, target and disease) and the domain context in literature.
Document search on PubMed, the pre-eminent database for biomedical literature, relies on the anno... more Document search on PubMed, the pre-eminent database for biomedical literature, relies on the annotation of its documents with relevant terms from the Medical Subject Headings ontology (MeSH) for improving recall through query expansion. Patent documents are another important information source, though they are considerably less accessible. One option to expand patent search beyond pure keywords is the inclusion of classification information: Since every patent is assigned at least one class code, it should be possible for these assignments to be automatically used in a similar way as the MeSH annotations in PubMed. In order to develop a system for this task, it is necessary to have a good understanding of the properties of both classification systems. This report describes our comparative analysis of MeSH and the main patent classification system, the International Patent Classification (IPC). We investigate the hierarchical structures as well as the properties of the terms/classes respectively, and we compare the assignment of IPC codes to patents with the annotation of PubMed documents with MeSH terms. Our analysis shows a strong structural similarity of the hierarchies, but significant differences of terms and annotations. The low number of IPC class assignments and the lack of occurrences of class labels in patent texts imply that current patent search is severely limited. To overcome these limits, we evaluate a method for the automated assignment of additional classes to patent documents, and we propose a system for guided patent search based on the use of class co-occurrence information and external resources.
Biomedical Data Journal, 2015
The rapidly growing wealth of published scientific work, produced by researchers and scholars, ha... more The rapidly growing wealth of published scientific work, produced by researchers and scholars, has resulted in a pressing need for more effective processes towards reviewing scientific articles and research data, organizing data journals, as well as for improved tools and techniques for bibliographic analysis and management of scientometrics. The ongoing EU research project OpenScienceLink aims to address these needs, as well as offer a wide range of opportunities for better collaboration between researchers, by introducing a web-based Platform which offers efficient and intelligent applications and services for exploiting open access scientific information in the biomedical domain. The Platform is empowered by the semantic and social networking capabilities of three leading edge background infrastructures, which have been adapted and integrated for the scope of the project. In this paper, we present the five pilot services that are provided by the OpenScienceLink project. All five services are integrated into the web-based OpenScienceLink platform that is publicly accessible at
The complexity and scale of the knowledge in the biomedical domain has motivated research work to... more The complexity and scale of the knowledge in the biomedical domain has motivated research work towards mining heterogeneous data from structured and unstructured knowledge bases. Towards this direction, it is necessary to combine facts in order to formulate hypotheses or draw conclusions about the domain concepts. In this work we attempt to address this problem by using indirect knowledge connecting two concepts in a graph to identify hidden relations between them. The graph represents concepts as vertices and relations as edges, stemming from structured (ontologies) and unstructured (text) data. In this graph we attempt to mine path patterns which potentially characterize a biomedical relation. For our experimental evaluation we focus on two frequent relations, namely "has target", and "may treat". Our results suggest that relation discovery using indirect knowledge is possible, with an AUC that can reach up to 0.8. Finally, analysis of the results indicates tha...