Yehoshua Perl - Academia.edu (original) (raw)
Papers by Yehoshua Perl
Proceedings Amia Annual Symposium Amia Symposium, Feb 1, 2002
The Unified Medical Language System integrates about 800,000 concepts from 99 biomedical terminol... more The Unified Medical Language System integrates about 800,000 concepts from 99 biomedical terminologies. Each concept is assigned to at least one semantic type of the Semantic Network. During the integration, it is unavoidable that some classification errors and inconsistencies will be introduced. In this paper, we present an auditing technique to find such errors and inconsistencies. Our technique is based on an expert reviewing the pure intersections of meta-semantic types of the metaschema, a compact abstract view of the Semantic Network. Results regarding the pure intersections are reported. The analysis results for pure intersections with 1 to 6 concepts are presented. Various kinds of errors are identified.
Networks, 1982
ABSTRACT The following problem is considered: Given an integer K, a graph G with two distinct ver... more ABSTRACT The following problem is considered: Given an integer K, a graph G with two distinct vertices s and t, find the maximum number of disjoint paths of length K from s to t. The problem has several variants: the paths may be vertex-disjoint or edge-disjoint, the lengths of the paths may be equal to K or bounded by K, the graph may be undirected or directed. It is shown that except for small values of K all the problems are NP-complete. Assuming P ≠ NP, for each problem, the largest value of K for which the problem is not NP-complete is found. Whenever a polynomial algorithm exists, an efficient algorithm is described.
Journal of biomedical informatics, Jan 28, 2018
In previous research, we have demonstrated for a number of ontologies that structurally complex c... more In previous research, we have demonstrated for a number of ontologies that structurally complex concepts (for different definitions of "complex") in an ontology are more likely to exhibit errors than other concepts. Thus, such complex concepts often become fertile ground for quality assurance (QA) in ontologies. They should be audited first. One example of complex concepts is given by "overlapping concepts" (to be defined below.) Historically, a different auditing methodology had to be developed for every single ontology. For better scalability and efficiency, it is desirable to identify family-wide QA methodologies. Each such methodology would be applicable to a whole family of similar ontologies. In past research, we had divided the 685 ontologies of BioPortal into families of structurally similar ontologies. We showed for four ontologies of the same large family in BioPortal that "overlapping concepts" are indeed statistically significantly more like...
Journal of healthcare engineering, 2017
Ontologies are important components of health information management systems. As such, the qualit... more Ontologies are important components of health information management systems. As such, the quality of their content is of paramount importance. It has been proven to be practical to develop quality assurance (QA) methodologies based on automated identification of sets of concepts expected to have higher likelihood of errors. Four kinds of such sets (called QA-sets) organized around the themes of complex and uncommonly modeled concepts are introduced. A survey of different methodologies based on these QA-sets and the results of applying them to various ontologies are presented. Overall, following these approaches leads to higher QA yields and better utilization of QA personnel. The formulation of additional QA-set methodologies will further enhance the suite of available ontology QA tools.
Studies in health technology and informatics, 2017
Maintenance and use of a large ontology, consisting of thousands of knowledge assertions, are ham... more Maintenance and use of a large ontology, consisting of thousands of knowledge assertions, are hampered by its scope and complexity. It is important to provide tools for summarization of ontology content in order to facilitate user "big picture" comprehension. We present a parameterized methodology for the semi-automatic summarization of major topics in an ontology, based on a compact summary of the ontology, called an "aggregate partial-area taxonomy", followed by manual enhancement. An experiment is presented to test the effectiveness of such summarization measured by coverage of a given list of major topics of the corresponding application domain. SNOMED CT's Specimen hierarchy is the test-bed. A domain-expert provided a list of topics that serves as a gold standard. The enhanced results show that the aggregate taxonomy covers most of the domain's main topics.
Studies in health technology and informatics, 2017
In previous research we have shown that hierarchically complex overlapping concepts have a higher... more In previous research we have shown that hierarchically complex overlapping concepts have a higher error rate of errors versus control concepts. In this poster we show an exmaple from Neoplasm concepts of the NCI thesaurus (NCIt) demonstrating that erroneous overplapping concepts, reflected in the partial-area units of a partial-area taxonomy, display visual complexity. Furthermore, correcting these erroneous concepts causes visual simplification.
Journal of Biomedical Informatics
Annals of the New York Academy of Sciences
The purpose of the Big Data to Knowledge initiative is to develop methods for discovering new kno... more The purpose of the Big Data to Knowledge initiative is to develop methods for discovering new knowledge from large amounts of data. However, if the resulting knowledge is so large that it resists comprehension, referred to here as Big Knowledge (BK), how can it be used properly and creatively? We call this secondary challenge, Big Knowledge to Use. Without a high-level mental representation of the kinds of knowledge in a BK knowledgebase, effective or innovative use of the knowledge may be limited. We describe summarization and visualization techniques that capture the big picture of a BK knowledgebase, possibly created from Big Data. In this research, we distinguish between assertion BK and rule-based BK (rule BK) and demonstrate the usefulness of summarization and visualization techniques of assertion BK for clinical phenotyping. As an example, we illustrate how a summary of many intracranial bleeding concepts can improve phenotyping, compared to the traditional approach. We also demonstrate the usefulness of summarization and visualization techniques of rule BK for drug-drug interaction discovery.
Journal of biomedical informatics, Mar 1, 2017
Thousands of changes are applied to SNOMED CT's concepts during each release cycle. These cha... more Thousands of changes are applied to SNOMED CT's concepts during each release cycle. These changes are the result of efforts to improve or expand the coverage of health domains in the terminology. Understanding which concepts changed, how they changed, and the overall impact of a set of changes is important for editors and end users. Each SNOMED CT release comes with delta files, which identify all of the individual additions and removals of concepts and relationships. These files typically contain tens of thousands of individual entries, overwhelming users. They also do not identify the editorial processes that were applied to individual concepts and they do not capture the overall impact of a set of changes on a subhierarchy of concepts. In this paper we introduce a methodology and accompanying software tool called a SNOMED CT Visual Semantic Delta ("semantic delta" for short) to enable a comprehensive review of changes in SNOMED CT. The semantic delta displays a grap...
AMIA ... Annual Symposium proceedings. AMIA Symposium, 2016
SNOMED CT's content undergoes many changes from one release to the next. Over the last year S... more SNOMED CT's content undergoes many changes from one release to the next. Over the last year SNOMED CT's Bacterial infectious disease subhierarchy has undergone significant editing to bring consistent modeling to its concepts. In this paper we analyze the stated and inferred structural modifications that affected the Bacterial infectious disease subhierarchy between the Jan 2015 and Jan 2016 SNOMED CT releases using a two-phased approach. First, we introduce a methodology for creating a human readable list of changes. Next, we utilize partial-area taxonomies, which are compact summaries of SNOMED CT's content and structure, to identify the "big picture" changes that occurred in the subhierarchy. We illustrate how partial-area taxonomies can be used to help identify groups of concepts that were affected by these editing operations and the nature of these changes. Modeling issues identified using our two-phase methodology are discussed.
Journal of biomedical informatics, Aug 1, 2016
Software tools play a critical role in the development and maintenance of biomedical ontologies. ... more Software tools play a critical role in the development and maintenance of biomedical ontologies. One important task that is difficult without software tools is ontology quality assurance. In previous work, we have introduced different kinds of abstraction networks to provide a theoretical foundation for ontology quality assurance tools. Abstraction networks summarize the structure and content of ontologies. One kind of abstraction network that we have used repeatedly to support ontology quality assurance is the partial-area taxonomy. It summarizes structurally and semantically similar concepts within an ontology. However, the use of partial-area taxonomies was ad hoc and not generalizable. In this paper, we describe the Ontology Abstraction Framework (OAF), a unified framework and software system for deriving, visualizing, and exploring partial-area taxonomy abstraction networks. The OAF includes support for various ontology representations (e.g., OWL and SNOMED CT's relational ...
Journal of biomedical informatics, Jul 1, 2017
Biomedical ontologies often reuse content (i.e., classes and properties) from other ontologies. C... more Biomedical ontologies often reuse content (i.e., classes and properties) from other ontologies. Content reuse enables a consistent representation of a domain and reusing content can save an ontology author significant time and effort. Prior studies have investigated the existence of reused terms among the ontologies in the NCBO BioPortal, but as of yet there has not been a study investigating how the ontologies in BioPortal utilize reused content in the modeling of their own content. In this study we investigate how 355 ontologies hosted in the NCBO BioPortal reuse content from other ontologies for the purposes of creating new ontology content. We identified 197 ontologies that reuse content. Among these ontologies, 108 utilize reused classes in the modeling of their own classes and 116 utilize reused properties in class restrictions. Current utilization of reuse and quality issues related to reuse are discussed.
Artificial intelligence in medicine, Jun 19, 2017
To examine whether disjoint partial-area taxonomy, a semantically-based evaluation methodology th... more To examine whether disjoint partial-area taxonomy, a semantically-based evaluation methodology that has been successfully tested in SNOMED CT, will perform with similar effectiveness on Uberon, an anatomical ontology that belongs to a structurally similar family of ontologies as SNOMED CT. A disjoint partial-area taxonomy was generated for Uberon. One hundred randomly selected test concepts that overlap between partial-areas were matched to a same size control sample of non-overlapping concepts. The samples were blindly inspected for non-critical issues and presumptive errors first by a general domain expert whose results were then confirmed or rejected by a highly experienced anatomical ontology domain expert. Reported issues were subsequently reviewed by Uberon's curators. Overlapping concepts in Uberon's disjoint partial-area taxonomy exhibited a significantly higher rate of all issues. Clear-cut presumptive errors trended similarly but did not reach statistical significa...
Great Lakes Computer Science Conference, 1989
ABSTRACT Two common schemes in data compression are fixed to variable length coding and variable ... more ABSTRACT Two common schemes in data compression are fixed to variable length coding and variable to fixed length coding. Higher compression is expected from the more flexible scheme of variable to variable length coding. In such a scheme a compression dictionary is used to transfer variable length strings over the text alphabet into variable length strings over the coding alphabet. The compression is achieved due to matching longer more frequent text strings with shorter coding strings. To obtain a variable to variable length coding we choose to cascade the LZW, variable to fixed, coding with the Huffman, fixed to variable, coding. In this work we consider the effective way of performing this cascading, to optimize the compression using limited time resources.
The purpose of this paper is to demonstrate how the transformation of a medical vocabulary based ... more The purpose of this paper is to demonstrate how the transformation of a medical vocabulary based on a Semantic Network (SN) model into a vocabulary based on an Object-Oriented Database (OODB) model helps in the maintenance of the vocabulary. We describe an OODB schema which captures the overall structure of the vocabulary in a compact form and uncovers some errors and inconsistencies made in the vocabulary's original modeling. The resolution of these mistakes leads to an improved version of the SN-based vocabulary. A new OODB schema for the vocabulary is then derived based on the improved SN version. This experience demonstrates how the abstraction and modeling capabilities of OODBs can be used to enhance a user's understanding of the overarching structure of a complex medical vocabulary system. The OODB schema developed herein serves as the basis for the Object-Oriented Healthcare Vocabulary Repository (OOHVR), a medical vocabulary implemented as an ONTOS database.
J Amer Med Inform Assoc, 2009
Proceedings Amia Annual Symposium Amia Symposium, Feb 1, 2002
The UMLS's Semantic Network (SN) serves as a valuable abstraction for the underlying concept repo... more The UMLS's Semantic Network (SN) serves as a valuable abstraction for the underlying concept repository called the Metathesaurus (META). Specifically, the SN forms a classification layer for the META, with each ofthe META 's constituent concepts assigned to one or more semantic types in the SN. The rule in the design ofthe SN is to have concepts explicitly assigned to the lowest possible semantic types in the SN's IS-A hierarchy. Implicit assignment to higher semantic types can be inferred via the IS-A relationships. However, in subsequent versions ofthe UMLS, unnecessary, simultaneous assignments to descendant and ancestor semantic types have been discovered (e.g., 8,622 in the UMLS 1998 version and 12,657 in the 2001 version). The assignment ofconcepts to such ancestor semantic types is called redundant classification. There is a needfor an automated auditing tool that can identify all these redundant classifications. In this paper, an efficient algorithm for this auditing task is introduced. Details of its application to the current (2001) version of the UMLS are presented and the results are discussed.
Proceedings Amia Annual Symposium Amia Symposium, Feb 1, 2002
The Unified Medical Language System integrates about 800,000 concepts from 99 biomedical terminol... more The Unified Medical Language System integrates about 800,000 concepts from 99 biomedical terminologies. Each concept is assigned to at least one semantic type of the Semantic Network. During the integration, it is unavoidable that some classification errors and inconsistencies will be introduced. In this paper, we present an auditing technique to find such errors and inconsistencies. Our technique is based on an expert reviewing the pure intersections of meta-semantic types of the metaschema, a compact abstract view of the Semantic Network. Results regarding the pure intersections are reported. The analysis results for pure intersections with 1 to 6 concepts are presented. Various kinds of errors are identified.
Networks, 1982
ABSTRACT The following problem is considered: Given an integer K, a graph G with two distinct ver... more ABSTRACT The following problem is considered: Given an integer K, a graph G with two distinct vertices s and t, find the maximum number of disjoint paths of length K from s to t. The problem has several variants: the paths may be vertex-disjoint or edge-disjoint, the lengths of the paths may be equal to K or bounded by K, the graph may be undirected or directed. It is shown that except for small values of K all the problems are NP-complete. Assuming P ≠ NP, for each problem, the largest value of K for which the problem is not NP-complete is found. Whenever a polynomial algorithm exists, an efficient algorithm is described.
Journal of biomedical informatics, Jan 28, 2018
In previous research, we have demonstrated for a number of ontologies that structurally complex c... more In previous research, we have demonstrated for a number of ontologies that structurally complex concepts (for different definitions of "complex") in an ontology are more likely to exhibit errors than other concepts. Thus, such complex concepts often become fertile ground for quality assurance (QA) in ontologies. They should be audited first. One example of complex concepts is given by "overlapping concepts" (to be defined below.) Historically, a different auditing methodology had to be developed for every single ontology. For better scalability and efficiency, it is desirable to identify family-wide QA methodologies. Each such methodology would be applicable to a whole family of similar ontologies. In past research, we had divided the 685 ontologies of BioPortal into families of structurally similar ontologies. We showed for four ontologies of the same large family in BioPortal that "overlapping concepts" are indeed statistically significantly more like...
Journal of healthcare engineering, 2017
Ontologies are important components of health information management systems. As such, the qualit... more Ontologies are important components of health information management systems. As such, the quality of their content is of paramount importance. It has been proven to be practical to develop quality assurance (QA) methodologies based on automated identification of sets of concepts expected to have higher likelihood of errors. Four kinds of such sets (called QA-sets) organized around the themes of complex and uncommonly modeled concepts are introduced. A survey of different methodologies based on these QA-sets and the results of applying them to various ontologies are presented. Overall, following these approaches leads to higher QA yields and better utilization of QA personnel. The formulation of additional QA-set methodologies will further enhance the suite of available ontology QA tools.
Studies in health technology and informatics, 2017
Maintenance and use of a large ontology, consisting of thousands of knowledge assertions, are ham... more Maintenance and use of a large ontology, consisting of thousands of knowledge assertions, are hampered by its scope and complexity. It is important to provide tools for summarization of ontology content in order to facilitate user "big picture" comprehension. We present a parameterized methodology for the semi-automatic summarization of major topics in an ontology, based on a compact summary of the ontology, called an "aggregate partial-area taxonomy", followed by manual enhancement. An experiment is presented to test the effectiveness of such summarization measured by coverage of a given list of major topics of the corresponding application domain. SNOMED CT's Specimen hierarchy is the test-bed. A domain-expert provided a list of topics that serves as a gold standard. The enhanced results show that the aggregate taxonomy covers most of the domain's main topics.
Studies in health technology and informatics, 2017
In previous research we have shown that hierarchically complex overlapping concepts have a higher... more In previous research we have shown that hierarchically complex overlapping concepts have a higher error rate of errors versus control concepts. In this poster we show an exmaple from Neoplasm concepts of the NCI thesaurus (NCIt) demonstrating that erroneous overplapping concepts, reflected in the partial-area units of a partial-area taxonomy, display visual complexity. Furthermore, correcting these erroneous concepts causes visual simplification.
Journal of Biomedical Informatics
Annals of the New York Academy of Sciences
The purpose of the Big Data to Knowledge initiative is to develop methods for discovering new kno... more The purpose of the Big Data to Knowledge initiative is to develop methods for discovering new knowledge from large amounts of data. However, if the resulting knowledge is so large that it resists comprehension, referred to here as Big Knowledge (BK), how can it be used properly and creatively? We call this secondary challenge, Big Knowledge to Use. Without a high-level mental representation of the kinds of knowledge in a BK knowledgebase, effective or innovative use of the knowledge may be limited. We describe summarization and visualization techniques that capture the big picture of a BK knowledgebase, possibly created from Big Data. In this research, we distinguish between assertion BK and rule-based BK (rule BK) and demonstrate the usefulness of summarization and visualization techniques of assertion BK for clinical phenotyping. As an example, we illustrate how a summary of many intracranial bleeding concepts can improve phenotyping, compared to the traditional approach. We also demonstrate the usefulness of summarization and visualization techniques of rule BK for drug-drug interaction discovery.
Journal of biomedical informatics, Mar 1, 2017
Thousands of changes are applied to SNOMED CT's concepts during each release cycle. These cha... more Thousands of changes are applied to SNOMED CT's concepts during each release cycle. These changes are the result of efforts to improve or expand the coverage of health domains in the terminology. Understanding which concepts changed, how they changed, and the overall impact of a set of changes is important for editors and end users. Each SNOMED CT release comes with delta files, which identify all of the individual additions and removals of concepts and relationships. These files typically contain tens of thousands of individual entries, overwhelming users. They also do not identify the editorial processes that were applied to individual concepts and they do not capture the overall impact of a set of changes on a subhierarchy of concepts. In this paper we introduce a methodology and accompanying software tool called a SNOMED CT Visual Semantic Delta ("semantic delta" for short) to enable a comprehensive review of changes in SNOMED CT. The semantic delta displays a grap...
AMIA ... Annual Symposium proceedings. AMIA Symposium, 2016
SNOMED CT's content undergoes many changes from one release to the next. Over the last year S... more SNOMED CT's content undergoes many changes from one release to the next. Over the last year SNOMED CT's Bacterial infectious disease subhierarchy has undergone significant editing to bring consistent modeling to its concepts. In this paper we analyze the stated and inferred structural modifications that affected the Bacterial infectious disease subhierarchy between the Jan 2015 and Jan 2016 SNOMED CT releases using a two-phased approach. First, we introduce a methodology for creating a human readable list of changes. Next, we utilize partial-area taxonomies, which are compact summaries of SNOMED CT's content and structure, to identify the "big picture" changes that occurred in the subhierarchy. We illustrate how partial-area taxonomies can be used to help identify groups of concepts that were affected by these editing operations and the nature of these changes. Modeling issues identified using our two-phase methodology are discussed.
Journal of biomedical informatics, Aug 1, 2016
Software tools play a critical role in the development and maintenance of biomedical ontologies. ... more Software tools play a critical role in the development and maintenance of biomedical ontologies. One important task that is difficult without software tools is ontology quality assurance. In previous work, we have introduced different kinds of abstraction networks to provide a theoretical foundation for ontology quality assurance tools. Abstraction networks summarize the structure and content of ontologies. One kind of abstraction network that we have used repeatedly to support ontology quality assurance is the partial-area taxonomy. It summarizes structurally and semantically similar concepts within an ontology. However, the use of partial-area taxonomies was ad hoc and not generalizable. In this paper, we describe the Ontology Abstraction Framework (OAF), a unified framework and software system for deriving, visualizing, and exploring partial-area taxonomy abstraction networks. The OAF includes support for various ontology representations (e.g., OWL and SNOMED CT's relational ...
Journal of biomedical informatics, Jul 1, 2017
Biomedical ontologies often reuse content (i.e., classes and properties) from other ontologies. C... more Biomedical ontologies often reuse content (i.e., classes and properties) from other ontologies. Content reuse enables a consistent representation of a domain and reusing content can save an ontology author significant time and effort. Prior studies have investigated the existence of reused terms among the ontologies in the NCBO BioPortal, but as of yet there has not been a study investigating how the ontologies in BioPortal utilize reused content in the modeling of their own content. In this study we investigate how 355 ontologies hosted in the NCBO BioPortal reuse content from other ontologies for the purposes of creating new ontology content. We identified 197 ontologies that reuse content. Among these ontologies, 108 utilize reused classes in the modeling of their own classes and 116 utilize reused properties in class restrictions. Current utilization of reuse and quality issues related to reuse are discussed.
Artificial intelligence in medicine, Jun 19, 2017
To examine whether disjoint partial-area taxonomy, a semantically-based evaluation methodology th... more To examine whether disjoint partial-area taxonomy, a semantically-based evaluation methodology that has been successfully tested in SNOMED CT, will perform with similar effectiveness on Uberon, an anatomical ontology that belongs to a structurally similar family of ontologies as SNOMED CT. A disjoint partial-area taxonomy was generated for Uberon. One hundred randomly selected test concepts that overlap between partial-areas were matched to a same size control sample of non-overlapping concepts. The samples were blindly inspected for non-critical issues and presumptive errors first by a general domain expert whose results were then confirmed or rejected by a highly experienced anatomical ontology domain expert. Reported issues were subsequently reviewed by Uberon's curators. Overlapping concepts in Uberon's disjoint partial-area taxonomy exhibited a significantly higher rate of all issues. Clear-cut presumptive errors trended similarly but did not reach statistical significa...
Great Lakes Computer Science Conference, 1989
ABSTRACT Two common schemes in data compression are fixed to variable length coding and variable ... more ABSTRACT Two common schemes in data compression are fixed to variable length coding and variable to fixed length coding. Higher compression is expected from the more flexible scheme of variable to variable length coding. In such a scheme a compression dictionary is used to transfer variable length strings over the text alphabet into variable length strings over the coding alphabet. The compression is achieved due to matching longer more frequent text strings with shorter coding strings. To obtain a variable to variable length coding we choose to cascade the LZW, variable to fixed, coding with the Huffman, fixed to variable, coding. In this work we consider the effective way of performing this cascading, to optimize the compression using limited time resources.
The purpose of this paper is to demonstrate how the transformation of a medical vocabulary based ... more The purpose of this paper is to demonstrate how the transformation of a medical vocabulary based on a Semantic Network (SN) model into a vocabulary based on an Object-Oriented Database (OODB) model helps in the maintenance of the vocabulary. We describe an OODB schema which captures the overall structure of the vocabulary in a compact form and uncovers some errors and inconsistencies made in the vocabulary's original modeling. The resolution of these mistakes leads to an improved version of the SN-based vocabulary. A new OODB schema for the vocabulary is then derived based on the improved SN version. This experience demonstrates how the abstraction and modeling capabilities of OODBs can be used to enhance a user's understanding of the overarching structure of a complex medical vocabulary system. The OODB schema developed herein serves as the basis for the Object-Oriented Healthcare Vocabulary Repository (OOHVR), a medical vocabulary implemented as an ONTOS database.
J Amer Med Inform Assoc, 2009
Proceedings Amia Annual Symposium Amia Symposium, Feb 1, 2002
The UMLS's Semantic Network (SN) serves as a valuable abstraction for the underlying concept repo... more The UMLS's Semantic Network (SN) serves as a valuable abstraction for the underlying concept repository called the Metathesaurus (META). Specifically, the SN forms a classification layer for the META, with each ofthe META 's constituent concepts assigned to one or more semantic types in the SN. The rule in the design ofthe SN is to have concepts explicitly assigned to the lowest possible semantic types in the SN's IS-A hierarchy. Implicit assignment to higher semantic types can be inferred via the IS-A relationships. However, in subsequent versions ofthe UMLS, unnecessary, simultaneous assignments to descendant and ancestor semantic types have been discovered (e.g., 8,622 in the UMLS 1998 version and 12,657 in the 2001 version). The assignment ofconcepts to such ancestor semantic types is called redundant classification. There is a needfor an automated auditing tool that can identify all these redundant classifications. In this paper, an efficient algorithm for this auditing task is introduced. Details of its application to the current (2001) version of the UMLS are presented and the results are discussed.