olga giraldo | Universidad Politécnica de Madrid (original) (raw)
Papers by olga giraldo
Zenodo (CERN European Organization for Nuclear Research), Sep 30, 2022
Precision, Recall and F1 score calculated to compare the automatic annotation with that from doma... more Precision, Recall and F1 score calculated to compare the automatic annotation with that from domain experts.
With the continuous creation, sharing and transformation of data re-sulting from Research and Dev... more With the continuous creation, sharing and transformation of data re-sulting from Research and Development in the Life Sciences domain, provenance tracking and valorization of research outputs remain challenges. Tracing biomed-ical materials and associated data throughout the research life cycle requires tracking materials, methods, transformations, partial results, locations and many other facets. Research is not always carried out in one place; it is usually distrib-uted across several laboratories. Accordingly, research outcomes of various kinds are constantly being produced, recorded, transformed and shared in a decentral-ized manner. In consequence, the digital continuum is very often lost for practical purposes. Moreover, the value of all the assets produced throughout the research life cycle is neglected because we assign all the value to the product that comes at the very end of the process: the scholarly publication holds all the value. We argue that distributed ledgers, Blo...
<strong>Background</strong> Information reported by scientific literature still remai... more <strong>Background</strong> Information reported by scientific literature still remains locked up in discrete documents that are not always interconnected or machine-readable. The Semantic Web together with approaches such as the Resource Description Framework (RDF) and the Linked Open Data (LOD) initiative offer a connectivity tissue that can be used to support the generation of self-describing, machine-readable documents. <strong>Results</strong> Biotea is an approach to generate RDF from scholarly documents. Our RDF model makes extensive use of existing ontologies and semantic enrichment services. Our dataset comprises 270,834 articles from PubMed Open Central in RDF/XML distributed in 404 zipped files. The RDFization process takes care of metadata, e.g., title, authors and journal, as well as semantic annotations on biological entities along the full text. Biological entities are extracted by using the NCBO Annotator and Whatizit. We use the Bibliographic Ontology (BIBO), Dublin Core Metadata Initiative Terms (DCMI-terms), and the Provenance Ontology (PROV-O) to model the bibliographic metadata. Links to related pages such as PubMed HTML articles are provided via rdfs:seeAlso while links to other semantic representation such as Bio2RDF PubMed articles are provided via owl:sameAs. The NCBO Annotator is used to extract entities covering ChEBI for chemicals; Pathway, and Functional Genomics Data Society (MGED) for genes and proteins; Master Drug Data Base (MDDB), NDDF, and NDFRT for drugs; SNOMED, SYMP, MedDRA, MeSH, MedlinePlus Health Topics (MedlinePlus), Online Mendelian Inheritance in Man (OMIM), FMA, ICD10, and Ontology for Biomedical Investigations (OBI) for diseases and medical terms; PO for plants; and MeSH, SNOMED, and NCIt for general terms. Whatizit is used for GO, UniProt proteins, UniProt Taxonomy, and diseases mapped to the UMLS; UniProt taxa are also mapped to NCBI Taxon vocabulary. <strong>Conclusions</strong> Biotea delivers models and tools for metadata enrichment and semantic pro [...]
Biotea-2-Bioschemas mapps Biotea model to schema.org following the approach proposed by Bioschema... more Biotea-2-Bioschemas mapps Biotea model to schema.org following the approach proposed by Bioschemas. Here we present the test data used in Biotea GitHub pages, corresponding to 2596 PubMed Open Access (PMC-OA) subset publications together with the software used to render schema.org markup. Date deposited includes (i) publications retrieved from PMC-OA API, i.e., full text in JATS/XML, (ii) ontology terms recognized in the abstracts and obtained from the NCBO Annotator, i.e., semantic annotations, and (iii) the same annotations following the PubAnnotation format. Software deposited includes (i) biotea-bioschemas-metadata which parses JATS/XML files and creates Bioschemas markup including metadata, abstract and references, (ii) biotea-bioschemas-annotations which parses PubAnnotation annotations and creates Bioschemas markup, and (iii) biotea-bioschemas-showcase which uses the other two in order to display markup in a graphical basic way and render it as a script element in the HTML fo...
Dataset of experimental protocols analyzed for academic purposes.
This is the first release.
Guidelines about how to manually annotate experimental protocols in life sciences. The annotation... more Guidelines about how to manually annotate experimental protocols in life sciences. The annotation is focused in the identification of words or phrases that can be related to: i) the Sample(s) tested in a protocol, ii) Instruments used, iii) Reagents employed, and the overall iv) Objective of a protocol –SIRO elements.
Guidelines for authors from journals publishing experimental protocols in life sciences.
Missing information in protocols steps
Rather than a document that is constantly being written as in the wiki approach, the Living Docum... more Rather than a document that is constantly being written as in the wiki approach, the Living Document (LD) is a document that also acts as a document router, operating by means of structured and organized social tagging and using existing ontologies. It offers an environment where users can manage papers and related information, share their knowledge with their peers and discover hidden associations among the shared knowledge. The LD builds upon both the Semantic Web, which values the integration of well-structured data, and the Social Web, which aims to facilitate interaction amongst people by means of user-generated content. In this vein, the LD is similar to a social networking system, with users as central nodes in the network, with the difference that interaction is focused on papers rather than people. Papers, with their ability to represent research interests, expertise, affiliations, and links to web based tools and databanks, are the central axis for interaction amongst user...
This presentation is an introduction to the basic functionality of BioH annotation tool.
The Annotation Ontology (AO) has proven to be a valuable resource for structuring annotations in ... more The Annotation Ontology (AO) has proven to be a valuable resource for structuring annotations in scientific documents. We are representing elements of discourse with the AO; by using our proposed extension it is possible to mark up specific rhetorical structures and build a network of interconnected documents. The extension presented in this paper also makes it possible to represent more expressive associations across nanopublications.
In this paper we present “SMART Protocols”, a semantic and NLP-based infrastructure for processin... more In this paper we present “SMART Protocols”, a semantic and NLP-based infrastructure for processing and enacting experimental protocols. Our contribution is twofold; on the one hand, SMART Protocols delivers a semantic layer that represents the knowledge encoded in experimental protocols. On the other hand, it builds the groundwork for making use of such semantics within an NLP framework. We emphasize on the semantic and NLP components, namely the SMART Protocols (SP) Ontology, the Sample Instrument Reagent Objective (SIRO) model and the text mining integrative architecture GATE. The SMART Protocols (SP) Ontology results from the analysis of over 300 experimental protocols in various domains –molecular biology, cell and developmental biology and others. The gathered terminology is then evaluated, rules are improved accordingly and then a new iteration starts. The SIRO model defines an extended layer of metadata for experimental protocols; SIRO is also a Minimal Information (MI) model...
Correspondence: ogiraldo@fi.upm.es Ontology Engineering Group, Universidad Politécnica de Madrid,... more Correspondence: ogiraldo@fi.upm.es Ontology Engineering Group, Universidad Politécnica de Madrid, Madrid, Spain Full list of author information is available at the end of the article Abstract We are manually annotating a corpus of 100 full-text experimental protocols in cell biology, neuroscience, developmental biology, microbiology and molecular biology. We are gathering the protocols from repositories like bio-protocols, Cold Spring Harbor Protocols and Nature protocols. Our annotation focuses on: the SIRO model, namely: i) the Sample tested, ii) the Instruments employed, iii) the Reagents used and, iv) the overall Objective of the protocol. The SIRO model represents the minimal information for describing an experimental protocol. Our manual annotation experience illustrates how to use annotations from domain experts in the enrichment of an ontology; we are identifying key terms in the text and relating these to concepts in the ontology. By the same token, our annotation experienc...
Websites are commonly used to expose data to end users, enabling search, filter, and download cap... more Websites are commonly used to expose data to end users, enabling search, filter, and download capabilities making it easier for users to find, organize and obtain data relevant to their own interests. With the continuous growth of data in the Life Sciences domain, it becomes difficult for users to easily find information required for their research on one single website. Search engines should make it easier for researchers to search and retrieve collated information from multiple sites so they can better decide where to go next. Schema.org is a collaborative project providing schemas for semantically structuring data in web pages. By adding semantic mark-up it becomes easier to determine whether a web page refers to a book or a movie. It also facilitates summarizing information in a fashion similar to infoboxes used in Wikipedia. Bioschemas is a community effort aiming to extend schema.org to support mark-up for Life Sciences websites. Here we present an overview of the main types u...
In this poster we present the semantic and NLP layers in the development of our repository for ex... more In this poster we present the semantic and NLP layers in the development of our repository for experimental protocols. We have studied existing repositories for experimental protocols as well the experimental protocols themselves. We have identified end-‐user features across existing repositories; we have also structured the semantics for these documents, defined by an ontology and a Minimal Information model for experimental protocols. In addition, we have built an NLP layer that makes extensive use of semantics. Our integrative approach focuses on facilitating search, retrieval and socialization of experimental protocols. We also focus on facilitating the generation of documents that are born semantics.
Although ontologies were originally conceived as abstract knowledge representations understandabl... more Although ontologies were originally conceived as abstract knowledge representations understandable by computers, there is an ever increasing need of providing this knowledge to the user in a more friendly way. One method to facilitate this process is ontology localization, which has acquired great importance in research, in that it tries to present the ontology information in the user’s language. However, localizing ontologies is not a trivial task. In this paper we propose some ontology design patterns that can guide users in the process of assigning labels or identifiers to ontology entities. Based on our experience on localizing ontologies, we have developed some good practices as patterns following the Ontology Design Patterns initiative.
Genomics & Informatics
2019, Korea Genome Organization This is an open-access article distributed under the terms of the... more 2019, Korea Genome Organization This is an open-access article distributed under the terms of the Creative Commons Attribution license (http:// creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The total number of scholarly publications grows day by day, making it necessary to explore and use simple yet effective ways to expose their metadata. Schema.org supports adding structured metadata to web pages via markup, making it easier for data providers but also for search engines to provide the right search results. Bioschemas is based on the standards of schema.org, providing new types, properties and guidelines for metadata, i.e., providing metadata profiles tailored to the Life Sciences domain. Here we present our proposed contribution to Bioschemas (from the project "Biotea"), which supports metadata contributions for scholarly publications via profiles and web components. Biotea comprises a semantic model to represent publications together with annotated elements recognized from the scientific text; our Biotea model has been mapped to schema.org following Bioschemas standards.
Zenodo (CERN European Organization for Nuclear Research), Sep 30, 2022
Precision, Recall and F1 score calculated to compare the automatic annotation with that from doma... more Precision, Recall and F1 score calculated to compare the automatic annotation with that from domain experts.
With the continuous creation, sharing and transformation of data re-sulting from Research and Dev... more With the continuous creation, sharing and transformation of data re-sulting from Research and Development in the Life Sciences domain, provenance tracking and valorization of research outputs remain challenges. Tracing biomed-ical materials and associated data throughout the research life cycle requires tracking materials, methods, transformations, partial results, locations and many other facets. Research is not always carried out in one place; it is usually distrib-uted across several laboratories. Accordingly, research outcomes of various kinds are constantly being produced, recorded, transformed and shared in a decentral-ized manner. In consequence, the digital continuum is very often lost for practical purposes. Moreover, the value of all the assets produced throughout the research life cycle is neglected because we assign all the value to the product that comes at the very end of the process: the scholarly publication holds all the value. We argue that distributed ledgers, Blo...
<strong>Background</strong> Information reported by scientific literature still remai... more <strong>Background</strong> Information reported by scientific literature still remains locked up in discrete documents that are not always interconnected or machine-readable. The Semantic Web together with approaches such as the Resource Description Framework (RDF) and the Linked Open Data (LOD) initiative offer a connectivity tissue that can be used to support the generation of self-describing, machine-readable documents. <strong>Results</strong> Biotea is an approach to generate RDF from scholarly documents. Our RDF model makes extensive use of existing ontologies and semantic enrichment services. Our dataset comprises 270,834 articles from PubMed Open Central in RDF/XML distributed in 404 zipped files. The RDFization process takes care of metadata, e.g., title, authors and journal, as well as semantic annotations on biological entities along the full text. Biological entities are extracted by using the NCBO Annotator and Whatizit. We use the Bibliographic Ontology (BIBO), Dublin Core Metadata Initiative Terms (DCMI-terms), and the Provenance Ontology (PROV-O) to model the bibliographic metadata. Links to related pages such as PubMed HTML articles are provided via rdfs:seeAlso while links to other semantic representation such as Bio2RDF PubMed articles are provided via owl:sameAs. The NCBO Annotator is used to extract entities covering ChEBI for chemicals; Pathway, and Functional Genomics Data Society (MGED) for genes and proteins; Master Drug Data Base (MDDB), NDDF, and NDFRT for drugs; SNOMED, SYMP, MedDRA, MeSH, MedlinePlus Health Topics (MedlinePlus), Online Mendelian Inheritance in Man (OMIM), FMA, ICD10, and Ontology for Biomedical Investigations (OBI) for diseases and medical terms; PO for plants; and MeSH, SNOMED, and NCIt for general terms. Whatizit is used for GO, UniProt proteins, UniProt Taxonomy, and diseases mapped to the UMLS; UniProt taxa are also mapped to NCBI Taxon vocabulary. <strong>Conclusions</strong> Biotea delivers models and tools for metadata enrichment and semantic pro [...]
Biotea-2-Bioschemas mapps Biotea model to schema.org following the approach proposed by Bioschema... more Biotea-2-Bioschemas mapps Biotea model to schema.org following the approach proposed by Bioschemas. Here we present the test data used in Biotea GitHub pages, corresponding to 2596 PubMed Open Access (PMC-OA) subset publications together with the software used to render schema.org markup. Date deposited includes (i) publications retrieved from PMC-OA API, i.e., full text in JATS/XML, (ii) ontology terms recognized in the abstracts and obtained from the NCBO Annotator, i.e., semantic annotations, and (iii) the same annotations following the PubAnnotation format. Software deposited includes (i) biotea-bioschemas-metadata which parses JATS/XML files and creates Bioschemas markup including metadata, abstract and references, (ii) biotea-bioschemas-annotations which parses PubAnnotation annotations and creates Bioschemas markup, and (iii) biotea-bioschemas-showcase which uses the other two in order to display markup in a graphical basic way and render it as a script element in the HTML fo...
Dataset of experimental protocols analyzed for academic purposes.
This is the first release.
Guidelines about how to manually annotate experimental protocols in life sciences. The annotation... more Guidelines about how to manually annotate experimental protocols in life sciences. The annotation is focused in the identification of words or phrases that can be related to: i) the Sample(s) tested in a protocol, ii) Instruments used, iii) Reagents employed, and the overall iv) Objective of a protocol –SIRO elements.
Guidelines for authors from journals publishing experimental protocols in life sciences.
Missing information in protocols steps
Rather than a document that is constantly being written as in the wiki approach, the Living Docum... more Rather than a document that is constantly being written as in the wiki approach, the Living Document (LD) is a document that also acts as a document router, operating by means of structured and organized social tagging and using existing ontologies. It offers an environment where users can manage papers and related information, share their knowledge with their peers and discover hidden associations among the shared knowledge. The LD builds upon both the Semantic Web, which values the integration of well-structured data, and the Social Web, which aims to facilitate interaction amongst people by means of user-generated content. In this vein, the LD is similar to a social networking system, with users as central nodes in the network, with the difference that interaction is focused on papers rather than people. Papers, with their ability to represent research interests, expertise, affiliations, and links to web based tools and databanks, are the central axis for interaction amongst user...
This presentation is an introduction to the basic functionality of BioH annotation tool.
The Annotation Ontology (AO) has proven to be a valuable resource for structuring annotations in ... more The Annotation Ontology (AO) has proven to be a valuable resource for structuring annotations in scientific documents. We are representing elements of discourse with the AO; by using our proposed extension it is possible to mark up specific rhetorical structures and build a network of interconnected documents. The extension presented in this paper also makes it possible to represent more expressive associations across nanopublications.
In this paper we present “SMART Protocols”, a semantic and NLP-based infrastructure for processin... more In this paper we present “SMART Protocols”, a semantic and NLP-based infrastructure for processing and enacting experimental protocols. Our contribution is twofold; on the one hand, SMART Protocols delivers a semantic layer that represents the knowledge encoded in experimental protocols. On the other hand, it builds the groundwork for making use of such semantics within an NLP framework. We emphasize on the semantic and NLP components, namely the SMART Protocols (SP) Ontology, the Sample Instrument Reagent Objective (SIRO) model and the text mining integrative architecture GATE. The SMART Protocols (SP) Ontology results from the analysis of over 300 experimental protocols in various domains –molecular biology, cell and developmental biology and others. The gathered terminology is then evaluated, rules are improved accordingly and then a new iteration starts. The SIRO model defines an extended layer of metadata for experimental protocols; SIRO is also a Minimal Information (MI) model...
Correspondence: ogiraldo@fi.upm.es Ontology Engineering Group, Universidad Politécnica de Madrid,... more Correspondence: ogiraldo@fi.upm.es Ontology Engineering Group, Universidad Politécnica de Madrid, Madrid, Spain Full list of author information is available at the end of the article Abstract We are manually annotating a corpus of 100 full-text experimental protocols in cell biology, neuroscience, developmental biology, microbiology and molecular biology. We are gathering the protocols from repositories like bio-protocols, Cold Spring Harbor Protocols and Nature protocols. Our annotation focuses on: the SIRO model, namely: i) the Sample tested, ii) the Instruments employed, iii) the Reagents used and, iv) the overall Objective of the protocol. The SIRO model represents the minimal information for describing an experimental protocol. Our manual annotation experience illustrates how to use annotations from domain experts in the enrichment of an ontology; we are identifying key terms in the text and relating these to concepts in the ontology. By the same token, our annotation experienc...
Websites are commonly used to expose data to end users, enabling search, filter, and download cap... more Websites are commonly used to expose data to end users, enabling search, filter, and download capabilities making it easier for users to find, organize and obtain data relevant to their own interests. With the continuous growth of data in the Life Sciences domain, it becomes difficult for users to easily find information required for their research on one single website. Search engines should make it easier for researchers to search and retrieve collated information from multiple sites so they can better decide where to go next. Schema.org is a collaborative project providing schemas for semantically structuring data in web pages. By adding semantic mark-up it becomes easier to determine whether a web page refers to a book or a movie. It also facilitates summarizing information in a fashion similar to infoboxes used in Wikipedia. Bioschemas is a community effort aiming to extend schema.org to support mark-up for Life Sciences websites. Here we present an overview of the main types u...
In this poster we present the semantic and NLP layers in the development of our repository for ex... more In this poster we present the semantic and NLP layers in the development of our repository for experimental protocols. We have studied existing repositories for experimental protocols as well the experimental protocols themselves. We have identified end-‐user features across existing repositories; we have also structured the semantics for these documents, defined by an ontology and a Minimal Information model for experimental protocols. In addition, we have built an NLP layer that makes extensive use of semantics. Our integrative approach focuses on facilitating search, retrieval and socialization of experimental protocols. We also focus on facilitating the generation of documents that are born semantics.
Although ontologies were originally conceived as abstract knowledge representations understandabl... more Although ontologies were originally conceived as abstract knowledge representations understandable by computers, there is an ever increasing need of providing this knowledge to the user in a more friendly way. One method to facilitate this process is ontology localization, which has acquired great importance in research, in that it tries to present the ontology information in the user’s language. However, localizing ontologies is not a trivial task. In this paper we propose some ontology design patterns that can guide users in the process of assigning labels or identifiers to ontology entities. Based on our experience on localizing ontologies, we have developed some good practices as patterns following the Ontology Design Patterns initiative.
Genomics & Informatics
2019, Korea Genome Organization This is an open-access article distributed under the terms of the... more 2019, Korea Genome Organization This is an open-access article distributed under the terms of the Creative Commons Attribution license (http:// creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The total number of scholarly publications grows day by day, making it necessary to explore and use simple yet effective ways to expose their metadata. Schema.org supports adding structured metadata to web pages via markup, making it easier for data providers but also for search engines to provide the right search results. Bioschemas is based on the standards of schema.org, providing new types, properties and guidelines for metadata, i.e., providing metadata profiles tailored to the Life Sciences domain. Here we present our proposed contribution to Bioschemas (from the project "Biotea"), which supports metadata contributions for scholarly publications via profiles and web components. Biotea comprises a semantic model to represent publications together with annotated elements recognized from the scientific text; our Biotea model has been mapped to schema.org following Bioschemas standards.