Organizing gene literature retrieval, profiling, and visualization training workshops for early career researchers (original) (raw)

B.E.A.R. GeneInfo: a tool for identifying gene-related biomedical publications through user modifiable queries

BMC bioinformatics, 2004

Once specific genes are identified through high throughput genomics technologies there is a need to sort the final gene list to a manageable size for validation studies. The triaging and sorting of genes often relies on the use of supplemental information related to gene structure, metabolic pathways, and chromosomal location. Yet in disease states where the genes may not have identifiable structural elements, poorly defined metabolic pathways, or limited chromosomal data, flexible systems for obtaining additional data are necessary. In these situations having a tool for searching the biomedical literature using the list of identified genes while simultaneously defining additional search terms would be useful. We have built a tool, BEAR GeneInfo, that allows flexible searches based on the investigators knowledge of the biological process, thus allowing for data mining that is specific to the scientist's strengths and interests. This tool allows a user to upload a series of GenBa...

pubmed2ensembl: A Resource for Mining the Biological Literature on Genes

PLoS ONE, 2011

Background: The last two decades have witnessed a dramatic acceleration in the production of genomic sequence information and publication of biomedical articles. Despite the fact that genome sequence data and publications are two of the most heavily relied-upon sources of information for many biologists, very little effort has been made to systematically integrate data from genomic sequences directly with the biological literature. For a limited number of model organisms dedicated teams manually curate publications about genes; however for species with no such dedicated staff many thousands of articles are never mapped to genes or genomic regions. Methodology/Principal Findings: To overcome the lack of integration between genomic data and biological literature, we have developed pubmed2ensembl (http://www.pubmed2ensembl.org), an extension to the BioMart system that links over 2,000,000 articles in PubMed to nearly 150,000 genes in Ensembl from 50 species. We use several sources of curated (e.g., Entrez Gene) and automatically generated (e.g., gene names extracted through text-mining on MEDLINE records) sources of gene-publication links, allowing users to filter and combine different data sources to suit their individual needs for information extraction and biological discovery. In addition to extending the Ensembl BioMart database to include published information on genes, we also implemented a scripting language for automated BioMart construction and a novel BioMart interface that allows text-based queries to be performed against PubMed and PubMed Central documents in conjunction with constraints on genomic features. Finally, we illustrate the potential of pubmed2ensembl through typical use cases that involve integrated queries across the biomedical literature and genomic data. Conclusion/Significance: By allowing biologists to find the relevant literature on specific genomic regions or sets of functionally related genes more easily, pubmed2ensembl offers a much-needed genome informatics inspired solution to accessing the ever-increasing biomedical literature.

TPX: Biomedical literature search made easy

Bioinformation, 2012

TPX is a web-based PubMed search enhancement tool that enables faster article searching using analysis and exploration features. These features include identification of relevant biomedical concepts from search results with linkouts to source databases, concept based article categorization, concept assisted search and filtering, query refinement. A distinguishing feature here is the ability to add user-defined concept names and/or concept types for named entity recognition. The tool allows contextual exploration of knowledge sources by providing concept association maps derived from the MEDLINE repository. It also has a full-text search mode that can be configured on request to access local text repositories, incorporating entity co-occurrence search at sentence/paragraph levels. Local text files can also be analyzed on-the-fly.

Accessing biomedical literature in the current information landscape

Methods in molecular biology (Clifton, N.J.), 2014

Biomedical and life sciences literature is unique because of its exponentially increasing volume and interdisciplinary nature. Biomedical literature access is essential for several types of users including biomedical researchers, clinicians, database curators, and bibliometricians. In the past few decades, several online search tools and literature archives, generic as well as biomedicine specific, have been developed. We present this chapter in the light of three consecutive steps of literature access: searching for citations, retrieving full text, and viewing the article. The first section presents the current state of practice of biomedical literature access, including an analysis of the search tools most frequently used by the users, including PubMed, Google Scholar, Web of Science, Scopus, and Embase, and a study on biomedical literature archives such as PubMed Central. The next section describes current research and the state-of-the-art systems motivated by the challenges a us...

Mining the Biomedical Literature in the Genomic Era: An Overview

Journal of Computational Biology, 2003

The past decade has seen a tremendous growth in the amount of experimental and computational biomedical data, specifically in the areas of genomics and proteomics. This growth is accompanied by an accelerated increase in the number of biomedical publications discussing the findings. In the last few years, there has been a lot of interest within the scientific community in literature-mining tools to help sort through this abundance of literature and find the nuggets of information most relevant and useful for specific analysis tasks. This paper provides a road map to the various literature-mining methods, both in general and within bioinformatics. It surveys the disciplines involved in unstructured-text analysis, categorizes current work in biomedical literature mining with respect to these disciplines, and provides examples of text analysis methods applied towards meeting some of the current challenges in bioinformatics.

Enhancing access to the Bibliome: the TREC 2004 Genomics Track

Journal of biomedical discovery and collaboration, 2006

The goal of the TREC Genomics Track is to improve information retrieval in the area of genomics by creating test collections that will allow researchers to improve and better understand failures of their systems. The 2004 track included an ad hoc retrieval task, simulating use of a search engine to obtain documents about biomedical topics. This paper describes the Genomics Track of the Text Retrieval Conference (TREC) 2004, a forum for evaluation of IR research systems, where retrieval in the genomics domain has recently begun to be assessed. A total of 27 research groups submitted 47 different runs. The most effective runs, as measured by the primary evaluation measure of mean average precision (MAP), used a combination of domain-specific and general techniques. The best MAP obtained by any run was 0.4075. Techniques that expanded queries with gene name lists as well as words from related articles had the best efficacy. However, many runs performed more poorly than a simple baselin...

Gene: a gene-centered information resource at NCBI

Nucleic acids research, 2015

The National Center for Biotechnology Information's (NCBI) Gene database (www.ncbi.nlm.nih.gov/gene) integrates gene-specific information from multiple data sources. NCBI Reference Sequence (RefSeq) genomes for viruses, prokaryotes and eukaryotes are the primary foundation for Gene records in that they form the critical association between sequence and a tracked gene upon which additional functional and descriptive content is anchored. Additional content is integrated based on the genomic location and RefSeq transcript and protein sequence data. The content of a Gene record represents the integration of curation and automated processing from RefSeq, collaborating model organism databases, consortia such as Gene Ontology, and other databases within NCBI. Records in Gene are assigned unique, tracked integers as identifiers. The content (citations, nomenclature, genomic location, gene products and their attributes, phenotypes, sequences, interactions, variation details, maps, expre...

Literature Mapping with PubAtlas - extending PubMed with a 'BLASTing interface

Summit on Translational Bioinformatics, 2009

PubAtlas (www.pubatlas.org) is a web service and standalone program providing literature maps for the biomedical research literature. It accepts userdefined sets of terms (PubMed queries) as input, and permits 'BLASTing' of one set against another: for all terms x and y in these sets, deriving the results of the pairwise intersections x AND y. This all vs. all capability extends PubMed with a literature analysis interface. Correspondingly, the basic form of literature map that PubAtlas provides for exploring associations among sets of terms is an interactive tabular display, in heatmap/microarray format. PubAtlas supports development of specialized lexica-hierarchies of controlled terminology that can represent sets of related concepts or a 'user-defined query language'. PubAtlas also provides historical perspectives on the literature, with temporal query features that highlight historical patterns. Generally, it is a framework for extending the PubMed interface, and an extensible platform for producing interactive literature maps.

Gelsius: A Literature-Based Workflow for Determining Quantitative Associations between Genes and Biological Processes

IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2000

An effective knowledge extraction and quantification methodology from biomedical literature would allow the researcher to organize and analyze the results of high-throughput experiments on microarrays and next-generation sequencing technologies. Despite the large amount of raw information available on the web, a tool able to extract a measure of the correlation between a list of genes and biological processes is not yet available. In this paper, we present Gelsius, a workflow that incorporates biomedical literature to quantify the correlation between genes and terms describing biological processes. To achieve this target, we build different modules focusing on query expansion and document cononicalization. In this way, we reached to improve the measurement of correlation, performed using a latent semantic analysis approach. To the best of our knowledge, this is the first complete tool able to extract a measure of genes-biological processes correlation from literature. We demonstrate the effectiveness of the proposed workflow on six biological processes and a set of genes, by showing that correlation results for known relationships are in accordance with definitions of gene functions provided by NCI Thesaurus. On the other side, the tool is able to propose new candidate relationships for later experimental validation. The tool is available at http://bioeda1.polito.it:8080/medSearchServlet/.