GXD: a Gene Expression Database for the laboratory mouse: current status and recent enhancements (original) (raw)

Abstract

The Gene Expression Database (GXD) is a community resource of gene expression information for the laboratory mouse. The database is designed as an open-ended system that can integrate different types of expression data. New expression data are made available on a daily basis. Thus, GXD provides increasingly complete information about what transcripts and proteins are produced by what genes; where, when and in what amounts these gene products are expressed; and how their expression varies in different mouse strains and mutants. GXD is integrated with the Mouse Genome Database (MGD). Continuously refined interconnections with sequence databases and with databases from other species place the gene expression information in the larger biological and analytical context. GXD is accessible through the Mouse Genome Informatics Web site at http://www. informatics.jax.org/ or directly at http://www.informatics. jax.org/menus/expression_menu.shtml

INTRODUCTION

Gene expression patterns provide important insight into the molecular mechanisms of development, differentiation and disease. The mouse serves as a pivotal animal model because it is closely related to the human and because tissues from many different mouse strains and mutants are readily available for detailed expression analysis.

To cope with the large volumes and the complexity of gene expression data for the laboratory mouse we developed the Gene Expression Database (GXD) (13). GXD approaches the phenomenon of gene expression information in a comprehensive manner by addressing the following three key issues.

(i) Integration of expression data. The ultimate goal of gene expression analysis is to determine which RNAs and proteins are produced from a given gene, and where, when and in what amounts these products are expressed at the cellular level. Each of the currently available expression detection methods provides only partial clues to these questions. GXD is therefore designed to integrate different types of expression data. Integration is achieved by storing ‘primary’ expression data. ‘Primary’ expression results such as the time and tissue of expression, the genetic origin of the sample, the number and sizes of detected bands and sequence information are described together with the molecular probe, the expression assay type and the experimental conditions used. In this format new data and new assay types can be added readily, and novel insights resulting from new data can be represented dynamically.

(ii) Standardized description of expression pattern. GXD describes expression patterns by using an extensive dictionary of anatomical terms developed in collaboration with our Edinburgh colleagues (4). The anatomical dictionary names the tissues and structures for each developmental stage, and organizes the terms hierarchically from body region or system to tissue to tissue substructure (4). This model enables an integrated description of expression patterns for various assays with differing spatial resolution, computational analysis of expression patterns at different levels of detail and continuous extensions of the anatomical hierarchy itself. Expression records are directly connected to digitized images of original expression data. Our Edinburgh colleagues are developing a 3D atlas/graphical gene expression database for mouse development that will enable 3D graphical storage, display and analysis of in situ expression patterns (5). In the longer term, GXD and the 3D atlas/graphical gene expression database will be integrated to generate the Mouse Gene Expression Information Resource that will combine text-based and image-based analysis methods (1).

(iii) Integration with other databases. To be really useful, gene expression data must be placed in the larger biological context. GXD is integrated with the Mouse Genome Database (MGD) (6) to enable global analysis of genotype, expression and phenotype information for the laboratory mouse, and has comprehensive links to sequence databases (711), OMIM, MEDLINE, and to databases from other species. GXD actively maintains and extends those links to foster a comprehensive analysis of gene expression information.

GXD is implemented in the Sybase relational database management system. A WWW interface provides access to the database using HTML-based query forms combined with CGI scripts. Direct SQL access is also possible. Users wishing to open an SQL account may contact MGI User Support (see below). The design of GXD, its data fields and their significance for database queries, and the WWW query interface have been described in more detail previously (3). Here we report on the current status of the GXD project and progress made during the last year.

CURRENT STATUS AND RECENT ENHANCEMENTS

Since the release of GXD 1.0 in July 1998, the database has been updated and new expression data have been made available on a daily basis. Therefore, GXD’s data content and its utility as a community resource has grown considerably during the last year. Data are acquired from the literature by editorial staff and, so far to a limited extent, via electronic submission from laboratories. We have developed a new tool, the Gene Expression Notebook, that will facilitate standardized annotation of expression data and electronic submission of these data to GXD. We have continued our collaborative efforts on establishing links to other resources and developing new classification schemes for gene products that will provide additional sorting parameters for filtering complex expression data and extracting meaningful biological information. In the following, we describe these achievements in more detail.

The GXD Index

Since the start of the GXD project, we have identified all newly published articles documenting data on endogenous gene expression during mouse development and indexed these articles with respect to Authors, Journals, Gene(s) and Embryonic Age(s) analyzed and Expression Assays used. The GXD Index is updated daily. As of September 30, 1999, it contained 14 338 entries covering 4822 references and expression information for 3420 genes. The data are searchable via the GXD Index query form. It enables queries such as: ‘What publications contain immunohistochemistry data for Ncam at day 9 of mouse development?’. The GXD Index is thus a powerful tool for locating specific types of expression information in the literature.

Gene expression data

At present, GXD includes RNA in situ hybridization and immunohistochemistry data, northern and western blot data, RT–PCR data, RNAase protection data and mouse cDNA/EST data.

The majority of mouse cDNA data is obtained from the IMAGE consortium and the WashU/HHMI projects (12,13). The database currently contains data for >354 000 IMAGE mouse cDNA clones and their corresponding ESTs. Access to mouse cDNA information is provided via the ‘cDNA clone query form’. This form is designed for interrogating expression information deduced from the tissue or cell line origin of cDNAs via queries such as: ‘From which tissues and developmental stages have cDNAs for the gene Lepr been isolated?’ or ‘For which genes on chromosome 4 have cDNAs been isolated from kidney?’.

The Gene Expression Data query form provides access to all other types of expression data. It enables a variety of important ‘expression’ queries, for example: ‘What assays have been used to study expression of En1 and what are the results reported?’; ‘Where and when is Notch1 expressed?’; ‘What genes are expressed in diencephalon or in any substructure of the diencephalon?’; ‘For which genes was expression analyzed but not detected in either the diencephalon or any structure that contains the diencephalon using RT–PCR, RNA in situ hybridization or immunohistochemistry experiments?’; ‘Which genes within 3 cM of Pltr6 are expressed in muscle?’. Clearly, GXD provides powerful query capabilities. Equally clearly, their utility is proportional to the data content in GXD. Since GXD 1.0 was launched in July 1998 with one large set of electronically submitted RT–PCR data, the GXD editorial staff have been entering expression data, mainly from the literature, on a daily basis. As of September 30, 1999, GXD included 47 368 annotated expression results from 2944 assays that together cover expression information for 1166 genes. (As defined in GXD, one ‘assay’ analyzes the expression of one gene in one or multiple samples by a specific method using a specific probe under specific experimental conditions.) Almost all annotated expression results are linked to digitized images of original data. Images are either scanned from the literature or, as in the case of the journal Development, obtained from the publisher in electronic format. A large proportion of the data in GXD are derived from RNA in situ hybridization or immunohistochemistry experiments. Further, more than one third of the assays include expression analysis in mutant mice (targeted mutations for the most part) and the number of those studies is increasing. This illustrates the biological complexity of the data captured by GXD. The amount of expression data in GXD should grow rapidly in the near future, as we receive more data via electronic submission and include new types of expression data in GXD (see below).

Alternate Transcripts Report

As a new database report, we provide an extensive list of mouse genes that produce alternate transcripts. For several years, MGD and GXD editorial staff have flagged molecular probe records if that probe has been described in publications as ‘derived from a gene that produces alternate transcripts’. The Alternate Transcripts Report is based on this information (14). The list is updated daily and, as of September 30, 1999, included 833 genes. In the future, we will provide similar but more comprehensive reports that will be based on primary expression data in GXD.

Electronic data submission: the Gene Expression Notebook

Annotation of expression data from the literature is time-intensive because data must be extracted from published reports and brought into a standardized format. This effort is limited by the number of available database curators. Furthermore, due to space limitations standard publications normally include only a small portion of the primary data generated by the authors. The GXD project is therefore putting a strong emphasis on direct data submission from laboratories. Based on our previous prototype work with the Gene Expression Annotator (2) we have, during the last year, developed the Gene Expression Notebook (Fig. 1). The application is designed as a tool that can be used as a laboratory notebook to manage expression data locally. In addition, data can be exported in standardized format for electronic data submissions to GXD. The Gene Expression Notebook is implemented in Excel, an application available on both the Macintosh and PC platforms and familiar to many biologists. Expression results, images, molecular probes, specimens, experimental conditions, etc. can be entered, and researchers can easily add fields to store laboratory-specific information (such as the place where specimens are stored or the date of preparation). They can include all expression experiments and data generated in the laboratory and later designate which data to include in a submission. Data from laboratory-specific fields are automatically filtered out by the submission software. Thus, electronic submission of expression data is becoming a small extension of laboratory work, rather than a large extra burden for researchers. The Gene Expression Notebook is currently being tested by a number of laboratories and refined based on their feedback. In the near future we plan to make the application available to the broader research community. Data submissions will receive accession numbers that can be cited in publications, and they will be subject to several levels of review as described previously (3). The Gene Expression Notebook is primarily designed for conventional laboratories that study expression on a gene-by-gene basis. We are also working with groups that generate mouse expression data in a high-throughput fashion. Those laboratories normally maintain their own laboratory databases from which we can download data in bulk.

Figure 1.

Figure 1

The Gene Expression Notebook. Expression results and images are entered into ‘assay sheets’ together with detailed descriptions of molecular probes, specimens and experimental conditions used. Parts of the RNA in situ hybridization assay sheet are shown at the bottom; additional worksheets for specimen and probe information are shown at the top. Information entered into the assay sheets about ‘genes’, ‘probes’, ‘probe preparation methods’, ‘specimens’ and ‘specimen preparation methods’ is automatically stored on separate worksheets (top). Alternatively, probes and specimen information, etc., can be entered directly into the respective worksheet as probes and specimens are generated in the laboratory. Identifier fields (the white fields) on the assay sheet harbor pull-down menus that list all identifiers (names) of the objects in corresponding worksheets. Upon selecting, for example, the name of a specific probe, all the other information for that probe is automatically inserted into the assay sheet. Expression patterns can be described in the form of figure legends using plain text descriptions, by entering ad-hoc terms for tissues and structures in the results record, or, preferably, by entering terms from the anatomical dictionary into the results record using cut and paste procedures. Example data shown are from GXD entry MGI:1269967; data annotated from de la Pompa et al., Development (1997), 124, 1139–1148.

During the past few years, large scale cDNA sequencing projects have been the most productive tool for gene discovery and have generated large numbers of probes that can now be used in gene expression studies. High-throughput expression methods such as the analysis of high density cDNA or oligonucleotide arrays with complex cDNA probes can produce huge amounts of expression data (15–19). Computational and statistical methods will be very important for analyzing and understanding these data (20). However, the utility of these methods will be limited if the array expression data are not integrated with additional biological information. We are addressing this issue in the following manner.

As part of their annotation work, GXD and MGD are establishing links between ‘genes’ in MGD/GXD and sequence entries in DDBJ/EMBL/GenBank (79). This information provides the basis for cross-references between ‘genes’ in our database and protein entries in SWISS-PROT (11) that are maintained and curated by editorial staff, in collaboration with SWISS-PROT. Based on these links, we derive additional cross-references to mouse nucleotide sequences in DDBJ/EMBL/GenBank. The NCBI Unigene project (21) uses our gene/nucleotide sequence associations to putatively assign ESTs and Unigene clusters to mouse genes, and we are in the process of implementing similar links to Unigene. In this way, EST data become integrated with all the other information for the corresponding gene that is or will be available in GXD and MGD, and the resources they link to.

Cross-references to SWISS-PROT also provide access to biochemical and structural classification schemes and increased interconnectivity with many additional resources. Along similar lines, we are collaborating with FlyBase (22), the Saccharomyces Genome Database (23) and MGD in building shared controlled vocabularies for describing cellular functions and locations of gene products and biological processes they are involved in. These classification schemes, combined with skilled data curation, will provide important new search parameters for expression data.

FUTURE DIRECTIONS

We will continue to populate GXD with expression data, improve its database infrastructure, and develop advanced query and display tools. Array-based methods make it possible to analyze the expression of tens of thousands of genes in parallel in different tissues that can be derived from many different mouse strains and mutants. GXD will be expanded to include these data. It will foster the analysis of these and other expression data by combining different types of expression information, by integrating them with genotype and phenotype/disease data for mouse strains and mutants in MGD, by links to other databases, and by the development of new biological search parameters.

USER SUPPORT

GXD provides user support through online documentation and dedicated User Support Staff. User Support can be contacted by Email (mgi-help@informatics.jax.org), phone (+1 207 288 6445) or Fax (+1 207 288 6132).

CITING GXD

To reference the database itself, please cite this article. For references to specific GXD data, we suggest the following format: these data were retrieved from the Gene Expression Database (GXD), Mouse Genome Informatics, The Jackson Laboratory, Bar Harbor, Maine, USA, WWW URL: http://www.informatics.jax.org [type in date (month, year) when you retrieved the data cited].

SUPPLEMENTARY MATERIAL

Relevant URL links are available at NAR Online.

[Supplementary Data]

Acknowledgments

ACKNOWLEDGEMENTS

We would like to thank Jon Beal for his work on the Gene Expression Notebook, Lori Corbani, Glenn Colby, John Gilbert and Prita Mani for help in software development and database maintenance, Marjorie May for help in user support and Janice Ormsby for secretarial assistance. It is a pleasure to thank our colleagues Drs Jonathan Bard and Matthew Kaufman at the University of Edinburgh and Drs Duncan Davidson and Richard Baldock at the MRC Human Genetics Unit in Edinburgh for making the Dictionary of Mouse Developmental Anatomy available to us. We also would like to thank the journal Development for making electronic image files available to us on a regular basis. The Gene Expression Database is supported by NIH grant HD33745.

REFERENCES

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

[Supplementary Data]