INTERFEROME: the database of interferon regulated genes (original) (raw)

Abstract

INTERFEROME is an open access database of types I, II and III Interferon regulated genes (http://www.interferome.org) collected from analysing expression data sets of cells treated with IFNs. This database of interferon regulated genes integrates information from high-throughput experiments with annotation, ontology, orthologue sequences from 37 species, tissue expression patterns and gene regulatory information to enable a detailed investigation of the molecular mechanisms underlying IFN biology. INTERFEROME fulfils a need in infection, immunity, development and cancer research by providing computational tools to assist in identifying interferon signatures in gene lists generated by high-throughput expression technologies, and their potential molecular and biological consequences.

INTRODUCTION

The interferons (IFNs) are a family of cytokines identified more than 50 years ago as potent antiviral proteins. In addition to antiviral, antibacterial and anti-parasitic host-defense functions they are now also recognized as crucial regulators of cell proliferation, differentiation, survival and death as well as activators of specialized cell functions particularly in the immune system (1,2) and play important roles in infectious and inflammatory diseases, autoimmunity and cancer (3). Intriguingly, their biological properties have led to their use as therapeutics in certain viral infections such as chronic hepatitis C (HCV), hepatitis B (HCB) virus infections and in combination with antiretroviral therapy in AIDS, in haematological cancers and solid tumuors and in multiple sclerosis (4); but they are targeted for blockade in autoimmune diseases such as systemic lupus erythematosus (5). Therefore, understanding the molecular pathways and genes regulated by IFNs will result in better understanding of disease mechanisms and effective clinical use of both IFNs and related cytokines that utilize similar signal transduction pathways.

There are three types of IFNs, namely types I (IFN α, β, δ, ε, ζ, κ, ν, τ, ω), II (IFNγ) and III (IFNλ) (3). Each IFN type has sequence similarity, signals via specific cell surface receptor complexes (IFNAR, IFNGR, IFNLR) and is produced by particular cells in response to specific stimuli (6,7). Despite these differences, all types of IFNs mediate their effects by activating multiple signaling pathways, some of which are common, including the well-characterized JAK-STAT pathway (8) as well as other pathways (9). The JAK-STAT signaling events lead to the activation of transcription factor complexes such as IFN stimulated gene factor 3 (ISGF3) and homo- or hetero-dimerized signal transducers and activators of transcription (STAT) molecules, and their subsequent binding to interferon response elements such as interferon stimulated response element (ISRE), gamma activated sequence (GAS) and STAT-binding sites in promoters of IRGs, resulting in the transcriptional activation of many IRGs (10). The advent of expression profiling technology has enabled us to ascertain a broad picture of gene expression changes resulting from the action on target cells of cytokines such as IFNs. Some IRGs are well characterized in relation to specific functions (e.g. the antiviral Mx proteins, 2, 5-OAS, PKR and RnaseL) (11,12). However, many IRGs have no known function or functions that have not been previously associated with IFNs. The challenge is now to firstly determine the scope of the genes regulated by IFNs, and to understand why some, and not others, are activated or repressed in specific circumstances and relate this to biological outcomes.

In this exercise, we have catalogued IRGs from all published reports where cells were treated with an IFN of any type, and incorporated these into a searchable database (http://www.interferome.org). The utility of the database is to enable the reliable identification of interferon regulated gene signatures from high-throughput data sets (i.e. microarray, proteomic data, etc.). It will also assist in identifying regulatory elements and enable comparison of normal tissue expression of IRGs in human and mouse. This will have implications for determining hitherto unknown role of IFNs or IRGs in specific circumstances whether regulating homoeostasis or disease pathogenesis or understanding the basis of patient responses to IFN or related therapies.

OVERVIEW

The INTERFEROME database enables the identification of interferon regulated genes and integrates IRG-specific sequence, annotation and regulatory information. IRGs in INTERFEROME were collected manually by analysing in-house IFN microarray data and published microarray/proteomic data sets. In addition, extensive manual literature mining and searching of regulatory databases, such as Transcription Regulatory Region Database (TRRD) (13), was performed to identify previously characterized IRGs.

Sequence and annotation data for IRGs were mined from public databases such as Ensembl v.49 (14), and NCBI Entrez Gene (15) using an automated data mining and integration software pipeline developed in our laboratory. The main component of this pipeline is the data downloader program, which is written in the Perl programming language and interacts with the application programming interfaces (APIs) provided by external databases. It utilizes the list of manually curated IRGs to extract corresponding sequence and annotation information from public databases and ensures that the sequence and annotation data in INTERFEROME is current. The sequence data collected includes; genes (59 308), amino acids (79 215), putative promoters (79 206), 3′- and 5′-UTR sequences. Annotation includes multiple gene identifiers and gene alias information, gene ontologies and orthologue information. Orthologues of human and mouse IRGs were collected from 37 species and will enable detailed phylogenetic analysis of these proteins. Basal tissue expression data for human and mouse IRG were collected from the Novartis Foundation GNF portal (16). Quality control of sequence data included BLAST2GO (17) analysis to ensure amino acid sequences corresponded to relevant IRGs. These diverse data sets were then processed to generate regulatory information such as transcription factor binding sites (TFBS), protein domains and motifs, gene ontologies and normalized tissue expression.

An example of possible search results is shown in the screenshots depicted in Figure 1. In addition to identifying IRG signatures in gene lists we also provide the ability to identify putative TFBS in proximal promoter regions of IRGs and the integration of expression information provides the ability to search for normal tissue expression of IRGs in 79 human and 61 mouse tissues. We have also generated BLAST databases of IRG sequences, integrated with NCBI BLAST software, enabling users to perform BLAST analysis (15). Addition of data sets and incorporation of new functionality will be announced in the INTERFEROME blog (http://www.interferome.org/blog). As new data sets are incorporated, INTERFEROME will grow significantly and enable more sophisticated analysis of IRGs.

Figure 1.

Figure 1.

The functionality of the INTERFEROME database is summarized in Figure 1, which displays some of the screen captures resulting from searching the database with a gene list. Its main functionality is the IRG signature page which identifies IRGs that were present in the gene list and displays associated annotation. In addition INTERFEROME's expression and promoter analysis capability is highlighted. An example search is demonstrated in Supplementary Figures.

DESIGN AND IMPLEMENTATION

Data were collected from in-house IFN-treated microarrays and more than 28 publications (18–45) identified through literature searches where high-throughput analysis (microarray or proteomic) was performed on cells/tissues treated only with IFNs. The statistical analysis of the data sets in the relevant publications adhered to currently accepted methods and standards. Gene lists were analysed and genes that demonstrated 1.5-fold or more differential expression were identified as IRGs. This included up- and downregulated genes from 23 human data sets, five mouse datasets and single chimp, cow and sheep data sets. There were 28 type I IFN data sets, 11 type II and three type III data sets. The subtypes used include type I IFNs such as (IFNA1a, IFNA1, IFNA2, IFNA2a, IFNA2b, IFN conA, IFNB, IFNT and IFNE1), type II (IFNG) and type III (IFNL). Overall 1996 human and 1925 mouse IRGs were identified.

INTERFEROME uses an Apache web server, in a Linux environment and a collection of Perl CGI and PHP scripts providing the user interface coupled to a MySQL relational database management system. The PHP scripts include a tag cloud generator, a Venn visualizer and all the search form pages. Two CGI Perl programs which generate TFBS graphics [utilizing BioPerl graphics modules (46)] and hierarchical clustered heat maps [using Cluster 3.0 and matrix2png (47)] of tissue expression data provide the other data visualization methods.

USING INTERFEROME

Searching interferon regulated gene signatures

The main functionality of this database is to enable the identification interferon regulated genes in a gene list generated from a high-throughput microarray or proteomic experiment. This ability to detect an interferon signature is invaluable in identifying interferon regulated pathways involved in innate immune, inflammatory or anti-tumour responses and will enable the extraction of more meaningful information about the biology associated with a set of differentially regulated genes. A mouse or human gene list of up to 100 genes can be submitted to INTERFEROME, and provides the option of identifying either all IRGs or those that are induced by a specific interferon type (types I, II or III). The submitted gene list must be Ensembl IDs as the use of gene names and gene symbols may result in ambiguities (48). A link is provided to external gene identifier conversion resources to enable the conversion of gene identifiers (49). When the list is submitted to the INTERFEROME server, a new page with identified IRGs is generated (Supplementary Figure S1A and B). This page provides information, such as Ensembl ID, Entrez Gene ID and Gene Symbol, all of which are clickable links that directs the user to the relevant page in external database resources such as the Ensembl genome browser, NCBI Gene page or HGNC gene nomenclature page. In addition, a short description of each gene and its chromosomal location together with additional links to external bioinformatic resources such as iHOP (50), UniProt (51) and the manually curated VEGA database (52) are provided to enable the collection of additional gene and protein-specific information. A list of IFN types (I, II or III) that induce the gene is shown together with the number of data sets (in brackets) in which a particular gene was differentially regulated. This enables the differentiation of core IRGs (those that are induced by most IFNs in most situations) from those that are induced in specific situations. This page also displays a tag cloud of all the ontologies associated with the IRG list where ontologies (obtained from Ensembl and GO) with higher frequencies are shown in a larger font. Clicking on an ontology term redirects the user to the gene ontology organization web page (53) for that particular term. Finally, a Venn diagram visualization of type I versus type II is also generated to show the proportional induction by different IFNs. A collection of help pages offer assistance in using INTERFEROME and an example search is available in Supplementary Data.

Searching normal tissue expression of IRGs

The human and mouse expression data sets were downloaded from the GNF portal, mapped to IRGs and expression of each gene was normalized across tissue by log2 transformation with a mean of 0 and an SD of 1. When a list of either human or mouse IRGs is submitted, a results page with a heat map depicting the tissue expression of the genes in 79 human or 61 mouse tissues is displayed (Supplementary Figure S2). Red and blue denote up and down regulation of gene expression, respectively, and the intensity of the colour is proportional to the intensity of expression. This heat-map enables the identification of IRGs with tissue-specific expression. It should be noted that the heat map reflects the tissues where these IRGs are basally expressed, not after their induction by IFNs.

IRG promoter analysis

INTERFEROME stores more than 274 000 TFBS and enables the visualization of selected TFBS such as ISRE, STAT, IRF, ICSBP, TATA and NFKB known to be important in the regulation of IRGs. TFBSs are displayed in the 1000 bp immediately upstream of the transcription start site (TSS) and includes the 5′-UTR. The form also provides the user with the option of displaying selected TFBSs within all the known transcripts of a particular gene (Supplementary Figure S3A and B). Previous analysis of IRGs suggests that core promoters for most IRGs are located within the first 1000 bp upstream of the TSS (54). However, ∼64% of IRGs collected in INTERFEROME have ISRE elements within the 10 kb upstream of the TSS. Distal promoter elements that are beyond the first 1000 bp may be important for transcriptional regulation of IRGs, thus care should be taken in interpreting preliminary promoter analysis results and predictions should be validated by experimental techniques such as chromatin immuno-precipitation (ChIP). We utilized Transfac Professional version 12.1 to preprocess sequences and store binding sites and their positions within INTERFEROME. The Transfac database is a collection of position weight matrices for transcription factor binding sites (55). These matrices are generated from experimentally derived TFBS (using sequences that are known to be bound with high affinity by particular transcription factors) derived either by in vitro selection of tightly bound oligonucleotides (SELEX) or by ChIP or bandshift experiments that identify specific transcription factor–DNA interactions. These matrices summarize how many sequences have a given nucleotide at a given position in the transcription factor binding site. A position weight matrix (PWM) can be used to define a position-specific scoring matrix (PSSM), whose values are proportional to binding free energies (56). The search form enables the selection of core and matrix match cut-offs between 0 and 1. When these match values are closer to one, only higher stringency TFBS are displayed. The ability to specify the cut-off values enables the user to determine the optimal values necessary to minimize the identification of false-positive and false-negative TFBS.

Orthologue sequence download

A search form enabling the submission of a human or mouse IRG and retrieving its orthologous amino acid sequences stored in INTERFEROME is provided. Orthologue sequences are derived from 37 species from Ensembl version 49. When a human or mouse IRG is used to search the database a table with a list of orthologues is displayed and a link to download the sequences in the FASTA format is provided (Supplementary Figure S4).

Search IRG BLAST databases

We have generated IRG protein BLAST databases for IRG sequences from human, mouse and 37 vertebrate species, enabling protein or nucleotide BLAST (BLASTP or BLASTX) analysis to be performed on our sequence data sets. This functionality will enable other sequences/novel genes to be searched against IRGs and allow the identification of common domains or novel IRGs. The BLAST search forms provide the ability to specify settings and display graphical results similar to NCBI BLAST (15).

FUTURE DIRECTIONS

As expression, genomic and other sequence information is updated in external databases, we will reflect those changes in INTERFEROME. In addition as more interferon-treated expression data sets become available, we hope to add these to INTERFEROME and expand the IRG data sets. Each additional data set will increase the reliability and specificity of the IRG gene set. We propose to eventually provide normalized expression information on the IRG data sets and enable IFN subtype, dose, time and tissue or cell-type-specific IRG analysis. We will also provide more comprehensive IRG-specific regulatory and interaction information, statistical and computational methods to analyse _cis_-regulatory modules, pathway information, sophisticated data visualization methods and disease data sets to identify disease-specific IRG signatures.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

This work was supported by the National Health and Medical Research Council, Australia and CRC for chronic inflammatory disease. Funding for open access charge: National Health and Medical Research Council of Australia.

Conflict of interest statement. None declared.

ACKNOWLEDGMENTS

The Bio21 UROP program.

REFERENCES