Home | Cis Red (original) (raw)

The cisRED database holds conserved sequence motifs identified by motif discovery methods applied on a genome scale. Currently we work with regions 1.5 Kb upstream of a TSS, net of repeats; motifs in such regions should be implicated in gene regulation. For each target human gene, we search for over-represented motifs using the target, its co-expressed genes and its orthologues. We identify co-expressed genes with a separate pipeline, and we query orthologues from an EnsEMBL compara database. v1.0 uses the genomic sequences and annotations in EnsEMBL v22, in which the human assembly is NCBI 34.

cisRED 1.0 holds predicted regulatory elements for ~5.5K human genes. This set includes 130 of the ~500 ENCODE genes. Each of these target genes had at least one high-confidence co-expressed gene and one orthologue, as well as discovered motifs with p-values < 0.05. P-values were estimated from motif score distributions for large sets of 'unrelated' sequence sets.

cisRED's predicted human regulatory elements can be viewed directly from the database, using either the UCSC genome browser (human, July 2003) or a Java WebStart version of the Sockeye workspace. Results will be viewable at EnsEMBL in the near future.

cisRED makes many types of data available. For each regulatory element, you can access a motif matrix in TRANSFAC, JASPAR and Excel formats, or a JPG image of the sequence logo. Data and SQL create-table statements are available for the current and previous MySQL databases (v1.0: 14 MB). Database schema diagrams are available. You can download a compressed file that contains all of cisRED's input FASTA sequence sets (v1.0: 35 MB). A table of high-confidence co-expressed genes for ~5.5K human genes is available (2 MB, XLS).

Software

Sockeye

Sockeye is a Java/Java3D application for assembling, viewing and working with genomic information in a direct, interactive and extensible 3D workspace. Designed primarily for phylogenetic footprinting and regulatory element discovery, it lets a user assemble and quickly extract meaning from complex comparative genomics datasets. Sockeye queries and displays EnsEMBL data. It directly queries a gene's orthologues or coexpressed genes. It integrates an extensible library of command-line bioinformatics applications with the 3D workspace through the Chinook application server. It uses JASPAR and TRANSFAC TFBS resources. It offers a powerful environment for work with sequence alignment methods, sequence conservation profiles and conserved regions. It facilitates comparing results from different algorithms and different algorithm parameter settings. It can import and display results in GFF files. It can zoom easily between a whole genome display and a basepair display. InstallAnywhere and WebStart installers are freely available.

Chinook

Chinook is a peer-to-peer (P2P) bioinformatics service. Chinook turns command-line applications into services that are broadcast over a virtual network. A Java application, Chinook was developed to provide compute resources for genome-wide regulatory motif detection. From this, it has grown into a platform that facilitates the exchange of analysis techniques within a local community and worldwide. Currently, over ten analysis services have been made "Chinook-ready". The algorithms range from sequence alignment to regulatory element prediction. Chinook uses XML to make it extremely easy to add new services.

DUNE: Drift Under Neutral Evolution

UNE is a Java application that transforms an input DNA sequence from one species into an in-silico mutated sequence for another species using a model for molecular evolution that is "neutral"; i.e. occurs with no regions under selective constraint. The source and output can be any species in the application's internal phylogenetic tree, which currently contains: 1) Fugu, zebrafish, chicken, rat, mouse, chimpanzee and human; and 2) C. elegans and C. briggsae. Neutral evolution is modeled by allowing substitution, deletion and insertion events to occur according to parameters in recent published analyses. The package is part of the component in the cisRED production pipeline that estimates p-values for computationally discovered motifs from input sequence sets that include orthologues to the target gene. DUNE is not yet available for licensing; however, if you are interested in using it, please contact us.

HitPlotter

This Java application lets a user quickly evaluate results from one or multiple motif discovery applications. It reads a simple, compact GMM-format text file that is typically generated from a discovery method output report using a Perl script, and displays discovered "motifs" as horizontal coloured lines. This direct display mode lets a user quickly assess discovery datasets that can include hundreds of 'hits' from a large set of different discovery methods, each of which was run with different parameter settings on tens to hundreds of input sequences. The visualization is supported by tools for showing and hiding sequences and datasets, filtering hits by score, etc. Annotations in GMM format (known TFBS, exons, repeats, etc.) can be displayed to give genomic context to hits. 'Hits' from pattern scans can also be displayed, e.g. from 'known' matrices from JASPAR or TRANSFAC. PERL '2gmm' scripts are easily written. They are currently available for ANN-Spec, BioProspector, CONSENSUS, Gibbs Sampler, MDmodule, MDscan, MEME, PhyloCon, Teiresias, MotifSampler, and RSAT oligo-analysis and WCONSENSUS. Although the HitPlotter is still in a pre-beta stage of development, we are willing to license it. Please contact us if you are interested in using it.

cisRED.

Databases of genome-wide regulatory module and element predictionsDatabaseAssemblySearchregionsSearchregiontypeNbr. ofinputspeciesConservedmotifsDiscoveryp-valuethresholdEnsemblcompatibilityReleasedateHuman 9NCBI v36b18.7kpromoter41236k0.01Build 38-4926 Jul. 2007Mouse 4NCBI m3717.5kpromoter38223k0.1Build 47-4926 Sep. 2007Mouse 3.1NCBI m3517.5kpromoter38223k0.1Build 3818 Apr. 2007Rat 1.1RGSC v3.16.7kpromoter28116k0.25n/a12 Feb. 2006C.elegans 4WormBase WS1703.8kpromoter8158k1.0Build 44-4618 Jul. 2008Human Stat1 ChIP-seq peaks 1NCBI v35226ChIP-seq23~6k1.0n/a03 Apr. 2007
OverviewThe cisRED database holds conserved sequence motifs identified by genome scale motif discovery, similarity, clustering, co-occurrence and coexpression calculations. Sequence inputs include low-coverage genome sequence data and ENCODE data. A Nucleic Acids Research article describes the system architecture; please use this publication to cite cisRED. PubMed publications that cite cisRED are listed here.cisRED makes three levels of information available for regulatory elements:'Atomic' motifs: These are conserved, over-represented, sequence sets, typically 6 to 12 bp long, that have been discovered in a 'search region' sequence set.Groups of 'similar' motifs: These are identified either by a) annotating motifs with site sequences from TRANSFAC, JASPAR and ORegAnno databases (annotation-based groups), or by b) 'de novo' hierarchical clustering with the OPTICS algorithm ('de novo' groups).Patterns of motif group labels that co-occur in many search regions: These putative regulatory modules are ranked using genome-scale statistical and functional properties. Motifs in highly ranked patterns are likely the most reliable predictions.In promoter-based cisRED databases, sequence search regions for motif discovery extend from 1.5 Kb upstream to 200b downstream of a transcription start site, net of most types of repeats and of coding exons. Many transcription factor binding sites are located in such regions. For each target gene's search region, we use a base set of probabilistic ab initio discovery tools, in parallel, to find over-represented atomic motifs. Discovery methods use comparative genomics with over 40 vertebrate input genomes.In ChIP-seq-based cisRED databases, sequence search regions for motif discovery correspond to significant peaks that represent genome-wide sites of protein-DNA binding. Because such peaks occur in a wide range of genic and intergenic locations, ChIP-seq and promoter-based databases are complementary. Currently, motif discovery for ChIP-seq data uses scan-based approaches that make more explicit use of sets of sequences known to be functional transcription factor binding sites, and that consider a wide range of levels of conservation. For the human STAT1 ChIP-seq database search regions in the target species (human) was selected +/- 300 bp around the ChIP-seq peak maximum. Repeats and coding regions were masked. Multiple sequence alignment were used to assemble orthologous input sequences from other species.You can access cisRED's data in three ways:view predicted regulatory elements directly in cisRED's web user interface. From this interface, motifs can be viewed 'live' in the UCSC or Ensembl genome browsers.download the data and SQL structure for each species' MySQL 4.x database, with a schema diagram and example SQL queries, from the Databases and Methods tab.query the databases directly with SQL at db.cisred.org. Queries can be driven from command line or graphical clients (e.g. the MySQL QueryBrowser), or programmatically from Perl, Python, Java, Ruby, etc. The username is 'anonymous' and the password should be left blank.cisRED human motifs are available as a native data type at the Ensembl genome browser.cisRED is an ongoing project. Updates will be released frequently.

Databases of genome-wide regulatory module and element predictionsDatabaseAssemblySearchregionsSearchregiontypeNbr. ofinputspeciesConservedmotifsDiscoveryp-valuethresholdEnsemblcompatibilityReleasedateHuman 9NCBI v36b18.7kpromoter41236k0.01Build 38-4926 Jul. 2007Mouse 4NCBI m3717.5kpromoter38223k0.1Build 47-4926 Sep. 2007Mouse 3.1NCBI m3517.5kpromoter38223k0.1Build 3818 Apr. 2007Rat 1.1RGSC v3.16.7kpromoter28116k0.25n/a12 Feb. 2006C.elegans 4WormBase WS1703.8kpromoter8158k1.0Build 44-4618 Jul. 2008Human Stat1 ChIP-seq peaks 1NCBI v35226ChIP-seq23~6k1.0n/a03 Apr. 2007

OverviewThe cisRED database holds conserved sequence motifs identified by genome scale motif discovery, similarity, clustering, co-occurrence and coexpression calculations. Sequence inputs include low-coverage genome sequence data and ENCODE data. A Nucleic Acids Research article describes the system architecture; please use this publication to cite cisRED. PubMed publications that cite cisRED are listed here.cisRED makes three levels of information available for regulatory elements:'Atomic' motifs: These are conserved, over-represented, sequence sets, typically 6 to 12 bp long, that have been discovered in a 'search region' sequence set.Groups of 'similar' motifs: These are identified either by a) annotating motifs with site sequences from TRANSFAC, JASPAR and ORegAnno databases (annotation-based groups), or by b) 'de novo' hierarchical clustering with the OPTICS algorithm ('de novo' groups).Patterns of motif group labels that co-occur in many search regions: These putative regulatory modules are ranked using genome-scale statistical and functional properties. Motifs in highly ranked patterns are likely the most reliable predictions.In promoter-based cisRED databases, sequence search regions for motif discovery extend from 1.5 Kb upstream to 200b downstream of a transcription start site, net of most types of repeats and of coding exons. Many transcription factor binding sites are located in such regions. For each target gene's search region, we use a base set of probabilistic ab initio discovery tools, in parallel, to find over-represented atomic motifs. Discovery methods use comparative genomics with over 40 vertebrate input genomes.In ChIP-seq-based cisRED databases, sequence search regions for motif discovery correspond to significant peaks that represent genome-wide sites of protein-DNA binding. Because such peaks occur in a wide range of genic and intergenic locations, ChIP-seq and promoter-based databases are complementary. Currently, motif discovery for ChIP-seq data uses scan-based approaches that make more explicit use of sets of sequences known to be functional transcription factor binding sites, and that consider a wide range of levels of conservation. For the human STAT1 ChIP-seq database search regions in the target species (human) was selected +/- 300 bp around the ChIP-seq peak maximum. Repeats and coding regions were masked. Multiple sequence alignment were used to assemble orthologous input sequences from other species.You can access cisRED's data in three ways:view predicted regulatory elements directly in cisRED's web user interface. From this interface, motifs can be viewed 'live' in the UCSC or Ensembl genome browsers.download the data and SQL structure for each species' MySQL 4.x database, with a schema diagram and example SQL queries, from the Databases and Methods tab.query the databases directly with SQL at db.cisred.org. Queries can be driven from command line or graphical clients (e.g. the MySQL QueryBrowser), or programmatically from Perl, Python, Java, Ruby, etc. The username is 'anonymous' and the password should be left blank.cisRED human motifs are available as a native data type at the Ensembl genome browser.cisRED is an ongoing project. Updates will be released frequently.

Publications

Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing

Gordon Robertson, Martin Hirst, Matthew Bainbridge, Misha Bilenky, Yongjun Zhao, Thomas Zeng, Ghia Euskirchen, Bridget Bernier, Richard Varhol, Allen Delaney, Nina Thiessen, Obi L. Griffith, Ann He, Marco Marra, Michael Snyder, and Steven Jones

Nature Methods. In press. (doi:10.1038/nMeth 1068)

Supplementary Information

We developed a method, ChIP-sequencing (ChIP-seq), combining chromatin immunoprecipitation (ChIP) and massively parallel sequencing to identify mammalian DNA sequences bound by transcription factors in vivo. We used ChIPseq to map STAT1 targets in interferon-γ (IFN-γ)-stimulated and unstimulated human HeLa S3 cells, and compared the method's performance to ChIP-PCR and to ChIP-chip for four chromosomes. By ChIP-seq, using 15.1 and 12.9 million uniquely mapped sequence reads, and an estimated false discovery rate of less than 0.001, we identified 41,582 and 11,004 putative STAT1-binding regions in stimulated and unstimulated cells, respectively. Of the 34 loci known to contain STAT1 interferon-responsive binding sites, ChIP-seq found 24 (71%). ChIP-seq targets were enriched in sequences similar to known STAT1 binding motifs. Comparisons with two ChIP-PCR data sets suggested that ChIP-seq sensitivity was between 70% and 92% and specificity was at least 95%.

cisRED: A database system for genome scale computational discovery of regulatory elements

Robertson A.G, Bilenky M, Lin K, He A, Yuen W, Dagpinar M, Varhol R, Teague K, Griffith O.L, Zhang X, Pan Y, Hassel M, Sleumer M.C, Pan W, Pleasance E.D, Chuang, M, Hao H, Li Y.Y, Robertson N, Fjell C, Li B, Montgomery S.B, Astakhova T, Zhou J, Sander J, Siddiqui A.S and Jones S.J.M

Nucleic Acids Res. 2006 Jan 1;34(Database issue):D68-73. | View this article on PubMed.

We describe cisRED, a database for conserved regulatory elements that are identified and ranked by a genome-scale computational system . The database and high-throughput predictive pipeline are designed to address diverse target genomes in the context of rapidly evolving data resources and tools. Motifs are predicted in promoter regions using multiple discovery methods applied to sequence sets that include corresponding sequence regions from vertebrates. We estimate motif significance by applying discovery and post-processing methods to randomized sequence sets that are adaptively derived from target sequence sets, retain motifs with p-values below a threshold, and identify groups of similar motifs and co-occurring motif patterns. The database offers information on atomic motifs, motif groups and patterns. It is web-accessible, and can be queried directly, downloaded, or installed locally.

Sockeye: a 3D environment for comparative genomics

Montgomery SB, Astakhova T, Bilenky M, Birney E, Fu T, Hassel M, Melsopp C, Rak M, Robertson AG, Sleumer M, Siddiqui AS, Jones SJ.

Genome Res. 2004 May;14(5):956-62. | View this article on PubMed.

Comparative genomics techniques are used in bioinformatics analyses to identify the structural and functional properties of DNA sequences. As the amount of available sequence data steadily increases, the ability to perform large-scale comparative analyses has become increasingly relevant. In addition, the growing complexity of genomic feature annotation means that new approaches to genomic visualization need to be explored. We have developed a Java-based application called Sockeye that uses three-dimensional (3D) graphics technology to facilitate the visualization of annotation and conservation across multiple sequences. This software uses the Ensembl database project to import sequence and annotation information from several eukaryotic species. A user can additionally import their own custom sequence and annotation data. Individual annotation objects are displayed in Sockeye by using custom 3D models. Ensembl-derived and imported sequences can be analyzed by using a suite of multiple and pair-wise alignment algorithms. The results of these comparative analyses are also displayed in the 3D environment of Sockeye. By using the Java3D API to visualize genomic data in a 3D environment, we are able to compactly display cross-sequence comparisons. This provides the user with a novel platform for visualizing and comparing genomic feature organization.

Assessment and integration of publicly available SAGE, cDNA microarray, and oligonucleotide microarray expression data for global coexpression analyses.

Griffith OL, Pleasance ED, Fulton DL, Oveisi M, Ester M, Siddiqui AS, Jones SJ.

Genomics. 2005 Oct;86(4):476-88. | View this article on PubMed.

Background: Large amounts of gene expression data from several different platforms are being made available to the scientific community and increasingly used as tools for validation and integration of other studies. Several studies have compared two or three platforms to evaluate the consistency of expression profiles for a single tissue or sample series but few have determined if these translate into reliable gene co-expression patterns across many conditions.

Results: We have analyzed Homo sapiens data from 1202 cDNA microarray experiments, 242 SAGE libraries and 667 Affymetrix oligonucleotide microarray experiments. Using standard co-expression analysis methods, we have assessed each platform for internal consistency, performed inter-platform comparisons, and tested each platform's predictions against the Gene Ontology. An overall correlation of correlations (rc) analysis showed that the platforms agree significantly better than random (p<0.001, 1000 randomizations) but with very low correlations of rc < 0.102. A rank analysis also showed significant but poor agreement with only 3-8% better performance than randomized data. Comparison against the Gene Ontology (GO) revealed that all three platforms identify more co-expressed gene pairs with common biological processes than random data and as the Pearson correlation for a gene pair increased it was more likely to be confirmed by GO.

Conclusions: The three datasets compared demonstrate significant but low levels of global concordance. When evaluated for biological relevance, the Affymetrix dataset performed best with gene pairs of correlation 0.9-1.0 confirmed by GO in 74% of cases. However, our results suggest that all three datasets may provide some biologically relevant predictions of co-expression. Researchers are cautioned against using any one dataset exclusively for their analyses.

An application of peer-to-peer technology to the discovery, use and assessment of bioinformatics programs.

Montgomery SB, Fu T, Guan J, Lin K, Jones SJ.

Nature Methods 2, 563 (2005). | View this article on PubMed.

We have created an open-source peer-to-peer system for bioinformatics analysis. Our system enables researchers across the globe to freely access state-of-the-art algorithms and computational resources. Algorithms are found and new jobs are submitted to remote servers through either BioPerl scripting or a sophisticated Java-based user interface. Furthermore, this system has been designed to provide support to applications that require access to a diverse range of bioinformatics functionality.

Currently, over 20 algorithms are accessible via this network at multiple locations. Each node's ability to advertise and upgrade new services ensures that users of our system are accessing the most current versions of algorithms (in a few cases directly from the authors). Additionally, each node can customize the methods of annotation and sequence retrieval for its clients; typically, we use EnsEMBL1 for sequence retrieval.

We hypothesize that the peer-to-peer approach can facilitate improved communication between the biologists who want to use bioinformatics tools and the authors of such techniques themselves.