Sequence Comparison of Human and Mouse Genes Reveals a Homologous Block Structure in the Promoter Regions (original) (raw)

DBTSS: DataBase of Human Transcription Start Sites, progress report 2006

Nucleic Acids Research, 2006

DBTSS was first constructed in 2002 based on precise, experimentally determined 5 0 end clones. Several major updates and additions have been made since the last report. First, the number of human clones has drastically increased, going from 190 964 to 1 359 000. Second, information about potential alternative promoters is presented because the number of 5 0 end clones is now sufficient to determine several promoters for one gene. Namely, we defined putative promoter groups by clustering transcription start sites (TSSs) separated by ,500 bases. A total of 8308 human genes and 4276 mouse genes were found to have putative multiple promoters. Third, DBTSS provides detailed sequence comparisons of userspecified TSSs. Finally, we have added TSS information for zebrafish, malaria and schyzon (a red algae model organism). DBTSS is accessible at http://dbtss. hgc.jp.

Genome-Wide Analysis of Promoters: Clustering by Alignment and Analysis of Regular Patterns

PLoS ONE, 2014

In this paper we perform a genome-wide analysis of H. sapiens promoters. To this aim, we developed and combined two mathematical methods that allow us to (i) classify promoters into groups characterized by specific global structural features, and (ii) recover, in full generality, any regular sequence in the different classes of promoters. One of the main findings of this analysis is that H. sapiens promoters can be classified into three main groups. Two of them are distinguished by the prevalence of weak or strong nucleotides and are characterized by short compositionally biased sequences, while the most frequent regular sequences in the third group are strongly correlated with transposons. Taking advantage of the generality of these mathematical procedures, we have compared the promoter database of H. sapiens with those of other species. We have found that the above-mentioned features characterize also the evolutionary content appearing in mammalian promoters, at variance with ancestral species in the phylogenetic tree, that exhibit a definitely lower level of differentiation among promoters.

Clustering of DNA Sequences in Human Promoters

Genome Research, 2004

We have determined the distribution of each of the 65,536 DNA sequences that are eight bases long (8-mer) in a set of 13,010 human genomic promoter sequences aligned relative to the putative transcription start site (TSS). A limited number of 8-mers have peaks in their distribution (cluster), and most cluster within 100 bp of the TSS. The 156 DNA sequences exhibiting the greatest statistically significant clustering near the TSS can be placed into nine groups of related sequences. Each group is defined by a consensus sequence, and seven of these consensus sequences are known binding sites for the transcription factors (TFs) SP1, NF-Y, ETS, CREB, TBP, USF, and NRF-1. One sequence, which we named Clus1, is not a known TF binding site. The ninth sequence group is composed of the strand-specific Kozak sequence that clusters downstream of the TSS. An examination of the co-occurrence of these TF consensus sequences indicates a positive correlation for most of them except for sequences bound by TBP (the TATA box). Human mRNA expression data from 29 tissues indicate that the ETS, NRF-1, and Clus1 sequences that cluster are predominantly found in the promoters of housekeeping genes (e.g., ribosomal genes). In contrast, TATA is more abundant in the promoters of tissue-specific genes. This analysis identified eight DNA sequences in 5082 promoters that we suggest are important for regulating gene expression. 3 Corresponding author. E-MAIL Vinsonc@dc37a.nci.nih.gov; FAX (301) 496-8419. Article and publication are at

Identification and Characterization of the Potential Promoter Regions of 1031 Kinds of Human Genes

To understand the mechanism of transcriptional regulation, it is essential to identify and characterize thepromoter, which is located proximal to the mRNA start site. To identify the promoters from the large volumesof genomic sequences, we used mRNA start sites determined by a large-scale sequencing of the cDNA librariesconstructed by the “oligo-capping” method. We aligned the mRNA start sites with the genomic sequences andretrieved adjacent sequences as potential promoter regions (PPRs) for 1031 genes. The PPR sequences weresearched to determine the frequencies of major promoter elements. Among 1031 PPRs, 329 (32%) containedTATA boxes, 872 (85%) contained initiators, 999 (97%) contained GC box, and 663 (64%) contained CAATbox. Furthermore, 493 (48%) PPRs were located in CpG islands. This frequency of CpG islands was reduced inTATA+/Inr+ PPRs and in the PPRs of ubiquitously expressed genes. In the PPRs of the CGM2 gene, the DRAgene, and the TM30pl genes, which showed highly colon specific expression patterns, the consensus sequences ofE boxes were commonly observed. The PPRs were also useful for exploring promoter SNPs. (PDF) ERRATUM Identification and characterization of the potential promoter regions of 1031 kinds of human genes. Available from: https://www.researchgate.net/publication/222101688\_ERRATUM\_Identification\_and\_characterization\_of\_the\_potential\_promoter\_regions\_of\_1031\_kinds\_of\_human\_genes#fullTextFileContent [accessed May 17 2023].

Genome-wide analysis of mammalian promoter architecture and evolution

Nature Genetics, 2006

Mammalian promoters can be separated into two classes, conserved TATA box-enriched promoters, which initiate at a welldefined site, and more plastic, broad and evolvable CpG-rich promoters. We have sequenced tags corresponding to several hundred thousand transcription start sites (TSSs) in the mouse and human genomes, allowing precise analysis of the sequence architecture and evolution of distinct promoter classes. Different tissues and families of genes differentially use distinct types of promoters. Our tagging methods allow quantitative analysis of promoter usage in different tissues and show that differentially regulated alternative TSSs are a common feature in protein-coding genes and commonly generate alternative N termini. Among the TSSs, we identified new start sites associated with the majority of exons and with 3¢ UTRs. These data permit genome-scale identification of tissue-specific promoters and analysis of the cis-acting elements associated with them.

Position dominant sequence elements in experimentally verified human promoters and their putative relation to cancer

Cancer genomics & proteomics

Promoter regions of the human genome play a key role in our understanding of the regulatory mechanisms related to the physiological and disease states. The aim of this study was to investigate the sequence positional properties of experimentally verified human promoters. Consequently, we determined short sequence elements ranging from 4 to 9mers presenting position dominance close to, or away from the transcription start site (TSS). For this purpose rigid statistical criteria were used and whether position dominance was in any way related to transcription control was determined. To achieve this goal we designed and implemented a dedicated filtering method to massively detect position-dominant sequence elements embedded in the promoter set. Additionally, via a high throughput procedure, we gathered data on the majority of the publicly available transcription factor-binding sites (TFBSs) and matched them to our findings, aiming to accomplish a large-scale correlation between position-...

Identification and functional analysis of human transcriptional promoters

Genome research, 2003

Genomic and full-length cDNA sequences provide opportunities for understanding human gene structure and transcriptional regulatory elements. The simplest regulatory elements to identify are promoters, as their positions are dictated by the location of transcription start sites. We aligned full-length cDNA clones from the Mammalian Gene Collection to the human genome rough draft sequence to estimate the start sites of more than 10,000 human transcripts. We selected genomic sequence just upstream from the 5' end of these cDNA sequences and designated these as putative promoters. We assayed the functions of 152 of these DNA fragments, chosen at random from the entire set, in a luciferase-based transfection assay in four human cultured cell types. Ninety-one percent of these DNA fragments showed significant transcriptional activity in at least one of the cell lines, whereas 89% showed activity in at least two of the lines. We analyzed the distributions of strengths of these promoter...

DBTSS: DataBase of human Transcriptional Start Sites and full-length cDNAs

Nucleic Acids Research, 2002

Although the information of cDNAs is indispensable for analyzing gene function, most of the cDNA sequences stored in current databases are imperfect in the sense that they lack the precise information of 5′ end termini. To overcome this difficulty, we have developed the oligo-capping method to obtain fulllength cDNAs, the information of which has been partly deposited in public databases. In this study, we further constructed human cDNA libraries enriched in clones containing the cap structure to systematically explore the 5′ end structure of expressed genes. Of approximately 217 402 5′ end sequences obtained, 111 382 have been matched to cDNA sequences of known genes (7889 genes) and are presented in our new database, DataBase of Transcriptional Start Sites (DBTSS; http://elmo.ims.u-tokyo.ac.jp/ dbtss/). Sequence comparison between our entries and those of a reference sequence database, RefSeq, revealed that 4683 (34%) of RefSeq sequences should be extended towards the 5′ ends. We also mapped each sequence on the human draft genome sequence to identify its transcriptional start site, which provides us with more detailed information on distribution patterns of transcriptional start sites and adjacent regulatory regions.

Identification of Conserved Regulatory Elements in Mammalian Promoter Regions: A Case Study Using the PCK1 Promoter

Genomics, Proteomics & Bioinformatics, 2008

A systematic phylogenetic footprinting approach was performed to identify conserved transcription factor binding sites (TFBSs) in mammalian promoter regions using human, mouse and rat sequence alignments. We found that the score distributions of most binding site models did not follow the Gaussian distribution required by many statistical methods. Therefore, we performed an empirical test to establish the optimal threshold for each model. We gauged our computational predictions by comparing with previously known TFBSs in the PCK1 gene promoter of the cytosolic isoform of phosphoenolpyruvate carboxykinase, and achieved a sensitivity of 75% and a specif icity of approximately 32%. Almost all known sites overlapped with predicted sites, and several new putative TFBSs were also identif ied. We validated a predicted SP1 binding site in the control of PCK1 transcription using gel shift and reporter assays. Finally, we applied our computational approach to the prediction of putative TFBSs within the promoter regions of all available RefSeq genes. Our full set of TFBS predictions is freely available at http://bfgl.anri.barc.usda.gov/tfbsConsSites.