P. Bucher - Academia.edu (original) (raw)

Papers by P. Bucher

Research paper thumbnail of The Eukaryotic Promoter Database: expansion of EPDnew and new promoter analysis tools

Nucleic acids research, 2015

We present an update of EPDNew (http://epd.vital-it.ch), a recently introduced new part of the Eu... more We present an update of EPDNew (http://epd.vital-it.ch), a recently introduced new part of the Eukaryotic Promoter Database (EPD) which has been described in more detail in a previous NAR Database Issue. EPD is an old database of experimentally characterized eukaryotic POL II promoters, which are conceptually defined as transcription initiation sites or regions. EPDnew is a collection of automatically compiled, organism-specific promoter lists complementing the old corpus of manually compiled promoter entries of EPD. This new part is exclusively derived from next generation sequencing data from high-throughput promoter mapping experiments. We report on the recent growth of EPDnew, its extension to additional model organisms and its improved integration with other bioinformatics resources developed by our group, in particular the Signal Search Analysis and ChIP-Seq web servers.

Research paper thumbnail of A cSNP map and database for human chromosome 21

Genome research, 2001

Single nucleotide polymorphisms (SNPs) are likely to contribute to the study of complex genetic d... more Single nucleotide polymorphisms (SNPs) are likely to contribute to the study of complex genetic diseases. The genomic sequence of human chromosome 21q was recently completed with 225 annotated genes, thus permitting efficient identification and precise mapping of potential cSNPs by bioinformatics approaches. Here we present a human chromosome 21 (HC21) cSNP database and the first chromosome-specific cSNP map. Potential cSNPs were generated using three approaches: (1) Alignment of the complete HC21 genomic sequence to cognate ESTs and mRNAs. Candidate cSNPs were automatically extracted using a novel program for context-dependent SNP identification that efficiently discriminates between true variation, poor quality sequencing, and paralogous gene alignments. (2) Multiple alignment of all known HC21 genes to all other human database entries. (3) Gene-targeted cSNP discovery. To date we have identified 377 cSNPs averaging ~1 SNP per 1.5 kb of transcribed sequence, covering 65% of known ...

Research paper thumbnail of The cis-acting elements controlling mouse IL-2R alpha transcription

Research paper thumbnail of Mouse interleukin-2 receptor alpha gene expression. Delimitation of cis-acting regulatory elements in transgenic mice and by mapping of DNase-I hypersensitive sites

The Journal of biological chemistry, Jan 5, 1995

The alpha chain of the interleukin-2 receptor (IL-2R alpha) is a key regulator of lymphocyte prol... more The alpha chain of the interleukin-2 receptor (IL-2R alpha) is a key regulator of lymphocyte proliferation. To analyze the mechanisms controlling its expression in normal cells, we used the 5'-flanking region (base pairs -2539/+93) of the mouse gene to drive chloramphenicol acetyltransferase expression in four transgenic mouse lines. Constitutive transgene activity was restricted to lymphoid organs. In mature T lymphocytes, transgene and endogenous IL-2R alpha gene expression was stimulated by concanavalin A and up-regulated by IL-2 with very similar kinetics. In thymic T cell precursors, IL-1 and IL-2 cooperatively induced transgene and IL-2R alpha gene expression. These results show that regulation of the endogenous IL-2R alpha gene occurs mainly at the transcriptional level. They demonstrate that cis-acting elements in the 5'-flanking region present in the transgene confer correct tissue specificity and inducible expression in mature T cells and their precursors in respon...

Research paper thumbnail of Significant similarity and dissimilarity in homologous proteins

Molecular biology and evolution, 1992

Common practice emphasizes significant sequence similarities between different members of protein... more Common practice emphasizes significant sequence similarities between different members of protein families. These similarities presumably reflect on evolutionary conservation of structurally and functionally essential residues. The nonconserved regions, on the other hand, may be either selectively neutral or differentiated. We propose several distributional sequence statistics (e.g., clustering of charged residues, compositional biases, and repetitive patterns) as indicators of differentiation events. These ideas are illustrated with various examples, including comparisons among G protein-coupled receptors, herpesvirus proteins, and GTPase-activating proteins.

Research paper thumbnail of Paper: THE PROSITE DATABASE, ITS STATUS IN 1999

Research paper thumbnail of Related Content - Bioinformatics - Oxford Journals

Research paper thumbnail of How Much Does It Cost?: Optimization of Costs in Sequence Analysis of Social Science Data

Sociological Methods & Research, 2009

One major methodological problem in analysis of sequence data is the determination of costs from ... more One major methodological problem in analysis of sequence data is the determination of costs from which distances between sequences are derived. Although this problem is currently not optimally dealt with in the social sciences, it has some similarity with problems that have been solved in bioinformatics for three decades. In this article, the authors propose an optimization of substitution and deletion/insertion costs based on computational methods. The authors provide an empirical way of determining costs for cases, frequent in the social sciences, in which theory does not clearly promote one cost scheme over another. Using three distinct data sets, the authors tested the distances and cluster solutions produced by the new cost scheme in comparison with solutions based on cost schemes associated with other research strategies. The proposed method performs well compared with other cost-setting strategies, while it alleviates the justification problem of cost schemes.

Research paper thumbnail of Improving the sensitivity of the sequence profile method

Protein Science, 2008

The sequence profile method (Gribskov M, McLachlan AD, Eisenberg D, 1987, Proc Natl Acad Sci USA ... more The sequence profile method (Gribskov M, McLachlan AD, Eisenberg D, 1987, Proc Natl Acad Sci USA 844355-4358) is a powerful tool to detect distant relationships between amino acid sequences. A profile is a table of position-specific scores and gap penalties, providing a generalized description of a protein motif, which can be used for sequence alignments and database searches instead of an individual sequence. A sequence profile is derived from a multiple sequence alignment. We have found 2 ways to improve the sensitivity of sequence profiles: (1) Sequence weights: Usage of individual weights for each sequence avoids bias toward closely related sequences. These weights are automatically assigned based on the distance of the sequences using a published procedure (Sibbald PR, Argos P, 1990, JMolBiol216:813-818). (2) Amino acid substitution table: In addition to the alignment, the construction of a profile also needs an amino acid substitution table. We have found that in some cases a new table, the BLOSUM45 table (Henikoff S, Henikoff JG, 1992, Proc Natl Acud Sci USA 89: 10915-10919), is more sensitive than the original Dayhoff table or the modified Dayhoff table used in the current implementation. Profiles derived by the improved method are more sensitive and selective in a number of cases where previous methods have failed to completely separate true members from false positives.

Research paper thumbnail of Comparative T cell receptor repertoire selection by antigen after adoptive transfer: A glimpse at an antigen-specific preimmune repertoire

Proceedings of the National Academy of Sciences, 2000

Research paper thumbnail of The Eukaryotic Promoter Database EPD

Nucleic Acids Research, 1998

The Eukaryotic Promoter Database (EPD) is an annotated non-redundant collection of eukaryotic POL... more The Eukaryotic Promoter Database (EPD) is an annotated non-redundant collection of eukaryotic POL II promoters for which the transcription start site has been determined experimentally. Access to promoter sequences is provided by pointers to positions in nucleotide sequence entries. The annotation part of an entry includes a description of the initiation site mapping data, exhaustive cross-references to the EMBL nucleotide sequence database, SWISS-PROT, TRANSFAC and other databases, as well as bibliographic references. EPD is structured in a way that facilitates dynamic extraction of biologically meaningful promoter subsets for comparative sequence analysis. WWW-based interfaces have been developed that enable the user to view EPD entries in different formats, to select and extract promoter sequences according to a variety of criteria, and to navigate to related databases exploiting different cross-references. The EPD web site also features yearly updated base frequency matrices for major eukaryotic promoter elements. EPD can be accessed at http://www.epd.isb-sib.ch

Research paper thumbnail of CleanEx: new data extraction and merging tools based on MeSH term annotation

Nucleic Acids Research, 2009

The CleanEx expression database (http://www.clea nex.isb-sib.ch) provides access to public gene e... more The CleanEx expression database (http://www.clea nex.isb-sib.ch) provides access to public gene expression data via unique gene names as well as via experiments biomedical characteristics. To reach this, a dual annotation of both sequences and experiments has been generated. First, the system links official gene symbols to any kind of sequences used for gene expression measurements (cDNA, Affymetrix, oligonucleotide arrays, SAGE or MPSS tags, Expressed Sequence Tags or other mRNA sequences, etc.). For the biomedical annotation, we re-annotate each experiment from the CleanEx database with the MeSH (Medical Subject Headings) terms, primarily used by NLM (National Library of Medicine) for indexing articles for the MEDLINE/PubMED database. This annotation allows a fast and easy retrieval of expression data with common biological or medical features. The numerical data can then be exported as matrix-like tab-delimited text files. Data can be extracted from either one dataset or from heterogeneous datasets.

Research paper thumbnail of BlastR--fast and accurate database searches for non-coding RNAs

Nucleic Acids Research, 2011

Research paper thumbnail of The Eukaryotic Promoter Database (EPD): recent developments

Nucleic Acids Research, 1999

The Eukaryotic Promoter Database (EPD) is an annotated non-redundant collection of eukaryotic POL... more The Eukaryotic Promoter Database (EPD) is an annotated non-redundant collection of eukaryotic POL II promoters, for which the transcription start site has been determined experimentally. Access to promoter sequences is provided by pointers to positions in nucleotide sequence entries. The annotation part of an entry includes description of the initiation site mapping data, cross-references to other databases, and bibliographic references. EPD is structured in a way that facilitates dynamic extraction of biologically meaningful promoter subsets for comparative sequence analysis. Recent efforts have focused on exhaustive crossreferencing to the EMBL nucleotide sequence database, and on the improvement of the WWW-based user interfaces and data retrieval mechanisms. EPD can be accessed at http://www.epd.isb-sib.ch

Research paper thumbnail of EPD and EPDnew, high-quality promoter resources in the next-generation sequencing era

Nucleic Acids Research, 2013

The Eukaryotic Promoter Database (EPD), available online at http://epd.vital-it.ch, is a collecti... more The Eukaryotic Promoter Database (EPD), available online at http://epd.vital-it.ch, is a collection of experimentally defined eukaryotic POL II promoters which has been maintained for more than 25 years. A promoter is represented by a single position in the genome, typically the major transcription start site (TSS). EPD primarily serves biologists interested in analysing the motif content, chromatin structure or DNA methylation status of co-regulated promoter subsets. Initially, promoter evidence came from TSS mapping experiments targeted at single genes and published in journal articles. Today, the TSS positions provided by EPD are inferred from next-generation sequencing data distributed in electronic form. Traditionally, EPD has been a high-quality database with low coverage. The focus of recent efforts has been to reach complete gene coverage for important model organisms. To this end, we introduced a new section called EPDnew, which is automatically assembled from multiple, carefully selected input datasets. As another novelty, we started to use chromatin signatures in addition to mRNA 5 0 tags to locate promoters of weekly expressed genes. Regarding user interfaces, we introduced a new promoter viewer which enables users to explore promoterdefining experimental evidence in a UCSC genome browser window.

Research paper thumbnail of High-throughput SELEX–SAGE method for quantitative modeling of transcription-factor binding sites

Nature Biotechnology, 2002

Research paper thumbnail of Sea urchin histone mRNA termini are located in gene regions downstream from putative regulatory sequences

Research paper thumbnail of RNA Profiling and Chromatin Immunoprecipitation-Sequencing Reveal that PTF1a Stabilizes Pancreas Progenitor Identity via the Control of MNX1/HLXB9 and a Network of Other Transcription Factors

Molecular and Cellular Biology, 2012

Pancreas development is initiated by the specification and expansion of a small group of endoderm... more Pancreas development is initiated by the specification and expansion of a small group of endodermal cells. Several transcription factors are crucial for progenitor maintenance and expansion, but their interactions and the downstream targets mediating their activity are poorly understood. Among those factors, PTF1a, a basic helix-loop-helix (bHLH) transcription factor which controls pancreas exocrine cell differentiation, maintenance, and functionality, is also needed for the early specification of pancreas progenitors. We used RNA profiling and chromatin immunoprecipitation (ChIP) sequencing to identify a set of targets in pancreas progenitors. We demonstrate that Mnx1, a gene that is absolutely required in pancreas progenitors, is a major direct target of PTF1a and is regulated by a distant enhancer element. Pdx1, Nkx6.1, and Onecut1 are also direct PTF1a targets whose expression is promoted by PTF1a. These proteins, most of which were previously shown to be necessary for pancreas bud maintenance or formation, form a transcription factor network that allows the maintenance of pancreas progenitors. In addition, we identify Bmp7, Nr5a2, RhoV, and P2rx1 as new targets of PTF1a in pancreas progenitors. P ancreatic transcription factor 1a (Ptf1a) encodes a basic helixloop-helix (bHLH) transcription factor most closely related to the Twist subclass (31). It was first identified as one of three subunits of the PTF1 transcription factor complex that is required for the expression of pancreatic digestive enzyme genes (11, 53-55). The PTF1 complex also comprises a class A bHLH protein, p64, also known as PTF1b/TCF12/HEB and p75/TCF3/E12/E47, a subunit that is required for the import of the PTF1 complex into the cell nucleus (6, 60). In addition to this initially identified tripartite complex, PTF1 was also shown to interact with recombination signal binding protein for immunoglobulin kappa J region (RBPJ/ RBPJK) or recombination signal binding protein for immunoglobulin kappa J region-like (RBPJL) depending on cell types and developmental stages (6, 39, 46). The PTF1 complex binds a bipartite cognate site that contains two distinct sequence motifs (11, 55). p64 was shown to contact a TGGGAAA/TTTCCCA sequence (A box/TC box), and although p64 was identified as HEB (NCBI), RBPJL and RBPJK subsequently were shown to bind this sequence (6, 39, 46). PTF1a binds to CANNTG, the canonical binding site for bHLH proteins (E box; formerly called B box) (11, 55). Interactions with NR5A2/LRH-1 also were recently uncovered (23). PTF1a is a protein that is required for the differentiation of the nervous system (2, 18, 24, 50), retina (14, 15, 44), and pancreas. The truncation of the human PTF1A gene leads to permanent neonatal diabetes mellitus due to pancreas agenesis (58, 59, 62). In Ptf1a knockout (KO) mice, exocrine pancreas agenesis was similarly observed (29, 32). Although the expression of this gene was initially thought to be limited to exocrine cells (32), tracing experiments have clearly shown that it is also expressed in early pancreas progenitors that give rise to exocrine and endocrine cells, including insulin-secreting beta cells (9, 16, 29). This is further supported by the reduction in endocrine cell numbers in the absence of PTF1a in mice and zebrafish (16, 29, 38). In the absence of

Research paper thumbnail of Evidence for selective evolution in codon usage in conserved amino acid segments of human alphaherpesvirus proteins

Journal of Molecular Evolution, 1991

The genomes of human viruses herpes simplex 1 (HSV1) and varicella zoster (VZV), although similar... more The genomes of human viruses herpes simplex 1 (HSV1) and varicella zoster (VZV), although similar in biology, largely concordant in gene order, and identical in many amino acid segments, differ widely in their genomic G+C (abbreviated S) content, which is high in HSV1 (68%) and low in VZV (46%). This paper analyzes several striking codon usage contrasts. The S difference in coding regions is dramatically large in codon site 3, 3,about423, about 42%. The large difference in 3,about423 is maintained at the same level in a subset of closely similar genes and even in corresponding identical amino acid blocks. A similar difference in S levels in silent site 1 (S 1) is found in leucine and arginine. The difference in 3levelsoccursineverygeneandineverymulticodonaminoacidform.TheSdifferencealsoexistsinaminoacidusage,withHSV1usingsignificantlymorecodontypesSSN,whileVZVusesmorecodontypesWWN(whereWstandsforAorT).Thenonoverlappingandnarrowhistogramsof3 levels occurs in every gene and in every multicodon amino acid form. The S difference also exists in amino acid usage, with HSV1 using significantly more codon types SSN, while VZV uses more codon types WWN (where W stands for A or T). The nonoverlapping and narrow histograms of 3levelsoccursineverygeneandineverymulticodonaminoacidform.TheSdifferencealsoexistsinaminoacidusage,withHSV1usingsignificantlymorecodontypesSSN,whileVZVusesmorecodontypesWWN(whereWstandsforAorT).Thenonoverlappingandnarrowhistogramsof3 gene frequencies in both viruses suggest that the difference has arisen and been maintained by a process of selective rather than nonselective effects. This is in sharp contrast to the relatively large variance seen for highly similar genes in the human versus yeast analysis. Interpretations and hypotheses to explain the HSV 1 vs VZV codon usage disparity relate to virus-host interactions, to the role of viral genes in DNA metabolism, to availability of molecular resources (molecular Gause exclusion principle), and to differences in genomic structure.

Research paper thumbnail of Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences

Journal of Molecular Biology, 1990

Optimized weight matrices defining four major eukaryotic promoter elements, the TATAbox, cap sign... more Optimized weight matrices defining four major eukaryotic promoter elements, the TATAbox, cap signal, CCAAT-, and GC-box, are presented; they were derived by comparative sequence analysis of 502 unrelated RNA polymerase II promoter regions. The new TATAbox and cap signal descriptions differ in several respects from the only hitherto available base frequency Tables. The CCAAT-box matrix, obtained with no prior assumption but CCAAT being the core of the motif, reflects precisely the sequence specificity of the recently discovered nuclear factor NY-I/CPl but does not include typical recognition sequences of two other purported CCAAT-binding proteins, CTF and CBP. The GC-box description is longer than the previously proposed consensus sequences but is consistent with Spl protein-DNA binding data. The notion of a CACCC element distinct from the GC-box seems not to be justified any longer in view of the new weight matrix. Unlike the two fixed-distance elements, neither the CCAAT-nor the GC-box occurs at significantly high frequency in the upstream regions of non-vertebrate genes. Preliminary attempts to predict promoters with the aid of the new signal descriptions were unexpectedly successful. The new TATA-box matrix locates eukaryotic transcription initiation sites as reliably as do the best currently available methods to map Escherichia coli promoters. This analysis was made possible by the recently established Eukaryotic Promoter Database (EPD) of the EMBL Nucleotide Sequence Data Library. In order to derive the weight matrices, a novel algorithm has been devised that is generally applicable to sequence motifs positionally correlated with a biologically defined position in the sequences. The signal must be sufficiently over-represented in a particular region relative to the given site, but need not be present in all members of the input sequence collection. The algorithm iteratively redefines the set of putative motif representatives from which a weight matrix is derived, so as to maximize a quantitative measure of local over-representation, an optimization criterion that naturally combines structural and positional constancy. A comprehensive description of the technique is presented in Methods and Data.

Research paper thumbnail of The Eukaryotic Promoter Database: expansion of EPDnew and new promoter analysis tools

Nucleic acids research, 2015

We present an update of EPDNew (http://epd.vital-it.ch), a recently introduced new part of the Eu... more We present an update of EPDNew (http://epd.vital-it.ch), a recently introduced new part of the Eukaryotic Promoter Database (EPD) which has been described in more detail in a previous NAR Database Issue. EPD is an old database of experimentally characterized eukaryotic POL II promoters, which are conceptually defined as transcription initiation sites or regions. EPDnew is a collection of automatically compiled, organism-specific promoter lists complementing the old corpus of manually compiled promoter entries of EPD. This new part is exclusively derived from next generation sequencing data from high-throughput promoter mapping experiments. We report on the recent growth of EPDnew, its extension to additional model organisms and its improved integration with other bioinformatics resources developed by our group, in particular the Signal Search Analysis and ChIP-Seq web servers.

Research paper thumbnail of A cSNP map and database for human chromosome 21

Genome research, 2001

Single nucleotide polymorphisms (SNPs) are likely to contribute to the study of complex genetic d... more Single nucleotide polymorphisms (SNPs) are likely to contribute to the study of complex genetic diseases. The genomic sequence of human chromosome 21q was recently completed with 225 annotated genes, thus permitting efficient identification and precise mapping of potential cSNPs by bioinformatics approaches. Here we present a human chromosome 21 (HC21) cSNP database and the first chromosome-specific cSNP map. Potential cSNPs were generated using three approaches: (1) Alignment of the complete HC21 genomic sequence to cognate ESTs and mRNAs. Candidate cSNPs were automatically extracted using a novel program for context-dependent SNP identification that efficiently discriminates between true variation, poor quality sequencing, and paralogous gene alignments. (2) Multiple alignment of all known HC21 genes to all other human database entries. (3) Gene-targeted cSNP discovery. To date we have identified 377 cSNPs averaging ~1 SNP per 1.5 kb of transcribed sequence, covering 65% of known ...

Research paper thumbnail of The cis-acting elements controlling mouse IL-2R alpha transcription

Research paper thumbnail of Mouse interleukin-2 receptor alpha gene expression. Delimitation of cis-acting regulatory elements in transgenic mice and by mapping of DNase-I hypersensitive sites

The Journal of biological chemistry, Jan 5, 1995

The alpha chain of the interleukin-2 receptor (IL-2R alpha) is a key regulator of lymphocyte prol... more The alpha chain of the interleukin-2 receptor (IL-2R alpha) is a key regulator of lymphocyte proliferation. To analyze the mechanisms controlling its expression in normal cells, we used the 5'-flanking region (base pairs -2539/+93) of the mouse gene to drive chloramphenicol acetyltransferase expression in four transgenic mouse lines. Constitutive transgene activity was restricted to lymphoid organs. In mature T lymphocytes, transgene and endogenous IL-2R alpha gene expression was stimulated by concanavalin A and up-regulated by IL-2 with very similar kinetics. In thymic T cell precursors, IL-1 and IL-2 cooperatively induced transgene and IL-2R alpha gene expression. These results show that regulation of the endogenous IL-2R alpha gene occurs mainly at the transcriptional level. They demonstrate that cis-acting elements in the 5'-flanking region present in the transgene confer correct tissue specificity and inducible expression in mature T cells and their precursors in respon...

Research paper thumbnail of Significant similarity and dissimilarity in homologous proteins

Molecular biology and evolution, 1992

Common practice emphasizes significant sequence similarities between different members of protein... more Common practice emphasizes significant sequence similarities between different members of protein families. These similarities presumably reflect on evolutionary conservation of structurally and functionally essential residues. The nonconserved regions, on the other hand, may be either selectively neutral or differentiated. We propose several distributional sequence statistics (e.g., clustering of charged residues, compositional biases, and repetitive patterns) as indicators of differentiation events. These ideas are illustrated with various examples, including comparisons among G protein-coupled receptors, herpesvirus proteins, and GTPase-activating proteins.

Research paper thumbnail of Paper: THE PROSITE DATABASE, ITS STATUS IN 1999

Research paper thumbnail of Related Content - Bioinformatics - Oxford Journals

Research paper thumbnail of How Much Does It Cost?: Optimization of Costs in Sequence Analysis of Social Science Data

Sociological Methods & Research, 2009

One major methodological problem in analysis of sequence data is the determination of costs from ... more One major methodological problem in analysis of sequence data is the determination of costs from which distances between sequences are derived. Although this problem is currently not optimally dealt with in the social sciences, it has some similarity with problems that have been solved in bioinformatics for three decades. In this article, the authors propose an optimization of substitution and deletion/insertion costs based on computational methods. The authors provide an empirical way of determining costs for cases, frequent in the social sciences, in which theory does not clearly promote one cost scheme over another. Using three distinct data sets, the authors tested the distances and cluster solutions produced by the new cost scheme in comparison with solutions based on cost schemes associated with other research strategies. The proposed method performs well compared with other cost-setting strategies, while it alleviates the justification problem of cost schemes.

Research paper thumbnail of Improving the sensitivity of the sequence profile method

Protein Science, 2008

The sequence profile method (Gribskov M, McLachlan AD, Eisenberg D, 1987, Proc Natl Acad Sci USA ... more The sequence profile method (Gribskov M, McLachlan AD, Eisenberg D, 1987, Proc Natl Acad Sci USA 844355-4358) is a powerful tool to detect distant relationships between amino acid sequences. A profile is a table of position-specific scores and gap penalties, providing a generalized description of a protein motif, which can be used for sequence alignments and database searches instead of an individual sequence. A sequence profile is derived from a multiple sequence alignment. We have found 2 ways to improve the sensitivity of sequence profiles: (1) Sequence weights: Usage of individual weights for each sequence avoids bias toward closely related sequences. These weights are automatically assigned based on the distance of the sequences using a published procedure (Sibbald PR, Argos P, 1990, JMolBiol216:813-818). (2) Amino acid substitution table: In addition to the alignment, the construction of a profile also needs an amino acid substitution table. We have found that in some cases a new table, the BLOSUM45 table (Henikoff S, Henikoff JG, 1992, Proc Natl Acud Sci USA 89: 10915-10919), is more sensitive than the original Dayhoff table or the modified Dayhoff table used in the current implementation. Profiles derived by the improved method are more sensitive and selective in a number of cases where previous methods have failed to completely separate true members from false positives.

Research paper thumbnail of Comparative T cell receptor repertoire selection by antigen after adoptive transfer: A glimpse at an antigen-specific preimmune repertoire

Proceedings of the National Academy of Sciences, 2000

Research paper thumbnail of The Eukaryotic Promoter Database EPD

Nucleic Acids Research, 1998

The Eukaryotic Promoter Database (EPD) is an annotated non-redundant collection of eukaryotic POL... more The Eukaryotic Promoter Database (EPD) is an annotated non-redundant collection of eukaryotic POL II promoters for which the transcription start site has been determined experimentally. Access to promoter sequences is provided by pointers to positions in nucleotide sequence entries. The annotation part of an entry includes a description of the initiation site mapping data, exhaustive cross-references to the EMBL nucleotide sequence database, SWISS-PROT, TRANSFAC and other databases, as well as bibliographic references. EPD is structured in a way that facilitates dynamic extraction of biologically meaningful promoter subsets for comparative sequence analysis. WWW-based interfaces have been developed that enable the user to view EPD entries in different formats, to select and extract promoter sequences according to a variety of criteria, and to navigate to related databases exploiting different cross-references. The EPD web site also features yearly updated base frequency matrices for major eukaryotic promoter elements. EPD can be accessed at http://www.epd.isb-sib.ch

Research paper thumbnail of CleanEx: new data extraction and merging tools based on MeSH term annotation

Nucleic Acids Research, 2009

The CleanEx expression database (http://www.clea nex.isb-sib.ch) provides access to public gene e... more The CleanEx expression database (http://www.clea nex.isb-sib.ch) provides access to public gene expression data via unique gene names as well as via experiments biomedical characteristics. To reach this, a dual annotation of both sequences and experiments has been generated. First, the system links official gene symbols to any kind of sequences used for gene expression measurements (cDNA, Affymetrix, oligonucleotide arrays, SAGE or MPSS tags, Expressed Sequence Tags or other mRNA sequences, etc.). For the biomedical annotation, we re-annotate each experiment from the CleanEx database with the MeSH (Medical Subject Headings) terms, primarily used by NLM (National Library of Medicine) for indexing articles for the MEDLINE/PubMED database. This annotation allows a fast and easy retrieval of expression data with common biological or medical features. The numerical data can then be exported as matrix-like tab-delimited text files. Data can be extracted from either one dataset or from heterogeneous datasets.

Research paper thumbnail of BlastR--fast and accurate database searches for non-coding RNAs

Nucleic Acids Research, 2011

Research paper thumbnail of The Eukaryotic Promoter Database (EPD): recent developments

Nucleic Acids Research, 1999

The Eukaryotic Promoter Database (EPD) is an annotated non-redundant collection of eukaryotic POL... more The Eukaryotic Promoter Database (EPD) is an annotated non-redundant collection of eukaryotic POL II promoters, for which the transcription start site has been determined experimentally. Access to promoter sequences is provided by pointers to positions in nucleotide sequence entries. The annotation part of an entry includes description of the initiation site mapping data, cross-references to other databases, and bibliographic references. EPD is structured in a way that facilitates dynamic extraction of biologically meaningful promoter subsets for comparative sequence analysis. Recent efforts have focused on exhaustive crossreferencing to the EMBL nucleotide sequence database, and on the improvement of the WWW-based user interfaces and data retrieval mechanisms. EPD can be accessed at http://www.epd.isb-sib.ch

Research paper thumbnail of EPD and EPDnew, high-quality promoter resources in the next-generation sequencing era

Nucleic Acids Research, 2013

The Eukaryotic Promoter Database (EPD), available online at http://epd.vital-it.ch, is a collecti... more The Eukaryotic Promoter Database (EPD), available online at http://epd.vital-it.ch, is a collection of experimentally defined eukaryotic POL II promoters which has been maintained for more than 25 years. A promoter is represented by a single position in the genome, typically the major transcription start site (TSS). EPD primarily serves biologists interested in analysing the motif content, chromatin structure or DNA methylation status of co-regulated promoter subsets. Initially, promoter evidence came from TSS mapping experiments targeted at single genes and published in journal articles. Today, the TSS positions provided by EPD are inferred from next-generation sequencing data distributed in electronic form. Traditionally, EPD has been a high-quality database with low coverage. The focus of recent efforts has been to reach complete gene coverage for important model organisms. To this end, we introduced a new section called EPDnew, which is automatically assembled from multiple, carefully selected input datasets. As another novelty, we started to use chromatin signatures in addition to mRNA 5 0 tags to locate promoters of weekly expressed genes. Regarding user interfaces, we introduced a new promoter viewer which enables users to explore promoterdefining experimental evidence in a UCSC genome browser window.

Research paper thumbnail of High-throughput SELEX–SAGE method for quantitative modeling of transcription-factor binding sites

Nature Biotechnology, 2002

Research paper thumbnail of Sea urchin histone mRNA termini are located in gene regions downstream from putative regulatory sequences

Research paper thumbnail of RNA Profiling and Chromatin Immunoprecipitation-Sequencing Reveal that PTF1a Stabilizes Pancreas Progenitor Identity via the Control of MNX1/HLXB9 and a Network of Other Transcription Factors

Molecular and Cellular Biology, 2012

Pancreas development is initiated by the specification and expansion of a small group of endoderm... more Pancreas development is initiated by the specification and expansion of a small group of endodermal cells. Several transcription factors are crucial for progenitor maintenance and expansion, but their interactions and the downstream targets mediating their activity are poorly understood. Among those factors, PTF1a, a basic helix-loop-helix (bHLH) transcription factor which controls pancreas exocrine cell differentiation, maintenance, and functionality, is also needed for the early specification of pancreas progenitors. We used RNA profiling and chromatin immunoprecipitation (ChIP) sequencing to identify a set of targets in pancreas progenitors. We demonstrate that Mnx1, a gene that is absolutely required in pancreas progenitors, is a major direct target of PTF1a and is regulated by a distant enhancer element. Pdx1, Nkx6.1, and Onecut1 are also direct PTF1a targets whose expression is promoted by PTF1a. These proteins, most of which were previously shown to be necessary for pancreas bud maintenance or formation, form a transcription factor network that allows the maintenance of pancreas progenitors. In addition, we identify Bmp7, Nr5a2, RhoV, and P2rx1 as new targets of PTF1a in pancreas progenitors. P ancreatic transcription factor 1a (Ptf1a) encodes a basic helixloop-helix (bHLH) transcription factor most closely related to the Twist subclass (31). It was first identified as one of three subunits of the PTF1 transcription factor complex that is required for the expression of pancreatic digestive enzyme genes (11, 53-55). The PTF1 complex also comprises a class A bHLH protein, p64, also known as PTF1b/TCF12/HEB and p75/TCF3/E12/E47, a subunit that is required for the import of the PTF1 complex into the cell nucleus (6, 60). In addition to this initially identified tripartite complex, PTF1 was also shown to interact with recombination signal binding protein for immunoglobulin kappa J region (RBPJ/ RBPJK) or recombination signal binding protein for immunoglobulin kappa J region-like (RBPJL) depending on cell types and developmental stages (6, 39, 46). The PTF1 complex binds a bipartite cognate site that contains two distinct sequence motifs (11, 55). p64 was shown to contact a TGGGAAA/TTTCCCA sequence (A box/TC box), and although p64 was identified as HEB (NCBI), RBPJL and RBPJK subsequently were shown to bind this sequence (6, 39, 46). PTF1a binds to CANNTG, the canonical binding site for bHLH proteins (E box; formerly called B box) (11, 55). Interactions with NR5A2/LRH-1 also were recently uncovered (23). PTF1a is a protein that is required for the differentiation of the nervous system (2, 18, 24, 50), retina (14, 15, 44), and pancreas. The truncation of the human PTF1A gene leads to permanent neonatal diabetes mellitus due to pancreas agenesis (58, 59, 62). In Ptf1a knockout (KO) mice, exocrine pancreas agenesis was similarly observed (29, 32). Although the expression of this gene was initially thought to be limited to exocrine cells (32), tracing experiments have clearly shown that it is also expressed in early pancreas progenitors that give rise to exocrine and endocrine cells, including insulin-secreting beta cells (9, 16, 29). This is further supported by the reduction in endocrine cell numbers in the absence of PTF1a in mice and zebrafish (16, 29, 38). In the absence of

Research paper thumbnail of Evidence for selective evolution in codon usage in conserved amino acid segments of human alphaherpesvirus proteins

Journal of Molecular Evolution, 1991

The genomes of human viruses herpes simplex 1 (HSV1) and varicella zoster (VZV), although similar... more The genomes of human viruses herpes simplex 1 (HSV1) and varicella zoster (VZV), although similar in biology, largely concordant in gene order, and identical in many amino acid segments, differ widely in their genomic G+C (abbreviated S) content, which is high in HSV1 (68%) and low in VZV (46%). This paper analyzes several striking codon usage contrasts. The S difference in coding regions is dramatically large in codon site 3, 3,about423, about 42%. The large difference in 3,about423 is maintained at the same level in a subset of closely similar genes and even in corresponding identical amino acid blocks. A similar difference in S levels in silent site 1 (S 1) is found in leucine and arginine. The difference in 3levelsoccursineverygeneandineverymulticodonaminoacidform.TheSdifferencealsoexistsinaminoacidusage,withHSV1usingsignificantlymorecodontypesSSN,whileVZVusesmorecodontypesWWN(whereWstandsforAorT).Thenonoverlappingandnarrowhistogramsof3 levels occurs in every gene and in every multicodon amino acid form. The S difference also exists in amino acid usage, with HSV1 using significantly more codon types SSN, while VZV uses more codon types WWN (where W stands for A or T). The nonoverlapping and narrow histograms of 3levelsoccursineverygeneandineverymulticodonaminoacidform.TheSdifferencealsoexistsinaminoacidusage,withHSV1usingsignificantlymorecodontypesSSN,whileVZVusesmorecodontypesWWN(whereWstandsforAorT).Thenonoverlappingandnarrowhistogramsof3 gene frequencies in both viruses suggest that the difference has arisen and been maintained by a process of selective rather than nonselective effects. This is in sharp contrast to the relatively large variance seen for highly similar genes in the human versus yeast analysis. Interpretations and hypotheses to explain the HSV 1 vs VZV codon usage disparity relate to virus-host interactions, to the role of viral genes in DNA metabolism, to availability of molecular resources (molecular Gause exclusion principle), and to differences in genomic structure.

Research paper thumbnail of Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences

Journal of Molecular Biology, 1990

Optimized weight matrices defining four major eukaryotic promoter elements, the TATAbox, cap sign... more Optimized weight matrices defining four major eukaryotic promoter elements, the TATAbox, cap signal, CCAAT-, and GC-box, are presented; they were derived by comparative sequence analysis of 502 unrelated RNA polymerase II promoter regions. The new TATAbox and cap signal descriptions differ in several respects from the only hitherto available base frequency Tables. The CCAAT-box matrix, obtained with no prior assumption but CCAAT being the core of the motif, reflects precisely the sequence specificity of the recently discovered nuclear factor NY-I/CPl but does not include typical recognition sequences of two other purported CCAAT-binding proteins, CTF and CBP. The GC-box description is longer than the previously proposed consensus sequences but is consistent with Spl protein-DNA binding data. The notion of a CACCC element distinct from the GC-box seems not to be justified any longer in view of the new weight matrix. Unlike the two fixed-distance elements, neither the CCAAT-nor the GC-box occurs at significantly high frequency in the upstream regions of non-vertebrate genes. Preliminary attempts to predict promoters with the aid of the new signal descriptions were unexpectedly successful. The new TATA-box matrix locates eukaryotic transcription initiation sites as reliably as do the best currently available methods to map Escherichia coli promoters. This analysis was made possible by the recently established Eukaryotic Promoter Database (EPD) of the EMBL Nucleotide Sequence Data Library. In order to derive the weight matrices, a novel algorithm has been devised that is generally applicable to sequence motifs positionally correlated with a biologically defined position in the sequences. The signal must be sufficiently over-represented in a particular region relative to the given site, but need not be present in all members of the input sequence collection. The algorithm iteratively redefines the set of putative motif representatives from which a weight matrix is derived, so as to maximize a quantitative measure of local over-representation, an optimization criterion that naturally combines structural and positional constancy. A comprehensive description of the technique is presented in Methods and Data.