Stacia Engel - Academia.edu (original) (raw)

Papers by Stacia Engel

Research paper thumbnail of Saccharomyces genome database update: server architecture, pan-genome nomenclature, and external resources

Genetics

As one of the first model organism knowledgebases, Saccharomyces Genome Database (SGD) has been s... more As one of the first model organism knowledgebases, Saccharomyces Genome Database (SGD) has been supporting the scientific research community since 1993. As technologies and research evolve, so does SGD: from updates in software architecture, to curation of novel data types, to incorporation of data from, and collaboration with, other knowledgebases. We are continuing to make steps toward providing the community with an S. cerevisiae pan-genome. Here, we describe software upgrades, a new nomenclature system for genes not found in the reference strain, and additions to gene pages. With these improvements, we aim to remain a leading resource for students, researchers, and the broader scientific community.

Research paper thumbnail of Gene Ontology annotation at SGD

<b>Copyright information:</b>Taken from "Genome Snapshot: a new resource at the ... more <b>Copyright information:</b>Taken from "Genome Snapshot: a new resource at the Genome Database (SGD) presenting an overview of the genome"Nucleic Acids Research 2005;34(Database issue):D442-D445.Published online 28 Dec 2005PMCID:PMC1347479.© The Author 2006. Published by Oxford University Press. All rights reserved () Summary of GO annotations (). The column, 'Total Number of Annotations', refers to the total number of gene products (protein and RNA gene products) currently annotated to one or more terms (other than 'unknown') in each of the three GO ontologies: Biological Process, Molecular Function and Cellular Component. The number of gene products annotated to 'unknown' for any ontology is provided in the second column. The third column offers links to the graphs shown in . () Distribution of gene products by process, function and component (). Shown are percentages of gene products annotated to a specific term that maps up the ontology to a yeast GO Slim term. The yeast GO Slim is a high-level subset of GO terms that allows grouping of genes into broad categories (see text for details). Annotations to 'unknown' are excluded. (Note that the Cellular Component graph is not shown.)

Research paper thumbnail of Chromosomal features annotated in the genome at SGD

<b>Copyright information:</b>Taken from "Genome Snapshot: a new resource at the ... more <b>Copyright information:</b>Taken from "Genome Snapshot: a new resource at the Genome Database (SGD) presenting an overview of the genome"Nucleic Acids Research 2005;34(Database issue):D442-D445.Published online 28 Dec 2005PMCID:PMC1347479.© The Author 2006. Published by Oxford University Press. All rights reserved () Graphical View of Protein Coding Genes (). ORFs are classified by SGD as 'Verified', 'Uncharacterized' and 'Dubious'. 'Verified' ORFs are those for which experimental evidence demonstrates that a gene product is produced in . 'Uncharacterized' ORFs have orthologs in at least one other species and are likely to encode proteins although experimental proof has not yet been published. 'Dubious' ORFs are those unlikely to encode a protein because they are not conserved in closely related species, and because no data exist demonstrating that a protein is produced. () The Genome Inventory (). A total count of each feature type in the genome as well as a count of each feature type on each chromosome is displayed in this table. (Note that data for chromosomes IV through XIV are not shown in the figure.) Definitions for each feature type can be found in SGD's Glossary. Clicking on any feature type initiates a search that uses SGD's Advanced Search tool () to find all the features in SGD of that type. This table also lists the current length of each chromosome in base pair.

Research paper thumbnail of Summary table of the Model Organism BLASTP Best Hits page

<b>Copyright information:</b>Taken from "Fungal BLAST and Model Organism BLASTP ... more <b>Copyright information:</b>Taken from "Fungal BLAST and Model Organism BLASTP Best Hits: new comparison resources at the Genome Database (SGD)"Nucleic Acids Research 2004 ;33(Database Issue):D374-D377.Published online 17 Dec 2004 PMCID:PMC539977.Copyright © 2005 Oxford University Press A summary table similar to this representative table is generated for each locus having a 'hit' in one or more model organism databases. In this figure, protein Yer179wp results are shown as an example. Columns of the table are as follows: species of the hit protein; name of the database for the hit protein, hyperlinked to the home page of that database; name of the hit protein from its database, hyperlinked to the database page of that protein or its gene; description of the hit protein, as found in its database; -value (expectation value), reflecting the number of hits expected to be found by chance; percent aligned, showing the percentage of the length of the query protein over which it aligns with the hit protein; source range, showing the amino acid coordinates of the region of the query protein that was aligned; and target range, showing the amino acid coordinates of the region of the 'hit' protein that was aligned with the query protein.

Research paper thumbnail of DOI: 10.1093/nar/gkh033

tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from ... more tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms

Research paper thumbnail of Term Matrix: A novel Gene Ontology annotation quality control system based on ontology term co-annotation patterns

Biological processes are accomplished by the coordinated action of gene products. Gene products o... more Biological processes are accomplished by the coordinated action of gene products. Gene products often participate in multiple processes, and can therefore be annotated to multiple Gene Ontology (GO) terms. Nevertheless, processes that are functionally, temporally, and/or spatially distant may have few gene products in common, and co-annotation to unrelated processes likely reflects errors in literature curation, ontology structure, or automated annotation pipelines. We have developed an annotation quality control workflow that uses rules based on mutually exclusive processes to detect annotation errors, based on and validated by case studies including the three we present here: fission yeast protein-coding gene annotations over time; annotations for cohesin complex subunits in human and model species; and annotations using a selected set of GO biological process terms in human and five model species. For each case study, we reviewed available GO annotations, identified pairs of biol...

Research paper thumbnail of Automated generation of gene summaries at the Alliance of Genome Resources

Database, 2020

Short paragraphs that describe gene function, referred to as gene summaries, are valued by users ... more Short paragraphs that describe gene function, referred to as gene summaries, are valued by users of biological knowledgebases for the ease with which they convey key aspects of gene function. Manual curation of gene summaries, while desirable, is difficult for knowledgebases to sustain. We developed an algorithm that uses curated, structured gene data at the Alliance of Genome Resources (Alliance; www.alliancegenome.org) to automatically generate gene summaries that simulate natural language. The gene data used for this purpose include curated associations (annotations) to ontology terms from the Gene Ontology, Disease Ontology, model organism knowledgebase (MOK)-specific anatomy ontologies and Alliance orthology data. The method uses sentence templates for each data category included in the gene summary in order to build a natural language sentence from the list of terms associated with each gene. To improve readability of the summaries when numerous gene annotations are present, w...

Research paper thumbnail of RNAcentral: a hub of information for non-coding RNA sequences

Nucleic Acids Research, 2018

RNAcentral is a comprehensive database of noncoding RNA (ncRNA) sequences, collating information ... more RNAcentral is a comprehensive database of noncoding RNA (ncRNA) sequences, collating information on ncRNA sequences of all types from a broad range of organisms. We have recently added a new genome mapping pipeline that identifies genomic lo

Research paper thumbnail of Outreach and online training services at the Saccharomyces Genome Database

Database, 2017

The Saccharomyces Genome Database (SGD; www.yeastgenome.org), the primary genetics and genomics r... more The Saccharomyces Genome Database (SGD; www.yeastgenome.org), the primary genetics and genomics resource for the budding yeast S. cerevisiae, provides free public access to expertly curated information about the yeast genome and its gene products. As the central hub for the yeast research community, SGD engages in a variety of social outreach efforts to inform our users about new developments, promote collaboration, increase public awareness of the importance of yeast to biomedical research, and facilitate scientific discovery. Here we describe these various outreach methods, from networking at scientific conferences to the use of online media such as blog posts and webinars, and include our perspectives on the benefits provided by outreach activities for model organism databases.

Research paper thumbnail of Detection of Fermentation-Related Microorganisms

Research paper thumbnail of The Saccharomyces Genome Database Variant Viewer

Nucleic acids research, Jan 17, 2015

The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org) is the authoritative communit... more The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org) is the authoritative community resource for the Saccharomyces cerevisiae reference genome sequence and its annotation. In recent years, we have moved toward increased representation of sequence variation and allelic differences within S. cerevisiae. The publication of numerous additional genomes has motivated the creation of new tools for their annotation and analysis. Here we present the Variant Viewer: a dynamic open-source web application for the visualization of genomic and proteomic differences. Multiple sequence alignments have been constructed across high quality genome sequences from 11 different S. cerevisiae strains and stored in the SGD. The alignments and summaries are encoded in JSON and used to create a two-tiered dynamic view of the budding yeast pan-genome, available at http://www.yeastgenome.org/variant-viewer.

Research paper thumbnail of Correction: AGAPE (Automated Genome Analysis PipelinE) for Pan-Genome Analysis of Saccharomyces cerevisiae

PLOS ONE, 2015

There are missing Author Contributions. The correct contributions are: Conceived and designed the... more There are missing Author Contributions. The correct contributions are: Conceived and designed the experiments: GS MS JMC. Performed the experiments: GS JG KC BD. Analyzed the data: GS BJAD JD. Contributed reagents/materials/analysis tools: GS JG KC BD. Wrote the paper: GS BJAD JD SE BD JMC. There is an omission in the Acknowledgments. The following sentence should be included in the Acknowledgments: We thank SGD Project staff for the creation of the high quality and detailed database of S. cerevisiae genes and their products and Webb Miller for helpful comments. Illumina sequencing services were performed by the Stanford Center for Genomics and Personalized Medicine.

Research paper thumbnail of AGAPE (Automated Genome Analysis PipelinE) for Pan-Genome Analysis of Saccharomyces cerevisiae

PloS one, 2015

The characterization and public release of genome sequences from thousands of organisms is expand... more The characterization and public release of genome sequences from thousands of organisms is expanding the scope for genetic variation studies. However, understanding the phenotypic consequences of genetic variation remains a challenge in eukaryotes due to the complexity of the genotype-phenotype map. One approach to this is the intensive study of model systems for which diverse sources of information can be accumulated and integrated. Saccharomyces cerevisiae is an extensively studied model organism, with well-known protein functions and thoroughly curated phenotype data. To develop and expand the available resources linking genomic variation with function in yeast, we aim to model the pan-genome of S. cerevisiae. To initiate the yeast pan-genome, we newly sequenced or re-sequenced the genomes of 25 strains that are commonly used in the yeast research community using advanced sequencing technology at high quality. We also developed a pipeline for automated pan-genome analysis, which ...

Research paper thumbnail of The Reference Genome Sequence of Saccharomyces cerevisiae: Then and Now

The genome of the budding yeast Saccharomyces cerevisiae was the first completely sequenced from ... more The genome of the budding yeast Saccharomyces cerevisiae was the first completely sequenced from a eukaryote. It was released in 1996 as the work of a worldwide effort of hundreds of researchers. In the time since, the yeast genome has been intensively studied by geneticists, molecular biologists, and computational scientists all over the world. Maintenance and annotation of the genome sequence have long been provided by the Saccharomyces Genome Database, one of the original model organism databases. To deepen our understanding of the eukaryotic genome, the S. cerevisiae strain S288C reference genome sequence was updated recently in its first major update since 1996. The new version, called "S288C 2010," was determined from a single yeast colony using modern sequencing technologies and serves as the anchor for further innovations in yeast genomic science. KEYWORDS Saccharomyces cerevisiae model organism reference sequence genome release S288C Research regarding the genetics of yeast began in earnest during the 1930s and 1940s, with the pioneering work of Øjvind Winge (Szybalski 2001) and the work of Carl Lindegren (Lindegren 1949). The 16 chromosomes of Saccharomyces cerevisiae comprise the first completely finished eukaryotic genome and were sequenced in the early 1990s by an international consortium of researchers from 19 countries working in 94 laboratories using several different sequencing methods and technologies (Goffeau et al. 1996). The genome sequence is that of strain background S288C, and the strains used for the sequencing were predominantly AB972 (ATCC 76269) and FY1679 (ATCC 96604), two strains isogenic with S288C. Some sections of chromosome III were sequenced from XJ24-4a, A364A (ATCC 204626), and DC5 (ATCC 64665), and a small portion of chromosome XIV was taken from strain A364A (Table 1). Here, we recount the genealogical history of S288C and the key derivative strains AB972 and FY1679. We also discuss the early S. cerevisiae sequencing efforts of the 1990s. Finally, we describe the resequencing and update of the S. cerevisiae reference genome. MATERIALS AND METHODS Provenance of S288C S288C is a common gal2 mutant haploid laboratory strain with a long history of use in genetic and molecular biology studies. S288C has a complex genealogy; it is a contrived strain produced through numerous deliberate crosses, first by Carl Lindegren, and in later years by Robert Mortimer (Figure 1). Almost 90% of the S288C gene pool is from strain EM93, isolated by Emil Mrak in 1938 from a rotting fig collected outside the town of Merced in California's Central Valley (Mortimer and Johnston 1986). Lindegren obtained Mrak's EM93 for use in a laborious project to develop fertile breeding stocks for his genetic studies concerning the fermentation of different carbohydrates (Lindegren 1949). Lindegren obtained from L. J. Wickerham a culture

Research paper thumbnail of Gene function, metabolic pathways and comparative genomics in yeast

Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003

The budding yeast, Saccharomyces cerevisiae, has been experimentally manipulated for several deca... more The budding yeast, Saccharomyces cerevisiae, has been experimentally manipulated for several decades. Much of the information generated is available in the Saccharomyces Genome D a t a b a s e (SGD, http://www.yeastgenome.org/). SGD contains large datasets of both genomic and proteomic information, as well as tools for data analysis. This paper will highlight three datasets that are maintained by SGD. First, a large dataset of hand-curated information is provided in machine readable format for each gene of the Saccharomyces genome. These hand-curated annotations use the Gene Ontology (GO) controlled vocabularies for Biological Process, Molecular Function and Cellular Component and each contains categorical evidence codes and literature references. A second area of focus is on metabolic pathways. A new dataset of hand-curated information on metabolic pathways within budding yeast was released in May 2003. This resource can be searched to view biochemical reactions and pathways and their component gene products. This resource also maps data from genome-wide expression analyses onto the pathway overview providing a visualization of the changes in gene expression in the context of cellular metabolism. These pathways are created and edited using the Pathway Tools software but the content is reviewed and updated by SGD. A third dataset has recently become available as the result of two comparative genomic analyses. Two groups sequenced the genomes of several yeasts closely related to S. cerevisiae, and then completed a gene-bygene comparison of these genomes. These genome comparisons were combined with available experimental evidence by SGD. Using these data the annotations for the S.cerevisiae reference genome were improved. All these datasets are freely available from the SGD ftp site (see Online Resources section).

Research paper thumbnail of Genetic Analysis of Eutypa Strains from California Supports the Presence of Two Pathogenic Species

Phytopathology®, 1999

Eutypa dieback is a perennial canker disease that adversely affects grape (Vitis vinifera) produc... more Eutypa dieback is a perennial canker disease that adversely affects grape (Vitis vinifera) production throughout the world. The causal agent has been known as either Eutypa armeniacae or E. lata, and it has been unclear whether the two taxa are separate species. We analyzed 115 isolates of Eutypa and conspecific strains, including 106 from California, using amplified fragment length polymorphism (AFLP) and sequence analysis of the ribosomal DNA (rDNA) internal transcribed spacer (ITS) sequence. Strains from cultivated plant species exhibited an average genetic distance of 0.34, as calculated by the DICE coefficient (NTSYS-pc software). An unweighted pair-group method with arithmetic averages dendrogram revealed a genetically distinct (distance of 0.73) group of Eutypa strains from valley oak (Quercus lobata) and madrone (Arbutus menziesii) and a strain from grape. Analysis of rDNA ITS sequences strongly supported the genetically distinct cluster detected in the AFLP data. Combined d...

Research paper thumbnail of Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms

Nucleic Acids Research, 2004

The Saccharomyces Genome Database (SGD; http:// www.yeastgenome.org/), a scienti®c database of th... more The Saccharomyces Genome Database (SGD; http:// www.yeastgenome.org/), a scienti®c database of the molecular biology and genetics of the yeast Saccharomyces cerevisiae, has recently developed several new resources that allow the comparison and integration of information on a genome-wide scale, enabling the user not only to ®nd detailed information about individual genes, but also to make connections across groups of genes with common features and across different species. The Fungal Alignment Viewer displays alignments of sequences from multiple fungal genomes, while the Sequence Similarity Query tool displays PSI-BLAST alignments of each S.cerevisiae protein with similar proteins from any species whose sequences are contained in the non-redundant (nr) protein data set at NCBI. The Yeast Biochemical Pathways tool integrates groups of genes by their common roles in metabolism and displays the metabolic pathways in a graphical form. Finally, the Find Chromosomal Features search interface provides a versatile tool for querying multiple types of information in SGD.

Research paper thumbnail of Saccharomyces Genome Database (SGD) provides biochemical and structural information for budding yeast proteins

Nucleic Acids Research, 2003

The Saccharomyces Genome Database (SGD: http:// genome-www.stanford.edu/Saccharomyces/) has recen... more The Saccharomyces Genome Database (SGD: http:// genome-www.stanford.edu/Saccharomyces/) has recently developed new resources to provide more complete information about proteins from the budding yeast Saccharomyces cerevisiae. The PDB Homologs page provides structural information from the Protein Data Bank (PDB) about yeast proteins and/or their homologs. SGD has also created a resource that utilizes the eMOTIF database for motif information about a given protein. A third new resource is the Protein Information page, which contains protein physical and chemical properties, such as molecular weight and hydropathicity scores, predicted from the translated ORF sequence.

Research paper thumbnail of Fungal BLAST and Model Organism BLASTP Best Hits: new comparison resources at the Saccharomyces Genome Database (SGD)

Nucleic Acids Research, 2004

The Saccharomyces Genome Database (SGD; http:// www.yeastgenome.org/) is a scientific database of... more The Saccharomyces Genome Database (SGD; http:// www.yeastgenome.org/) is a scientific database of gene, protein and genomic information for the yeast Saccharomyces cerevisiae. SGD has recently developed two new resources that facilitate nucleotide and protein sequence comparisons between S.cerevisiae and other organisms. The Fungal BLAST tool provides directed searches against all fungal nucleotide and protein sequences available from GenBank, divided into categories according to organism, status of completeness and annotation, and source. The Model Organism BLASTP Best Hits resource displays, for each S.cerevisiae protein, the single most similar protein from several model organisms and presents links to the database pages of those proteins, facilitating access to curated information about potential orthologs of yeast proteins.

Research paper thumbnail of The Gene Ontology: enhancements for 2011

Nucleic Acids Research, 2011

The Gene Ontology (GO) (http://www.geneontology .org) is a community bioinformatics resource that... more The Gene Ontology (GO) (http://www.geneontology .org) is a community bioinformatics resource that represents gene product function through the use of structured, controlled vocabularies. The number of GO annotations of gene products has increased due to curation efforts among GO Consortium (GOC) groups, including focused literature-based annotation and ortholog-based functional inference. The GO ontologies continue to expand and improve as a result of targeted ontology development, including the introduction of computable logical definitions and development of new tools for the streamlined addition of terms to the ontology. The GOC continues to support its user community through the use of e-mail lists, social media and web-based resources.

Research paper thumbnail of Saccharomyces genome database update: server architecture, pan-genome nomenclature, and external resources

Genetics

As one of the first model organism knowledgebases, Saccharomyces Genome Database (SGD) has been s... more As one of the first model organism knowledgebases, Saccharomyces Genome Database (SGD) has been supporting the scientific research community since 1993. As technologies and research evolve, so does SGD: from updates in software architecture, to curation of novel data types, to incorporation of data from, and collaboration with, other knowledgebases. We are continuing to make steps toward providing the community with an S. cerevisiae pan-genome. Here, we describe software upgrades, a new nomenclature system for genes not found in the reference strain, and additions to gene pages. With these improvements, we aim to remain a leading resource for students, researchers, and the broader scientific community.

Research paper thumbnail of Gene Ontology annotation at SGD

<b>Copyright information:</b>Taken from "Genome Snapshot: a new resource at the ... more <b>Copyright information:</b>Taken from "Genome Snapshot: a new resource at the Genome Database (SGD) presenting an overview of the genome"Nucleic Acids Research 2005;34(Database issue):D442-D445.Published online 28 Dec 2005PMCID:PMC1347479.© The Author 2006. Published by Oxford University Press. All rights reserved () Summary of GO annotations (). The column, 'Total Number of Annotations', refers to the total number of gene products (protein and RNA gene products) currently annotated to one or more terms (other than 'unknown') in each of the three GO ontologies: Biological Process, Molecular Function and Cellular Component. The number of gene products annotated to 'unknown' for any ontology is provided in the second column. The third column offers links to the graphs shown in . () Distribution of gene products by process, function and component (). Shown are percentages of gene products annotated to a specific term that maps up the ontology to a yeast GO Slim term. The yeast GO Slim is a high-level subset of GO terms that allows grouping of genes into broad categories (see text for details). Annotations to 'unknown' are excluded. (Note that the Cellular Component graph is not shown.)

Research paper thumbnail of Chromosomal features annotated in the genome at SGD

<b>Copyright information:</b>Taken from "Genome Snapshot: a new resource at the ... more <b>Copyright information:</b>Taken from "Genome Snapshot: a new resource at the Genome Database (SGD) presenting an overview of the genome"Nucleic Acids Research 2005;34(Database issue):D442-D445.Published online 28 Dec 2005PMCID:PMC1347479.© The Author 2006. Published by Oxford University Press. All rights reserved () Graphical View of Protein Coding Genes (). ORFs are classified by SGD as 'Verified', 'Uncharacterized' and 'Dubious'. 'Verified' ORFs are those for which experimental evidence demonstrates that a gene product is produced in . 'Uncharacterized' ORFs have orthologs in at least one other species and are likely to encode proteins although experimental proof has not yet been published. 'Dubious' ORFs are those unlikely to encode a protein because they are not conserved in closely related species, and because no data exist demonstrating that a protein is produced. () The Genome Inventory (). A total count of each feature type in the genome as well as a count of each feature type on each chromosome is displayed in this table. (Note that data for chromosomes IV through XIV are not shown in the figure.) Definitions for each feature type can be found in SGD's Glossary. Clicking on any feature type initiates a search that uses SGD's Advanced Search tool () to find all the features in SGD of that type. This table also lists the current length of each chromosome in base pair.

Research paper thumbnail of Summary table of the Model Organism BLASTP Best Hits page

<b>Copyright information:</b>Taken from "Fungal BLAST and Model Organism BLASTP ... more <b>Copyright information:</b>Taken from "Fungal BLAST and Model Organism BLASTP Best Hits: new comparison resources at the Genome Database (SGD)"Nucleic Acids Research 2004 ;33(Database Issue):D374-D377.Published online 17 Dec 2004 PMCID:PMC539977.Copyright © 2005 Oxford University Press A summary table similar to this representative table is generated for each locus having a 'hit' in one or more model organism databases. In this figure, protein Yer179wp results are shown as an example. Columns of the table are as follows: species of the hit protein; name of the database for the hit protein, hyperlinked to the home page of that database; name of the hit protein from its database, hyperlinked to the database page of that protein or its gene; description of the hit protein, as found in its database; -value (expectation value), reflecting the number of hits expected to be found by chance; percent aligned, showing the percentage of the length of the query protein over which it aligns with the hit protein; source range, showing the amino acid coordinates of the region of the query protein that was aligned; and target range, showing the amino acid coordinates of the region of the 'hit' protein that was aligned with the query protein.

Research paper thumbnail of DOI: 10.1093/nar/gkh033

tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from ... more tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms

Research paper thumbnail of Term Matrix: A novel Gene Ontology annotation quality control system based on ontology term co-annotation patterns

Biological processes are accomplished by the coordinated action of gene products. Gene products o... more Biological processes are accomplished by the coordinated action of gene products. Gene products often participate in multiple processes, and can therefore be annotated to multiple Gene Ontology (GO) terms. Nevertheless, processes that are functionally, temporally, and/or spatially distant may have few gene products in common, and co-annotation to unrelated processes likely reflects errors in literature curation, ontology structure, or automated annotation pipelines. We have developed an annotation quality control workflow that uses rules based on mutually exclusive processes to detect annotation errors, based on and validated by case studies including the three we present here: fission yeast protein-coding gene annotations over time; annotations for cohesin complex subunits in human and model species; and annotations using a selected set of GO biological process terms in human and five model species. For each case study, we reviewed available GO annotations, identified pairs of biol...

Research paper thumbnail of Automated generation of gene summaries at the Alliance of Genome Resources

Database, 2020

Short paragraphs that describe gene function, referred to as gene summaries, are valued by users ... more Short paragraphs that describe gene function, referred to as gene summaries, are valued by users of biological knowledgebases for the ease with which they convey key aspects of gene function. Manual curation of gene summaries, while desirable, is difficult for knowledgebases to sustain. We developed an algorithm that uses curated, structured gene data at the Alliance of Genome Resources (Alliance; www.alliancegenome.org) to automatically generate gene summaries that simulate natural language. The gene data used for this purpose include curated associations (annotations) to ontology terms from the Gene Ontology, Disease Ontology, model organism knowledgebase (MOK)-specific anatomy ontologies and Alliance orthology data. The method uses sentence templates for each data category included in the gene summary in order to build a natural language sentence from the list of terms associated with each gene. To improve readability of the summaries when numerous gene annotations are present, w...

Research paper thumbnail of RNAcentral: a hub of information for non-coding RNA sequences

Nucleic Acids Research, 2018

RNAcentral is a comprehensive database of noncoding RNA (ncRNA) sequences, collating information ... more RNAcentral is a comprehensive database of noncoding RNA (ncRNA) sequences, collating information on ncRNA sequences of all types from a broad range of organisms. We have recently added a new genome mapping pipeline that identifies genomic lo

Research paper thumbnail of Outreach and online training services at the Saccharomyces Genome Database

Database, 2017

The Saccharomyces Genome Database (SGD; www.yeastgenome.org), the primary genetics and genomics r... more The Saccharomyces Genome Database (SGD; www.yeastgenome.org), the primary genetics and genomics resource for the budding yeast S. cerevisiae, provides free public access to expertly curated information about the yeast genome and its gene products. As the central hub for the yeast research community, SGD engages in a variety of social outreach efforts to inform our users about new developments, promote collaboration, increase public awareness of the importance of yeast to biomedical research, and facilitate scientific discovery. Here we describe these various outreach methods, from networking at scientific conferences to the use of online media such as blog posts and webinars, and include our perspectives on the benefits provided by outreach activities for model organism databases.

Research paper thumbnail of Detection of Fermentation-Related Microorganisms

Research paper thumbnail of The Saccharomyces Genome Database Variant Viewer

Nucleic acids research, Jan 17, 2015

The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org) is the authoritative communit... more The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org) is the authoritative community resource for the Saccharomyces cerevisiae reference genome sequence and its annotation. In recent years, we have moved toward increased representation of sequence variation and allelic differences within S. cerevisiae. The publication of numerous additional genomes has motivated the creation of new tools for their annotation and analysis. Here we present the Variant Viewer: a dynamic open-source web application for the visualization of genomic and proteomic differences. Multiple sequence alignments have been constructed across high quality genome sequences from 11 different S. cerevisiae strains and stored in the SGD. The alignments and summaries are encoded in JSON and used to create a two-tiered dynamic view of the budding yeast pan-genome, available at http://www.yeastgenome.org/variant-viewer.

Research paper thumbnail of Correction: AGAPE (Automated Genome Analysis PipelinE) for Pan-Genome Analysis of Saccharomyces cerevisiae

PLOS ONE, 2015

There are missing Author Contributions. The correct contributions are: Conceived and designed the... more There are missing Author Contributions. The correct contributions are: Conceived and designed the experiments: GS MS JMC. Performed the experiments: GS JG KC BD. Analyzed the data: GS BJAD JD. Contributed reagents/materials/analysis tools: GS JG KC BD. Wrote the paper: GS BJAD JD SE BD JMC. There is an omission in the Acknowledgments. The following sentence should be included in the Acknowledgments: We thank SGD Project staff for the creation of the high quality and detailed database of S. cerevisiae genes and their products and Webb Miller for helpful comments. Illumina sequencing services were performed by the Stanford Center for Genomics and Personalized Medicine.

Research paper thumbnail of AGAPE (Automated Genome Analysis PipelinE) for Pan-Genome Analysis of Saccharomyces cerevisiae

PloS one, 2015

The characterization and public release of genome sequences from thousands of organisms is expand... more The characterization and public release of genome sequences from thousands of organisms is expanding the scope for genetic variation studies. However, understanding the phenotypic consequences of genetic variation remains a challenge in eukaryotes due to the complexity of the genotype-phenotype map. One approach to this is the intensive study of model systems for which diverse sources of information can be accumulated and integrated. Saccharomyces cerevisiae is an extensively studied model organism, with well-known protein functions and thoroughly curated phenotype data. To develop and expand the available resources linking genomic variation with function in yeast, we aim to model the pan-genome of S. cerevisiae. To initiate the yeast pan-genome, we newly sequenced or re-sequenced the genomes of 25 strains that are commonly used in the yeast research community using advanced sequencing technology at high quality. We also developed a pipeline for automated pan-genome analysis, which ...

Research paper thumbnail of The Reference Genome Sequence of Saccharomyces cerevisiae: Then and Now

The genome of the budding yeast Saccharomyces cerevisiae was the first completely sequenced from ... more The genome of the budding yeast Saccharomyces cerevisiae was the first completely sequenced from a eukaryote. It was released in 1996 as the work of a worldwide effort of hundreds of researchers. In the time since, the yeast genome has been intensively studied by geneticists, molecular biologists, and computational scientists all over the world. Maintenance and annotation of the genome sequence have long been provided by the Saccharomyces Genome Database, one of the original model organism databases. To deepen our understanding of the eukaryotic genome, the S. cerevisiae strain S288C reference genome sequence was updated recently in its first major update since 1996. The new version, called "S288C 2010," was determined from a single yeast colony using modern sequencing technologies and serves as the anchor for further innovations in yeast genomic science. KEYWORDS Saccharomyces cerevisiae model organism reference sequence genome release S288C Research regarding the genetics of yeast began in earnest during the 1930s and 1940s, with the pioneering work of Øjvind Winge (Szybalski 2001) and the work of Carl Lindegren (Lindegren 1949). The 16 chromosomes of Saccharomyces cerevisiae comprise the first completely finished eukaryotic genome and were sequenced in the early 1990s by an international consortium of researchers from 19 countries working in 94 laboratories using several different sequencing methods and technologies (Goffeau et al. 1996). The genome sequence is that of strain background S288C, and the strains used for the sequencing were predominantly AB972 (ATCC 76269) and FY1679 (ATCC 96604), two strains isogenic with S288C. Some sections of chromosome III were sequenced from XJ24-4a, A364A (ATCC 204626), and DC5 (ATCC 64665), and a small portion of chromosome XIV was taken from strain A364A (Table 1). Here, we recount the genealogical history of S288C and the key derivative strains AB972 and FY1679. We also discuss the early S. cerevisiae sequencing efforts of the 1990s. Finally, we describe the resequencing and update of the S. cerevisiae reference genome. MATERIALS AND METHODS Provenance of S288C S288C is a common gal2 mutant haploid laboratory strain with a long history of use in genetic and molecular biology studies. S288C has a complex genealogy; it is a contrived strain produced through numerous deliberate crosses, first by Carl Lindegren, and in later years by Robert Mortimer (Figure 1). Almost 90% of the S288C gene pool is from strain EM93, isolated by Emil Mrak in 1938 from a rotting fig collected outside the town of Merced in California's Central Valley (Mortimer and Johnston 1986). Lindegren obtained Mrak's EM93 for use in a laborious project to develop fertile breeding stocks for his genetic studies concerning the fermentation of different carbohydrates (Lindegren 1949). Lindegren obtained from L. J. Wickerham a culture

Research paper thumbnail of Gene function, metabolic pathways and comparative genomics in yeast

Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003

The budding yeast, Saccharomyces cerevisiae, has been experimentally manipulated for several deca... more The budding yeast, Saccharomyces cerevisiae, has been experimentally manipulated for several decades. Much of the information generated is available in the Saccharomyces Genome D a t a b a s e (SGD, http://www.yeastgenome.org/). SGD contains large datasets of both genomic and proteomic information, as well as tools for data analysis. This paper will highlight three datasets that are maintained by SGD. First, a large dataset of hand-curated information is provided in machine readable format for each gene of the Saccharomyces genome. These hand-curated annotations use the Gene Ontology (GO) controlled vocabularies for Biological Process, Molecular Function and Cellular Component and each contains categorical evidence codes and literature references. A second area of focus is on metabolic pathways. A new dataset of hand-curated information on metabolic pathways within budding yeast was released in May 2003. This resource can be searched to view biochemical reactions and pathways and their component gene products. This resource also maps data from genome-wide expression analyses onto the pathway overview providing a visualization of the changes in gene expression in the context of cellular metabolism. These pathways are created and edited using the Pathway Tools software but the content is reviewed and updated by SGD. A third dataset has recently become available as the result of two comparative genomic analyses. Two groups sequenced the genomes of several yeasts closely related to S. cerevisiae, and then completed a gene-bygene comparison of these genomes. These genome comparisons were combined with available experimental evidence by SGD. Using these data the annotations for the S.cerevisiae reference genome were improved. All these datasets are freely available from the SGD ftp site (see Online Resources section).

Research paper thumbnail of Genetic Analysis of Eutypa Strains from California Supports the Presence of Two Pathogenic Species

Phytopathology®, 1999

Eutypa dieback is a perennial canker disease that adversely affects grape (Vitis vinifera) produc... more Eutypa dieback is a perennial canker disease that adversely affects grape (Vitis vinifera) production throughout the world. The causal agent has been known as either Eutypa armeniacae or E. lata, and it has been unclear whether the two taxa are separate species. We analyzed 115 isolates of Eutypa and conspecific strains, including 106 from California, using amplified fragment length polymorphism (AFLP) and sequence analysis of the ribosomal DNA (rDNA) internal transcribed spacer (ITS) sequence. Strains from cultivated plant species exhibited an average genetic distance of 0.34, as calculated by the DICE coefficient (NTSYS-pc software). An unweighted pair-group method with arithmetic averages dendrogram revealed a genetically distinct (distance of 0.73) group of Eutypa strains from valley oak (Quercus lobata) and madrone (Arbutus menziesii) and a strain from grape. Analysis of rDNA ITS sequences strongly supported the genetically distinct cluster detected in the AFLP data. Combined d...

Research paper thumbnail of Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms

Nucleic Acids Research, 2004

The Saccharomyces Genome Database (SGD; http:// www.yeastgenome.org/), a scienti®c database of th... more The Saccharomyces Genome Database (SGD; http:// www.yeastgenome.org/), a scienti®c database of the molecular biology and genetics of the yeast Saccharomyces cerevisiae, has recently developed several new resources that allow the comparison and integration of information on a genome-wide scale, enabling the user not only to ®nd detailed information about individual genes, but also to make connections across groups of genes with common features and across different species. The Fungal Alignment Viewer displays alignments of sequences from multiple fungal genomes, while the Sequence Similarity Query tool displays PSI-BLAST alignments of each S.cerevisiae protein with similar proteins from any species whose sequences are contained in the non-redundant (nr) protein data set at NCBI. The Yeast Biochemical Pathways tool integrates groups of genes by their common roles in metabolism and displays the metabolic pathways in a graphical form. Finally, the Find Chromosomal Features search interface provides a versatile tool for querying multiple types of information in SGD.

Research paper thumbnail of Saccharomyces Genome Database (SGD) provides biochemical and structural information for budding yeast proteins

Nucleic Acids Research, 2003

The Saccharomyces Genome Database (SGD: http:// genome-www.stanford.edu/Saccharomyces/) has recen... more The Saccharomyces Genome Database (SGD: http:// genome-www.stanford.edu/Saccharomyces/) has recently developed new resources to provide more complete information about proteins from the budding yeast Saccharomyces cerevisiae. The PDB Homologs page provides structural information from the Protein Data Bank (PDB) about yeast proteins and/or their homologs. SGD has also created a resource that utilizes the eMOTIF database for motif information about a given protein. A third new resource is the Protein Information page, which contains protein physical and chemical properties, such as molecular weight and hydropathicity scores, predicted from the translated ORF sequence.

Research paper thumbnail of Fungal BLAST and Model Organism BLASTP Best Hits: new comparison resources at the Saccharomyces Genome Database (SGD)

Nucleic Acids Research, 2004

The Saccharomyces Genome Database (SGD; http:// www.yeastgenome.org/) is a scientific database of... more The Saccharomyces Genome Database (SGD; http:// www.yeastgenome.org/) is a scientific database of gene, protein and genomic information for the yeast Saccharomyces cerevisiae. SGD has recently developed two new resources that facilitate nucleotide and protein sequence comparisons between S.cerevisiae and other organisms. The Fungal BLAST tool provides directed searches against all fungal nucleotide and protein sequences available from GenBank, divided into categories according to organism, status of completeness and annotation, and source. The Model Organism BLASTP Best Hits resource displays, for each S.cerevisiae protein, the single most similar protein from several model organisms and presents links to the database pages of those proteins, facilitating access to curated information about potential orthologs of yeast proteins.

Research paper thumbnail of The Gene Ontology: enhancements for 2011

Nucleic Acids Research, 2011

The Gene Ontology (GO) (http://www.geneontology .org) is a community bioinformatics resource that... more The Gene Ontology (GO) (http://www.geneontology .org) is a community bioinformatics resource that represents gene product function through the use of structured, controlled vocabularies. The number of GO annotations of gene products has increased due to curation efforts among GO Consortium (GOC) groups, including focused literature-based annotation and ortholog-based functional inference. The GO ontologies continue to expand and improve as a result of targeted ontology development, including the introduction of computable logical definitions and development of new tools for the streamlined addition of terms to the ontology. The GOC continues to support its user community through the use of e-mail lists, social media and web-based resources.