Ensembl 2004 (original) (raw)

Journal Article

,

*To whom correspondence should be addressed. Tel: +44 1223 494983; Fax: +44 1223 494919; Email: th@sanger.ac.u k

Search for other works by this author on:

,

Search for other works by this author on:

,

Search for other works by this author on:

,

Search for other works by this author on:

,

Search for other works by this author on:

,

Search for other works by this author on:

,

Search for other works by this author on:

,

Search for other works by this author on:

,

Search for other works by this author on:

,

Search for other works by this author on:

... Show more

Published:

01 January 2004

Cite

E. Birney, D. Andrews, P. Bevan, M. Caccamo, G. Cameron, Y. Chen, L. Clarke, G. Coates, T. Cox, J. Cuff, V. Curwen, T. Cutts, T. Down, R. Durbin, E. Eyras, X. M. Fernandez‐Suarez, P. Gane, B. Gibbins, J. Gilbert, M. Hammond, H. Hotz, V. Iyer, A. Kahari, K. Jekosch, A. Kasprzyk, D. Keefe, S. Keenan, H. Lehvaslaiho, G. McVicker, C. Melsopp, P. Meidl, E. Mongin, R. Pettett, S. Potter, G. Proctor, M. Rae, S. Searle, G. Slater, D. Smedley, J. Smith, W. Spooner, A. Stabenau, J. Stalker, R. Storey, A. Ureta‐Vidal, C. Woodwark, M. Clamp, T. Hubbard, Ensembl 2004, Nucleic Acids Research, Volume 32, Issue suppl_1, 1 January 2004, Pages D468–D470, https://doi.org/10.1093/nar/gkh038
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

The Ensembl ( http://www.ensembl.org/ ) database project provides a bioinformatics framework to organize biology around the sequences of large genomes. It is a comprehensive and integrated source of annotation of large genome sequences, available via interactive website, web services or flat files. As well as being one of the leading sources of genome annotation, Ensembl is an open source software engineering project to develop a portable system able to handle very large genomes and associated requirements. The facilities of the system range from sequence analysis to data storage and visualization and installations exist around the world both in companies and at academic sites. With a total of nine genome sequences available from Ensembl and more genomes to follow, recent developments have focused mainly on closer integration between genomes and external data.

Received September 16, 2003; Accepted September 18, 2003

INTRODUCTION

Genome sequences provide a natural framework about which to organize biological data. In the short time in which they have been available, genome databases have proved invaluable resources to researchers. Ensembl provides one of the most popular sources of automatic analysis and integration of large genome sequence data and is a joint project between the EBI and the Sanger Institute. It now contains nine genomes: five vertebrates: human, mouse, rat, fugu, zebrafish; two worms: Caenorhabditis briggsae and Caenorhabditis elegans and two insects: Drosophila melanogaster and Anopheles gambiae . Ensembl has been involved in the continued analysis of human data, analysis of the mouse genome ( 1 ), analysis of the A.gambiae genome ( 2 ) and the C.briggsae genome. Ensembl gene predictions have also formed the core set of annotations for the forthcoming rat genome analysis. Ensembl remains an entirely open project with all data freely available and code openly licensed. Ensembl has developed a strong developer network of users in both academia and industry and is being installed both to mirror Ensembl generated data and to be used as a software foundation for user projects. Several papers describing specific aspects of Ensembl have recently been submitted ( 36 ). This paper briefly outlines some of the developments of the project since the report last year ( 7 ).

NEW DEVELOPMENTS

Regular update cycle

To streamline the handling of this ever changing and increasing amount of data, from February 2003, Ensembl adopted a monthly release cycle, allowing improvements to the web interface and database schema to be released monthly, with new data being incorporated as it became available. Database dumps and flat files are released in sync with updates to the website.

Pre‐ensembl website

A full Ensembl annotation of a genome takes some weeks to complete. To provide users with immediate access to newly released genome assemblies Ensembl now offers a pre‐ensembl website ( http://pre.ensembl.org/ ) with limited functionality. This can be made available only a few days after the release of the genome and provides BLAST and SSAHA searching, placement of all known proteins, repeat masking and ab initio gene predictions.

Otter: an extended Ensembl schema for gene curation

During the year, Ensembl developed a new software component called Otter. Otter is an Ensembl database, but with an extended schema and an associated client/server system to support manual gene annotation. The Sanger Institute vertebrate annotation system is being migrated to use Otter, which will then put both automatic (Ensembl) and manual annotation under a single software framework and help greatly with subsequent data integration. The Otter server communicates with annotation clients via an XML format, which allows easy exchange and verification of annotation generated with different systems.

The Apollo genome browser ( 4 ), a GMOD component ( http://www.gmod.org/ ) under joint development by Ensembl and the Berkeley Drosophila genome project ( http://www. bdgp.org/ ), can be used as an annotation client for Otter. Apollo has also been extended to display data from DAS (distributed annotation system) servers. As an editor, Apollo has the advantage of being able to view and edit annotation in a comparative genomic context: by connecting to two Otter servers (e.g. human and mouse) and an Ensembl compara database containing pre‐calculated synteny information between the two genomes, it is possible to view annotation for both genomes and edit each in the context of the synteny with the other.

ENHANCEMENTS

Other than these new developments, there have been continuous enhancements to existing features of Ensembl over the year. Users are recommended to read the What’s new pages accompanying every release as user interface improvements are frequently subtle, but can save researchers considerable time. Some of the more significant improvements are listed here.

Ensembl genome annotation and comparative analysis

The quality of the annotation produced by the core automatic gene building system has continued to improve, with builds delivered on seven genome assemblies during the year. The most recent is the first version of the finished human genome sequence (NCBI33) announced in April, which also has pseudogenes automatically predicted. In parallel with gene building, comparative analysis is now routinely carried out for each new assembly. DNA synteny is generated between human, mouse and rat and putative gene orthologues between all five vertebrates and between each of the two worms and insects are automatically generated.

Ensembl website

Last year’s move to the new schema enabled the development of significant enhancements to the Ensembl webviews. These include the addition of a fourth basepair level panel to Contigview, showing nucleotide, six frame amino acid translation and restriction enzyme site features. Additional pre‐processing of SNP data during the building of the Ensembl‐lite database (a denormalized database to speed web access), with respect to other annotation, has allowed Contigview, Transview and Protview to be extended to show SNPs against transcripts and their protein products, including labelling of synonymous and non‐synonymous coding SNPs. Other enhancements to Contigview include labelled syntenic blocks shown on the overview panel and access to a new interface, Dotterview, from DNA conservation tracks on the detailed view panel. Dotterview is a web interface to the program Dotter, showing a dotplot of DNA similarity by default over a 10 kb window in two genomes, with Ensembl annotation. The interface for adding DAS ( 8 ) sources to Contigview has continued to be developed, giving the user much greater control over display of each source.

EnsemblMart: data mining for genomes

Ensembl has continued to import new externally generated data sets and resources into its system. These are frequently available in contigview via the DAS source menu; however, many are also being incorporated into EnsemblMart as additional data mining indicies. Examples include the STACK expression database eVOC nomenclature (collaboration with SANBI); rat QTLs and microarray identifiers from Affymetrix and others. All of these data types are queryable via the Mart data mining interface, which has increased substantially in functionality over the year and now has its own ‘What’s new’ web pages and includes such functionality as integration with the ArrayExpress microarray repository at EBI.

Ensembl software system

The flexibility of components of the Ensembl software system are increasingly leading to their reuse elsewhere. Within the Sanger Institute alone, the Ensembl pipeline is being used to support gene curation by both the Wormbase and Havana (vertebrate annotation) groups. Havana is also in the process of making use of the Otter database for storing its gene annotation. The Ensembl website code has been reused to power the Vega website ( http://vega.sanger.ac.uk/ ), which shows curated annotation of vertebrate genomes collected from a number of annotation groups into a single database. The fact that Ensembl data are also being served via DAS servers ( 8 ) is encouraging data to be combined in novel ways to provide specialist data displays. The website code has already been reused to build Contigview‐like webviews of a virtual database composed entirely of different DAS sources.

FUTURE DIRECTIONS

Ensembl remains focused on providing a genome information infrastructure of use to many researchers, principally via the web. As well as providing the baseline annotation for a number of genomes, Ensembl is continuously trying to improve all aspects of its work, from software engineering through to data analysis. 2004 promises a number of new genomes (e.g. chicken, chimp and honey bee) but also continued technology and presentation improvements, such as new views of cross‐species data, organized around the putative gene orthologues predicted by the comparative analysis pipeline.

CONTACTING ENSEMBL

Ensembl is a joint project of the European Bioinformatics Institute (EBI) and the Wellcome Trust Sanger Institute (WTSI), both of which are located on the Wellcome Trust Genome Campus, Cambridge, UK. To receive announcements about updates, subscribe to the ‘announce’ mailing list: majordomo@ebi.ac.uk ‘subscribe ensembl‐announce’. To follow the day‐to‐day development of Ensembl, subscribe to the ‘development’ mailing list: majordomo@ebi.ac.uk ‘subscribe ensembl‐dev’. Requests for information and support can be sent to helpdesk@ensembl.org , which is a fully supported helpdesk. Extensive additional documentation can be found on the Ensembl website, including installation guides and tutorials, about using both the software system and the web interface.

ACKNOWLEDGEMENTS

We are grateful to users of our website and the developers on our mailing lists for much useful feedback and discussion. The Ensembl project is funded principally by the Wellcome Trust with additional funding from EMBL and NIH‐NIAID.

References

Waterston,R.H., Lindblad‐Toh,K., Birney,E., Rogers,J., Abril,J.F., Agarwal,P., Agarwala,R., Ainscough,R., Alexandersson,M., An,P. et al. (

2002

) Initial sequencing and comparative analysis of the mouse genome.

Nature

,

420

,

520

–562.

Holt,R.A., Subramanian,G.M., Halpern,A., Sutton,G.G., Charlab,R., Nusskern,D.R., Wincker,P., Clark,A.G., Ribeiro,J.M., Wides,R. et al. (

2002

) The genome sequence of the malaria mosquito Anopheles gambiae .

Science

,

298

,

129

–149.

Birney,E., Clamp,M.E. and Hubbard,T.J. (

2002

) Databases and tools for browsing genomes.

Annu. Rev. Genom. Hum. Genet.

,

3

,

293

–310.

Lewis,S.E., Searle,S.M., Harris,N., Gibson,M., Lyer,V., Richter,J., Wiel,C., Bayraktaroglir,L., Birney,E., Crosby,M.A. et al. (

2002

) Apollo: a sequence annotation editor.

Genome Biol.

,

3

, RESEARCH0082.

Hoon,S., Ratnapu,K.K., Chia,J.M., Kumarasamy,B., Juguang,X., Clamp,M., Stabenau,A., Potter,S., Clarke,L. and Stupka,E. (

2003

) Biopipe: a flexible framework for protocol‐based bioinformatics analysis.

Genome Res.

,

13

,

1904

–1915.

Clamp,M. (2003) The Jalview Java Alignment Editor.

Bioinformatics

, in press.

Clamp,M., Andrews,D., Barker,D., Bevan,P., Cameron,G., Chen,Y., Clark,L., Cox,T., Cuff,J., Curwen,V. et al. (

2003

) Ensembl 2002: accommodating comparative genomics.

Nucleic Acids Res.

,

31

,

38

–42.

Dowell,R.D., Jokerst,R.M., Day,A., Eddy,S.R. and Stein,L. (

2001

) The Distributed Annotation System.

BMC Bioinformatics

,

2

,

7

.

Oxford University Press

I agree to the terms and conditions. You must accept the terms and conditions.

Submit a comment

Name

Affiliations

Comment title

Comment

You have entered an invalid code

Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.

Citations

Views

Altmetric

Metrics

Total Views 1,831

1,412 Pageviews

419 PDF Downloads

Since 1/1/2017

Month: Total Views:
January 2017 1
February 2017 2
April 2017 1
May 2017 2
July 2017 3
August 2017 2
October 2017 4
November 2017 7
December 2017 18
January 2018 33
February 2018 27
March 2018 25
April 2018 34
May 2018 18
June 2018 18
July 2018 21
August 2018 21
September 2018 24
October 2018 12
November 2018 10
December 2018 14
January 2019 12
February 2019 15
March 2019 19
April 2019 26
May 2019 28
June 2019 21
July 2019 28
August 2019 41
September 2019 29
October 2019 32
November 2019 9
December 2019 33
January 2020 13
February 2020 15
March 2020 19
April 2020 18
May 2020 6
June 2020 10
July 2020 13
August 2020 24
September 2020 13
October 2020 12
November 2020 10
December 2020 9
January 2021 12
February 2021 14
March 2021 23
April 2021 18
May 2021 14
June 2021 21
July 2021 27
August 2021 23
September 2021 23
October 2021 22
November 2021 7
December 2021 14
January 2022 20
February 2022 13
March 2022 19
April 2022 21
May 2022 18
June 2022 25
July 2022 35
August 2022 20
September 2022 33
October 2022 45
November 2022 14
December 2022 34
January 2023 19
February 2023 9
March 2023 24
April 2023 32
May 2023 44
June 2023 12
July 2023 11
August 2023 25
September 2023 19
October 2023 30
November 2023 24
December 2023 21
January 2024 16
February 2024 27
March 2024 33
April 2024 27
May 2024 88
June 2024 16
July 2024 29
August 2024 20
September 2024 38

Citations

145 Web of Science

×

Email alerts

Citing articles via

More from Oxford Academic