A new bioinformatics analysis tools framework at EMBL–EBI (original) (raw)

Journal Article

European Bioinformatics Institute, EMBL Outstation, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK

Search for other works by this author on:

European Bioinformatics Institute, EMBL Outstation, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK

Search for other works by this author on:

European Bioinformatics Institute, EMBL Outstation, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK

Search for other works by this author on:

European Bioinformatics Institute, EMBL Outstation, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK

Search for other works by this author on:

European Bioinformatics Institute, EMBL Outstation, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK

Search for other works by this author on:

European Bioinformatics Institute, EMBL Outstation, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK

Search for other works by this author on:

European Bioinformatics Institute, EMBL Outstation, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK

*To whom correspondence should be addressed. Tel: +44 1223 494423; Fax:

+44 1223 494468

; Email: rls@ebi.ac.uk

Search for other works by this author on:

Received:

27 January 2010

Revision received:

06 April 2010

Cite

Mickael Goujon, Hamish McWilliam, Weizhong Li, Franck Valentin, Silvano Squizzato, Juri Paern, Rodrigo Lopez, A new bioinformatics analysis tools framework at EMBL–EBI, Nucleic Acids Research, Volume 38, Issue suppl_2, 1 July 2010, Pages W695–W699, https://doi.org/10.1093/nar/gkq313
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

The EMBL-EBI provides access to various mainstream sequence analysis applications. These include sequence similarity search services such as BLAST, FASTA, InterProScan and multiple sequence alignment tools such as ClustalW, T-Coffee and MUSCLE. Through the sequence similarity search services, the users can search mainstream sequence databases such as EMBL-Bank and UniProt, and more than 2000 completed genomes and proteomes. We present here a new framework aimed at both novice as well as expert users that exposes novel methods of obtaining annotations and visualizing sequence analysis results through one uniform and consistent interface. These services are available over the web and via Web Services interfaces for users who require systematic access or want to interface with customized pipe-lines and workflows using common programming languages. The framework features novel result visualizations and integration of domain and functional predictions for protein database searches. It is available at http://www.ebi.ac.uk/Tools/sss for sequence similarity searches and at http://www.ebi.ac.uk/Tools/msa for multiple sequence alignments.

INTRODUCTION

Bioinformatics is a vast and complex multidisciplinary research area where numerous tools have been developed over the years to analyse constantly growing amounts of data. Since 1998, the European Bioinformatics Institute (EMBL–EBI) has provided public access to various mainstream sequence analysis applications (1,2). These include sequence similarity search services (http://www.ebi.ac.uk/Tools/similarity.html), such as FASTA (3), BLAST (4,5) and InterProScan (6) and multiple sequence alignment tools (http://www.ebi.ac.uk/Tools/sequence.html), such as ClustalW (7), T-Coffee (8), MUSCLE (9), Kalign (10) and MAFFT (11). These services are provided via a PERL-CGI job dispatcher framework for managing job submission and result representation. This infrastructure handled more than 16 million jobs during 2009. The popularity of these services has made it necessary to redesign the system in order to minimize maintenance and enhance the integration of features requested by users. A new and modular framework, called JDispatcher, has been developed to improve the accessibility and quality of the services relevant to the biological community.

JDispatcher framework

JDispatcher is aimed at both novice and expert users and exposes novel methods of obtaining annotations and visualizing sequence analysis results through one uniform and consistent interface. These services are available interactively over the web and via SOAP and REST interfaces for systematic or programmatic use. The new framework provides input validation to assure successful job submissions, offers new visualization features to assist in the interpretation of results and uses the EBI search engine, EB-eye (12), to integrate relevant annotations.

A user can submit sequences using web forms that contain all supported parameters and their possible values. The different tools have been grouped into categories based on their purpose (Table 1).

Table 1.

Tools available in the JDispatcher framework

Category	Tool
Sequence Similarity Searches (sss)	psisearch, psiblast, ncbiblast, wublast, fasta, ssearch, ggsearch and glsearch
Multiple Sequence Alignments (msa)	clustalw2, tcoffee, kalign, muscle, mafft, and prank

Category	Tool
Sequence Similarity Searches (sss)	psisearch, psiblast, ncbiblast, wublast, fasta, ssearch, ggsearch and glsearch
Multiple Sequence Alignments (msa)	clustalw2, tcoffee, kalign, muscle, mafft, and prank

Table 1.

Tools available in the JDispatcher framework

Category	Tool
Sequence Similarity Searches (sss)	psisearch, psiblast, ncbiblast, wublast, fasta, ssearch, ggsearch and glsearch
Multiple Sequence Alignments (msa)	clustalw2, tcoffee, kalign, muscle, mafft, and prank

Category	Tool
Sequence Similarity Searches (sss)	psisearch, psiblast, ncbiblast, wublast, fasta, ssearch, ggsearch and glsearch
Multiple Sequence Alignments (msa)	clustalw2, tcoffee, kalign, muscle, mafft, and prank

Within a category, the tools share the same interface design, which uses well established usability patterns, such as wizard-like steps to guide the user through the submission process. It makes use of decision-trees to validate all the parameters required to warrant successful job submissions. If the validation fails, the user is notified about which specific parameters or data are invalid, and the job is not submitted. Alternatively, JDispatcher assigns a unique job identifier and sends a request to a workload management system for the job to be executed. The identifier is then used to keep track of the tasks and to retrieve the results when they become available. The results of each job are kept for a maximum of 7 days.

Results representation

The results of an analysis are made available using various representations (e.g. HTML tables, XML files, images, etc.). In order to produce these representations, each result is converted into a generic category-specific model that is used by a renderer that generates the requested output. The renderers are specific to the model and not to the tool, and thus are available across all the tools in a category. The availability of multiple views of the same data helps the user to interpret and compare results from different tools within a category.

Sequence search algorithms produce limited hits annotation. With the new framework it is possible to navigate hits and access related information. Figure 1 shows the ‘Summary Table’ of an SSEARCH of mouse glomulin (UniProtKB/Swiss-Prot GLMN_MOUSE), which is essential for the development of the vascular system, against the UniProtKB/Swiss-Prot database (13). Each column heading has clickable arrows that allow the user to sort the results according to the values in the columns [e.g. sequence length, score, percentage identity, positives and E()-value]. Each match is enriched with links to cross-references and related information in various data resources (e.g. gene expression, genomic sequences, structures, function, ontologies and literature citations). Optionally, the alignment from the search, and/or the full-annotation for the selected matches can be displayed. A hits selection can also be downloaded in fasta format.

Summary Table view of the results obtained when searching the sequence of mouse glomulin against the UniProtKB/Swiss-Prot database using SSEARCH.

Figure 1.

Summary Table view of the results obtained when searching the sequence of mouse glomulin against the UniProtKB/Swiss-Prot database using SSEARCH.

Figure 2 shows the ‘Visual Output’ obtained from searches using SSEARCH and NCBI BLAST of the glomulin sequence against UniProtKB/Swiss-Prot using default parameters. Comparison of the two images reveals notable differences in the sequence matches reported by the two search methods. For example, differences in the aligned regions between glomulin and aberrant root formation protein 4 for Arabidopsis (ALF4_ARATH) are clearly visible in both; SSEARCH identifies two MON2 homologues at E()-values <1 (MON2_XENLA and MON2_HUMAN), which may indicate there is a structural relationship between GLMN at the C-terminus of the MON2 homologues, although these may not share related functions.

Comparisons between the Visual output results obtained when searching the sequence of mouse glomulin against the UniProtKB/Swiss-Prot database using SSEARCH and NCBI BLAST, respectively.

Figure 2.

Comparisons between the Visual output results obtained when searching the sequence of mouse glomulin against the UniProtKB/Swiss-Prot database using SSEARCH and NCBI BLAST, respectively.

Determining which functional domains and families a protein belongs to is critical to the understanding of the biological processes it may be involved in. This is important for the characterization of existing drug targets as well as in the identification of novel ones. Family and domain functional predictions have been built into the framework, using pre-calculated matches from the InterPro Consortium (14) data. This enables users, not only to search for sequence similarities when using the UniProt databases, but also to characterize the sequence query in terms of domain architectures that may elicit its function. Figure 3 shows ‘Functional Predictions’ for a hypothetical bioactive lysophospholipid that was compared against UniProtKB/Swiss-Prot using NCBI BLAST. The hypothetical sequence has several good homologues, all belonging to the GPCR rhodopsin-like superfamily, which are clearly seen. This indicates the query protein could represent a potential target for receptor-binding studies.

Functional prediction view of the results obtained when comparing the sequence of putative bioactive lysophospholipid that was compared against UniProtKB/Swiss-Prot using NCBI BLAST.

Figure 3.

Functional prediction view of the results obtained when comparing the sequence of putative bioactive lysophospholipid that was compared against UniProtKB/Swiss-Prot using NCBI BLAST.

In both, the ‘Visual Output’ and ‘Functional Predictions’ result representations, the matches are coloured, from red to blue, according to E()-value, using a relative scale, from the most to the least significant hits within the result. An absolute scale, which ranges from E() = 0 to E()=1.0, is also available. These aim to aid the user in deciding whether weak similarities may be biologically significant. These images are available in Scalable Vector Graphics (SVG), Portable Network Graphic (PNG) and JPEG output, providing wide compatibility. The raw result and processed forms, such as the ‘Summary Table’ content and XML formats are downloadable for further processing by the user.

The examples above illustrate how, from a single sequence similarity search, it is possible to access related sources of annotation, determine visually which results are relevant and infer gene and protein functional associations, using the JDispatcher framework.

Web Services

Web Services technologies have opened up important opportunities for the analysis of life sciences data. It is now well established that sharing resources, across geographically distributed networks, is advantageous to scientists and bioinformaticians through the re-use of generic services, such as those presented in this article. The new JDispatcher framework provides multiple front-ends: in addition to the web interface, SOAP and REST APIs (http://www.ebi.ac.uk/Tools/webservices/) have been implemented to offer programmatic access using accepted web services standards.

The SOAP and REST APIs cater for users requiring systematic access to a wide range of sequence similarity search and multiple sequence alignment services, which can be built into local analytical workflows and pipelines (e.g. Taverna (15), Triana (http://www.trianacode.org/), KNIME (www.knime.org) (16) and Pipeline Pilot (http://accelrys.com/products/scitegic/index.html))—typical usage scenarios include the characterization of novel genomes and proteomes and the analysis of data derived from meta-genome experiments.

Using the APIs, complex applications can be developed in various programming languages, which include: C/C++, C#, Java, Perl, PHP, Python and Ruby, or scripting environments such a Bash, csh, batch and PowerShell. This allows integration of services into existing and/or new applications that require access to fast sequence database searching or multiple sequence alignment methods. To facilitate this type of usage, the services provide extensive meta-information describing the available parameters, including their possible values and descriptions of their purpose.

Typical applications of the JDispatcher framework services include: providing an alternative interface for specialist usage targeted at a specific community; integrating a service into an existing data portal to provide analysis services; and enhancing analysis results by directly connecting the result with the data. These are of importance to service providers and users of pipelines who may not have the resources to run and maintain the infrastructure required to support equivalent functionality.

CONCLUSIONS

The modularity of this new framework reduces maintenance overheads and simplifies the addition of tools and features. Keeping the result data model and the renderers separate provides the flexibility to add additional representations to all functionally related tools. This improves the level of usability for both novice and expert users. The presented visualization examples highlight important insights in the understanding of existing and new nucleotide and protein sequences from both genomes and metagenome experiments and suggest novel ways in which these data can be interpreted.

Academic and commercial laboratories can integrate the JDispatcher framework services with their local analytical pipelines or workflows. These represent an important contribution to the growing number of available services in bioinformatics and have been submitted to the BioCatalogue (17) (www.biocatalogue.org), a registry of freely available web services in the life sciences.

FUNDING

The European Commission under FELICS [contract number 021902 (RII3), within the Research Infrastructure Action of the FP6 ‘Structuring the European Research Area’ Programme]; core funding from the European Molecular Biology Laboratory; European Patent Office. Funding for open access charge: EMBL.

Conflict of interest statement. None declared.

ACKNOWLEDGEMENTS

We acknowledge valuable feedback from Prof. William Pearson from the University of Virginia, USA and the InterPro and UniProt teams at EMBL-EBI.

REFERENCES

Web services at the European Bioinformatics Institute—2009

Nucleic Acids Res.

2009

, vol.

(pg.

W10

)

The European Bioinformatics Institute’s data resources

Nucleic Acids Res.

2010

, vol.

(pg.

D17

D25

)

Improved tools for biological sequence comparison

Proc. Natl Acad. Sci. USA

1988

, vol.

(pg.

2444

2448

)

WU-Blast2 server at the European Bioinformatics Institute

Nucleic Acids Res.

2003

, vol.

(pg.

3795

3798

)

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

Nucleic Acids Res.

1997

, vol.

(pg.

3389

3402

)

InterProScan: protein domains identifier

Nucleic Acids Res.

2005

, vol.

(pg.

W116

W120

)

et al.

ClustalW2 and ClustalX version 2.0

Bioinformatics

2007

, vol.

(pg.

2947

2948

)

T-Coffee: a novel method for multiple sequence alignments

J. Mol. Biol.

2000

, vol.

302

(pg.

205

217

)

MUSCLE: multiple sequence alignment with high accuracy and high throughput

Nucleic Acids Res.

2004

, vol.

(pg.

1792

1797

)

Kalign – an accurate and fast multiple sequence alignment algorithm

BMC Bioinformatics

2005

, vol.

pg.

298

Multiple alignment of DNA sequences with MAFFT

Methods Mol. Biol.

2009

, vol.

537

(pg.

)

Fast and efficient searching of biological data resources—using EB-eye

Brief. Bioinformatics

2010

doi:10.1098/bib/bbp065 [Epub ahead of print 11 February 2010]

The UniProt Consortium

The Universal Protein Resource (UniProt) in 2010

Nucleic Acids Res.

2010

, vol.

(pg.

D142

D148

)

et al.

InterPro: the integrative protein signature database

Nucleic Acids Res.

2009

, vol.

(pg.

D211

D215

)

Taverna: a tool for building and running workflows of services

Nucleic Acids Res.

2006

, vol.

(pg.

W729

W732

)

KNIME: The Konstanz Information Miner

Data Analysis, Machine Learning and Applications – Proceedings of the 31st Annual Conference of the Gesellschaft fï¿½r Klassifikation e.V., Studies in Classification, Data Analysis, and Knowledge Organization

2007

Berlin, Germany

Springer

(pg.

319

326

)

BioCatalogue: a curated web service registry for the life science community

Nature Precedings

2009

http://www.iscb.org/uploaded/css/36/11627.pdf

ï¿½ The Author(s) 2010. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

I agree to the terms and conditions. You must accept the terms and conditions.

Submit a comment

Name

Affiliations

Comment title

Comment

You have entered an invalid code

Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.

Citations

Views

Altmetric

Metrics

Total Views 13,446

10,468 Pageviews

2,978 PDF Downloads

Since 11/1/2016

Month:	Total Views:
November 2016	13
December 2016	19
January 2017	36
February 2017	72
March 2017	71
April 2017	53
May 2017	66
June 2017	62
July 2017	47
August 2017	73
September 2017	46
October 2017	72
November 2017	89
December 2017	272
January 2018	244
February 2018	199
March 2018	213
April 2018	198
May 2018	133
June 2018	133
July 2018	131
August 2018	193
September 2018	153
October 2018	143
November 2018	199
December 2018	178
January 2019	108
February 2019	179
March 2019	237
April 2019	219
May 2019	186
June 2019	145
July 2019	162
August 2019	171
September 2019	174
October 2019	130
November 2019	110
December 2019	128
January 2020	89
February 2020	122
March 2020	98
April 2020	92
May 2020	115
June 2020	119
July 2020	125
August 2020	144
September 2020	116
October 2020	136
November 2020	179
December 2020	105
January 2021	151
February 2021	181
March 2021	191
April 2021	165
May 2021	134
June 2021	177
July 2021	93
August 2021	111
September 2021	103
October 2021	123
November 2021	181
December 2021	152
January 2022	129
February 2022	126
March 2022	182
April 2022	159
May 2022	151
June 2022	160
July 2022	110
August 2022	129
September 2022	138
October 2022	152
November 2022	130
December 2022	210
January 2023	163
February 2023	166
March 2023	194
April 2023	202
May 2023	157
June 2023	118
July 2023	110
August 2023	147
September 2023	115
October 2023	163
November 2023	158
December 2023	257
January 2024	228
February 2024	134
March 2024	165
April 2024	143
May 2024	151
June 2024	137
July 2024	118
August 2024	93
September 2024	110
October 2024	143
November 2024	39

A new bioinformatics analysis tools framework at EMBL–EBI (original) (raw)

Cite

Abstract