A new bioinformatics analysis tools framework at EMBL–EBI (original) (raw)

Journal Article

,

European Bioinformatics Institute, EMBL Outstation, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK

Search for other works by this author on:

,

European Bioinformatics Institute, EMBL Outstation, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK

Search for other works by this author on:

,

European Bioinformatics Institute, EMBL Outstation, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK

Search for other works by this author on:

,

European Bioinformatics Institute, EMBL Outstation, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK

Search for other works by this author on:

,

European Bioinformatics Institute, EMBL Outstation, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK

Search for other works by this author on:

,

European Bioinformatics Institute, EMBL Outstation, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK

Search for other works by this author on:

European Bioinformatics Institute, EMBL Outstation, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK

*To whom correspondence should be addressed. Tel: +44 1223 494423; Fax:

+44 1223 494468

; Email: rls@ebi.ac.uk

Search for other works by this author on:

Received:

27 January 2010

Revision received:

06 April 2010

Cite

Mickael Goujon, Hamish McWilliam, Weizhong Li, Franck Valentin, Silvano Squizzato, Juri Paern, Rodrigo Lopez, A new bioinformatics analysis tools framework at EMBL–EBI, Nucleic Acids Research, Volume 38, Issue suppl_2, 1 July 2010, Pages W695–W699, https://doi.org/10.1093/nar/gkq313
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

The EMBL-EBI provides access to various mainstream sequence analysis applications. These include sequence similarity search services such as BLAST, FASTA, InterProScan and multiple sequence alignment tools such as ClustalW, T-Coffee and MUSCLE. Through the sequence similarity search services, the users can search mainstream sequence databases such as EMBL-Bank and UniProt, and more than 2000 completed genomes and proteomes. We present here a new framework aimed at both novice as well as expert users that exposes novel methods of obtaining annotations and visualizing sequence analysis results through one uniform and consistent interface. These services are available over the web and via Web Services interfaces for users who require systematic access or want to interface with customized pipe-lines and workflows using common programming languages. The framework features novel result visualizations and integration of domain and functional predictions for protein database searches. It is available at http://www.ebi.ac.uk/Tools/sss for sequence similarity searches and at http://www.ebi.ac.uk/Tools/msa for multiple sequence alignments.

INTRODUCTION

Bioinformatics is a vast and complex multidisciplinary research area where numerous tools have been developed over the years to analyse constantly growing amounts of data. Since 1998, the European Bioinformatics Institute (EMBL–EBI) has provided public access to various mainstream sequence analysis applications (1,2). These include sequence similarity search services (http://www.ebi.ac.uk/Tools/similarity.html), such as FASTA (3), BLAST (4,5) and InterProScan (6) and multiple sequence alignment tools (http://www.ebi.ac.uk/Tools/sequence.html), such as ClustalW (7), T-Coffee (8), MUSCLE (9), Kalign (10) and MAFFT (11). These services are provided via a PERL-CGI job dispatcher framework for managing job submission and result representation. This infrastructure handled more than 16 million jobs during 2009. The popularity of these services has made it necessary to redesign the system in order to minimize maintenance and enhance the integration of features requested by users. A new and modular framework, called JDispatcher, has been developed to improve the accessibility and quality of the services relevant to the biological community.

JDispatcher framework

JDispatcher is aimed at both novice and expert users and exposes novel methods of obtaining annotations and visualizing sequence analysis results through one uniform and consistent interface. These services are available interactively over the web and via SOAP and REST interfaces for systematic or programmatic use. The new framework provides input validation to assure successful job submissions, offers new visualization features to assist in the interpretation of results and uses the EBI search engine, EB-eye (12), to integrate relevant annotations.

A user can submit sequences using web forms that contain all supported parameters and their possible values. The different tools have been grouped into categories based on their purpose (Table 1).

Table 1.

Tools available in the JDispatcher framework

Category Tool
Sequence Similarity Searches (sss) psisearch, psiblast, ncbiblast, wublast, fasta, ssearch, ggsearch and glsearch
Multiple Sequence Alignments (msa) clustalw2, tcoffee, kalign, muscle, mafft, and prank
Category Tool
Sequence Similarity Searches (sss) psisearch, psiblast, ncbiblast, wublast, fasta, ssearch, ggsearch and glsearch
Multiple Sequence Alignments (msa) clustalw2, tcoffee, kalign, muscle, mafft, and prank

Table 1.

Tools available in the JDispatcher framework

Category Tool
Sequence Similarity Searches (sss) psisearch, psiblast, ncbiblast, wublast, fasta, ssearch, ggsearch and glsearch
Multiple Sequence Alignments (msa) clustalw2, tcoffee, kalign, muscle, mafft, and prank
Category Tool
Sequence Similarity Searches (sss) psisearch, psiblast, ncbiblast, wublast, fasta, ssearch, ggsearch and glsearch
Multiple Sequence Alignments (msa) clustalw2, tcoffee, kalign, muscle, mafft, and prank

Within a category, the tools share the same interface design, which uses well established usability patterns, such as wizard-like steps to guide the user through the submission process. It makes use of decision-trees to validate all the parameters required to warrant successful job submissions. If the validation fails, the user is notified about which specific parameters or data are invalid, and the job is not submitted. Alternatively, JDispatcher assigns a unique job identifier and sends a request to a workload management system for the job to be executed. The identifier is then used to keep track of the tasks and to retrieve the results when they become available. The results of each job are kept for a maximum of 7 days.

Results representation

The results of an analysis are made available using various representations (e.g. HTML tables, XML files, images, etc.). In order to produce these representations, each result is converted into a generic category-specific model that is used by a renderer that generates the requested output. The renderers are specific to the model and not to the tool, and thus are available across all the tools in a category. The availability of multiple views of the same data helps the user to interpret and compare results from different tools within a category.

Sequence search algorithms produce limited hits annotation. With the new framework it is possible to navigate hits and access related information. Figure 1 shows the ‘Summary Table’ of an SSEARCH of mouse glomulin (UniProtKB/Swiss-Prot GLMN_MOUSE), which is essential for the development of the vascular system, against the UniProtKB/Swiss-Prot database (13). Each column heading has clickable arrows that allow the user to sort the results according to the values in the columns [e.g. sequence length, score, percentage identity, positives and E()-value]. Each match is enriched with links to cross-references and related information in various data resources (e.g. gene expression, genomic sequences, structures, function, ontologies and literature citations). Optionally, the alignment from the search, and/or the full-annotation for the selected matches can be displayed. A hits selection can also be downloaded in fasta format.

Summary Table view of the results obtained when searching the sequence of mouse glomulin against the UniProtKB/Swiss-Prot database using SSEARCH.

Figure 1.

Summary Table view of the results obtained when searching the sequence of mouse glomulin against the UniProtKB/Swiss-Prot database using SSEARCH.

Figure 2 shows the ‘Visual Output’ obtained from searches using SSEARCH and NCBI BLAST of the glomulin sequence against UniProtKB/Swiss-Prot using default parameters. Comparison of the two images reveals notable differences in the sequence matches reported by the two search methods. For example, differences in the aligned regions between glomulin and aberrant root formation protein 4 for Arabidopsis (ALF4_ARATH) are clearly visible in both; SSEARCH identifies two MON2 homologues at E()-values <1 (MON2_XENLA and MON2_HUMAN), which may indicate there is a structural relationship between GLMN at the C-terminus of the MON2 homologues, although these may not share related functions.

Comparisons between the Visual output results obtained when searching the sequence of mouse glomulin against the UniProtKB/Swiss-Prot database using SSEARCH and NCBI BLAST, respectively.

Figure 2.

Comparisons between the Visual output results obtained when searching the sequence of mouse glomulin against the UniProtKB/Swiss-Prot database using SSEARCH and NCBI BLAST, respectively.

Determining which functional domains and families a protein belongs to is critical to the understanding of the biological processes it may be involved in. This is important for the characterization of existing drug targets as well as in the identification of novel ones. Family and domain functional predictions have been built into the framework, using pre-calculated matches from the InterPro Consortium (14) data. This enables users, not only to search for sequence similarities when using the UniProt databases, but also to characterize the sequence query in terms of domain architectures that may elicit its function. Figure 3 shows ‘Functional Predictions’ for a hypothetical bioactive lysophospholipid that was compared against UniProtKB/Swiss-Prot using NCBI BLAST. The hypothetical sequence has several good homologues, all belonging to the GPCR rhodopsin-like superfamily, which are clearly seen. This indicates the query protein could represent a potential target for receptor-binding studies.

Functional prediction view of the results obtained when comparing the sequence of putative bioactive lysophospholipid that was compared against UniProtKB/Swiss-Prot using NCBI BLAST.

Figure 3.

Functional prediction view of the results obtained when comparing the sequence of putative bioactive lysophospholipid that was compared against UniProtKB/Swiss-Prot using NCBI BLAST.

In both, the ‘Visual Output’ and ‘Functional Predictions’ result representations, the matches are coloured, from red to blue, according to E()-value, using a relative scale, from the most to the least significant hits within the result. An absolute scale, which ranges from E() = 0 to E()=1.0, is also available. These aim to aid the user in deciding whether weak similarities may be biologically significant. These images are available in Scalable Vector Graphics (SVG), Portable Network Graphic (PNG) and JPEG output, providing wide compatibility. The raw result and processed forms, such as the ‘Summary Table’ content and XML formats are downloadable for further processing by the user.

The examples above illustrate how, from a single sequence similarity search, it is possible to access related sources of annotation, determine visually which results are relevant and infer gene and protein functional associations, using the JDispatcher framework.

Web Services

Web Services technologies have opened up important opportunities for the analysis of life sciences data. It is now well established that sharing resources, across geographically distributed networks, is advantageous to scientists and bioinformaticians through the re-use of generic services, such as those presented in this article. The new JDispatcher framework provides multiple front-ends: in addition to the web interface, SOAP and REST APIs (http://www.ebi.ac.uk/Tools/webservices/) have been implemented to offer programmatic access using accepted web services standards.

The SOAP and REST APIs cater for users requiring systematic access to a wide range of sequence similarity search and multiple sequence alignment services, which can be built into local analytical workflows and pipelines (e.g. Taverna (15), Triana (http://www.trianacode.org/), KNIME (www.knime.org) (16) and Pipeline Pilot (http://accelrys.com/products/scitegic/index.html))—typical usage scenarios include the characterization of novel genomes and proteomes and the analysis of data derived from meta-genome experiments.

Using the APIs, complex applications can be developed in various programming languages, which include: C/C++, C#, Java, Perl, PHP, Python and Ruby, or scripting environments such a Bash, csh, batch and PowerShell. This allows integration of services into existing and/or new applications that require access to fast sequence database searching or multiple sequence alignment methods. To facilitate this type of usage, the services provide extensive meta-information describing the available parameters, including their possible values and descriptions of their purpose.

Typical applications of the JDispatcher framework services include: providing an alternative interface for specialist usage targeted at a specific community; integrating a service into an existing data portal to provide analysis services; and enhancing analysis results by directly connecting the result with the data. These are of importance to service providers and users of pipelines who may not have the resources to run and maintain the infrastructure required to support equivalent functionality.

CONCLUSIONS

The modularity of this new framework reduces maintenance overheads and simplifies the addition of tools and features. Keeping the result data model and the renderers separate provides the flexibility to add additional representations to all functionally related tools. This improves the level of usability for both novice and expert users. The presented visualization examples highlight important insights in the understanding of existing and new nucleotide and protein sequences from both genomes and metagenome experiments and suggest novel ways in which these data can be interpreted.

Academic and commercial laboratories can integrate the JDispatcher framework services with their local analytical pipelines or workflows. These represent an important contribution to the growing number of available services in bioinformatics and have been submitted to the BioCatalogue (17) (www.biocatalogue.org), a registry of freely available web services in the life sciences.

FUNDING

The European Commission under FELICS [contract number 021902 (RII3), within the Research Infrastructure Action of the FP6 ‘Structuring the European Research Area’ Programme]; core funding from the European Molecular Biology Laboratory; European Patent Office. Funding for open access charge: EMBL.

Conflict of interest statement. None declared.

ACKNOWLEDGEMENTS

We acknowledge valuable feedback from Prof. William Pearson from the University of Virginia, USA and the InterPro and UniProt teams at EMBL-EBI.

REFERENCES

1

Web services at the European Bioinformatics Institute—2009

,

Nucleic Acids Res.

,

2009

, vol.

37

(pg.

W6

-

W10

)

2

The European Bioinformatics Institute’s data resources

,

Nucleic Acids Res.

,

2010

, vol.

38

(pg.

D17

-

D25

)

3

Improved tools for biological sequence comparison

,

Proc. Natl Acad. Sci. USA

,

1988

, vol.

85

(pg.

2444

-

2448

)

4

WU-Blast2 server at the European Bioinformatics Institute

,

Nucleic Acids Res.

,

2003

, vol.

31

(pg.

3795

-

3798

)

5

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs

,

Nucleic Acids Res.

,

1997

, vol.

25

(pg.

3389

-

3402

)

6

InterProScan: protein domains identifier

,

Nucleic Acids Res.

,

2005

, vol.

33

(pg.

W116

-

W120

)

7

et al.

ClustalW2 and ClustalX version 2.0

,

Bioinformatics

,

2007

, vol.

23

(pg.

2947

-

2948

)

8

T-Coffee: a novel method for multiple sequence alignments

,

J. Mol. Biol.

,

2000

, vol.

302

(pg.

205

-

217

)

9

MUSCLE: multiple sequence alignment with high accuracy and high throughput

,

Nucleic Acids Res.

,

2004

, vol.

32

(pg.

1792

-

1797

)

10

Kalign – an accurate and fast multiple sequence alignment algorithm

,

BMC Bioinformatics

,

2005

, vol.

6

pg.

298

11

Multiple alignment of DNA sequences with MAFFT

,

Methods Mol. Biol.

,

2009

, vol.

537

(pg.

39

-

64

)

12

Fast and efficient searching of biological data resources—using EB-eye

,

Brief. Bioinformatics

,

2010

doi:10.1098/bib/bbp065 [Epub ahead of print 11 February 2010]

13

The UniProt Consortium

The Universal Protein Resource (UniProt) in 2010

,

Nucleic Acids Res.

,

2010

, vol.

38

(pg.

D142

-

D148

)

14

et al.

InterPro: the integrative protein signature database

,

Nucleic Acids Res.

,

2009

, vol.

37

(pg.

D211

-

D215

)

15

Taverna: a tool for building and running workflows of services

,

Nucleic Acids Res.

,

2006

, vol.

34

(pg.

W729

-

W732

)

16

KNIME: The Konstanz Information Miner

,

Data Analysis, Machine Learning and Applications – Proceedings of the 31st Annual Conference of the Gesellschaft f�r Klassifikation e.V., Studies in Classification, Data Analysis, and Knowledge Organization

,

2007

Berlin, Germany

Springer

(pg.

319

-

326

)

17

BioCatalogue: a curated web service registry for the life science community

,

Nature Precedings

,

2009

http://www.iscb.org/uploaded/css/36/11627.pdf

� The Author(s) 2010. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

I agree to the terms and conditions. You must accept the terms and conditions.

Submit a comment

Name

Affiliations

Comment title

Comment

You have entered an invalid code

Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.

Citations

Views

Altmetric

Metrics

Total Views 13,446

10,468 Pageviews

2,978 PDF Downloads

Since 11/1/2016

Month: Total Views:
November 2016 13
December 2016 19
January 2017 36
February 2017 72
March 2017 71
April 2017 53
May 2017 66
June 2017 62
July 2017 47
August 2017 73
September 2017 46
October 2017 72
November 2017 89
December 2017 272
January 2018 244
February 2018 199
March 2018 213
April 2018 198
May 2018 133
June 2018 133
July 2018 131
August 2018 193
September 2018 153
October 2018 143
November 2018 199
December 2018 178
January 2019 108
February 2019 179
March 2019 237
April 2019 219
May 2019 186
June 2019 145
July 2019 162
August 2019 171
September 2019 174
October 2019 130
November 2019 110
December 2019 128
January 2020 89
February 2020 122
March 2020 98
April 2020 92
May 2020 115
June 2020 119
July 2020 125
August 2020 144
September 2020 116
October 2020 136
November 2020 179
December 2020 105
January 2021 151
February 2021 181
March 2021 191
April 2021 165
May 2021 134
June 2021 177
July 2021 93
August 2021 111
September 2021 103
October 2021 123
November 2021 181
December 2021 152
January 2022 129
February 2022 126
March 2022 182
April 2022 159
May 2022 151
June 2022 160
July 2022 110
August 2022 129
September 2022 138
October 2022 152
November 2022 130
December 2022 210
January 2023 163
February 2023 166
March 2023 194
April 2023 202
May 2023 157
June 2023 118
July 2023 110
August 2023 147
September 2023 115
October 2023 163
November 2023 158
December 2023 257
January 2024 228
February 2024 134
March 2024 165
April 2024 143
May 2024 151
June 2024 137
July 2024 118
August 2024 93
September 2024 110
October 2024 143
November 2024 39

×

Email alerts

Citing articles via

More from Oxford Academic