The Bioperl toolkit: Perl modules for the life sciences - PubMed (original) (raw)

. 2002 Oct;12(10):1611-8.

doi: 10.1101/gr.361602.

David Block, Kris Boulez, Steven E Brenner, Stephen A Chervitz, Chris Dagdigian, Georg Fuellen, James G R Gilbert, Ian Korf, Hilmar Lapp, Heikki Lehväslaiho, Chad Matsalla, Chris J Mungall, Brian I Osborne, Matthew R Pocock, Peter Schattner, Martin Senger, Lincoln D Stein, Elia Stupka, Mark D Wilkinson, Ewan Birney

Affiliations

The Bioperl toolkit: Perl modules for the life sciences

Jason E Stajich et al. Genome Res. 2002 Oct.

Abstract

The Bioperl project is an international open-source collaboration of biologists, bioinformaticians, and computer scientists that has evolved over the past 7 yr into the most comprehensive library of Perl modules available for managing and manipulating life-science information. Bioperl provides an easy-to-use, stable, and consistent programming interface for bioinformatics application programmers. The Bioperl modules have been successfully and repeatedly used to reduce otherwise complex tasks to only a few lines of code. The Bioperl object model has been proven to be flexible enough to support enterprise-level applications such as EnsEMBL, while maintaining an easy learning curve for novice Perl programmers. Bioperl is capable of executing analyses and processing results from programs such as BLAST, ClustalW, or the EMBOSS suite. Interoperation with modules written in Python and Java is supported through the evolving BioCORBA bridge. Bioperl provides access to data stores such as GenBank and SwissProt via a flexible series of sequence input/output modules, and to the emerging common sequence data storage format of the Open Bioinformatics Database Access project. This study describes the overall architecture of the toolkit, the problem domains that it addresses, and gives specific examples of how the toolkit can be used to solve common life-sciences problems. We conclude with a discussion of how the open-source nature of the project has contributed to the development effort.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Rendering a sequence graphically with Bio∷Graphics. This image represents a 20-Kb segment of the C. elegans genome containing annotated genes, a cross-species alignment (C. elegans to C. briggsae), EST alignments, SNPs, PCR primer pairs, and a GC content histogram. The module's flexible glyph-based architecture allows the application programmer to adjust precisely how to display biological objects. Glyphs allow the programmer to define different symbols for different data types or data sources and each are drawn as a separate track in the image. The module is also suitable for illustrating the extent of protein domains, physical (clone) maps, and horizontal maps.

Figure 2

Figure 2

This figure shows a portion of the Bioperl object model including the interfaces (shown in italicized type) for sequences (PrimarySeqI, SeqI, RichSeqI) and their implementations PrimarySeq (general sequence), Seq (sequence with features), RichSeq (sequence with features and rich annotation), LargePrimarySeq (for sequences too large to be held in a program's memory), and LargeSeq (large sequences with features). Also included in the diagram is the sequence feature interface (SeqFeatureI) and its implementations Similarity (manage similarity information), FeaturePair (paired feature information), and SimilarityPair (paired similarity information such as a pair-wise alignment information). Additionally, the diagram shows the location objects that manage Simple (start, end, and strand information), Split (multiple start and end spots on a sequence such as a set of exons), and so-called Fuzzy locations (where start, end or span is not exact) for sequence features.

Figure 3

Figure 3

Retrieving a sequence from a remote database with Bio∷DB∷EMBL. This code retrieves an mRNA sequence in EMBL format from the EBI EMBL databank with the accession no. U14680 and writes the sequence out in GenBank format to the terminal. One could replace Bio∷DB∷EMBL with Bio∷DB∷GenBank and instead retrieve the sequence from NCBI just as easily, as the software can read and write both EMBL and GenBank formats and is able to connect to both services through the World Wide Web. The retrieved sequence can then be passed to Bio∷Graphics for graphical rendering, to the Bio∷SeqIO interface for writing to a file, or to the ODBA interfaces for storage in a relational database.

Figure 4

Figure 4

Report parsing with Bio∷SearchIO. This code parses a BLAST report from a file called report.bls and saves, in an array called @HitsToSave, only the hits that have High-scoring Segment Pairs (HSPs) meeting an e-value and length threshold. In this case, any hit with e-value >0.001 or length < 120 residues will be excluded. Once the array is built, the names of each of the hits that had a HSP that met the criteria are printed out. To parse a FASTA (Pearson and Lipman 1988) report file one simply changes the format specification from blast to fasta.

Similar articles

Cited by

References

    1. Achard F, Vaysseix G, Barillot E. XML, Bioinformatics, and data integration. Bioinformatics. 2001;17:115–125. - PubMed
    1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. - PMC - PubMed
    1. Beck K. Extreme programming examined: Embrace change. Reading, MA: Addison Wesley; 1999.
    1. Burge C, Karlin S. Prediction of complete gene stuctures in human genomic DNA. J Mol Biol. 1997;268:78–94. - PubMed
    1. Chervitz SA, Fuellen G, Dagdigian C, Brenner SE, Birney E, Korf I. Bioperl: Standard perl modules for bioinformatics. Bio Informatics Technology and Systems (BITS) 1998. http://www.bitsjournal.com/bioperl.html , http://www.bitsjournal.com/bioperl.html. .

Publication types

MeSH terms

Grants and funding

LinkOut - more resources