BFAST: an alignment tool for large scale genome resequencing - PubMed (original) (raw)

BFAST: an alignment tool for large scale genome resequencing

Nils Homer et al. PLoS One. 2009.

Abstract

Background: The new generation of massively parallel DNA sequencers, combined with the challenge of whole human genome resequencing, result in the need for rapid and accurate alignment of billions of short DNA sequence reads to a large reference genome. Speed is obviously of great importance, but equally important is maintaining alignment accuracy of short reads, in the 25-100 base range, in the presence of errors and true biological variation.

Methodology: We introduce a new algorithm specifically optimized for this task, as well as a freely available implementation, BFAST, which can align data produced by any of current sequencing platforms, allows for user-customizable levels of speed and accuracy, supports paired end data, and provides for efficient parallel and multi-threaded computation on a computer cluster. The new method is based on creating flexible, efficient whole genome indexes to rapidly map reads to candidate alignment locations, with arbitrary multiple independent indexes allowed to achieve robustness against read errors and sequence variants. The final local alignment uses a Smith-Waterman method, with gaps to support the detection of small indels.

Conclusions: We compare BFAST to a selection of large-scale alignment tools -- BLAT, MAQ, SHRiMP, and SOAP -- in terms of both speed and accuracy, using simulated and real-world datasets. We show BFAST can achieve substantially greater sensitivity of alignment in the context of errors and true variants, especially insertions and deletions, and minimize false mappings, while maintaining adequate speed compared to other current methods. We show BFAST can align the amount of data needed to fully resequence a human genome, one billion reads, with high sensitivity and accuracy, on a modest computer cluster in less than 24 hours. BFAST is available at (http://bfast.sourceforge.net).

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1

Figure 1. Algorithmic steps and underlying data structure used by BFAST.

BFAST has three sequential steps: create indexes of the reference, find CALs (candidate alignment locations) using the indexes, and perform gapped local alignment. Gapped local alignment is performed on all possible CALs in order to identify the best possible alignment. Thus, a highly sensitive and comprehensive search step is followed by an integrated local alignment to maintain high sensitivity and high accuracy. The indexes can be reused on new sets of reads when the reference remains the same. In Panels A–D we detail the underlying data structure and storage format used to represent and access an index. Panel A represents an example genome (shaded region) to be indexed and a list of all suffixes of the genome with length greater than 10. The shaded regions are the only portions that must actually be stored in RAM: the genome sequence, the ordering of the suffix list, and the entries of the _k_-mer hash lookup. Panel B shows a mask with a key-size of 4 and width 11 to be indexed (10100000011), with a list of all suffixes ordered by the 4-mer keys defined by the mask that are possible from the target sequence. The vertical shaded region in Panel B gives the original ordering of the suffixes, or alternatively their start positions in the genome. A 2-mer hash table is shown in Panel C and is used to rapidly lookup 4-mer entries in the index with the key based on the 2-mer prefix of the 4-mer word. This hash, or an index into the index, returns a range (order start and order end) over the ordered suffix list in Panel B. Panel D shows the possible lookups for a 13-base read using the index defined in Panels A, B, and C. There is an error in the seventh position shown in red. Using the index, three different 4-mers can be defined by sliding the mask along the read. To lookup a given 4-mer, such as ATAG, the hash table is first consulted to find that the prefix AT extends from entry 2 to entry 3 in the index. Since this is not a unique range (i.e. the start and end are equal), the index is bisected using the range from the hash lookup, until all positions with ATAG are located. The read error (or mutation) interferes with alignment in that only one of the three 4-mers are found in this created index, but still yields the proper location for this read in the genome, further indexes would make reads with more errors/mutations mappable correctly. In practice longer keys are used.

Figure 2

Figure 2. Distribution of short sequence CALs to the human genome at various _k_-mer keys.

For varying key sizes, k, the number of lookup locations for each _k_-mer key from the Human reference genome was calculated, forward and reverse complementary sequence included (∼6·109 keys). The figures show the computed percentage of keys that have a given number of genomic locations, the equivalent of CALs for index lookup.

Figure 3

Figure 3. Evaluation of alignment algorithms from simulated 50 base pair reads.

10,000 50 bp reads each fitting to different read classes were aligned using indicated algorithms back to the human genome. Sensitivity is defined as the percent of all reads that were mapped correctly and is plotted on the y-axis in panels A, C, and E. The percentage of reads that are mapped to an incorrect location are plotted on the y axis in panels B, D, and F. The x-axis of panels A and B maps the number of sequence differences in the reads relative to the consensus genome. The x-axis plots the length of a contiguous deletion in panels C and D, and the length of a contiguous insertion in panels E and F. The alignment method is indicated as colored lines per the key within the first figure panel.

Figure 4

Figure 4. Evaluation of alignment algorithms from simulated 50 base pair color space reads.

10,000 50 base color space reads were simulated from the human genome in different read classes to assess sensitivity and accuracy of BFAST. Sensitivity is defined as the percent of all reads that were mapped correctly and is plotted on the y axis in panels A, C, and E. The percentage of reads that are mapped to an incorrect location are plotted on the y axis in panels B, D, and F. The X axis of all panels plots the number of color differences in the reads relative to the consensus genome for a variety of models. Panels A and B plot mapping of reads with a single SNP in addition to the color errors. Panels C and D plot mapping of reads with a single 10 bp deletion in addition to the color errors and single SNP. Panels E and F plot mapping of reads with a single 10 bp insertion in addition to the color errors and single SNP. The alignment method is indicated as colored lines per the key within the first figure panel.

References

    1. Holt RA, Jones SJ. The new paradigm of flow cell sequencing. Genome Res. 2008;18:839–846. - PubMed
    1. Hutchison CA,, 3rd DNA sequencing: bench to bedside and beyond. Nucleic Acids Res. 2007;35:6227–6237. - PMC - PubMed
    1. Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A. 1977;74:5463–5467. - PMC - PubMed
    1. Bentley DR. Whole-genome re-sequencing. Curr Opin Genet Dev. 2006;16:545–552. - PubMed
    1. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources