Local alignment of two-base encoded DNA sequence - PubMed (original) (raw)

Local alignment of two-base encoded DNA sequence

Nils Homer et al. BMC Bioinformatics. 2009.

Abstract

Background: DNA sequence comparison is based on optimal local alignment of two sequences using a similarity score. However, some new DNA sequencing technologies do not directly measure the base sequence, but rather an encoded form, such as the two-base encoding considered here. In order to compare such data to a reference sequence, the data must be decoded into sequence. The decoding is deterministic, but the possibility of measurement errors requires searching among all possible error modes and resulting alignments to achieve an optimal balance of fewer errors versus greater sequence similarity.

Results: We present an extension of the standard dynamic programming method for local alignment, which simultaneously decodes the data and performs the alignment, maximizing a similarity score based on a weighted combination of errors and edits, and allowing an affine gap penalty. We also present simulations that demonstrate the performance characteristics of our two base encoded alignment method and contrast those with standard DNA sequence alignment under the same conditions.

Conclusion: The new local alignment algorithm for two-base encoded data has substantial power to properly detect and correct measurement errors while identifying underlying sequence variants, and facilitating genome re-sequencing efforts based on this form of sequence data.

PubMed Disclaimer

Figures

Figure 1

Figure 1

The function Φ. Φ is a function that encodes two bases as a color. Each color is represented by a number ∈ {0, 1, 2, 3}.

Figure 2

Figure 2

Power evaluation for sequences with errors. We assess the power to align sequences with and without two-base encoding in the presence of a per-base or per-color error rate respectively.

Figure 3

Figure 3

Power evaluation for sequences with errors and base substitutions. We assess the power to align sequences with and without two-base encoding in the presence of errors and base substitutions.

Figure 4

Figure 4

Power evaluation for sequences with errors and a contiguous deletion. We assess the power to align sequences with and without two-base encoding in the presence of errors and a contiguous deletion.

Figure 5

Figure 5

Power evaluation for sequences with errors and a contiguous insertion. We assess the power to align sequences with and without two-base encoding in the presence of errors and a contiguous insertion.

Figure 6

Figure 6

The function Γ. Γ is a function that encodes one base and one color as a base.

Similar articles

Cited by

References

    1. Hamming R. Error Detecting and Error Correcting Codes. Bell System Technical Journal. 1950;29:147–160.
    1. Levenshtein VI. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Soviet Physics Doklady. 1966;10:706–710.
    1. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48:443–453. doi: 10.1016/0022-2836(70)90057-4. - DOI - PubMed
    1. Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981;147:195–197. doi: 10.1016/0022-2836(81)90087-5. - DOI - PubMed
    1. Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol. 1982;162:705–708. doi: 10.1016/0022-2836(82)90398-9. - DOI - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources