The GNUMAP algorithm: unbiased probabilistic mapping of oligonucleotides from next-generation sequencing - PubMed (original) (raw)

The GNUMAP algorithm: unbiased probabilistic mapping of oligonucleotides from next-generation sequencing

Nathan L Clement et al. Bioinformatics. 2010.

Abstract

Motivation: The advent of next-generation sequencing technologies has increased the accuracy and quantity of sequence data, opening the door to greater opportunities in genomic research.

Results: In this article, we present GNUMAP (Genomic Next-generation Universal MAPper), a program capable of overcoming two major obstacles in the mapping of reads from next-generation sequencing runs. First, we have created an algorithm that probabilistically maps reads to repeat regions in the genome on a quantitative basis. Second, we have developed a probabilistic Needleman-Wunsch algorithm which utilizes _prb.txt and _int.txt files produced in the Solexa/Illumina pipeline to improve the mapping accuracy for lower quality reads and increase the amount of usable data produced in a given experiment.

Availability: The source code for the software can be downloaded from http://dna.cs.byu.edu/gnumap.

PubMed Disclaimer

Figures

Fig. 1.

Fig. 1.

A flow-chart of the GNUMAP algorithm. First, the algorithm will incrementally find a_k_-mer piece in the consensus Solexa read. This _k_-mer is used as an index into the hash table, producing a list of positions in the genome with the exact _k_-mer sequence. These locations are expanded to align the same l nucleotides from the read to the genomic location. If the alignment score passes the user-defined threshold, the location is considered a hit, and recorded on the genome for future output.

Fig. 2.

Fig. 2.

Position-specific base-calling error rates for the various methods (PRB = Most probable base (MPB), INT = Highest intensity base, RMAP = MBP+quality, GER = MPB+chastity, GNU = Probability, GNUQ = Probability+quality). Notice that GNU seems to have the best error rate for the beginning of the read in both cases. In (b) GNUMAP's performance worsens but the overall error rate is still smaller. Plus this means that GNUMAP will rely more on the beginning of the read while mapping. (a) Indicates that the RMAPQ filtering may not be highly effective against base-calling errors.

Fig. 3.

Fig. 3.

The spike found by GNUMAP in the promoter region of the RALGPS2 gene was not found by any other program. This spike is located in a repeat region and has a ‘GGAA’ motif, indicating that this region may bound by ETS1. Note: The option for SOAP to report every match to a particular location was used here. This may not give an accurate representation of the method, but does show the noise that would occur if each matching location was reported.

Fig. 4.

Fig. 4.

Benchmark real spikes (bottom) compared with SeqMap (Jiang and Wong, 2008), SOAP (Li,R. et al., 2008), RMAP (Smith et al., 2008), Novocraft (unpublished data) and GNUMAP. Benchmark data were constructed by sampling from 1000 promoter regions in the_C.elegans_ genome. In (b) [an enlargement of the first boxed region in (a)], SeqMap incremented every location for reads from identical regions, producing a significant false positive spike. Attempting to remove false positives by discarding these reads, such as was done by RMAP, results in missing important information, as can be seen in (c) [an enlargement of the second boxed region in (a)]. Note: The intention of this figure is not to discuss the relative mapping capabilities of all currently implemented programs specifically, but to show the trend that would occur if each of these read-placement techniques were used.

Fig. 5.

Fig. 5.

Comparison of false positive rates in the detection of Solexa/Illumina spikes. For each point on the line corresponding to a particular algorithm, the value on the_x_-axis indicates the spike number for that algorithm. The_y_-axis value for that point indicates the number of spikes actually in the benchmark dataset. The difference between the number of spikes found by the algorithm and the diagonal is the number of false positives, i.e. GNUMAP and Novocraft had the lowest false positive rate. GNUMAP was correctly able to identify the top 36 spikes in the test dataset. SeqMap and RMAP performed similarly, as did MAQ and SOAP.

References

    1. Barski A, et al. High-resolution profiling of histone methylations in the human genome. Cell. 2007;129:823–837. - PubMed
    1. Butler J, et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 2008;18:810–820. - PMC - PubMed
    1. Chen W, et al. Mapping translocation breakpoints by next-generation sequencing. Genome Res. 2008;18:1143–1149. - PMC - PubMed
    1. Harismendy O, et al. Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biol. 2009;10:R32. - PMC - PubMed
    1. Jiang H, Wong W. SeqMap: mapping massive amount of oligonucleotides to the genome. Bioinformatics. 2008;24:2395–2396. - PMC - PubMed

MeSH terms

Substances

LinkOut - more resources