The deployment of an alignment free distance for DNA reads pairs filtering (original) (raw)

Background. DNA assembly consists in reconstructing the unknown primary structure of a DNA sequence from a large number of its fragments, called reads, that are obtained in the sequencing process. The need for fast assembly methods has increased with the introduction of next generation sequencing (NGS) machines, that can produce and extract, at low cost, a large number of short reads from a genomic source. A large class of DNA assembly methods rely on a filtering step, where promising read pairs are separated from non-promising ones in order to reduce the computational burden of the main assembly algorithm. Faster filtering can thus provide a significant contribution to speed up the reconstruction of sequenced DNA. Methods. We propose a fitering method for read pairs based on alignment free distance. The similarity of two reads is assessed by comparing the frequencies of their substrings of fixed dimensions (k-mers). We compare this alignment free distance with the Needleman-Wunsch edit distance and with the quality of the BLAST alignment. Our comparison is based on a very simple assumption: the most correct distance is that obtained by knowing in advance the reference sequence that we are trying to align. We compute the overlap between the reads that is obtained once they have been aligned on the original DNA sequence, and use that as a reference distance; then, we verify how the alignment free and the alignment based distances are able to reproduce this ideal distance. The capability of correctly reproducing this ideal distance is evaluated over samples of read pairs from Saccharomyces cerevisiae (yeast), Escherichia coli, and Homo sapiens (human) genomic sequences. Comparisons are based on the correctness of threshold predictors and are measured and cross-validated over different samples from the same sets of reads. Results. We show that, for the considered sequences, the adopted alignment free distance performs as well as, or better, than the more time consuming distances that require the alignment of the reads. Such assessment is based on prediction precision of the analyzed distances both on training and on test sets. Conclusions. We present computational results that show the efficacy of an alignment free distance in estimating a good read-to-read distance measure. We conclude that read pairs filtering based on alignment free distances may significantly accelerate the assembly process without a substantial loss in accuracy for the DNA sample sequence reconstruction.