Minimap2: pairwise alignment for nucleotide sequences - PubMed (original) (raw)

Minimap2: pairwise alignment for nucleotide sequences

Heng Li. Bioinformatics. 2018.

Abstract

Motivation: Recent advances in sequencing technologies promise ultra-long reads of ∼100 kb in average, full-length mRNA or cDNA reads in high throughput and genomic contigs over 100 Mb in length. Existing alignment programs are unable or inefficient to process such data at scale, which presses for the development of new alignment algorithms.

Results: Minimap2 is a general-purpose alignment program to map DNA or long mRNA sequences against a large reference database. It works with accurate short reads of ≥100 bp in length, ≥1 kb genomic reads at error rate ∼15%, full-length noisy Direct RNA or cDNA reads and assembly contigs or closely related full chromosomes of hundreds of megabases in length. Minimap2 does split-read alignment, employs concave gap cost for long insertions and deletions and introduces new heuristics to reduce spurious alignments. It is 3-4 times as fast as mainstream short-read mappers at comparable accuracy, and is ≥30 times faster than long-read genomic or cDNA mappers at higher accuracy, surpassing most aligners specialized in one type of alignment.

Availability and implementation: https://github.com/lh3/minimap2.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.

Evaluation on aligning simulated reads. Simulated reads were mapped to the primary assembly of human genome GRCh38. A read is considered correctly mapped if its longest alignment overlaps with the true interval, and the overlap length is ≥10% of the true interval length. Read alignments are sorted by mapping quality in the descending order. For each mapping quality threshold, the fraction of alignments (out of the number of input reads) with mapping quality above the threshold and their error rate are plotted along the curve. (a) Long-read alignment evaluation. 33 088 ≥1000 bp reads were simulated using pbsim (Ono et al., 2013) with error profile sampled from file ‘m131017_060208_42213_*.1.*’ downloaded at

http://bit.ly/chm1p5c3

. The N50 read length is 11 628. Aligners were run under the default setting for SMRT reads. Kart outputted all alignments at mapping quality 60, so is not shown in the figure. It mapped nearly all reads with 4.1% of alignments being wrong, less accurate than others. (b) Short-read alignment evaluation. 10 million pairs of 150 bp reads were simulated using mason2 (Holtgrewe, 2010) with option ‘–illumina-prob-mismatch-scale 2.5’. Short-read aligners were run under the default setting except for changing the maximum fragment length to 800 bp

Cited by

A chromosomal reference genome sequence for the malaria mosquito, Anopheles marshallii, Theobald, 1903.
Makanga BK, Ayala D, Rahola N, Bouafou LBA, Johnson HF, Heaton H, Wagah MG, Collins JC, Krasheninnikova K, Pelan SE, Pointon DB, Sims Y, Torrance JW, Tracey A, Uliano-Silva M, Wood JMD, von Wyschetzki K; Wellcome Sanger Institute Scientific Operations: Sequencing Operations; McCarthy SA, Neafsey DE, Makunin A, Lawniczak MKN. Makanga BK, et al. Wellcome Open Res. 2024 Sep 26;9:554. doi: 10.12688/wellcomeopenres.22989.1. eCollection 2024. Wellcome Open Res. 2024. PMID: 39507815 Free PMC article.
Time course transcriptomic profiling suggests Crp/Fnr transcriptional regulation of nosZ gene in a N2O-reducing thermophile.
Tsuchiya J, Mino S, Fujiwara F, Okuma N, Ichihashi Y, Morris RM, Nunn BL, Timmins-Schiffman E, Sawabe T. Tsuchiya J, et al. iScience. 2024 Sep 30;27(11):111074. doi: 10.1016/j.isci.2024.111074. eCollection 2024 Nov 15. iScience. 2024. PMID: 39507244 Free PMC article.

References

1. Abouelhoda M.I., Ohlebusch E. (2005) Chaining algorithms for multiple genome comparison. J. Discrete Algorithms, 3, 321–341.
1. Altschul S.F., Erickson B.W. (1986) Optimal sequence alignment using affine gap costs. Bull. Math. Biol., 48, 603–616. - PubMed
1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. - PMC - PubMed
1. Berlin K. et al. (2015) Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol., 33, 623–630. - PubMed
1. Byrne A. et al. (2017) Nanopore long-read RNAseq reveals widespread transcriptional variation among the surface receptors of individual B cells. Nat. Commun., 8, 16027.. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Miscellaneous
- NCI CPTAC Assay Portal

Minimap2: pairwise alignment for nucleotide sequences - PubMed (original) (raw)