Fast computation and applications of genome mappability - PubMed (original) (raw)
Fast computation and applications of genome mappability
Thomas Derrien et al. PLoS One. 2012.
Abstract
We present a fast mapping-based algorithm to compute the mappability of each region of a reference genome up to a specified number of mismatches. Knowing the mappability of a genome is crucial for the interpretation of massively parallel sequencing experiments. We investigate the properties of the mappability of eukaryotic DNA/RNA both as a whole and at the level of the gene family, providing for various organisms tracks which allow the mappability information to be visually explored. In addition, we show that mappability varies greatly between species and gene classes. Finally, we suggest several practical applications where mappability can be used to refine the analysis of high-throughput sequencing data (SNP calling, gene expression quantification and paired-end experiments). This work highlights mappability as an important concept which deserves to be taken into full account, in particular when massively parallel sequencing technologies are employed. The GEM mappability program belongs to the GEM (GEnome Multitool) suite of programs, which can be freely downloaded for any use from its website (http://gemlibrary.sourceforge.net).
Conflict of interest statement
Competing Interests: The authors have declared that no competing interests exist.
Figures
Figure 1. Effect of our approximation on the frequencies of the C.elegans genome, for
and . Both the exact and the approximated data were obtained with gem-mappability, the former by setting the value of parameter to , the latter with the default value of automatically selected by the program after the length of the C.elegans genome. Each panel shows how our approximation scatters the -mers originally populating a non-approximate -bit frequence bin into more than one single approximate bin. Using the panel – as an example, one can see that about 80% of the -mers fall into the correct bin, while the remaining 20% is dispersed in bins from – to –, with most of the -mers staying in bins close to the correct one. In addition, the color of the bins shows that such a 20% of -mers corresponds in absolute terms to a small number (in this example about the 90% of the -mers of the genome is unique and hence falls into the [1–1] bin, which, as explained in the text, is not perturbed by our approximation owing to the good properties of the latter).
Figure 2. Effect of our approximation on the frequencies of chromosome
of H.sapiens , for and . Both the exact and the approximated data were obtained with gem-mappability, the former by setting the value of parameter to , the latter with the default value of automatically selected by the program after the length of chromosome of H.sapiens. Each panel shows how our approximation scatters the -mers originally populating a non-approximate -bit frequence bin into more than one single approximate bin.
Figure 3. Visualization of mappability on the UCSC browser : the example of the human TK1 gene.
Six mappability tracks (green) are shown here corresponding to -mer sizes , , , and bp (from top to bottom of the figure). Regions with low mappability score have high frequencies, and conversely. This example illustrates that the uniqueness of the TK1 locus (especially within the introns) could be inversely correlated with the presence of some repetitive elements as identified by RepeatMasker .
Figure 4. Pileup mappability.
The number of all possible -mers covering a particular position of the genome (corresponding to nucleotide C) is equal to ( in this example). The average of the mappabilities of the -mers can be taken as the pileup mappability. Such a quantity represents how mappable would be this position in a pileup of a whole genome sequencing study with reads of length .
Figure 5. Relation between heterozygosity and pileup mappability.
Low-pileup-mappability regions are more prone to show a high value of heterozygosity than those with high mappability. This is due to the spurious contribution of reads which originate from similar regions belonging to the same mappability group. This figure was obtained for H.sapiens chromosomes and out of an in-house experiment with average coverage 30.
Figure 6. Influence of mismatch values and read lengths on the mappability of the human projected transcriptome as defined by GENCODE .
For simplicity, we display the proportion of -mers having a frequency of (i.e. uniquely mappable) and those having a frequency (ambiguous) on the first and second row, respectively. The influence of mismatch number and -mer lengths are presented in the first and second column, respectively.
Figure 7. Cluster of 5S rRNAs on human chr1 exhibiting a very low mappability profile.
This locus explains the peaks observed for annotated rRNAs in the frequency range -.
Figure 8. Comparison of Gencode protein-coding genes RPKM and RPKUM expression values as measured in brain tissue (data from [19]).
Both axis are log-scaled, and each dot represents a protein-coding gene with or without annotated paralogous genes (in green and red, respectively). Protein coding genes totally or partially included in segmental duplications are presented in the top panel, whereas those not overlapping segmental duplications are shown in the bottom panel. The figure illustrates the importance of taking into account the mappability information in order not to underestimate expression level. Without mappability correction, two main reasons are shown to introduce a bias in the quantification of expression levels: gene having paralogs, and genes overlapping segmental duplications.
Figure 9. Influence of paralogous genes on the mappability scores: the example of the HLA-A gene.
The HLA-A gene is part of the Major Histocompatibility Complex (MHC) involving a large gene family with numerous paralogs. This screenshot of the UCSC genome browser (with the six mappability tracks in green) illustrates the low uniqueness of the HLA-A gene (especially, its exon 4) which could render its targeting by RNASeq difficult (if only uniquely mapping reads are considered).
Figure 10. Read mapping and mappability are different concepts: there is no straightforward relation between the number of times a read matches the genome and the mappability of the regions it maps to.
Within an edit distance of 1 mismatch, the sequence IMG maps uniquely to location ING in the schematic genome “......PING-PONG.....”. However, the matched position is not unique in the genome, since considering 1 mismatch it has a frequency of 2 due to location ONG.
Figure 11. Schematic representation of the computation of the paired-end mappability.
In this example the average of the single-end mappabilities at the target position (base C) is bigger than the average of the single-end mappabilities at one of the pairs (base A). Hence the resulting paired-end mappability will be the average of the mean mappabilities at C and A.
Figure 12. Behavior of pileup single-end and paired-end mappabilities at different loci of human chromosome 1 (HSA1).
Parameters used to generate this example were: -mer length 100, 2 mismatches and a library size of 800 bases. Top left: Heatmap of the number of locations in HSA1 as a function of their single-end and paired-end mappabilities. Bottom left: Histogram of the number of locations in HSA1 that show different single-end and paired-end mappabilities, plotted versus their position along the chromosome. Top right: Heatmap of the number of locations in HSA1 as a function of their single-end mappability and their position along the chromosome. Bottom right: Heatmap of the number of locations in HSA1 as a function of their paired-end mappability and their position along the chromosome.
Figure 13. Proportion of completely rescuable positions for all human chromosomes.
In this figure we only consider positions having a single-end mappability greater than 1, and for different library sizes (300, 400, 500, 600, 700, 800, 900, 1000, 1500, 2000, 4000, 6000, 8000 and 10000 bp) we plot the fraction of locations which could be rescued by taking advantage of the fact that they have a paired-end mappability equal to one.
Similar articles
- Umap and Bismap: quantifying genome and methylome mappability.
Karimzadeh M, Ernst C, Kundaje A, Hoffman MM. Karimzadeh M, et al. Nucleic Acids Res. 2018 Nov 16;46(20):e120. doi: 10.1093/nar/gky677. Nucleic Acids Res. 2018. PMID: 30169659 Free PMC article. - Detection of False-Positive Deletions from the Database of Genomic Variants.
Duan J, Liu H, Zhao L, Yuan X, Wang YP, Wan M. Duan J, et al. Biomed Res Int. 2019 Apr 4;2019:8420547. doi: 10.1155/2019/8420547. eCollection 2019. Biomed Res Int. 2019. PMID: 31080831 Free PMC article. - Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score.
Lee H, Schatz MC. Lee H, et al. Bioinformatics. 2012 Aug 15;28(16):2097-105. doi: 10.1093/bioinformatics/bts330. Epub 2012 Jun 4. Bioinformatics. 2012. PMID: 22668792 Free PMC article. - A beginners guide to SNP calling from high-throughput DNA-sequencing data.
Altmann A, Weber P, Bader D, Preuss M, Binder EB, Müller-Myhsok B. Altmann A, et al. Hum Genet. 2012 Oct;131(10):1541-54. doi: 10.1007/s00439-012-1213-z. Epub 2012 Aug 11. Hum Genet. 2012. PMID: 22886560 Review. - Advancements in Next-Generation Sequencing.
Levy SE, Myers RM. Levy SE, et al. Annu Rev Genomics Hum Genet. 2016 Aug 31;17:95-115. doi: 10.1146/annurev-genom-083115-022413. Epub 2016 Jun 9. Annu Rev Genomics Hum Genet. 2016. PMID: 27362342 Review.
Cited by
- Evaluation of tools for identifying large copy number variations from ultra-low-coverage whole-genome sequencing data.
Smolander J, Khan S, Singaravelu K, Kauko L, Lund RJ, Laiho A, Elo LL. Smolander J, et al. BMC Genomics. 2021 May 17;22(1):357. doi: 10.1186/s12864-021-07686-z. BMC Genomics. 2021. PMID: 34000988 Free PMC article. - Causes and consequences of chromatin variation between inbred mice.
Hosseini M, Goodstadt L, Hughes JR, Kowalczyk MS, de Gobbi M, Otto GW, Copley RR, Mott R, Higgs DR, Flint J. Hosseini M, et al. PLoS Genet. 2013 Jun;9(6):e1003570. doi: 10.1371/journal.pgen.1003570. Epub 2013 Jun 13. PLoS Genet. 2013. PMID: 23785304 Free PMC article. - CRAG: de novo characterization of cell-free DNA fragmentation hotspots in plasma whole-genome sequencing.
Zhou X, Zheng H, Fu H, Dillehay McKillip KL, Pinney SM, Liu Y. Zhou X, et al. Genome Med. 2022 Dec 8;14(1):138. doi: 10.1186/s13073-022-01141-8. Genome Med. 2022. PMID: 36482487 Free PMC article. - Dosage regulation, and variation in gene expression and copy number of human Y chromosome ampliconic genes.
Vegesna R, Tomaszkiewicz M, Medvedev P, Makova KD. Vegesna R, et al. PLoS Genet. 2019 Sep 16;15(9):e1008369. doi: 10.1371/journal.pgen.1008369. eCollection 2019 Sep. PLoS Genet. 2019. PMID: 31525193 Free PMC article. - Quality control and evaluation of plant epigenomics data.
Schmitz RJ, Marand AP, Zhang X, Mosher RA, Turck F, Chen X, Axtell MJ, Zhong X, Brady SM, Megraw M, Meyers BC. Schmitz RJ, et al. Plant Cell. 2022 Jan 20;34(1):503-513. doi: 10.1093/plcell/koab255. Plant Cell. 2022. PMID: 34648025 Free PMC article.
References
- Ribeca P. The GEM (GEnome Multitool) library. 2008 URL http://gemlibrary.sourceforge.net. Accessed 2011 Dec 23.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials