Fast computation and applications of genome mappability - PubMed (original) (raw)

Fast computation and applications of genome mappability

Thomas Derrien et al. PLoS One. 2012.

Abstract

We present a fast mapping-based algorithm to compute the mappability of each region of a reference genome up to a specified number of mismatches. Knowing the mappability of a genome is crucial for the interpretation of massively parallel sequencing experiments. We investigate the properties of the mappability of eukaryotic DNA/RNA both as a whole and at the level of the gene family, providing for various organisms tracks which allow the mappability information to be visually explored. In addition, we show that mappability varies greatly between species and gene classes. Finally, we suggest several practical applications where mappability can be used to refine the analysis of high-throughput sequencing data (SNP calling, gene expression quantification and paired-end experiments). This work highlights mappability as an important concept which deserves to be taken into full account, in particular when massively parallel sequencing technologies are employed. The GEM mappability program belongs to the GEM (GEnome Multitool) suite of programs, which can be freely downloaded for any use from its website (http://gemlibrary.sourceforge.net).

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1

Figure 1. Effect of our approximation on the frequencies of the C.elegans genome, for

formula image and formula image. Both the exact and the approximated data were obtained with gem-mappability, the former by setting the value of parameter formula image to formula image, the latter with the default value of formula image automatically selected by the program after the length of the C.elegans genome. Each panel shows how our approximation scatters the formula image-mers originally populating a non-approximate formula image-bit frequence bin into more than one single approximate bin. Using the panel – as an example, one can see that about 80% of the formula image-mers fall into the correct bin, while the remaining 20% is dispersed in bins from – to –, with most of the formula image-mers staying in bins close to the correct one. In addition, the color of the bins shows that such a 20% of formula image-mers corresponds in absolute terms to a small number (in this example about the 90% of the formula image-mers of the genome is unique and hence falls into the [1–1] bin, which, as explained in the text, is not perturbed by our approximation owing to the good properties of the latter).

Figure 2

Figure 2. Effect of our approximation on the frequencies of chromosome

formula image of H.sapiens , for formula image and formula image. Both the exact and the approximated data were obtained with gem-mappability, the former by setting the value of parameter formula image to formula image, the latter with the default value of formula image automatically selected by the program after the length of chromosome formula image of H.sapiens. Each panel shows how our approximation scatters the formula image-mers originally populating a non-approximate formula image-bit frequence bin into more than one single approximate bin.

Figure 3

Figure 3. Visualization of mappability on the UCSC browser : the example of the human TK1 gene.

Six mappability tracks (green) are shown here corresponding to formula image-mer sizes formula image, formula image, formula image, formula image and formula image bp (from top to bottom of the figure). Regions with low mappability score have high frequencies, and conversely. This example illustrates that the uniqueness of the TK1 locus (especially within the introns) could be inversely correlated with the presence of some repetitive elements as identified by RepeatMasker .

Figure 4

Figure 4. Pileup mappability.

The number of all possible formula image-mers covering a particular position of the genome (corresponding to nucleotide C) is equal to formula image (formula image in this example). The average of the mappabilities of the formula image-mers can be taken as the pileup mappability. Such a quantity represents how mappable would be this position in a pileup of a whole genome sequencing study with reads of length formula image.

Figure 5

Figure 5. Relation between heterozygosity and pileup mappability.

Low-pileup-mappability regions are more prone to show a high value of heterozygosity than those with high mappability. This is due to the spurious contribution of reads which originate from similar regions belonging to the same mappability group. This figure was obtained for H.sapiens chromosomes formula image and formula image out of an in-house experiment with average coverage 30formula image.

Figure 6

Figure 6. Influence of mismatch values and read lengths on the mappability of the human projected transcriptome as defined by GENCODE .

For simplicity, we display the proportion of formula image-mers having a frequency of formula image (i.e. uniquely mappable) and those having a frequency formula image (ambiguous) on the first and second row, respectively. The influence of mismatch number and formula image-mer lengths are presented in the first and second column, respectively.

Figure 7

Figure 7. Cluster of 5S rRNAs on human chr1 exhibiting a very low mappability profile.

This locus explains the peaks observed for annotated rRNAs in the frequency range formula image-formula image.

Figure 8

Figure 8. Comparison of Gencode protein-coding genes RPKM and RPKUM expression values as measured in brain tissue (data from [19]).

Both axis are log-scaled, and each dot represents a protein-coding gene with or without annotated paralogous genes (in green and red, respectively). Protein coding genes totally or partially included in segmental duplications are presented in the top panel, whereas those not overlapping segmental duplications are shown in the bottom panel. The figure illustrates the importance of taking into account the mappability information in order not to underestimate expression level. Without mappability correction, two main reasons are shown to introduce a bias in the quantification of expression levels: gene having paralogs, and genes overlapping segmental duplications.

Figure 9

Figure 9. Influence of paralogous genes on the mappability scores: the example of the HLA-A gene.

The HLA-A gene is part of the Major Histocompatibility Complex (MHC) involving a large gene family with numerous paralogs. This screenshot of the UCSC genome browser (with the six mappability tracks in green) illustrates the low uniqueness of the HLA-A gene (especially, its exon 4) which could render its targeting by RNASeq difficult (if only uniquely mapping reads are considered).

Figure 10

Figure 10. Read mapping and mappability are different concepts: there is no straightforward relation between the number of times a read matches the genome and the mappability of the regions it maps to.

Within an edit distance of 1 mismatch, the sequence IMG maps uniquely to location ING in the schematic genome “......PING-PONG.....”. However, the matched position is not unique in the genome, since considering 1 mismatch it has a frequency of 2 due to location ONG.

Figure 11

Figure 11. Schematic representation of the computation of the paired-end mappability.

In this example the average of the single-end mappabilities at the target position (base C) is bigger than the average of the single-end mappabilities at one of the pairs (base A). Hence the resulting paired-end mappability will be the average of the mean mappabilities at C and A.

Figure 12

Figure 12. Behavior of pileup single-end and paired-end mappabilities at different loci of human chromosome 1 (HSA1).

Parameters used to generate this example were: formula image-mer length 100, 2 mismatches and a library size of 800 bases. Top left: Heatmap of the number of locations in HSA1 as a function of their single-end and paired-end mappabilities. Bottom left: Histogram of the number of locations in HSA1 that show different single-end and paired-end mappabilities, plotted versus their position along the chromosome. Top right: Heatmap of the number of locations in HSA1 as a function of their single-end mappability and their position along the chromosome. Bottom right: Heatmap of the number of locations in HSA1 as a function of their paired-end mappability and their position along the chromosome.

Figure 13

Figure 13. Proportion of completely rescuable positions for all human chromosomes.

In this figure we only consider positions having a single-end mappability greater than 1, and for different library sizes (300, 400, 500, 600, 700, 800, 900, 1000, 1500, 2000, 4000, 6000, 8000 and 10000 bp) we plot the fraction of locations which could be rescued by taking advantage of the fact that they have a paired-end mappability equal to one.

Similar articles

Cited by

References

    1. Li H, Homer N. A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinformatics. 2010;11:473–83. - PMC - PubMed
    1. Ribeca P. The GEM (GEnome Multitool) library. 2008 URL http://gemlibrary.sourceforge.net. Accessed 2011 Dec 23.
    1. Huda A, Mariño-Ramírez L, Landsman D, Jordan IK. Repetitive DNA elements, nucleosome binding and human gene expression. Gene. 2009;436:12–22. - PMC - PubMed
    1. Feschotte C. Transposable elements and the evolution of regulatory networks. Nat Rev Genet. 2008;9:397–405. - PMC - PubMed
    1. Whiteford N, Haslam N, Weber G, Prügel-Bennett A, Essex JW, et al. An analysis of the feasibility of short read sequencing. Nucleic Acids Res. 2005;33:e171. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources