Reliable identification of genomic variants from RNA-seq data - PubMed (original) (raw)

Reliable identification of genomic variants from RNA-seq data

Robert Piskol et al. Am J Hum Genet. 2013.

Abstract

Identifying genomic variation is a crucial step for unraveling the relationship between genotype and phenotype and can yield important insights into human diseases. Prevailing methods rely on cost-intensive whole-genome sequencing (WGS) or whole-exome sequencing (WES) approaches while the identification of genomic variants from often existing RNA sequencing (RNA-seq) data remains a challenge because of the intrinsic complexity in the transcriptome. Here, we present a highly accurate approach termed SNPiR to identify SNPs in RNA-seq data. We applied SNPiR to RNA-seq data of samples for which WGS and WES data are also available and achieved high specificity and sensitivity. Of the SNPs called from the RNA-seq data, >98% were also identified by WGS or WES. Over 70% of all expressed coding variants were identified from RNA-seq, and comparable numbers of exonic variants were identified in RNA-seq and WES. Despite our method's limitation in detecting variants in expressed regions only, our results demonstrate that SNPiR outperforms current state-of-the-art approaches for variant detection from RNA-seq data and offers a cost-effective and reliable alternative for SNP discovery.

Copyright © 2013 The American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.

PubMed Disclaimer

Figures

Figure 1

Figure 1

A Computational Framework for the Identification of SNPs from Transcriptome Data Shown are RNA-seq reads mapped to the human reference genome (blue lines) and all regions spanning known splice junctions (yellow lines separated by dashes). Subsequent variant calling used GATK and filtering to remove spurious sites, generating a high-confidence set of SNVs.

Figure 2

Figure 2

Comparison of SNPs Identified via RNA-Seq and WGS of GM12878 Cells and PBMCs SNPiR achieved high precision for both GM12878 (A) and PBMC (B) data sets, given that most of the RNA-seq variants were also identified by WGS of the same subject. Numbers in parentheses give the percentage of RNA-seq variants found in WGS.

Figure 3

Figure 3

Characteristics of SNPs Identified from RNA-Seq Data of GM12878 Cells (A) The composition of genomic regions for variants in WGS, WES, and RNA-seq suggests a high enrichment of RNA-seq variants in functionally important regions. Sites present in RNA-seq and WES occurred substantially more often in coding exons. (B) Overlap in coding variants detected from RNA-seq and WGS. Of all coding variants, 40.2% were found by RNA-seq. The majority of the remaining sites were not detected as a result of the lack of expression. “No variation” indicates that the position was homozygous in RNA, “OK but filtered” indicates that the position was heterozygous but was removed by one of our filtering steps, and “not expressed” indicates that the position was not covered by RNA-seq reads.

Figure 4

Figure 4

High Sensitivity of SNPiR Variant Calling in Coding Regions of Expressed Genes of GM12878 Cells (A) Sensitivity and number of detected variants called from RNA-seq data in dependence of the minimum gene expression (in FPKM). (B) Cumulative distribution of expression levels (in FPKM) for all reference genes.

Figure 5

Figure 5

Subsampling of RNA-Seq Reads Subsamplings of 5, 10, 20, 50, and 100 million reads were generated from the total set of 499 million GM12878 RNA-seq reads. We compared (A) the number of discovered variants, (B) the number of variants in coding regions, and (C) the genomic location of variants between the random samplings and the complete set RNA-seq reads, as well as (D) the mutational profile of known RNA-seq variants and genomic variants. In (A) and (B), “known” variants denote all variant sites that were discovered from RNA-seq and were either confirmed through WGS or present in dbSNP. Conversely, “novel” denotes all variants that were previously not found from WGS or dbSNP. The total amounts of novel variants per sample size are shown as small numbers above the data series.

Figure 6

Figure 6

Comparison of Genomic Variants Identified in CCDS and Exonic Regions by WGS, WES, or RNA-Seq An equal number of reads (94.1 million) of RNA-seq and WES data was used for fair comparison of variants identified in CCDS regions (A) and exonic regions (B).≥

Figure 7

Figure 7

Comparison of SNPiR with RNASEQR Overlap between the sites detected by SNPiR and RNASEQR on the same RNA-seq data set for GM12878 cells (A) and the number of known and novel variants discovered by SNPiR and RNASEQR, the precision and sensitivity of variant calling, and the ts/tv ratio for each category (B). Precision was calculated as the fraction of RNA-seq variants either supported by WGS or present in dbSNP. Sensitivity was determined as the fraction of WGS variants both found in coding regions and discovered in the RNA-seq data.

Similar articles

Cited by

References

    1. Abecasis G.R., Altshuler D., Auton A., Brooks L.D., Durbin R.M., Gibbs R.A., Hurles M.E., McVean G.A., 1000 Genomes Project Consortium A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. - PMC - PubMed
    1. Altshuler D.M., Gibbs R.A., Peltonen L., Altshuler D.M., Gibbs R.A., Peltonen L., Dermitzakis E., Schaffner S.F., Yu F., Peltonen L., International HapMap 3 Consortium Integrating common and rare genetic variation in diverse human populations. Nature. 2010;467:52–58. - PMC - PubMed
    1. MacArthur D.G., Balasubramanian S., Frankish A., Huang N., Morris J., Walter K., Jostins L., Habegger L., Pickrell J.K., Montgomery S.B., 1000 Genomes Project Consortium A systematic survey of loss-of-function variants in human protein-coding genes. Science. 2012;335:823–828. - PMC - PubMed
    1. Altshuler D., Hirschhorn J.N., Klannemark M., Lindgren C.M., Vohl M.C., Nemesh J., Lane C.R., Schaffner S.F., Bolk S., Brewer C. The common PPARgamma Pro12Ala polymorphism is associated with decreased risk of type 2 diabetes. Nat. Genet. 2000;26:76–80. - PubMed
    1. Chapman M.A., Lawrence M.S., Keats J.J., Cibulskis K., Sougnez C., Schinzel A.C., Harview C.L., Brunet J.P., Ahmann G.J., Adli M. Initial genome sequencing and analysis of multiple myeloma. Nature. 2011;471:467–472. - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources