Simulating realistic genomic data with rare variants - PubMed (original) (raw)

Simulating realistic genomic data with rare variants

Yaji Xu et al. Genet Epidemiol. 2013 Feb.

Abstract

Increasing evidence suggests that rare and generally deleterious genetic variants might have a strong impact on disease risks of not only Mendelian disease, but also many common diseases. However, identifying such rare variants remains challenging, and novel statistical methods and bioinformatic software must be developed. Hence, we have to extensively evaluate various methods under reasonable genetic models. Although there are abundant genomic data, they are not most helpful for the evaluation of the methods because the disease mechanism is unknown. Thus, it is imperative that we simulate genomic data that mimic the real data containing rare variants and that enable us to impose a known disease penetrance model. Although resampling simulation methods have shown their advantages in computational efficiency and in preserving important properties such as linkage disequilibrium (LD) and allele frequency, they still have limitations as we demonstrated. We propose an algorithm that combines a regression-based imputation with resampling to simulate genetic data with both rare and common variants. Logistic regression model was employed to fit the relationship between a rare variant and its nearby common variants in the 1000 Genomes Project data and then applied to the real data to fill in one rare variant at a time using the fitted logistic model based on common variants. Individuals then were simulated using the real data with imputed rare variants. We compared our method with existing simulators and demonstrated that our method performed well in retaining the real sample properties, such as LD and minor allele frequency, qualitatively.

© 2012 WILEY PERIODICALS, INC.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1

Figure 1. Simulating rare variants using datasets A and B

For a rare variant, a logistic regression model was built in dataset A using selected common variants surrounding, and this rare variant is then imputed in dataset B based on the constructed logistic model.

Figure 2

Figure 2. Simulating a genotype vector from two individuals (four haplotype vectors)

Started with the genomic data of individuals i and i + 1, a new individual is simulated by crossing over a pair of haplotypes from individual i, and a pair of haplotypes from individual i + 1. An asterisk * in the figure indicates the location where a cross-over event occurs. The alleles in the rectangles are selected to form the new individual.

Figure 3

Figure 3. LD patterns by different simulation methods and their reference (HapMap CEU)

Each curve demonstrates the change of the averaged value over pairwise LD (_R_2) values along different marker distances for four simulators, and their reference HapMap CEU sample (black). Note that simuRare implements our regression-based resampling approach.

Figure 4

Figure 4. LD patterns by different simulation methods and their reference (1000 Genomes)

Each curve demonstrates the change of the averaged value over pairwise LD (_R_2) values along different marker distances for HAPGEN2 and regression-based algorithm, and their reference the 1000 Genomes Phase I interim sample (black).

Figure 5

Figure 5. LD patterns by different simulation methods and their reference (1000 Genomes without rare SNPs)

Each curve demonstrates the change of the averaged value over pairwise LD (_R_2) values along different marker distances for HAPGEN2 and regression-based algorithm, and their reference 1000 Genomes Phase I interim sample after removing SNPs with MAF less than 0.01 (black).

Figure 6

Figure 6. LD patterns by different simulation methods and their reference (1000 Genomes CEU)

Each curve demonstrates the change of the averaged value over pairwise LD (_R_2) values along different marker distances for HAPGEN2 and regression-based algorithm, and their reference the 1000 Genomes Phase I interim CEU sample (black).

Figure 7

Figure 7. LD patterns by different simulation methods and their reference (1000 Genomes CEU without rare SNPs)

Each curve demonstrates the change of the averaged value over pairwise LD (_R_2) values along different marker distances for HAPGEN2 and regression-based algorithm, and their reference the 1000 Genomes Phase I interim CEU sample after removing SNPs with MAF less than 0.01 (black).

Figure 8

Figure 8. Simulated allele frequencies from four simulators against the reference (HapMap CEU)

Each panel in this figure demonstrates the comparison of the MAFs from one simulator with those from the reference HapMap CEU sample with imputed rare SNPs. Each point represents the deviation of a simulated allele frequency from the real allele frequency obtained from HapMap CEU sample.

Figure 9

Figure 9. Simulated allele frequencies from two simulators against the reference (1000 Genomes)

Each panel in this figure demonstrates the comparison of the MAFs from one simulator with those from the reference 1000 Genomes CEU or mixed population with all individuals from Africa, Asia, Europe, and the Americas. Each point represents the deviation of a simulated allele frequency from the real allele frequency obtained from the 1000 Genomes CEU or mixed sample.

Figure 10

Figure 10. Histograms of the simulated MAF errors for the 1000 Genomes CEU sample

The upper panel is the plot for HAPGEN2, and the lower panel is the plot for simuRare.

Figure 11

Figure 11. Histograms of the simulated MAF errors for the 1000 Genomes mixed sample

The upper panel is the plot for HAPGEN2, and the lower panel is the plot for simuRare.

Similar articles

Cited by

References

    1. 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061–1073. - PMC - PubMed
    1. Bodmer W, Bonilla C. Common and rare variants in multifactorial susceptibility to common diseases. Nat Genet. 2008;40(6):695–701. - PMC - PubMed
    1. Brennan JS, He Y, Calixte R, Nyirabahizi E, Jiang Y, Zhang H. A lasso-based approach to analyzing rare variants in genetic association studies. BMC Proceedings. 2011;5(Suppl 9):S100. - PMC - PubMed
    1. Cirulli ET, Goldstein DB. Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat Rev Genet. 2010;11(6):415–425. - PubMed
    1. Goddard KA, Hopkins PJ, Hall JM, Witte JS. Linkage disequilibrium and allele-frequency distributions for 114 single-nucleotide polymorphisms in five populations. Am J Hum Genet. 2000;66(1):216–234. - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources