Phased whole-genome genetic risk in a family quartet using a major allele reference sequence - PubMed (original) (raw)

. 2011 Sep;7(9):e1002280.

doi: 10.1371/journal.pgen.1002280. Epub 2011 Sep 15.

Rong Chen, Sergio P Cordero, Kelly E Ormond, Colleen Caleshu, Konrad J Karczewski, Michelle Whirl-Carrillo, Matthew T Wheeler, Joel T Dudley, Jake K Byrnes, Omar E Cornejo, Joshua W Knowles, Mark Woon, Katrin Sangkuhl, Li Gong, Caroline F Thorn, Joan M Hebert, Emidio Capriotti, Sean P David, Aleksandra Pavlovic, Anne West, Joseph V Thakuria, Madeleine P Ball, Alexander W Zaranek, Heidi L Rehm, George M Church, John S West, Carlos D Bustamante, Michael Snyder, Russ B Altman, Teri E Klein, Atul J Butte, Euan A Ashley

Affiliations

Phased whole-genome genetic risk in a family quartet using a major allele reference sequence

Frederick E Dewey et al. PLoS Genet. 2011 Sep.

Abstract

Whole-genome sequencing harbors unprecedented potential for characterization of individual and family genetic variation. Here, we develop a novel synthetic human reference sequence that is ethnically concordant and use it for the analysis of genomes from a nuclear family with history of familial thrombophilia. We demonstrate that the use of the major allele reference sequence results in improved genotype accuracy for disease-associated variant loci. We infer recombination sites to the lowest median resolution demonstrated to date (< 1,000 base pairs). We use family inheritance state analysis to control sequencing error and inform family-wide haplotype phasing, allowing quantification of genome-wide compound heterozygosity. We develop a sequence-based methodology for Human Leukocyte Antigen typing that contributes to disease risk prediction. Finally, we advance methods for analysis of disease and pharmacogenomic risk across the coding and non-coding genome that incorporate phased variant data. We show these methods are capable of identifying multigenic risk for inherited thrombophilia and informing the appropriate pharmacological therapy. These ethnicity-specific, family-based approaches to interpretation of genetic variation are emblematic of the next generation of genetic risk assessment using whole-genome sequencing.

PubMed Disclaimer

Conflict of interest statement

JVT and AWZ are founders, consultants, and equity holders in Clinical Future; GMC has advisory roles in and research sponsorships from several companies involved in genome sequencing technology and personal genomics (see http://arep.med.harvard.edu/gmc/tech.html); MS is on the scientific advisory board of DNA Nexus and holds stock in Personalis; RBA has received consultancy fees from Novartis and 23andMe and holds stock in Personalis; AJB is a scientific advisory board member and founder for NuMedii and Genstruct, a scientific advisory board member for Johnson and Johnson, has received consultancy fees from Lilly, NuMedii, Johnson and Johnson, Genstruct, Tercica, and Prevendia and honoraria from Lilly and Siemens, and holds stock in NuMedii, Genstruct, and Personalis. EAA holds stock in Personalis.

Figures

Figure 1

Figure 1. Pedigree and genetic risk prediction workflow.

A, Family pedigree with known medical history. The displayed ages represent the age of death for deceased subjects or the age at the time of medical history collection (9/2010) for living family members. Arrows denote sequenced family members. Abbreviations: AD, Alzheimer's disease; CABG, coronary artery bypass graft surgery; CHF, congestive heart failure; CVA, cerebrovascular accident; DM, diabetes mellitus; DVT, deep venous thrombosis; GERD, gastroesophageal reflux disease; HTN, hypertension; IDDM, insulin-dependent diabetes mellitus; MI, myocardial infarction; SAB, spontaneous abortion; SCD, sudden cardiac death. B, Workflow for phased genetic risk evaluation using whole genome sequencing.

Figure 2

Figure 2. Development of major allele reference sequences.

Allele frequencies from the low coverage whole genome sequencing pilot of the 1000 genomes data were used to estimate the major allele for each of the three main HapMap populations. The major allele was substituted for the NCBI reference sequence 37.1 reference base at every position at which the reference base differed from the major allele, resulting in approximately 1.6 million single nucleotide substitutions in the reference sequence. A, Approximately half of these positions were shared between all three HapMap population groups, with the YRI population containing the greatest number of major alleles differing from the NCBI reference sequence. B, Number of disease-associated variants represented in the NCBI reference genome by the minor allele in each of the three HapMap populations. C, Number of positions per Mbp at which the major allele differed from the reference base by chromosome and HapMap population.

Figure 3

Figure 3. Inheritance state analysis, error estimation, and phasing.

A, A Hidden Markov Model (HMM) was used to infer one of four Mendelian and two non-Mendelian inheritance states for each allele assortment at variant positions across the quartet. “MIE-rich” refers to Mendelian-inheritance error (MIE) rich regions. “Compression” refers to genotype errors from heterozygous structural variation in the reference or study subjects, manifest as a high proportion of uniformly heterozygous positions across the quartet. B, A combination of quality score calibration using orthogonal genotyping technology and filtering SNVs in error prone regions (MIE-rich and compression regions) identified by the HMM resulted in >90% reduction in the genotype error rate estimated by the MIE rate. C, Consistent with PRDM9 allelic status, approximately half of all recombinations in each parent occurred in hotspots. The mother has two haplotypes in the gene RNF212 associated with low recombination rates, while the father has one haplotype each associated with high and low recombination rates. Notation denotes base at [rs3796619, rs1670533]. D, Variant phasing using pedigree, inheritance state, and population linkage disequilibrium data. Pedigree data were first used to phase informative allele assortments in trios (top). The inheritance state of neighboring regions was used to phase positions in which all members of a mother-father-child trio were heterozygous and the sibling was homozygous for the reference or non-reference allele (middle). For uniformly heterozygous positions, we phased the non-reference allele using a maximum likelihood model to assign the non-reference allele to paternal or maternal chromosomes based on population linkage disequilibrium with phased SNVs within 250 kbp (bottom). In all panels a corresponds to the reference allele and b to the non-reference allele.

Figure 4

Figure 4. Ancestry and immunogenotyping using phased variant data.

A, Ancestry analysis of maternal and paternal origins based on principle components analysis of SNP genotypes intersected with the Population Reference Sample dataset. B, The HMM identified a recombination spanning the HLA–B locus and facilitated resolution of haplotype phase at HLA loci. Contig colors in the lower panel correspond to the inheritance state as depicted in Figure 3A. C, Common HLA types for family quartet based on phased sequence data.

Figure 5

Figure 5. Common variant risk prediction.

A, Common variant risk prediction for 28 disease states for each of the family members (f, father; m, mother; s, son; d, daughter) and 174 ethnicity-matched HapMap subjects. The x-axis in each plot represents the log10(likelihood ratio) for each disease according to allelic distribution of SNPs identified in the literature as significantly associated with disease by 2 or more studies including 2000 or more total subjects. B, Upper left: pre (base) and post (bar end) estimates of disease risk for the father according to common variant risk prediction, derived from the pre-probability of disease multiplied by the composite likelihood ratio from all SNPs meeting the criteria described above. Upper right: Composite likelihood ratio estimates for disease risk according to common genetic variation. Blue bars represent paternal estimate, pink bars represent maternal estimate, red points represent the estimate for the daughter, and blue points represent the estimate for the son. Lower panels: parental haplotype contribution to disease risk for each child (points) for the daughter (lower left) and son (lower right). Blue shading represents paternal haplotype risk allele contribution and pink shading represents maternal haplotype risk allele contribution.

Comment in

References

    1. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A. 2009;106:9362–9367. - PMC - PubMed
    1. Durbin RM, Abecasis GR, Altshuler DL, Auton A, Brooks LD, et al. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. - PMC - PubMed
    1. Roach JC, Glusman G, Smit AF, Huff CD, Hubley R, et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science. 2010;328:636–639. - PMC - PubMed
    1. Ashley EA, Butte AJ, Wheeler MT, Chen R, Klein TE, et al. Clinical assessment incorporating a personal genome. Lancet. 2010;375:1525–1535. - PMC - PubMed
    1. Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007;35:D61–65. - PMC - PubMed

Publication types

MeSH terms

Grants and funding

LinkOut - more resources