A public resource facilitating clinical use of genomes - PubMed (original) (raw)

. 2012 Jul 24;109(30):11920-7.

doi: 10.1073/pnas.1201904109. Epub 2012 Jul 13.

Joseph V Thakuria, Alexander Wait Zaranek, Tom Clegg, Abraham M Rosenbaum, Xiaodi Wu, Misha Angrist, Jong Bhak, Jason Bobe, Matthew J Callow, Carlos Cano, Michael F Chou, Wendy K Chung, Shawn M Douglas, Preston W Estep, Athurva Gore, Peter Hulick, Alberto Labarga, Je-Hyuk Lee, Jeantine E Lunshof, Byung Chul Kim, Jong-Il Kim, Zhe Li, Michael F Murray, Geoffrey B Nilsen, Brock A Peters, Anugraha M Raman, Hugh Y Rienhoff, Kimberly Robasky, Matthew T Wheeler, Ward Vandewege, Daniel B Vorhaus, Joyce L Yang, Luhan Yang, John Aach, Euan A Ashley, Radoje Drmanac, Seong-Jin Kim, Jin Billy Li, Leonid Peshkin, Christine E Seidman, Jeong-Sun Seo, Kun Zhang, Heidi L Rehm, George M Church

Affiliations

A public resource facilitating clinical use of genomes

Madeleine P Ball et al. Proc Natl Acad Sci U S A. 2012.

Abstract

Rapid advances in DNA sequencing promise to enable new diagnostics and individualized therapies. Achieving personalized medicine, however, will require extensive research on highly reidentifiable, integrated datasets of genomic and health information. To assist with this, participants in the Personal Genome Project choose to forgo privacy via our institutional review board- approved "open consent" process. The contribution of public data and samples facilitates both scientific discovery and standardization of methods. We present our findings after enrollment of more than 1,800 participants, including whole-genome sequencing of 10 pilot participant genomes (the PGP-10). We introduce the Genome-Environment-Trait Evidence (GET-Evidence) system. This tool automatically processes genomes and prioritizes both published and novel variants for interpretation. In the process of reviewing the presumed healthy PGP-10 genomes, we find numerous literature references implying serious disease. Although it is sometimes impossible to rule out a late-onset effect, stringent evidence requirements can address the high rate of incidental findings. To that end we develop a peer production system for recording and organizing variant evaluations according to standard evidence guidelines, creating a public forum for reaching consensus on interpretation of clinically relevant variants. Genome analysis becomes a two-step process: using a prioritized list to record variant evaluations, then automatically sorting reviewed variants using these annotations. Genome data, health and trait information, participant samples, and variant interpretations are all shared in the public domain-we invite others to review our results using our participant samples and contribute to our interpretations. We offer our public resource and methods to further personalized medical research.

PubMed Disclaimer

Conflict of interest statement

Conflict of interest statement: G.M.C. has advisory roles in and research sponsorships from several companies involved in genome sequencing technology and personal genomics (http://arep.med.harvard.edu/gmc/tech.html).

Figures

Fig. 1.

Fig. 1.

PGP enrollment and data collection process. Enrollment in the PGP involves a series of steps meant to ensure informed consent for the public release of personal, reidentifiable genome and trait data. Current and historical copies of our consent forms are publicly available at

http://www.personalgenomes.org/consent/

.

Fig. 2.

Fig. 2.

Venn diagram comparisons of variant calls in PGP1 genomes. Analysis of PGP1 genome variant calls from three different tissues: fibroblast cells, fibroblast-derived iPS cells, and EBV-transformed lymphocyte cells. (A) Overlap of all variant calls, limited to positions that are explicitly called as reference or variant in all three genomes. Positions where any of the three genomes have a no-call (lacking coverage to make a confident call) are discarded from analysis. The low residual discordance consists of sequencing errors or real differences between these three tissues and indicates high sequence quality in each of these samples. (B) Overlap of all variant calls; positions not called in other genomes are included in the analysis. Most locations that were called as “variant” by one genome and not by other genomes were due to a lack of coverage in the other genomes. Reporting the regions confidently called as matching reference (as opposed to regions lacking sufficient coverage) is critical to genome interpretation and data comparisons.

Fig. 3.

Fig. 3.

Drop in number of new variants in each additional genome. For each new genome that is analyzed, the number of new variants not already seen in a previous genome falls dramatically. If editors record variant evaluations, the process of genome evaluation becomes easier as the number of new variants that are prioritized within each new genome is reduced. Data represent the average of 1,000 simulations using random orderings of a combined set of 64 genomes (the PGP-10 and 54 unrelated public genomes released by CGI).

Fig. 4.

Fig. 4.

Assessment of prioritization scores using disease-specific mutation databases. To demonstrate successful prioritization of variants with our prioritization score, we calculated the prioritization scores assigned to variant lists from a variety of disease-specific mutation databases: the Albinism Database (Albinism), the ALS Online Genetics Database (ALSOD), the Cardiogenomics Sarcomere Protein Gene Mutation Database (Cardiogen), the Connexins and Deafness Homepage (Cx-Deafness), and the Autosomal Dominant Polycystic Kidney Disease Mutation Database (PKDB). A variety of factors contribute to variation in performance for these lists: some diseases, for example, are more likely to be caused by severe frameshift or nonsense mutations (which we score highly), and some lists may include genes that are not yet used in clinical testing.

Fig. 5.

Fig. 5.

GET-Evidence and genome reports. (A) Using GET-Evidence involves genome upload followed by review of prioritized insufficiently evaluated variants. Combining these reviews with previously reviewed variants produces the final genome report. (B) Insufficiently evaluated variants are ranked according to prioritization score and are listed with additional information of interest (allele frequency, number of associated articles, presence in databases, and computational predictions). (C) Sufficiently evaluated variants are presented in the genome report with summary information regarding variant effect, severity, and evidence.

Fig. 6.

Fig. 6.

Sample GET-Evidence variant report. Variant report pages on GET-Evidence allow editors to record and organize information relevant to variant interpretation. A scoring system is used for variant evidence and clinical importance categories to allow automatic sorting of interpreted variants. On the basis of the strong case/control evidence and high treatability and penetrance, this recessive pathogenic variant carried by PGP4 (listed in Table 2) was evaluated as well-established and high clinical importance.

Comment in

References

    1. Greenbaum D, Sboner A, Mu XJ, Gerstein M. Genomics and privacy: Implications of the new reality of closed data for the field. PLOS Comput Biol. 2011;7:e1002278. - PMC - PubMed
    1. Church GM. The Personal Genome Project. Mol Syst Biol. 2005;1:2005.0030. - PMC - PubMed
    1. Lunshof JE, Chadwick R, Vorhaus DB, Church GM. From genetic privacy to open consent. Nat Rev Genet. 2008;9:406–411. - PubMed
    1. Ball MP, et al. Targeted and genome-scale strategies reveal gene-body methylation signatures in human cells. Nat Biotechnol. 2009;27:361–368. - PMC - PubMed
    1. Zhang K, et al. Digital RNA allelotyping reveals tissue-specific and allele-specific gene expression in human. Nat Methods. 2009;6:613–618. - PMC - PubMed

Publication types

MeSH terms

Grants and funding

LinkOut - more resources