An evidence-based approach to establish the functional and clinical significance of CNVs in intellectual and developmental disabilities (original) (raw)

. Author manuscript; available in PMC: 2013 May 23.

Abstract

Purpose

Copy number variants (CNVs) have emerged as a major cause of human disease such as autism and intellectual disabilities. Because CNVs are common in normal individuals, determining the functional and clinical significance of rare CNVs in patients remains challenging. The adoption of whole-genome chromosomal microarray analysis (CMA) as a first-tier diagnostic test for individuals with unexplained developmental disabilities provides a unique opportunity to obtain large CNV datasets generated through routine patient care.

Methods

A consortium of diagnostic laboratories was established [the International Standards for Cytogenomic Arrays (ISCA) consortium] to share CNV and phenotypic data in a central, public database. We present the largest CNV case-control study to date comprising 15,749 ISCA cases and 10,118 published controls, focusing our initial analysis on recurrent deletions and duplications involving 14 CNV regions.

Results

Compared to controls, fourteen deletions, and seven duplications were significantly overrepresented in cases, providing a clinical diagnosis as pathogenic.

Conclusion

Given the rapid expansion of clinical CMA testing, very large datasets will be available to determine the functional significance of increasingly rare CNVs. This data will provide an evidenced-based guide to clinicians across many disciplines involved in the diagnosis, management, and care of these patients and their families.

Keywords: CNVs, evidence-based approach, clinical significance, ID/DD, consortium

INTRODUCTION

Copy number variation, defined as the gain or loss of genomic material >1 kb in size,1 has been the subject of intense research in both normal and disease populations over the last several years. These investigations were made possible by the completion of the Human Genome Project, which provided a detailed physical map and high-quality reference assembly of the human genome2 and enabled the development of whole-genome array technologies capable of accurate determination of copy number at very high resolution.

Copy number variants (CNVs) are common in normal individuals and have been identified in ~35% of the human genome.1 When present as hemizygous events in normal individuals, these imbalances are considered “benign” (i.e., no major phenotypic effect on human development); however, their role as susceptibility loci in common and complex genetic diseases and traits is now being actively explored. Data from control populations are being collected in databases of normal variation, including the Database of Genomic Variants (DGV)1 and the Database of Genomic Structural Variation (dbVar).3 These large datasets will contribute to a human gene dosage map through exclusion by defining those regions for which single copy loss or gain are tolerated and do not produce an overtly abnormal phenotype.

CNVs have also been identified as one of the most common causes of human disease. In fact, one of the earliest and most significant clinical benefits of the Human Genome Project has been the application of whole-genome CNV analysis to evaluate individuals with developmental disabilities, including developmental delay, intellectual disability, autism, epilepsy, and/or birth defects, a group of disorders representing up to 14% of the population 4. Commonly referred to as cytogenetic or chromosomal microarrays (CMA), these technologies have quickly replaced the standard G-banded karyotype as the first-tier genetic test for the evaluation of this patient population.5,6 There are many technology platforms available for whole-genome copy number analysis at resolutions of 100–500 kb (compared to ~5–10 Mb for karyotype), with even higher resolution at “clinical targets,” such as individual genes in which haploinsufficiency leads to dominant Mendelian disorders. From numerous published studies, the yield of clinically significant or pathogenic CNVs by CMA is 15–20%, compared with a yield of ~3–5% by standard cytogenetic analysis in the same patient population.5

In an important subset of CMA cases, the potential functional significance of a particular CNV may be unknown and is referred to as a variant of uncertain clinical significance (VOUS). Parental and family studies can be helpful in the clinical interpretation of these cases, as a de novo occurrence of the CNV strengthens the evidence that it is pathogenic. However, the significance of many CNVs still remains uncertain even after familial studies due to variable expressivity or incomplete penetrance. Therefore, it would be extremely beneficial to improve our knowledge of the functional significance of CNVs throughout the genome by performing comparative analyses of large datasets from case cohorts and control populations to definitively associate specific genomic regions with human disease.

Here we describe genome-wide CNV results from the first dataset from the International Standards for Cytogenomic Arrays (ISCA) consortium5 that includes analysis of 15,749 cases and 10,118 controls. This study was designed to assess the frequency of CNVs in this population and initiate an evidence-based process to determine the functional significance of structural variation across the genome. Compared to individually rare CNVs, recurrent CNVs lend themselves to large case-control studies due to their relatively higher frequency. Therefore, we have focused our initial analysis on 14 recurrent CNV regions to statistically assess the correlation between rare CNVs and developmental disorders. Furthermore, ongoing analysis of the ISCA CNV dataset compared to normal structural variation will delineate genomic regions and individual genes that are subject to dosage effects resulting in intellectual and other developmental disabilities. Such efforts will result in a human gene dosage map for developmental disorders.

MATERIALS AND METHODS

Cases

This study adhered to guidelines set by the Institutional Review Boards at the participating laboratories. CMA was performed in a subset of clinical ISCA laboratories on cases referred for diagnostic testing with various indications, including: unexplained developmental delay (DD), intellectual disability (ID), dysmorphic features, multiple congenital anomalies (MCA), autism spectrum disorders (ASD) or clinical features suggestive of a chromosomal syndrome. Anonymized data from 15,749 cases were included.

CNV detection

CMA was carried out following standard procedures. We used a consensus microarray design, focusing on unique genomic regions and avoiding repetitive sequences.7 The arrays were either 44K or 105K custom-designed 60-mer oligonucleotide arrays (Agilent Technologies, Santa Clara, CA) with a whole-genome backbone plus targeted, higher density coverage of known disease-causing regions.7 The backbone coverage included probes spaced every ~35–75 kb, allowing for CNVs of approximately 250 kb and greater to be detected. All clinically relevant CNVs ≥500 kb in the backbone are reported in this study. The 500kb threshold in the backbone regions was used since this size limit was consistently used as the reporting criteria by the ISCA laboratories. For the targeted regions, we could identify imbalances of ~20–50 kb.

Arrays were scanned using a GenePix Autoloader 4200AL, GenePix 4000B (Molecular Devices, Sunnyvale, CA) or Agilent scanner (Agilent Technologies, Santa Clara, CA). Results were analyzed using Feature Extraction and DNA Analytics software packages (Agilent Technologies, Santa Clara, CA). Data include only those imbalances that contained at least 4 consecutive probes with abnormal log2 ratios. Data are presented as minimum coordinates (sequence positions of the first and last probes within the CNV) in the NCBI36 genome assembly.

CNVs were categorized by clinical laboratories as pathogenic, VOUS or benign based on known clinically relevant regions, gene content and inheritance pattern as previously described.5,8 For both deletions and duplications, the genes located within the CNVs were assessed, as well as neighboring genes. Imbalances that involved large genomic segments from the chromosomal backbone coverage were considered to be likely pathogenic if they contained multiple known genes and did not overlap a confirmed benign CNV region. CNVs were classified as pathogenic if the CNV included an autosomal dominant gene known to cause a disease phenotype. The genomic regions associated with known pathogenic and benign CNVs are listed in Supplementary Tables 1–3 and were also deposited into dbVar (nstd45). Because the clinical laboratories that contributed data used different standards for reporting benign CNVs, an accurate assessment of the frequency of these benign CNVs was impossible for this dataset; therefore, benign CNVs identified in cases with otherwise normal array results were not included in this study.

Confirmation of abnormal array findings were carried out by fluorescence in situ hybridization (FISH), quantitative PCR (qPCR), standard G-banded chromosome analysis, multiplex ligation-dependent probe amplification (MLPA) or a second array analysis, depending on the size of the observed CNV. Since the great majority of pathogenic changes were confirmed by an independent method, the genotypic data quality is extremely high, providing a large dataset with high fidelity. Parental studies by FISH, qPCR, MLPA or array analysis were conducted to determine the inheritance in a subset of cases where parental samples were referred for follow-up testing. To the best of our knowledge, results from testing of parental and siblings’ samples were excluded from the final dataset if they showed the same genomic imbalance as the proband.

We developed an automated program to scan the data for inconsistencies in clinical interpretation for two or more reported genomic imbalances that overlapped in length by more than 50%, but that were classified differently (as pathogenic, VOUS, or benign). This program flagged the genomic regions in which there was inconsistent annotation of CNVs, and these CNVs were subsequently reviewed and, where appropriate, assigned a single classification. For cases with complex rearrangements involving several CNVs, the interpretation was based on each individual CNV. The reported CNVs from this study are included in Supplemental Table 4 and were submitted to dbVar (nstd37). The number of genes was assessed by counting partial and whole genes included in the region based on the UCSC known gene track.

Statistical analysis

Our initial approach focuses on recurrent events since they are more common and lend themselves to case-control analysis; future studies will focus on non-recurrent CNVs as large enough case numbers become available. Recurrent rearrangements mediated by segmental duplications were identified by comparison to previously described hotspot regions.9 Imbalances were considered recurrent if they included the critical region of the deletion/duplication event and, based on probe coverage, were likely mediated by paired, flanking segmental duplications. We carried out statistical analysis of 14 selected regions, including (see Table 1 for chromosome coordinates): 1q21 Thrombocytopenia-absent radius (TAR) region,10,11 distal 1q21.1,12,13 3q29,14,15 5q35,16,17 7q11.23,18,19 8p23.1,20,21 15q11.2-q13,2224 15q13,25,26 16p13.11,27,28 16p11.2,2931 17p11.2,32,33 17q12,3436 17q21.313739 and 22q11.2.40,41 For the 1q21 regions, if the imbalance included both 1q21 TAR10 and the distal 1q21.1 region,12 the imbalance was included in the distal 1q21.112 frequency. In the 15q11q13 region, imbalances that spanned BP2 through BP542 were counted in the BP2-BP3 frequency and not the BP4-BP5 frequency. Both the smaller and larger rearrangements (~1.5 and ~3.0 Mb) for 16p13.1128 and 22q1143 were included in their respective CNV categories. For this study, we excluded recurrent CNVs involving 17p12 (HNPP/CMT1A) since these CNVs are either not associated with cognitive defects or are late-onset in nature (and therefore not expected to be enriched in our mostly pediatric patient population) and 15q11 (BP1-2) which were not consistently reported by the contributing laboratories. CNV data from 10,118 individuals from control populations was obtained from several recent reports.4447 Processed CNV data were used directly from three of the previous control studies.4446 For the data from the Shi et al. paper,47 we performed CNV analysis of the raw data for regions of interest using the Affymetrix Power Tools software (Affymetrix, Santa Clara, CA). Log2 ratio data were extracted and analyzed using the BEAST algorithm (Satten et al., submitted). All p-values and odds ratios for case-control analyses were calculated using Fisher’s exact test.

Table 1.

Frequencies of recurrent deletions

Deleted Region Syndrome/Phenotype Approximate Minimum Coordinates (NCBI36) Number of Cases Frequency in 15,749 cases
22q11.2 22q11.2 deletion syndrome40 (1.5 & 3 Mb) chr22:17,400,436-18,676,130 93 1 in 169
16p11.2 Autism30 chr16:29,557,497-30,107,356 67 1 in 235
1q21.1 ID, microcephaly, cardiac and cataracts12,13 chr1:145,044,110-145,861,130 55 1 in 286
15q13.2-q13.3 BP4-BP5 ID and epilepsy25 chr15:28,924,396-30,232,700 46 1 in 342
15q11.2-q13 BP2-BP3 Prader-Willi/Angelman syndrome22 (BP1/2-3) chr15:21,309,483-26,230,781 41 1 in 384
7q11.23 Williams syndrome18 chr7:72,382,390-73,780,449 34 1 in 463
16p13.11 Autism, ID and schizophrenia27,28 (1.5 & 3 Mb) chr16:15,411,955-16,199,769 22 1 in 716
17q21.31 17q21 deletion syndrome37,38 chr17:41,060,948-41,650,183 22 1 in 716
17q12 Renal cysts, diabetes, autism and schizophrenia3436 chr17:31,930,169-33,323,031 18 1 in 875
1q21 Thrombocytopenia-absent radius (TAR) syndrome10 chr1:144,097,430-144,463,097 17 1 in 926
17p11.2 Smith-Magenis syndrome32 chr17:16,723,271-20,234,630 16 1 in 984
8p23.1 8p23.1 deletion syndrome20 chr8:8,156,705-11,803,128 10 1 in 1,575
3q29 3q29 deletion syndrome14,15 chr3:197,240,451-198,829,062 9 1 in 1,750
5q35 Sotos syndrome16 chr5:175,661,584-176,946,567 8 1 in 1,969

RESULTS

CNV Characterization

We analyzed data from 15,749 whole-genome oligonucleotide arrays on individuals who presented for diagnostic array testing with abnormal clinical phenotypes, including DD/ID, ASD and/or MCA. We detected 4,628 imbalances consistent with our reporting criteria (defined in Methods) and classified 2,691 (17.1%) as pathogenic (pCNVs), in line with prior reports of the yield from CMA in diagnostic testing.5 Since a single individual may have had multiple pCNVs (i.e., unbalanced translocations), the diagnostic yield for this dataset was 14.7% (2,321 cases with pCNV/15,749 total cases). Excluding 106 whole-chromosome aneuploidies, there were 2,585 pCNVs with a mean size of ~6.5 Mb (median of ~2.8 Mb) and a mean of ~69 genes per CNV (median of 44 genes). Deletions were more commonly interpreted as pathogenic than duplications, accounting for 67.9% of the imbalances.

In 9.3% of cases, an observed genomic imbalance was classified as a VOUS, since there was insufficient evidence to conclude the CNV was either pathogenic or benign. There were ultimately 1,468 CNVs classified as VOUS, with a mean size of 765 kb (median of 569 kb) and a mean of ~10 genes per CNV (median of 5 genes). Duplications were more common than deletions, accounting for 68.8% of the imbalances.

The inheritance of a CNV was determined in a subset of cases to aid in the clinical interpretation and where both parental specimens were available. Of the 1,412 CNVs with known inheritance, 566 (~40%) were found to be de novo. The majority of the de novo events (513 CNVs, ~91%) were classified as pathogenic, whereas 51 CNVs (~9%) were classified as uncertain. Two de novo CNVs, interpreted to be benign, were incidentally identified in the course of parental studies to determine the inheritance of other CNVs classified as VOUS. The de novo benign CNVs included a duplication of the beta-defensin cluster on chromosome 8p and a duplication of the CHRNA7 (MIM 118511) gene on chromosome 15q; both of these CNVs have been observed as common polymorphisms in control populations.

Frequency of recurrent events

A subset of the imbalances identified by CMA includes recurrent imbalances that result from rearrangements between low-copy repeats, also known as segmental duplications. These rearrangements cause genomic disorders that have been recently reviewed.48 Sharp et al. described 130 rearrangement hotspots in the human genome by defining these regions as large genomic segments (50 kb–10 Mb) that are flanked by segmental duplications ≥10 kb in size and ≥95% identical.9 Of all CNVs detected in this case cohort, ~24% result from rearrangements between segmental duplications.

Tables 1 and 2 show the frequencies in the ISCA dataset for 14 CNV regions associated with recurrent deletions and duplications, respectively. It is important to note that many of the recognizable recurrent syndromes may still be tested for by targeted FISH studies, rather than CMA. Since cases ascertained from FISH testing were not included in this study, the frequencies of such syndromes are likely underestimated.

Table 2.

Frequencies of recurrent duplications

Duplicated Region Syndrome/Phenotype Number of Cases Frequency in 15,749 Cases
16p13.11 Variable phenotype27,28 (1.5 & 3 Mb) 45 1 in 350
16p11.2 Autism30 39 1 in 404
15q11.2-q13 BP2-BP3 Autism23,24 (BP1/2-3) 35 1 in 450
22q11.2 Variable phenotype41 (1.5 & 3 Mb) 32 1 in 492
1q21.1 ID and autism12,13 28 1 in 562
17q12 Epilepsy34 21 1 in 750
7q11.23 Autism19 16 1 in 984
17p11.2 Potocki-Lupski syndrome33 15 1 in 1,050
15q13.2-q13.3 BP4-BP5 Psychiatric disease26 14 1 in 1,125
1q21 Reciprocal duplication of TAR region11 9 1 in 1,750
3q29 Variable phenotype15 8 1 in 1,969
8p23.1 Variable phenotype21 6 1 in 2,625
5q35 Short stature, microcephaly and speech delay17 2 1 in 7,875
17q21.31 Behavioral problems39 0 Unknown

For the 14 recurrent regions, the number of deletions and duplications were often unequal, which can be explained by ascertainment (recurrent duplications may result in milder phenotypes and therefore not be ascertained in our cohort of affected individuals) and mechanism (deletions generated by NAHR occur more frequently than duplications)49. Not surprisingly, the most common deletion in this cohort, with 93 cases (1 in 169 abnormal cases), was the 22q11.2 deletion (MIM 188400),40 while the reciprocal duplication (MIM 608363) with a milder phenotype 41 was detected in only 32 cases. The most common recurrent duplication in our dataset was in 16p13.11, seen in 45 cases, while the reciprocal deletion associated with neurodevelopmental defects was detected in only 22 cases. For both deletions and duplications, the second most commonly affected region was the recurrent 16p11.2 CNV (MIM 611913). Both deletions and duplications of this region have been reported in individuals with an abnormal neurological phenotype.30 The frequency of the 16p11.2 deletion in this abnormal cohort is approximately 1 in 235. Therefore, this CNV was detected nearly as often as the 22q11.2 deletions, indicating that this CNV is also a frequent cause of intellectual and developmental disabilities.

Frequency of non-recurrent events

Of all CNVs detected in this case cohort, most (~76%) were individually rare and not mediated by segmental duplications. This large group of CNVs provides a resource to examine regions of the genome that contain multiple CNVs with overlapping segments of deleted or duplicated material to define genotype-phenotype correlations. As an example, we highlight three recently described regions (2p15 deletion,50 16q24.3 deletion51 and 17p13 duplication52) where overlapping de novo CNVs were characterized to define the associated phenotype and identify candidate genes. In the ISCA case cohort, we found four de novo deletions in 2p15 with a smallest region of overlap (SRO) of ~2.4 Mb, five de novo deletions in 16q24 with a SRO of ~450 kb, and four de novo duplications in 17p13 with a SRO of ~312 kb. As the ISCA database grows, cases such as these will prove invaluable for identifying disease-causing genes.

Case-control analysis to define functional significance

The CNVs identified in this study of individuals with neurodevelopmental disorders are rare and highly heterogeneous, with no single CNV being identified in more than 1% of the cases. Therefore, methods are needed to begin to statistically assess the relationship between such rare variation and human disease. For this study, we first focused on deletions and duplications of 14 recurrent genomic regions since their relative frequency is higher than CNVs involving non-recurrent regions. We selected 14 of the most common and clinically relevant recurrent CNVs (listed in Methods) for a formal case-control study to initiate an evidence-based process for defining the clinical significance of structural variation across the genome. Many of these 14 regions have inconclusive or contradictory data in the literature regarding their phenotypic implications, so a targeted analysis of these regions is needed to inform their functional significance.

Tables 3 and 4 show the results of these analyses for recurrent deletions and duplications, respectively. We compared the ISCA case cohort of 15,749 cases to 10,118 combined controls from several recent publications.4447 These reports used microarrays with levels of resolution equivalent to or higher than the ISCA array design; thus, there should be no significant difference in sensitivity in the calls between the case and control datasets given that the 14 regions analyzed in this study were ~600 kb or greater. Although not all the controls used in these studies were formally assessed for neurocognitive abnormalities, these datasets have been used before as control populations in other studies.

Table 3.

Case-control analysis of recurrent deletions

Deleted Region Initial Call Final Call Cases Controls OR Lower 95% CI Upper 95% CI p-value Itsara et al. Study
22q11.2 pCNV pCNV 93 0 15.96 9.15E-21 7.93E-09
16p11.2 pCNV pCNV 67 5 8.64 3.52 27.49 6.34E-10 0.186
1q21.1 pCNV pCNV 55 3 11.82 3.84 59.07 5.38E-09 1.67E-04
15q13.2-q13.3 BP4-BP5 pCNV pCNV 46 0 7.71 1.44E-10 1.08E-05
15q11.2-q13 BP2-BP3 pCNV pCNV 41 0 6.84 2.77E-09
7q11.23 pCNV pCNV 34 0 5.62 8.49E-08
16p13.11 pCNV pCNV 22 3 4.72 1.42 24.62 0.0063
17q21.31 pCNV pCNV 22 0 3.52 2.49E-05
17q12 pCNV pCNV 18 0 2.83 0.00015
1q21 pCNV pCNV 17 1 10.93 1.71 456.06 0.0026 *
17p11.2 pCNV pCNV 16 0 2.48 0.00045
8p23.1 pCNV pCNV 10 0 1.44 0.0084
3q29 pCNV pCNV 9 0 1.27 0.0147 0.164
5q35 pCNV pCNV 8 0 1.10 0.026

Table 4.

Case-control analysis of recurrent duplications

Duplicated Region Initial Call Final Call Cases Controls OR Lower CI 95% Upper 95% CI p-value Itsara et al. Study
16p13.11 VOUS VOUS 45 20 1.45 0.84 2.59 0.203
16p11.2 VOUS pCNV 39 4 6.28 2.26 24.19 2.50E-05 0.100
15q11.2-q13 BP2-BP3 pCNV pCNV 35 0 5.79 4.57E-08 2.69E-04
22q11.2 pCNV pCNV 32 5 4.12 1.59 13.54 0.0011 0.330
1q21.1 pCNV pCNV 28 3 6.00 1.85 30.88 0.0004 0.041
17q12 pCNV pCNV 21 4 3.38 1.14 13.53 0.022
7q11.23 pCNV pCNV 16 1 10.29 1.60 430.72 0.0046
17p11.2 pCNV pCNV 15 0 2.31 0.0008
15q13.2-q13.3 BP4-BP5 VOUS VOUS 14 3 3.00 0.84 16.28 0.083 **
1q21 VOUS VOUS 9 12 0.48 0.179 1.25 0.116 *
3q29 pCNV VOUS 8 1 5.14 0.69 227.96 0.100 1
8p23.1 pCNV VOUS 6 0 0.76 0.088
5q35 pCNV VOUS 2 0 0.12 0.52
17q21.31 N/A N/A 0 0 nd nd nd nd

All fourteen recurrent deletions were significantly overrepresented in cases compared with controls (Table 3), demonstrating each is a pathogenic CNV. The 22q11.2 deletion was not seen in controls, confirming the pathogenic nature of this known disease-causing CNV (p=9.15E-21). The 16p11.2 deletion was observed in 67 cases in the ISCA cohort, but only five 16p11.2 deletions were found among the control population, providing strong evidence for the pathogenic nature of this CNV (OR=8.64; p=6.34E-10).

Other recurrent deletions detected with a high frequency in the abnormal cohort include those in 1q21.1 (MIM 612474; OR=11.82; p=5.38E-09), 15q13 (MIM 612001; OR=∞; p=1.44E-10) and 15q11-q13 [breakpoint (BP) 1/2-3 of the Prader-Willi (MIM 176270)/Angelman (MIM 105830) syndromes region; OR=∞; p=2.77E-09]. We also identified 18 deletions involving the 17q12 region (MIM 137920); these deletions were initially reported to have no neurocognitive phenotype.34 More recent studies, however, have shown an association between 17q12 deletions and developmental delays35 and autism/schizophrenia.36 The absence of the 17q12 deletion in 10,118 controls is strong evidence for classifying this deletion as pathogenic (p=0.00015).

We also analyzed the reciprocal duplications of the 14 recurrent deletion CNVs (Table 4). Determining the functional significance for duplications can be more challenging due to the more subtle and milder phenotypes associated with an increase in gene dosage compared to the more severe phenotypic effects of haploinsufficiency. The initial classifications for these CNVs ranged from VOUS to pathogenic events.

For six duplications initially classified as pathogenic (in 1q21.1 [MIM 612475], 7q11.23 [MIM 609757], 15q11.2-q13 [MIM 608636], 17p11.2 [MIM 610883], 17q12 and 22q11.2), the case-control analysis corroborated this classification (Table 4). The 16p11.2 duplication was initially classified as a VOUS; however, our case-control analysis demonstrates that this duplication is most likely pathogenic (OR=6.28; p=2.5E-05).

Several recurrent CNV regions have had equivocal reports in the literature. For example, duplications of 16p13.11 have been previously suggested to be linked with autism,27 while another study proposed that the duplications may be a benign CNV.28 Because of the uncertainty in the literature, duplications in three regions (16p13.11, 15q13 BP4-5 and proximal 1q21) were initially classified as VOUS. Since these duplications were not significantly enriched in the ISCA case cohort or in controls, the classification of these CNVs remains uncertain at this time using the formal case-control assessment.

Duplications of 3q29,15 8p23.121 and 5q3517 have been previously reported in individuals with abnormal phenotypes. In this case-control analysis, these events were identified more often in cases than in controls. However, due to the low frequency of these duplications in the clinically affected population, the differences were not statistically significant. Therefore, as a conservative approach, we would classify these three CNVs as uncertain until larger sample sizes are available. More detailed phenotypic investigations of individuals carrying duplications of 3q29, 8p23.1 and 5q35 in the ISCA cohort and other patient cohorts will help to clarify whether the observed phenotypes are consistent with the previously reported syndromes associated with these duplications.

DISCUSSION

There are now many published reports of the significant role of rare, de novo CNVs with major phenotypic effects in various human disease populations, including intellectual disabilities, autism spectrum disorders, epilepsy, and schizophrenia, among others. Many of these studies are based on well-phenotyped research cohorts that were originally collected and characterized to optimize the ability to detect small effects in genome-wide association studies. Although positive associations have been identified for a few common diseases through these efforts, a surprising and remarkable finding has been the identification of rare, de novo CNVs with major phenotypic effects, particularly in neurocognitive and behavioral disorders. Because these events are rare, obtaining adequate evidence for their functional role in disease causation requires very large sample sizes as well as large control populations.

An alternative model for assessing the contribution of CNVs to disease, which has been utilized particularly in the study of children with unexplained developmental disabilities and congenital anomalies, has been the reporting of case series from clinical laboratory testing. Most of these published studies have represented CNV data from single laboratories and were based on previous generation targeted array analysis using bacterial artificial chromosome (BAC) genomic clones.5 Compared to analysis of research cohorts of well-phenotyped patients, the amount and quality of phenotypic data associated with clinical laboratory referrals is often quite limited.

For this study, we have combined these two approaches by exploiting a large CNV dataset derived from a consortium of clinical laboratories to explore the frequency and functional significance of rare CNVs. Our analysis of the first 15,749 ISCA cases, one of the largest CNV studies to date, has confirmed the power of this approach. We have defined the frequency (17.1%) of pathogenic CNVs in a cohort of individuals with intellectual and developmental disabilities and performed formal case-control studies of selected recurrent genomic regions whose frequency was sufficient for statistical analysis.

The determination of whether a CNV contributes to an abnormal phenotype depends on many factors, including gene content, previous evidence of pathogenic CNVs in the region, type of CNV (deletion or duplication), inheritance pattern, and frequency in unaffected populations. As such, larger CNVs may be more likely to be classified as pathogenic since they have a higher chance of including a dosage-sensitive gene and/or they include a larger number of genes that cumulatively result in an abnormal phenotype. Our experience, as well as that of other groups,53 has shown that the classification of a previously unreported CNV not associated with known disease genes can vary. To address such discrepancies, we used case-control statistical evidence for 14 selected recurrent CNV regions to objectively determine their significance.

We analyzed deletions and duplications of each region separately, resulting in 28 total recurrent CNV regions. Using this approach, we demonstrated and confirmed the pathogenic nature of 20 recurrent regions. For the 16p11.2 duplications that had previously been reported as uncertain in the literature, we were able to re-classify this CNV region as pathogenic. Overall, we conclude that 21 out of the 28 recurrent CNVs examined should be considered pathogenic and provide a clinical diagnosis for any individual harboring a CNV of these regions.

The statistical approach we used to classify recurrent CNVs and the results we obtained are useful tools for researchers and the clinical community in interpreting whether a CNV has pathologic effects. However, while such statistical analysis is possible for recurrent CNVs, where the frequency is high, this strategy is more difficult for the remaining ~75% of CNVs, which are not mediated by segmental duplications and are individually very rare. Therefore, other approaches need to be explored to address this class of CNVs. One possibility for these highly heterogenous CNVs is to analyze all genomic intervals of a defined size (e.g., 500 kb or 1 Mb) or to use a “sliding-window” analysis to examine overlapping genomic intervals along the length of each chromosome. By comparing structural variation observed in cases to controls, disease-causing regions can be differentiated from those associated with normal variation by using the control data to define regions of the genome where dosage changes can be tolerated without overt phenotypic effects. Since non-recurrent CNVs are very rare events, the collection of data from hundreds of thousands of cases will be needed for this type of analysis to be successful. Continued efforts of the ISCA consortium, as well as other databases such as DECIPHER, will be essential to this process in order to obtain enough overlapping CNVs to provide the power needed for statistical analyses.

The ISCA consortium is continuing to grow and now includes over 150 clinical laboratories from across the world. Given the rapid increase in utilization of this testing on a routine clinical basis, and the ability to recruit an expanding number of collaborating labs contributing data to a central database, the size of this cohort will continue to rapidly grow, providing a highly cost-effective way to obtain very large CNV datasets. In addition, since this data will be publicly available through two NCBI resources, dbGaP (database of Genotypes and Phenotypes) and dbVar, this resource can be readily accessed by researchers as well as the clinical community. Having large datasets from individuals with abnormal phenotypes will foster more objective formal scientific analyses to predict which CNVs will impact human development. Such efforts will make it possible to develop a whole-genome dosage map in humans to determine which genes and regions are subject to haploinsufficiency or triplosensitivity compared to those that are tolerant of dosage changes.

Supplementary Material

Supplementary Data

Acknowledgments

We would like to thank all members of the clinical laboratories for performing the microarray experiments, including Daniel Saul, Stephanie Warren and Nancy Flores for technical assistance, and Angela DeLorenzo, Ken Chatterten and Kristi DeHaai for data entry. We would like to thank John C. Barber for helpful discussions, Eli Williams for critical reading of the manuscript, and Cheryl T. Strauss for editorial assistance. This work was funded in part by NIH grants HD064525 (D.H.L. and C.L.M.), MH074090 (D.H.L.and C.L.M.), MH080129 (S.T.W.), and MH083722 (S.T.W.), and the Intramural Research Program of the NIH, National Library of Medicine.

Footnotes

Supplemental Data

Supplemental Data include four tables.

Accession Numbers

CNVs for this dataset have been deposited in dbVar as study ID nstd37 and nstd45.

References

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data