Impact of whole genome amplification on analysis of copy number variants (original) (raw)

Journal Article

,

1 Genome Sciences Centre and 2 Department of Pathology, BC Cancer Agency, Vancouver, BC, Canada

Search for other works by this author on:

,

1 Genome Sciences Centre and 2 Department of Pathology, BC Cancer Agency, Vancouver, BC, Canada

Search for other works by this author on:

,

1 Genome Sciences Centre and 2 Department of Pathology, BC Cancer Agency, Vancouver, BC, Canada

Search for other works by this author on:

,

1 Genome Sciences Centre and 2 Department of Pathology, BC Cancer Agency, Vancouver, BC, Canada

Search for other works by this author on:

,

1 Genome Sciences Centre and 2 Department of Pathology, BC Cancer Agency, Vancouver, BC, Canada

Search for other works by this author on:

,

1 Genome Sciences Centre and 2 Department of Pathology, BC Cancer Agency, Vancouver, BC, Canada

Search for other works by this author on:

,

1 Genome Sciences Centre and 2 Department of Pathology, BC Cancer Agency, Vancouver, BC, Canada

Search for other works by this author on:

,

1 Genome Sciences Centre and 2 Department of Pathology, BC Cancer Agency, Vancouver, BC, Canada

Search for other works by this author on:

,

1 Genome Sciences Centre and 2 Department of Pathology, BC Cancer Agency, Vancouver, BC, Canada

Search for other works by this author on:

1 Genome Sciences Centre and 2 Department of Pathology, BC Cancer Agency, Vancouver, BC, Canada

*To whom correspondence should be addressed. Tel: +1 604 877 6082 ; Fax:

+1 604 876 3561

; Email: mmarra@bcgsc.ca

Search for other works by this author on:

Received:

28 November 2007

Revision received:

27 May 2008

Cite

T. J. Pugh, A. D. Delaney, N. Farnoud, S. Flibotte, M. Griffith, H. I. Li, H. Qian, P. Farinha, R. D. Gascoyne, M. A. Marra, Impact of whole genome amplification on analysis of copy number variants, Nucleic Acids Research, Volume 36, Issue 13, 1 August 2008, Page e80, https://doi.org/10.1093/nar/gkn378
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

Large-scale copy number variants (CNVs) have recently been recognized to play a role in human genome variation and disease. Approaches for analysis of CNVs in small samples such as microdissected tissues can be confounded by limited amounts of material. To facilitate analyses of such samples, whole genome amplification (WGA) techniques were developed. In this study, we explored the impact of Phi29 multiple-strand displacement amplification on detection of CNVs using oligonucleotide arrays. We extracted DNA from fresh frozen lymph node samples and used this for amplification and analysis on the Affymetrix Mapping 500k SNP array platform. We demonstrated that the WGA procedure introduces hundreds of potentially confounding CNV artifacts that can obscure detection of bona fide variants. Our analysis indicates that many artifacts are reproducible, and may correlate with proximity to chromosome ends and GC content. Pair-wise comparison of amplified products considerably reduced the number of apparent artifacts and partially restored the ability to detect real CNVs. Our results suggest WGA material may be appropriate for copy number analysis when amplified samples are compared to similarly amplified samples and that only the CNVs with the greatest significance values detected by such comparisons are likely to be representative of the unamplified samples.

INTRODUCTION

Initial analysis of the human genome identified single nucleotide polymorphisms (SNPs) as the primary source of genotypic and phenotypic variation among humans. However, subsequent studies identified large-scale copy number variants (CNV) that apparently impacted millions of nucleotides ( 1–6 ). These large-scale variants included polymorphic deletions and duplications that are present in >1% of the population and therefore meet the traditional definition of polymorphism ( 2 ). As of November 2007, 4878 CNV loci impacting 808 Mbp of DNA sequence have been identified and these are listed in the Database for Genomic Variants ( http://projects.tcag.ca/variation/ ). CNVs are also features of several human diseases including Alzheimer disease ( 7 ), Cri du chat syndrome ( 8 ), mental retardation ( 9 ) and cancer ( 10 , 11 ). As robust array-based methods for copy number detection continue to mature, increasing numbers of these variants are being identified ( 2 ).

Current whole-genome methods to detect CNVs require relatively large input quantities of DNA that are difficult or impossible to obtain from rare cell populations such as biopsies and microdissected tissues. To address this challenge, whole genome amplification (WGA) techniques were developed that increase the amount of DNA for analysis. For example, multiple-strand displacement amplification (MDA) using Phi29 DNA polymerase was used to generate microgram quantities of high molecular weight DNA (>30 kb) from nanograms of high quality input material ( 12 , 13 ). A recent report described a protocol for amplification of picogram quantities of DNA from single cells ( 14 ), further expanding the applications for this technique.

The replication fidelity of WGA techniques have been investigated ( 15–20 ). Estimates of base-pair incorporation errors resulting from Phi29-mediated amplification have ranged from 2.2 × 10 −5 ( 21 ) to 9.5 × 10 −6 ( 16 ) and the concordance of genotypes between unamplified and amplified samples were reported to be >99.8% ( 16 , 19 ). Recurrent WGA-induced copy number biases were observed in previous studies ( 15–20 ), and were associated with sequence repeats and proximity to chromosome ends ( 17–20 ), increased GC content ( 17 , 20 ), and annotated CNVs ( 17 ). Many of these associations were explored descriptively without statistical analysis and there was no consensus on the 92 recurrent regions of bias explicitly defined by three of these studies ( 16 , 17 , 20 ). A recent study of 532 samples subjected to WGA and subsequent analysis using the Affymetrix 10k Mapping array identified a median of 438 WGA-induced copy number artifacts in comparisons between amplified samples and an unamplified reference set ( 15 ). While there is a consensus that at least partial compensation of systematic biases can be achieved through the use of an amplified reference ( 16–20 ), it is unknown to what degree such comparisons can capture real CNVs detected using more sensitive, higher resolution platforms.

Recently, bias induced by a number of whole genome amplification protocols was examined using a high-throughput, massively parallel whole genome pyrosequencing technique ( 22 ). In this comparison, which involved sequencing two bacterial genomes, Phi29 MDA-based approaches generated the most complete genome coverage (50–99%), and introduced the least bias compared to other PCR-based techniques. DNA sequences generated from Phi29-amplified material were 2.9–3.8% lower in GC-content than those from the unamplified material, suggesting a relationship between amplification bias and GC-content. However, over-amplification of certain sequences could not be explained by any of the previously mentioned sources of bias suggesting a need to directly investigate the nature of regions prone to over- or under-amplification. Although the study was of high resolution, direct comparison of the results from this study with those using human samples is difficult due to differences in chromosome organization, size and composition.

In this study, we investigated amplification bias resulting from whole genome amplification on DNA from fresh-frozen human tissues using the Affymetrix 500k Mapping Array Set. We quantified the effects of WGA on microarray signal and background noise, localized and statistically analysed genomic regions of WGA-induced bias, and directly compared the ability to resolve CNVs in comparisons of unamplified and amplified material.

MATERIALS AND METHODS

Tissue material and DNA extraction

Normal lymph nodes from three individuals were fresh frozen in Optimal Cutting Temperature (OCT; Sakura Finetek, Torrance, CA) compound and stored at −80°C by the service pathology laboratory at the BC Cancer Agency. Genomic DNA was extracted from these sources using the Gentra PureGene DNA purification kit (Gentra Systems, Minneapolis, MN). Prior to labelling and microarray hybridization, the genomic DNA was quantified using a NanoDrop spectrophotometer (NanoDrop Technologies, Wilmington, DE). Prior to whole genome amplification, the genomic DNA was diluted to ∼1.5 ng/μl and quantified using a PicoGreen assay (Invitrogen, Carlsbad, CA). To ensure consistent DNA quality across all samples, the DNA was visualized on an agarose gel to confirm the presence of undegraded, predominantly high molecular weight (>10 kb) DNA.

Whole genome amplification

We used Qiagen's Repli-G Mini whole genome amplification kit and protocol (QIAgen, Valencia, CA) to amplify 7 ng of PicoGreen-quantified DNA from fresh frozen samples to generate >10 µg of high molecular weight DNA. We performed the isothermal amplification reaction in 1.5 ml microcentrifuge tubes incubated in a 30°C water bath for 18 h and inactivated the enzyme by incubating the tubes in a 65°C water bath for 3 min. The amplified products were purified and quantified as described in the previous section and the amplification products were visualized on a 0.8% agarose gel stained with SYBR Green (Invitrogen, Carlsbad, CA).

Labelling and hybridization to the Affymetrix 500k array

500 ng samples of DNA were processed following the instructions in the GeneChip Mapping 500K manual (Affymetrix, Santa Clara, CA). Briefly, 250 ng of DNA was digested using one of two restriction enzymes, Nsp I or Sty I, and ligated to Nsp I or Sty I adaptors. These adaptor-ligated fragments were amplified by PCR and the purified products quantified using a Bio-Tek PowerWave X spectrophotometer and the concentration normalized to 2 µg/µl. The normalized products were then fragmented and labelled as described in the manual. Samples were hybridized to the GeneChip Human Mapping 250K Nsp or Sty array in an Affymetrix Hybridization Oven 640. Washing and staining of the arrays were performed using an Affymetrix Fluidics Station 450. Images of the arrays were obtained using an Affymetrix GeneChip Scanner 3000.

Sample preparation for NimbleGen 385k CGH array

Samples of >2.5 µg of DNA were prepared following the instructions provided by NimbleGen Systems Inc. (NimbleGen Systems Inc, Madison, Wisconsin). Briefly, purified samples were concentrated to 250 ng/µl and analysed for quality on an agarose gel. Samples were then shipped on ice to NimbleGen for subsequent labelling and hybridization to the 385k Human Whole-Genome CGH array.

Genotype and copy number analysis

Genotype calls were derived from microarray images using the GTYPE v4.0 software program (Affymetrix, Santa Clara, CA). We detected CNVs in individual samples using comparisons to a common reference data set and comparisons between pre- and post-amplification sample pairs ( Figure 1 ). These were performed using a software pipeline ( Figure 1 ) that utilizes the Affymetrix Chromosome Copy Number Analysis Tool (CNAT) version 4.0 (Affymetrix, Santa Clara, CA) and an exhaustive t -score optimization algorithm.

 Experimental design. ( A ) In this study, we aimed to assess the impact of WGA on the detection of CNVs, to explore copy number biases induced by this technique, and to assess the use of pair-wise analysis to address such biases. To this end, DNA samples from three fresh frozen tissues were subject to WGA and analyzed pre- and post-amplification on the Affymetrix Mapping 500k SNP array set. For each copy number analysis, different sets of microarray data were compared as shown in panels B-D. Log 2 intensity ratios were calculated from the selected data comparisons using a software pipeline based on CNAT v4.0. These ratios were then screened by an ‘exhaustive search’ algorithm, in which t -scores were calculated in 3–30 probe windows and statistically significant aberrations identified above array-specific thresholds defined through permutation. To detect CNVs impacting more than 30 probes, aberrations found to contain more than 27 probes were subject to a t -score optimization using larger and larger window sizes until a local maximum t -score was found. The resulting high confidence lists of CNVs were then compared as appropriate for each analysis. ( B ) In this set of comparisons against a common reference set, we investigated the effect of WGA on array noise (i.e., the distribution of log 2 ratios) and the ability to resolve CNVs. To this end, each unamplified and amplified sample was independently compared against the Affy48 reference set, log 2 ratios calculated and detected CNVs were compared. ( C ) To assess the nature of bias induced by WGA, this data set directly compared matched pre- and post-WGA samples. Since matched samples were used, all CNVs detected in this analysis are due to the amplification technique. ( D ) This set of comparisons examined the ability of pair-wise analysis of amplified samples to reciprocate CNVs detected in unamplified samples. Three pair-wise comparisons were conducted using both unamplified and amplified material and the observed CNVs were compared.

Figure 1.

Experimental design. ( A ) In this study, we aimed to assess the impact of WGA on the detection of CNVs, to explore copy number biases induced by this technique, and to assess the use of pair-wise analysis to address such biases. To this end, DNA samples from three fresh frozen tissues were subject to WGA and analyzed pre- and post-amplification on the Affymetrix Mapping 500k SNP array set. For each copy number analysis, different sets of microarray data were compared as shown in panels B-D. Log 2 intensity ratios were calculated from the selected data comparisons using a software pipeline based on CNAT v4.0. These ratios were then screened by an ‘exhaustive search’ algorithm, in which t -scores were calculated in 3–30 probe windows and statistically significant aberrations identified above array-specific thresholds defined through permutation. To detect CNVs impacting more than 30 probes, aberrations found to contain more than 27 probes were subject to a t -score optimization using larger and larger window sizes until a local maximum t -score was found. The resulting high confidence lists of CNVs were then compared as appropriate for each analysis. ( B ) In this set of comparisons against a common reference set, we investigated the effect of WGA on array noise (i.e., the distribution of log 2 ratios) and the ability to resolve CNVs. To this end, each unamplified and amplified sample was independently compared against the Affy48 reference set, log 2 ratios calculated and detected CNVs were compared. ( C ) To assess the nature of bias induced by WGA, this data set directly compared matched pre- and post-WGA samples. Since matched samples were used, all CNVs detected in this analysis are due to the amplification technique. ( D ) This set of comparisons examined the ability of pair-wise analysis of amplified samples to reciprocate CNVs detected in unamplified samples. Three pair-wise comparisons were conducted using both unamplified and amplified material and the observed CNVs were compared.

To analyse sample pairs on the Affymetrix platform, we used CNAT to perform quantile normalization of probe intensities from the samples and calculated log 2 intensity ratios for each probe set on the array. For unpaired analysis of individual samples against a common reference set, we used a set of average probe intensities from the reference set in place of the second sample. The reference set used for this purpose, referred to hereafter as the ‘Affy48 reference set’, was downloaded from the Affymetrix website ( http://www.affymetrix.com/support/technical/sample_data/500k_data.affx ) and consisted of 48 samples representing five HapMap CEPH trios, five HapMap Yoruban trios, three other non-HapMap trios, and nine unrelated HapMap Asian samples. To analyse sample pairs on the NimbleGen platform, we used qspline normalized data and log 2 intensity ratios provided by NimbleGen for each probe on the array.

To identify significant deviations in the log 2 ratio data from both platforms, the following t -score optimization algorithm was used. First, log 2 ratios were sorted by genome coordinate and moving windows representing a number of adjacent probes were subjected to a t -test against the rest of the data outside of the window on the same chromosome. This was done across the entire genome for all window sizes from 3 to 30 probe sets for the Affymetrix and NimbleGen data. To establish a comparison-specific false-positive threshold, the order of log 2 ratios was then randomized and moving window t -tests were recalculated. Two t -score thresholds, one for amplifications and one for deletions, were then defined at which no amplifications or deletions were identified in the randomized data. These thresholds were then applied to the t -scores derived from the original data and regions with t -scores exceeding these thresholds were identified. To identify apparent variants impacting regions larger than our largest moving window size, t -scores were optimized for aberrations encompassing more than 27 probe sets using larger and larger windows until a local maximum t -score was found. As no CNVs met the false positive thresholds set for the NimbleGen data, a 50 probe window was used to detect statistically significant CNVs and a comparison-specific false positive threshold was not applied.

Sequence analysis of recurrent whole genome amplification-induced artifacts

In the analysis of recurrent WGA-induced artifacts, several sets of genomic coordinates were defined based on the human genome reference sequence Build 36/hg18 (released March, 2006) downloaded from the NCBI website ( http://www.ncbi.nlm.nih.gov/ ). To define a set of regions that were consistently over- or under-amplified by the whole genome amplification technique, we analysed apparent variants arising from our comparison of matched pre- and post-WGA samples for overlapping genomic coordinates across all three comparisons and defined minimal overlapping regions ( Supplementary Tables 1 and 2 ). These minimal overlapping regions were defined as the smallest region overlapped by a WGA-induced variant in all three comparisons. To define a subset of recurrently under-amplified chromosome ends, the first or last 2.5% of the reference genome sequence of any chromosome was recorded if it was impacted by a region consistently under-amplified by the WGA technique. To serve as reference sets representing the remainder of the human genome, random sets of coordinates were generated with equivalent size distributions for the regions consistently over- or under-amplified by the whole genome amplification technique and for the subset of recurrently biased regions affecting chromosome ends. In these reference sets, 10 random segments were generated with sizes corresponding to each entry in the list of regions affected by WGA-induced bias (i.e. 1900 amplifications and 750 deletions). The GC and repeat content of each entry in the above sets of coordinates were calculated in the following manner. For each set, the genomic sequence for each coordinate was downloaded from the Ensembl database ( http://www.ensembl.org ). To calculate the GC content of the sequence, the number of Gs and Cs in the sequence was counted and that number divided by the total length of the sequence. To calculate the repeat content of the sequence, the coordinates of the UCSC Genome Browser ‘Simple Repeats’ track generated by Tandem Repeats Finder ( 23 ) was used to identify base pairs belonging to repeat sequences. The number of these base pairs was then divided by the total length of the sequence to give the percentage of repeat sequence in the region. As most of the sets were not normally distributed in GC or repeat content as found by the Jarque-Bera test, the two-sample Kolmogorov-Smirnov test (KS test) was used to test whether these sets differed in their distribution of these two parameters.

RESULTS

Array noise and CNV in samples pre- and post-WGA

To establish a base line for array noise and CNV detection prior to amplification, each unamplified DNA sample was compared to the Affy48 reference set (Methods; Figure 1 b) and candidate CNVs were identified. This comparison versus the Affy48 set was then repeated using amplified samples. As a measure of array noise, we quantified the distribution of log 2 ratios resulting from these comparisons by calculating the mean, standard deviation (SD), and interquartile range (IQR) ( Table 1 , Figure 2 ). As expected due to normalization by CNAT4, the mean log 2 ratios from both unamplified and amplified samples were very close to zero. The SDs and IQRs of log 2 ratios from amplified samples were nearly twice those of the unamplified samples suggesting an increase in array noise using WGA material.

 Boxplots comparing the spread of log 2 ratios in unamplified and amplified samples. The log 2 ratios resulting from comparison of each sample against the Affy48 reference set were plotted using a standard box and whisker plot displaying a five number summary: maximum value or Q3 + 1.5 × IQR, Q3, mean, Q1, and minimum value or Q1 − 1.5 × IQR. Outliers, defined as values that fall more than 1.5 × IQR above Q3 or below Q1, are displayed as individual data points. Due to normalization as part of the CNAT4 analysis pipeline, the mean log 2 ratio from each sample is close to zero. However, the IQR, as well as the maximum and minimum values, were further from the mean in the amplified samples relative to the unamplified samples. The increased spread of data distribution is likely due to increased array noise and the detection of amplification biases induced by WGA.

Figure 2.

Boxplots comparing the spread of log 2 ratios in unamplified and amplified samples. The log 2 ratios resulting from comparison of each sample against the Affy48 reference set were plotted using a standard box and whisker plot displaying a five number summary: maximum value or Q3 + 1.5 × IQR, Q3, mean, Q1, and minimum value or Q1 − 1.5 × IQR. Outliers, defined as values that fall more than 1.5 × IQR above Q3 or below Q1, are displayed as individual data points. Due to normalization as part of the CNAT4 analysis pipeline, the mean log 2 ratio from each sample is close to zero. However, the IQR, as well as the maximum and minimum values, were further from the mean in the amplified samples relative to the unamplified samples. The increased spread of data distribution is likely due to increased array noise and the detection of amplification biases induced by WGA.

Table 1.

Distribution of log 2 ratios from comparison of unamplified and amplified samples versus a common reference set of 48 individuals

Sample compared versus Affy48 Mean a SD b IQR c Apparent amplifications Apparent deletions
Count P < Count P <
Sample 1 - Unamplified 0.0002517 0.3079 0.3428 2 1.99 × 10 −8 3 1.65 × 10 −9
Amplified 0.001971 0.3790 0.4793 322 9.76 × 10 −7 368 9.39 × 10 −9
Sample 2 - Unamplified 0.002710 0.2602 0.3152 2 3.70 × 10 −7 2 1.00 × 10 −16
Amplified −0.0001297 0.4188 0.5412 254 8.91 × 10 −7 157 8.33 × 10 −9
Sample 3 - Unamplified 0.003530 0.2584 0.3176 3 5.42 × 10 −10 1 1.00 × 10 −16
Amplified −0.0004284 0.4076 0.5178 295 7.45 × 10 −7 176 1.36 × 10 −8
Sample compared versus Affy48 Mean a SD b IQR c Apparent amplifications Apparent deletions
Count P < Count P <
Sample 1 - Unamplified 0.0002517 0.3079 0.3428 2 1.99 × 10 −8 3 1.65 × 10 −9
Amplified 0.001971 0.3790 0.4793 322 9.76 × 10 −7 368 9.39 × 10 −9
Sample 2 - Unamplified 0.002710 0.2602 0.3152 2 3.70 × 10 −7 2 1.00 × 10 −16
Amplified −0.0001297 0.4188 0.5412 254 8.91 × 10 −7 157 8.33 × 10 −9
Sample 3 - Unamplified 0.003530 0.2584 0.3176 3 5.42 × 10 −10 1 1.00 × 10 −16
Amplified −0.0004284 0.4076 0.5178 295 7.45 × 10 −7 176 1.36 × 10 −8

a Mean value of log 2 ratios resulting from each comparison. A site with with equivalent copy number in both samples would return a log 2 ratio of 0.

b Standard deviation of log 2 ratios resulting from from each comparison. These values are interpreted as a measure of data noise from each comparison.

c Interquartile range of log 2 ratios resulting from from each comparison. These values are interpreted as a measure of data noise from each comparison.

Table 1.

Distribution of log 2 ratios from comparison of unamplified and amplified samples versus a common reference set of 48 individuals

Sample compared versus Affy48 Mean a SD b IQR c Apparent amplifications Apparent deletions
Count P < Count P <
Sample 1 - Unamplified 0.0002517 0.3079 0.3428 2 1.99 × 10 −8 3 1.65 × 10 −9
Amplified 0.001971 0.3790 0.4793 322 9.76 × 10 −7 368 9.39 × 10 −9
Sample 2 - Unamplified 0.002710 0.2602 0.3152 2 3.70 × 10 −7 2 1.00 × 10 −16
Amplified −0.0001297 0.4188 0.5412 254 8.91 × 10 −7 157 8.33 × 10 −9
Sample 3 - Unamplified 0.003530 0.2584 0.3176 3 5.42 × 10 −10 1 1.00 × 10 −16
Amplified −0.0004284 0.4076 0.5178 295 7.45 × 10 −7 176 1.36 × 10 −8
Sample compared versus Affy48 Mean a SD b IQR c Apparent amplifications Apparent deletions
Count P < Count P <
Sample 1 - Unamplified 0.0002517 0.3079 0.3428 2 1.99 × 10 −8 3 1.65 × 10 −9
Amplified 0.001971 0.3790 0.4793 322 9.76 × 10 −7 368 9.39 × 10 −9
Sample 2 - Unamplified 0.002710 0.2602 0.3152 2 3.70 × 10 −7 2 1.00 × 10 −16
Amplified −0.0001297 0.4188 0.5412 254 8.91 × 10 −7 157 8.33 × 10 −9
Sample 3 - Unamplified 0.003530 0.2584 0.3176 3 5.42 × 10 −10 1 1.00 × 10 −16
Amplified −0.0004284 0.4076 0.5178 295 7.45 × 10 −7 176 1.36 × 10 −8

a Mean value of log 2 ratios resulting from each comparison. A site with with equivalent copy number in both samples would return a log 2 ratio of 0.

b Standard deviation of log 2 ratios resulting from from each comparison. These values are interpreted as a measure of data noise from each comparison.

c Interquartile range of log 2 ratios resulting from from each comparison. These values are interpreted as a measure of data noise from each comparison.

To compare the CNVs detected pre- and post-WGA, we counted apparent CNVs with p-values more significant than each comparison's false-positive detection limit ( Table 1 , Figure 3 ). The analysis of unamplified samples detected 13 candidate CNVs, 11 of which overlapped the coordinates of genomic variants listed in the Database of Genomic Variants ( http://projects.tcag.ca ) ( 5 ) ( Table 2 ). In contrast, the analysis of the amplified samples identified 1572 apparent CNVs, an approximately 100-fold increase in the number of apparently significant amplifications and deletions versus the unamplified samples ( Table 1 ). These artifactual CNVs are likely the result of WGA-induced biases.

Apparent CNVs in unamplified and amplified samples. The number of variants detected in unamplified and amplified samples from comparison against the Affy48 reference set were counted. The amplified samples appear to contain hundreds of CNVs not seen in the unamplified samples suggesting that WGA over- and under-represents of specific regions of the genome.

Figure 3.

Apparent CNVs in unamplified and amplified samples. The number of variants detected in unamplified and amplified samples from comparison against the Affy48 reference set were counted. The amplified samples appear to contain hundreds of CNVs not seen in the unamplified samples suggesting that WGA over- and under-represents of specific regions of the genome.

Table 2.

Apparent amplifications and deletions detected prior to amplification through comparison with a reference set of 48 individuals

Sample compared versus Affy48 Genome coordinates of variant (NCBI Build 36/hg18/Mar 2006) Size (bp) CN within variant CN outside variant SNP count P -value Variation locus a
Amplifications
Sample 1 chr7:48424572–48431182 6610 2.88184 2.04848 11 1.99 × 10 −8
chr14:19381928–19492423 110495 2.93812 2.03610 28 4.85 × 10 −13 Locus 2636
Sample 2 chr2:113809804–113849256 39452 2.28770 2.04023 12 3.70 × 10 −7 Locus 0397
chr17:41569489–41709662 140173 3.07396 2.03694 41 2.31 × 10 −12 Locus 3029
Sample 3 chr9:29695281–29706655 11374 2.19958 2.04042 4 <1.00 × 10 −16
chr14:19309086–19459561 150475 2.65807 2.03481 25 5.42 × 10 −10 Locus 2639
chr15:19163125–20077554 914429 2.66995 2.04165 72 <1.00 × 10 −16 Locus 2748
Deletions
Sample 1 chr7:142030227–142210594 180367 1.54593 2.04848 27 1.61 × 10 −10 Locus 1656
chr14:21451264–22044096 592832 1.51299 2.03610 161 <1.00 × 10 −16 Loci 2644 and 2645
chr22:33661041–33725126 64085 1.75349 2.06794 21 1.65 × 10 −9 Locus 3489
Sample 2 chr2:50682535–50865587 183052 1.44974 2.04023 40 <1.00 × 10 −16 Locus 0329
chr14:21792331–22040096 247765 1.38419 2.02893 60 <1.00 × 10 −16 Locus 2645
Sample 3 chr14:21800768–21932862 132094 1.53811 2.03481 32 <1.00 × 10 −16 Locus 2645
Sample compared versus Affy48 Genome coordinates of variant (NCBI Build 36/hg18/Mar 2006) Size (bp) CN within variant CN outside variant SNP count P -value Variation locus a
Amplifications
Sample 1 chr7:48424572–48431182 6610 2.88184 2.04848 11 1.99 × 10 −8
chr14:19381928–19492423 110495 2.93812 2.03610 28 4.85 × 10 −13 Locus 2636
Sample 2 chr2:113809804–113849256 39452 2.28770 2.04023 12 3.70 × 10 −7 Locus 0397
chr17:41569489–41709662 140173 3.07396 2.03694 41 2.31 × 10 −12 Locus 3029
Sample 3 chr9:29695281–29706655 11374 2.19958 2.04042 4 <1.00 × 10 −16
chr14:19309086–19459561 150475 2.65807 2.03481 25 5.42 × 10 −10 Locus 2639
chr15:19163125–20077554 914429 2.66995 2.04165 72 <1.00 × 10 −16 Locus 2748
Deletions
Sample 1 chr7:142030227–142210594 180367 1.54593 2.04848 27 1.61 × 10 −10 Locus 1656
chr14:21451264–22044096 592832 1.51299 2.03610 161 <1.00 × 10 −16 Loci 2644 and 2645
chr22:33661041–33725126 64085 1.75349 2.06794 21 1.65 × 10 −9 Locus 3489
Sample 2 chr2:50682535–50865587 183052 1.44974 2.04023 40 <1.00 × 10 −16 Locus 0329
chr14:21792331–22040096 247765 1.38419 2.02893 60 <1.00 × 10 −16 Locus 2645
Sample 3 chr14:21800768–21932862 132094 1.53811 2.03481 32 <1.00 × 10 −16 Locus 2645

Table 2.

Apparent amplifications and deletions detected prior to amplification through comparison with a reference set of 48 individuals

Sample compared versus Affy48 Genome coordinates of variant (NCBI Build 36/hg18/Mar 2006) Size (bp) CN within variant CN outside variant SNP count P -value Variation locus a
Amplifications
Sample 1 chr7:48424572–48431182 6610 2.88184 2.04848 11 1.99 × 10 −8
chr14:19381928–19492423 110495 2.93812 2.03610 28 4.85 × 10 −13 Locus 2636
Sample 2 chr2:113809804–113849256 39452 2.28770 2.04023 12 3.70 × 10 −7 Locus 0397
chr17:41569489–41709662 140173 3.07396 2.03694 41 2.31 × 10 −12 Locus 3029
Sample 3 chr9:29695281–29706655 11374 2.19958 2.04042 4 <1.00 × 10 −16
chr14:19309086–19459561 150475 2.65807 2.03481 25 5.42 × 10 −10 Locus 2639
chr15:19163125–20077554 914429 2.66995 2.04165 72 <1.00 × 10 −16 Locus 2748
Deletions
Sample 1 chr7:142030227–142210594 180367 1.54593 2.04848 27 1.61 × 10 −10 Locus 1656
chr14:21451264–22044096 592832 1.51299 2.03610 161 <1.00 × 10 −16 Loci 2644 and 2645
chr22:33661041–33725126 64085 1.75349 2.06794 21 1.65 × 10 −9 Locus 3489
Sample 2 chr2:50682535–50865587 183052 1.44974 2.04023 40 <1.00 × 10 −16 Locus 0329
chr14:21792331–22040096 247765 1.38419 2.02893 60 <1.00 × 10 −16 Locus 2645
Sample 3 chr14:21800768–21932862 132094 1.53811 2.03481 32 <1.00 × 10 −16 Locus 2645
Sample compared versus Affy48 Genome coordinates of variant (NCBI Build 36/hg18/Mar 2006) Size (bp) CN within variant CN outside variant SNP count P -value Variation locus a
Amplifications
Sample 1 chr7:48424572–48431182 6610 2.88184 2.04848 11 1.99 × 10 −8
chr14:19381928–19492423 110495 2.93812 2.03610 28 4.85 × 10 −13 Locus 2636
Sample 2 chr2:113809804–113849256 39452 2.28770 2.04023 12 3.70 × 10 −7 Locus 0397
chr17:41569489–41709662 140173 3.07396 2.03694 41 2.31 × 10 −12 Locus 3029
Sample 3 chr9:29695281–29706655 11374 2.19958 2.04042 4 <1.00 × 10 −16
chr14:19309086–19459561 150475 2.65807 2.03481 25 5.42 × 10 −10 Locus 2639
chr15:19163125–20077554 914429 2.66995 2.04165 72 <1.00 × 10 −16 Locus 2748
Deletions
Sample 1 chr7:142030227–142210594 180367 1.54593 2.04848 27 1.61 × 10 −10 Locus 1656
chr14:21451264–22044096 592832 1.51299 2.03610 161 <1.00 × 10 −16 Loci 2644 and 2645
chr22:33661041–33725126 64085 1.75349 2.06794 21 1.65 × 10 −9 Locus 3489
Sample 2 chr2:50682535–50865587 183052 1.44974 2.04023 40 <1.00 × 10 −16 Locus 0329
chr14:21792331–22040096 247765 1.38419 2.02893 60 <1.00 × 10 −16 Locus 2645
Sample 3 chr14:21800768–21932862 132094 1.53811 2.03481 32 <1.00 × 10 −16 Locus 2645

To assess experimental variation prior to amplification, each unamplified and amplified sample was subjected to a pair-wise comparison against an experimental replicate of itself ( Table 3 ). The lack of fluctuation in mean, SD and IQR in the log 2 ratios from unamplified replicates suggests a high degree of reproducibility of the array method used. Similarly, while still elevated relative to unamplified samples, there is no major fluctuation in these values between amplified replicates further supporting the notion that the WGA method behaves consistently. However, the values obtained from unamplified samples versus values obtained from amplified samples, using the Affy48 reference set, showed a substantial decrease in SDs and IQRs. This indicates that amplified samples produce different signal intensity distributions than unamplified samples, suggesting that comparison of amplified to unamplified data sets is potentially problematic.

Table 3.

Distribution of log 2 ratios from pair-wise comparison of experimental replicates of unamplified and amplified samples

Sample Mean SD IQR
Sample 1 - Unamplified 0.005517 0.2579 0.3223
Amplified 0.002538 0.2840 0.3544
Sample 2 - Unamplified 0.008175 0.2658 0.3299
Amplified 0.0003263 0.3264 0.4153
Sample 3 - Unamplified 0.0064235 0.2585 0.3187
Amplified 0.001687 0.2842 0.3517
Sample Mean SD IQR
Sample 1 - Unamplified 0.005517 0.2579 0.3223
Amplified 0.002538 0.2840 0.3544
Sample 2 - Unamplified 0.008175 0.2658 0.3299
Amplified 0.0003263 0.3264 0.4153
Sample 3 - Unamplified 0.0064235 0.2585 0.3187
Amplified 0.001687 0.2842 0.3517

Table 3.

Distribution of log 2 ratios from pair-wise comparison of experimental replicates of unamplified and amplified samples

Sample Mean SD IQR
Sample 1 - Unamplified 0.005517 0.2579 0.3223
Amplified 0.002538 0.2840 0.3544
Sample 2 - Unamplified 0.008175 0.2658 0.3299
Amplified 0.0003263 0.3264 0.4153
Sample 3 - Unamplified 0.0064235 0.2585 0.3187
Amplified 0.001687 0.2842 0.3517
Sample Mean SD IQR
Sample 1 - Unamplified 0.005517 0.2579 0.3223
Amplified 0.002538 0.2840 0.3544
Sample 2 - Unamplified 0.008175 0.2658 0.3299
Amplified 0.0003263 0.3264 0.4153
Sample 3 - Unamplified 0.0064235 0.2585 0.3187
Amplified 0.001687 0.2842 0.3517

CNVs induced by whole genome amplification

To identify apparent CNVs arising from non-uniform amplification bias in the WGA technique, data from paired pre- and post-WGA samples were directly compared to each other ( Figure 1 b). Our analysis identified apparent WGA-induced over- and under-amplifications in each of the three comparisons of amplified versus unamplified material. In sample 1, we detected 502 amplifications ( P -value threshold of detection, P < 1.68 × 10 −6 ) and 580 deletions ( P < 1.71 × 10 −8 ). In sample 2, we detected 467 amplifications ( P < 1.68 × 10 −6 ) and 202 deletions ( P < 1.64 × 10 −8 ). In sample 3, we detected 546 amplifications ( P < 1.68 × 10 −6 ) and 259 deletions ( P < 3.45 × 10 −8 ). Our analysis also revealed a set of 265 recurrent apparent WGA-associated aberrations that were detected in all three comparisons. This set consisted of 190 over-amplifications ( Supplementary Table 1 ) and 75 under-amplifications ( Supplementary Table 2 ). 39 of these regions overlapped one of the 92 regions of bias (31 of 62 over-amplifications, 8 of 30 under-amplifications) identified by three previous studies ( 16 , 17 , 20 ). 110 of the regions we identified overlapped genomic regions with known CNVs ( 2 ) (64 over-amplifications, 46 under-amplifications) but there was no correlation between regions susceptible to WGA-associated bias and known CNVs ( P = 1.00). In a set of 2650 random genomic coordinates with the same size distribution as the WGA-induced artifacts, 36.26% overlapped a known CNV, a proportion near the 41.51% overlap observed with the set of WGA-induced biases.

The minimal overlapping regions (see Methods) of WGA-induced over-amplifications ranged from 2207 bp to 357 399 bp with a median size of 58 961 bp, an IQR of 66 524 bp and encompassed 13.6 Mbp of the reference human genome sequence. These recurrently over-amplified sites were distributed throughout the genome and had a statistically significant increase in GC content relative to a set of 1900 random genomic segments with identical size distribution ( P = 8.36 × 10 −40 ). These over-amplified sites were also enriched for repeat sequences relative to the set of 1900 random genomic segments ( P = 1.76 × 10 −6 ). These results are compatible with the notion that over-amplification by the WGA technique is related to the GC and repeat content of the underlying sequence.

The minimal overlapping regions of the recurrent WGA-induced under-amplifications ranged from 5206 bp to 1.93 Mbp with a median size of 75 698 bp, an IQR of 64 619 and encompassed 8.37 Mb of the reference human genome sequence. These regions of under-amplification appeared to fall into two groups: those near chromosome ends and those distributed throughout the genome. Comparison of the 54 under-amplified sites distributed throughout the genome with a set of 540 random genomic segments with identical size distribution found no statistically significant difference in GC content ( P = 0.0796) or repeat sequences ( P = 0.1901). However, the under-amplifications were greatly depleted for GC-rich regions compared to the over-amplifications ( P = 1.93 × 10 −5 ) which supports the notion that WGA amplification efficiency is related to the GC content of the underlying sequence. A plot of GC content versus copy number shows a trend of increasing amplification magnitude (i.e. increasing copy number) with increasing GC content ( Figure 4 ).

Copy number distribution and GC content of WGA-induced CNVs. The number of variants and percentage GC content were plotted against copy number magnitude for all of the CNVs detected by comparisons of each pre- and post-WGA sample pair. There appears to be a direct relationship between the magnitude of over-amplification and increased GC content.

Figure 4.

Copy number distribution and GC content of WGA-induced CNVs. The number of variants and percentage GC content were plotted against copy number magnitude for all of the CNVs detected by comparisons of each pre- and post-WGA sample pair. There appears to be a direct relationship between the magnitude of over-amplification and increased GC content.

Of the 39 chromosome ends (see Methods) assayed by probe sets, 15 contained regions of under-amplification ( Table 4 ). Only three chromosome ends contained over-amplifications, suggesting that under-representation of chromosome ends is a consistent result of whole genome amplification. The set of chromosome end under-amplifications impacted 2.547 Mbp of the reference human genome sequence and the GC content was statistically greater than that of a set of 150 random genomic segments with identical size distribution ( P = 1.12 × 10 −6 ). However, there was no statistical difference in GC content been the under-amplified chromosome ends and the 25 appropriately amplified chromosome ends ( P = 0.8215). This suggests that amplification bias due to GC content does not play a role in under-amplification of specific subtelomeric regions. Under-amplified chromosome ends were enriched for repetitive sequences (see Methods) relative to both a set of 150 random genomic segments with identical size distribution ( P = 1.52 × 10 −9 ) and the 25 assayed chromosome ends that were not under-amplified ( P = 0.0022) suggesting that increased repeat content of specific chromosome ends may result in their under-amplification.

Table 4.

Regions of recurrent WGA under-amplification within chromosome ends

Genome coordinates (Build 36/hg18/Mar 2006) Size (Mbp) % GC content Mbp from nearest chromosome end
P -terminal end
chr1:3058506–3129776 0.071 57.113 3.059
chr1:5857077–5871605 0.015 57.168 5.857
chr2:554079–613259 0.059 45.934 0.554
chr2:1841469–1968296 0.127 45.876 1.841
chr5:487981–738504 0.251 56.251 0.488
chr5:2187888–2267721 0.080 49.395 2.188
chr5:2836714–2884070 0.047 41.89 2.837
chr5:3160861–3195828 0.035 46.205 3.161
chr8:791584–850907 0.059 47.539 0.792
chr8:1816651–1946694 0.130 49.183 1.817
chr10:2593122–2624375 0.031 37.102 2.593
chr19:373238–892603 0.519 59.541 0.373
q -terminal end
chr6:170198708–170308225 0.110 51.929 0.592
chr7:158582043–158739710 0.158 45.905 0.082
chr10:134327710–134332916 0.005 49.165 1.042
chr12:130611957–130673802 0.062 51.924 1.676
chr13:112193014–112294946 0.102 42.808 1.848
chr13:113053814–113215730 0.162 50.548 0.927
chr15:99580062–99745948 0.166 47.27 0.593
chr16:87408466–87706274 0.298 59.068 1.121
chr20:60967459–61027216 0.060 49.085 1.409
Genome coordinates (Build 36/hg18/Mar 2006) Size (Mbp) % GC content Mbp from nearest chromosome end
P -terminal end
chr1:3058506–3129776 0.071 57.113 3.059
chr1:5857077–5871605 0.015 57.168 5.857
chr2:554079–613259 0.059 45.934 0.554
chr2:1841469–1968296 0.127 45.876 1.841
chr5:487981–738504 0.251 56.251 0.488
chr5:2187888–2267721 0.080 49.395 2.188
chr5:2836714–2884070 0.047 41.89 2.837
chr5:3160861–3195828 0.035 46.205 3.161
chr8:791584–850907 0.059 47.539 0.792
chr8:1816651–1946694 0.130 49.183 1.817
chr10:2593122–2624375 0.031 37.102 2.593
chr19:373238–892603 0.519 59.541 0.373
q -terminal end
chr6:170198708–170308225 0.110 51.929 0.592
chr7:158582043–158739710 0.158 45.905 0.082
chr10:134327710–134332916 0.005 49.165 1.042
chr12:130611957–130673802 0.062 51.924 1.676
chr13:112193014–112294946 0.102 42.808 1.848
chr13:113053814–113215730 0.162 50.548 0.927
chr15:99580062–99745948 0.166 47.27 0.593
chr16:87408466–87706274 0.298 59.068 1.121
chr20:60967459–61027216 0.060 49.085 1.409

Table 4.

Regions of recurrent WGA under-amplification within chromosome ends

Genome coordinates (Build 36/hg18/Mar 2006) Size (Mbp) % GC content Mbp from nearest chromosome end
P -terminal end
chr1:3058506–3129776 0.071 57.113 3.059
chr1:5857077–5871605 0.015 57.168 5.857
chr2:554079–613259 0.059 45.934 0.554
chr2:1841469–1968296 0.127 45.876 1.841
chr5:487981–738504 0.251 56.251 0.488
chr5:2187888–2267721 0.080 49.395 2.188
chr5:2836714–2884070 0.047 41.89 2.837
chr5:3160861–3195828 0.035 46.205 3.161
chr8:791584–850907 0.059 47.539 0.792
chr8:1816651–1946694 0.130 49.183 1.817
chr10:2593122–2624375 0.031 37.102 2.593
chr19:373238–892603 0.519 59.541 0.373
q -terminal end
chr6:170198708–170308225 0.110 51.929 0.592
chr7:158582043–158739710 0.158 45.905 0.082
chr10:134327710–134332916 0.005 49.165 1.042
chr12:130611957–130673802 0.062 51.924 1.676
chr13:112193014–112294946 0.102 42.808 1.848
chr13:113053814–113215730 0.162 50.548 0.927
chr15:99580062–99745948 0.166 47.27 0.593
chr16:87408466–87706274 0.298 59.068 1.121
chr20:60967459–61027216 0.060 49.085 1.409
Genome coordinates (Build 36/hg18/Mar 2006) Size (Mbp) % GC content Mbp from nearest chromosome end
P -terminal end
chr1:3058506–3129776 0.071 57.113 3.059
chr1:5857077–5871605 0.015 57.168 5.857
chr2:554079–613259 0.059 45.934 0.554
chr2:1841469–1968296 0.127 45.876 1.841
chr5:487981–738504 0.251 56.251 0.488
chr5:2187888–2267721 0.080 49.395 2.188
chr5:2836714–2884070 0.047 41.89 2.837
chr5:3160861–3195828 0.035 46.205 3.161
chr8:791584–850907 0.059 47.539 0.792
chr8:1816651–1946694 0.130 49.183 1.817
chr10:2593122–2624375 0.031 37.102 2.593
chr19:373238–892603 0.519 59.541 0.373
q -terminal end
chr6:170198708–170308225 0.110 51.929 0.592
chr7:158582043–158739710 0.158 45.905 0.082
chr10:134327710–134332916 0.005 49.165 1.042
chr12:130611957–130673802 0.062 51.924 1.676
chr13:112193014–112294946 0.102 42.808 1.848
chr13:113053814–113215730 0.162 50.548 0.927
chr15:99580062–99745948 0.166 47.27 0.593
chr16:87408466–87706274 0.298 59.068 1.121
chr20:60967459–61027216 0.060 49.085 1.409

To assess WGA-induced CNV artifacts using a second array platform, we compared pre- and post-amplification sample pairs in three comparative genome hybridization (CGH) experiments using the NimbleGen 385k array. The log 2 ratios from these experiments were widely distributed (average SD = 0.378, average IQR = 0.457) and while several thousand CNVs were detected, none were identified with p-values passing the stringent false positive thresholds set by our algorithm due to the high level of noise in this data ( P < 3.51 × 10 −7 for over-amplifications, P < 3.30 × 10 −11 for under-amplifications). Analysis of this data using a 50 probe moving window without filtering for false positives detected 2116 WGA-induced CNVs (466 over-amplifications, 1650 under-amplifications) of which 141 occurred in all three comparisons (29 over-amplifications, 112 under-amplifications). Despite their relatively large size (average = 1.06 Mb, median = 0.36 Mb, SD = 4.10 Mb), only 28 of these overlapped recurrent artifacts detected by the Affymetrix comparisons (17 of 190 over-amplifications, 11 of 75 under-amplifications). This amount of overlap is similar to that seen with a random set of 2116 random genomic coordinates with the same size distribution as the CNVs detected by the NimbleGen platform of which 65 overlapped a WGA-induced CNV detected by the Affymetrix platform. These results suggest that these are artifacts resulting from the difficulty in distinguishing real CNVs from background noise when co-hybridizing amplified and unamplified samples even when a large moving window of 50 probes is used.

Use of amplified material for pair-wise copy number comparisons

To assess the use of WGA material in pair-wise comparisons, each sample was compared to the other samples one-by-one and relative differences in copy number in the three samples assessed using: (i) unamplified samples versus unamplified samples, (ii) amplified samples versus unamplified samples, and (iii) amplified samples versus amplified samples ( Figure 1 d). An example of the output from one such set of comparisons is illustrated in Figure 5 .

 Example of how a pair-wise comparison of amplified material can partially compensate for WGA-induced bias. Shown is the output of three copy number analyses conducted using our CNV discovery software pipeline. Copy number, calculated directly from log 2 ratios of probe intensities, is plotted against genome location using a sliding window of averaged data points, in this case 60 probes. In this example, a pair-wise comparison of two unamplified samples, identified a gain of copy number ( P < 1.00 × 10 −16 ) in unamplified sample #1 relative to unamplified sample #2 at a locus documented to be copy number variable in the Database of Genomic Variants. Conducting the same comparison after WGA of sample #1 results in hundreds of confounding CNVs from which the known CNV is indistinguishable. However, conducting this comparison after WGA of both samples restores the ability to detect this CNV. Artifactual variants do still remain as a result of random variation in the WGA process, however they do not reach the level of significance of the real event. Therefore, when interpreting results from comparisons of WGA samples, only the top-most hits are likely to be representative of the unamplified sample.

Figure 5.

Example of how a pair-wise comparison of amplified material can partially compensate for WGA-induced bias. Shown is the output of three copy number analyses conducted using our CNV discovery software pipeline. Copy number, calculated directly from log 2 ratios of probe intensities, is plotted against genome location using a sliding window of averaged data points, in this case 60 probes. In this example, a pair-wise comparison of two unamplified samples, identified a gain of copy number ( P < 1.00 × 10 −16 ) in unamplified sample #1 relative to unamplified sample #2 at a locus documented to be copy number variable in the Database of Genomic Variants. Conducting the same comparison after WGA of sample #1 results in hundreds of confounding CNVs from which the known CNV is indistinguishable. However, conducting this comparison after WGA of both samples restores the ability to detect this CNV. Artifactual variants do still remain as a result of random variation in the WGA process, however they do not reach the level of significance of the real event. Therefore, when interpreting results from comparisons of WGA samples, only the top-most hits are likely to be representative of the unamplified sample.

The unamplified versus unamplified comparisons identified 21 apparent differences in copy number among the three samples ( Tables 5 and 6 ). These pair-wise comparisons identified 5 of 13 apparent differences expected from the individual comparisons of samples to the Affy48 reference set. Twelve of these apparent differences, including the five differences expected from comparison with the Affy48 set, overlap variants listed in the Database of Genomic Variants ( http://projects.tcag.ca ). The amplified versus unamplified comparisons identified 3207 apparent differences in copy number among the three samples ( Table 5 ). Only seven of these apparent differences were detected by both unamplified/amplified and amplified/unamplified comparisons suggesting that systematic WGA-induced variants and random WGA-reaction variability mask real events.

Table 5.

Apparent copy number differences identified by pair-wise comparisons of all possible combinations of unamplified and amplified samples

Samples compared Apparent amplifications Apparent deletions Total apparent CNVs CNVs in common between matched comparisons
Count P < Count P <
Unamplified sample 1 Unamplified sample 2 4 4.26 × 10 −7 3 1.40 × 10 −8 7
Unamplified sample 1 Unamplified sample 3 4 3.88 × 10 −8 4 1.05 × 10 −13 8
Unamplified sample 2 Unamplified sample 3 4 1.09 × 10 −10 2 3.44 × 10 −15 6
Amplified sample 1 Unamplified sample 2 369 1.26 × 10 −6 367 7.77 × 10 −9 736 2
Unamplified sample 1 Amplified sample 2 69 1.05 × 10 −6 358 7.04 × 10 −9 427
Amplified sample 1 Unamplified sample 3 471 1.81 × 10 −6 498 1.28 × 10 −8 969 1
Unamplified sample 1 Amplified sample 3 110 1.60 × 10 −6 536 1.53 × 10 −8 646
Amplified sample 2 Unamplified sample 3 183 1.07 × 10 −6 49 5.64 × 10 −8 232 4
Unamplified sample 2 Amplified sample 3 67 1.28 × 10 −6 130 3.31 × 10 −8 197
Amplified sample 1 Amplified sample 2 21 2.03 × 10 −6 49 1.71 × 10 −8 70
Amplified sample 1 Amplified sample 3 18 9.67 × 10 −7 82 2.69 × 10 −8 100
Amplified sample 2 Amplified sample 3 44 1.82 × 10 −6 61 8.23 × 10 −8 105
Samples compared Apparent amplifications Apparent deletions Total apparent CNVs CNVs in common between matched comparisons
Count P < Count P <
Unamplified sample 1 Unamplified sample 2 4 4.26 × 10 −7 3 1.40 × 10 −8 7
Unamplified sample 1 Unamplified sample 3 4 3.88 × 10 −8 4 1.05 × 10 −13 8
Unamplified sample 2 Unamplified sample 3 4 1.09 × 10 −10 2 3.44 × 10 −15 6
Amplified sample 1 Unamplified sample 2 369 1.26 × 10 −6 367 7.77 × 10 −9 736 2
Unamplified sample 1 Amplified sample 2 69 1.05 × 10 −6 358 7.04 × 10 −9 427
Amplified sample 1 Unamplified sample 3 471 1.81 × 10 −6 498 1.28 × 10 −8 969 1
Unamplified sample 1 Amplified sample 3 110 1.60 × 10 −6 536 1.53 × 10 −8 646
Amplified sample 2 Unamplified sample 3 183 1.07 × 10 −6 49 5.64 × 10 −8 232 4
Unamplified sample 2 Amplified sample 3 67 1.28 × 10 −6 130 3.31 × 10 −8 197
Amplified sample 1 Amplified sample 2 21 2.03 × 10 −6 49 1.71 × 10 −8 70
Amplified sample 1 Amplified sample 3 18 9.67 × 10 −7 82 2.69 × 10 −8 100
Amplified sample 2 Amplified sample 3 44 1.82 × 10 −6 61 8.23 × 10 −8 105

Table 5.

Apparent copy number differences identified by pair-wise comparisons of all possible combinations of unamplified and amplified samples

Samples compared Apparent amplifications Apparent deletions Total apparent CNVs CNVs in common between matched comparisons
Count P < Count P <
Unamplified sample 1 Unamplified sample 2 4 4.26 × 10 −7 3 1.40 × 10 −8 7
Unamplified sample 1 Unamplified sample 3 4 3.88 × 10 −8 4 1.05 × 10 −13 8
Unamplified sample 2 Unamplified sample 3 4 1.09 × 10 −10 2 3.44 × 10 −15 6
Amplified sample 1 Unamplified sample 2 369 1.26 × 10 −6 367 7.77 × 10 −9 736 2
Unamplified sample 1 Amplified sample 2 69 1.05 × 10 −6 358 7.04 × 10 −9 427
Amplified sample 1 Unamplified sample 3 471 1.81 × 10 −6 498 1.28 × 10 −8 969 1
Unamplified sample 1 Amplified sample 3 110 1.60 × 10 −6 536 1.53 × 10 −8 646
Amplified sample 2 Unamplified sample 3 183 1.07 × 10 −6 49 5.64 × 10 −8 232 4
Unamplified sample 2 Amplified sample 3 67 1.28 × 10 −6 130 3.31 × 10 −8 197
Amplified sample 1 Amplified sample 2 21 2.03 × 10 −6 49 1.71 × 10 −8 70
Amplified sample 1 Amplified sample 3 18 9.67 × 10 −7 82 2.69 × 10 −8 100
Amplified sample 2 Amplified sample 3 44 1.82 × 10 −6 61 8.23 × 10 −8 105
Samples compared Apparent amplifications Apparent deletions Total apparent CNVs CNVs in common between matched comparisons
Count P < Count P <
Unamplified sample 1 Unamplified sample 2 4 4.26 × 10 −7 3 1.40 × 10 −8 7
Unamplified sample 1 Unamplified sample 3 4 3.88 × 10 −8 4 1.05 × 10 −13 8
Unamplified sample 2 Unamplified sample 3 4 1.09 × 10 −10 2 3.44 × 10 −15 6
Amplified sample 1 Unamplified sample 2 369 1.26 × 10 −6 367 7.77 × 10 −9 736 2
Unamplified sample 1 Amplified sample 2 69 1.05 × 10 −6 358 7.04 × 10 −9 427
Amplified sample 1 Unamplified sample 3 471 1.81 × 10 −6 498 1.28 × 10 −8 969 1
Unamplified sample 1 Amplified sample 3 110 1.60 × 10 −6 536 1.53 × 10 −8 646
Amplified sample 2 Unamplified sample 3 183 1.07 × 10 −6 49 5.64 × 10 −8 232 4
Unamplified sample 2 Amplified sample 3 67 1.28 × 10 −6 130 3.31 × 10 −8 197
Amplified sample 1 Amplified sample 2 21 2.03 × 10 −6 49 1.71 × 10 −8 70
Amplified sample 1 Amplified sample 3 18 9.67 × 10 −7 82 2.69 × 10 −8 100
Amplified sample 2 Amplified sample 3 44 1.82 × 10 −6 61 8.23 × 10 −8 105

The amplified versus amplified comparisons identified 275 apparent differences in copy number among the three samples ( Table 5 ). These amplified versus amplified comparisons identified 2 of the 12 apparent amplifications and 5 of the 9 apparent deletions seen in the unamplified comparisons ( Table 6 ), suggesting that pair-wise comparisons of material where both samples have been subjected to WGA can partially compensate for reproducible WGA-induced bias ( Figure 5 ). The most significant deletion identified by each unamplified comparison was recapitulated as the most significant deletion identified by the corresponding amplified comparison ( Table 6 ). This was also true of the most significant amplification in two of the three comparisons ( Table 6 ). The list of variants detected at lower levels of significance than these top scoring events may still contain real CNVs although it is difficult to isolate these from the remaining artifactual events resulting from random experimental variation without independent validation of each one.

Table 6.

Copy number variants detected by pair-wise comparisons of unamplified and amplified sample sets

Sample comparison Relative CN difference Detected by pairwise comparison of unamplified samples Detected by pairwise comparison of amplified samples Variation locus a
Coordinates (Build 36) P Rank Coordinates (Build 36) P Rank
1 versus 2 Increase chr2:50775422–51014967 1.00 × 10 −16 1 chr2:50828689–50960764 1.15 × 10 −9 1 of 21 0329 b
chr14:19272965–19489991 1.38 × 10 −10 2 2636
chr3:21942154–21975950 3.91 × 10 −7 3
chr16:22640088–22688093 4.26 × 10 −7 4 2893
Decrease chr17:41569489–41708649 1.00 × 10 −16 1 chr17:41587072–41709662 1.00 × 10 −16 1 of 48 3029
chr9:11936421–11997006 5.09 × 10 −11 2 1901
chr10:95243220–95304377 1.40 × 10 −8 3
1 versus 3 Increase chr8:124654695–124656225 1.00 × 10 −16 1
chr13:43692360–43696382 3.99 × 10 −13 2
chr18:20691186–20697540 4.86 × 10 −13 3
chr14:19402695–19502641 3.88 × 10 −8 4 2636
Decrease chr14:21715523–22040167 1.00 × 10 −16 1 chr14:21531617–22057862 1.00 × 10 −16 1 of 82 2644/5
chr10:54588936–54590136 1.00 × 10 −16 1
chr17:76310141–76321112 1.00 × 10 −16 1
chr15:19876834–20005562 1.05 × 10 −13 4 chr15:19877365–20077554 2.11 × 10 −10 37 of 82 2748
2 versus 3 Increase chr17:41572099–41708649 1.00 × 10 −16 1 chr17:41522422–41647903 8.47 × 10 −13 1 of 44 3029
chr15:84684853–84693981 1.00 × 10 −16 1 2830
chr15:98087203–98095507 1.11 × 10 −11 3 2860
chr16:77105899–77109454 1.09 × 10 −10 4
Decrease chr15:18711364–20079140 1.00 × 10 −16 1 chr15:19313868–20329239 1.00 × 10 −16 1 of 61 2748
chr2:50870615–51020480 3.44 × 10 −15 2 chr2:50828689–51018056 1.00 × 10 −16 1 of 61
Sample comparison Relative CN difference Detected by pairwise comparison of unamplified samples Detected by pairwise comparison of amplified samples Variation locus a
Coordinates (Build 36) P Rank Coordinates (Build 36) P Rank
1 versus 2 Increase chr2:50775422–51014967 1.00 × 10 −16 1 chr2:50828689–50960764 1.15 × 10 −9 1 of 21 0329 b
chr14:19272965–19489991 1.38 × 10 −10 2 2636
chr3:21942154–21975950 3.91 × 10 −7 3
chr16:22640088–22688093 4.26 × 10 −7 4 2893
Decrease chr17:41569489–41708649 1.00 × 10 −16 1 chr17:41587072–41709662 1.00 × 10 −16 1 of 48 3029
chr9:11936421–11997006 5.09 × 10 −11 2 1901
chr10:95243220–95304377 1.40 × 10 −8 3
1 versus 3 Increase chr8:124654695–124656225 1.00 × 10 −16 1
chr13:43692360–43696382 3.99 × 10 −13 2
chr18:20691186–20697540 4.86 × 10 −13 3
chr14:19402695–19502641 3.88 × 10 −8 4 2636
Decrease chr14:21715523–22040167 1.00 × 10 −16 1 chr14:21531617–22057862 1.00 × 10 −16 1 of 82 2644/5
chr10:54588936–54590136 1.00 × 10 −16 1
chr17:76310141–76321112 1.00 × 10 −16 1
chr15:19876834–20005562 1.05 × 10 −13 4 chr15:19877365–20077554 2.11 × 10 −10 37 of 82 2748
2 versus 3 Increase chr17:41572099–41708649 1.00 × 10 −16 1 chr17:41522422–41647903 8.47 × 10 −13 1 of 44 3029
chr15:84684853–84693981 1.00 × 10 −16 1 2830
chr15:98087203–98095507 1.11 × 10 −11 3 2860
chr16:77105899–77109454 1.09 × 10 −10 4
Decrease chr15:18711364–20079140 1.00 × 10 −16 1 chr15:19313868–20329239 1.00 × 10 −16 1 of 61 2748
chr2:50870615–51020480 3.44 × 10 −15 2 chr2:50828689–51018056 1.00 × 10 −16 1 of 61

b This CNV locus is overlapped only by the coordinates expected from comparison versus the Affy48 reference set.

Table 6.

Copy number variants detected by pair-wise comparisons of unamplified and amplified sample sets

Sample comparison Relative CN difference Detected by pairwise comparison of unamplified samples Detected by pairwise comparison of amplified samples Variation locus a
Coordinates (Build 36) P Rank Coordinates (Build 36) P Rank
1 versus 2 Increase chr2:50775422–51014967 1.00 × 10 −16 1 chr2:50828689–50960764 1.15 × 10 −9 1 of 21 0329 b
chr14:19272965–19489991 1.38 × 10 −10 2 2636
chr3:21942154–21975950 3.91 × 10 −7 3
chr16:22640088–22688093 4.26 × 10 −7 4 2893
Decrease chr17:41569489–41708649 1.00 × 10 −16 1 chr17:41587072–41709662 1.00 × 10 −16 1 of 48 3029
chr9:11936421–11997006 5.09 × 10 −11 2 1901
chr10:95243220–95304377 1.40 × 10 −8 3
1 versus 3 Increase chr8:124654695–124656225 1.00 × 10 −16 1
chr13:43692360–43696382 3.99 × 10 −13 2
chr18:20691186–20697540 4.86 × 10 −13 3
chr14:19402695–19502641 3.88 × 10 −8 4 2636
Decrease chr14:21715523–22040167 1.00 × 10 −16 1 chr14:21531617–22057862 1.00 × 10 −16 1 of 82 2644/5
chr10:54588936–54590136 1.00 × 10 −16 1
chr17:76310141–76321112 1.00 × 10 −16 1
chr15:19876834–20005562 1.05 × 10 −13 4 chr15:19877365–20077554 2.11 × 10 −10 37 of 82 2748
2 versus 3 Increase chr17:41572099–41708649 1.00 × 10 −16 1 chr17:41522422–41647903 8.47 × 10 −13 1 of 44 3029
chr15:84684853–84693981 1.00 × 10 −16 1 2830
chr15:98087203–98095507 1.11 × 10 −11 3 2860
chr16:77105899–77109454 1.09 × 10 −10 4
Decrease chr15:18711364–20079140 1.00 × 10 −16 1 chr15:19313868–20329239 1.00 × 10 −16 1 of 61 2748
chr2:50870615–51020480 3.44 × 10 −15 2 chr2:50828689–51018056 1.00 × 10 −16 1 of 61
Sample comparison Relative CN difference Detected by pairwise comparison of unamplified samples Detected by pairwise comparison of amplified samples Variation locus a
Coordinates (Build 36) P Rank Coordinates (Build 36) P Rank
1 versus 2 Increase chr2:50775422–51014967 1.00 × 10 −16 1 chr2:50828689–50960764 1.15 × 10 −9 1 of 21 0329 b
chr14:19272965–19489991 1.38 × 10 −10 2 2636
chr3:21942154–21975950 3.91 × 10 −7 3
chr16:22640088–22688093 4.26 × 10 −7 4 2893
Decrease chr17:41569489–41708649 1.00 × 10 −16 1 chr17:41587072–41709662 1.00 × 10 −16 1 of 48 3029
chr9:11936421–11997006 5.09 × 10 −11 2 1901
chr10:95243220–95304377 1.40 × 10 −8 3
1 versus 3 Increase chr8:124654695–124656225 1.00 × 10 −16 1
chr13:43692360–43696382 3.99 × 10 −13 2
chr18:20691186–20697540 4.86 × 10 −13 3
chr14:19402695–19502641 3.88 × 10 −8 4 2636
Decrease chr14:21715523–22040167 1.00 × 10 −16 1 chr14:21531617–22057862 1.00 × 10 −16 1 of 82 2644/5
chr10:54588936–54590136 1.00 × 10 −16 1
chr17:76310141–76321112 1.00 × 10 −16 1
chr15:19876834–20005562 1.05 × 10 −13 4 chr15:19877365–20077554 2.11 × 10 −10 37 of 82 2748
2 versus 3 Increase chr17:41572099–41708649 1.00 × 10 −16 1 chr17:41522422–41647903 8.47 × 10 −13 1 of 44 3029
chr15:84684853–84693981 1.00 × 10 −16 1 2830
chr15:98087203–98095507 1.11 × 10 −11 3 2860
chr16:77105899–77109454 1.09 × 10 −10 4
Decrease chr15:18711364–20079140 1.00 × 10 −16 1 chr15:19313868–20329239 1.00 × 10 −16 1 of 61 2748
chr2:50870615–51020480 3.44 × 10 −15 2 chr2:50828689–51018056 1.00 × 10 −16 1 of 61

b This CNV locus is overlapped only by the coordinates expected from comparison versus the Affy48 reference set.

Validation of WGA pair-wise comparisons for copy number detection

To determine the extent to which amplified pair-wise comparisons mask known, validated CNVs, DNA from the blood of three father/child pairs with previously described CNVs ( 9 ) were subjected to WGA and copy number analysis using the 250k Nsp chip of the Affymetrix 500k set. The original analysis of unamplified DNA performed using the Affymetrix Mapping 100k SNP array set ( 9 ) identified a total of 32 CNVs within the three father/child pairs of which five (two amplifications, three deletions) were validated by conventional cytogenetic analysis ( Table 7 ).

Table 7.

Copy number variants detected in MR families by pair-wise comparisons of unamplified and amplified sample sets (child versus father)

Family ID[ 9 ] Relative CN Validated aberrations detected by pairwise comparison of unamplified samples [ 9 ] (100k array set) Detected by pairwise comparison of amplified samples (250k Nsp array) Variation locus a
difference Coordinates (Build 36) Mbp Validation Cyto-band Coordinates (Build 36) P Rank b
8379 Increase chr10:259695–23144645 22.88 karyotyping 10p12.2–p15.3 chr10:1000464–24070263 1.00 × 10 −16 1 of 13 many
chr15:19208413–19943075 0.73 karyotyping 15q11.2 chr15:18850150–20335459 1.00 × 10 −16 1 of 13 2748
chr14:21394980–21864733 1.00 × 10 −16 1 of 13 many
1280 Increase chr9:10069844–10104307 5.54 × 10 −7 1 of 2
chr13:100974064–101034679 2.14 × 10 −6 2 of 2
Decrease chr4:22943293–23102259 0.16 FISH (BAC) 4p15.2 chr4:22828003–23025619 3.64 × 10 −10 1 of 4 0794
3476 Increase chr5:64484426–64535538 1.00 × 10 −16 1 of 6
chr20:50794691–50801972 1.00 × 10 −16 1 of 6 3405
Decrease chr1:83242288–83274337 0.03 FISH (fosmid) 1p31.1 0104
chr4:82282746–85558739 3.28 FISH (BAC) 4q21.23 chr4:82531241–92371701 1.00 × 10 −16 1 of 10 many
chr22:46869824–46963276 1.00 × 10 −16 1 of 10
Family ID[ 9 ] Relative CN Validated aberrations detected by pairwise comparison of unamplified samples [ 9 ] (100k array set) Detected by pairwise comparison of amplified samples (250k Nsp array) Variation locus a
difference Coordinates (Build 36) Mbp Validation Cyto-band Coordinates (Build 36) P Rank b
8379 Increase chr10:259695–23144645 22.88 karyotyping 10p12.2–p15.3 chr10:1000464–24070263 1.00 × 10 −16 1 of 13 many
chr15:19208413–19943075 0.73 karyotyping 15q11.2 chr15:18850150–20335459 1.00 × 10 −16 1 of 13 2748
chr14:21394980–21864733 1.00 × 10 −16 1 of 13 many
1280 Increase chr9:10069844–10104307 5.54 × 10 −7 1 of 2
chr13:100974064–101034679 2.14 × 10 −6 2 of 2
Decrease chr4:22943293–23102259 0.16 FISH (BAC) 4p15.2 chr4:22828003–23025619 3.64 × 10 −10 1 of 4 0794
3476 Increase chr5:64484426–64535538 1.00 × 10 −16 1 of 6
chr20:50794691–50801972 1.00 × 10 −16 1 of 6 3405
Decrease chr1:83242288–83274337 0.03 FISH (fosmid) 1p31.1 0104
chr4:82282746–85558739 3.28 FISH (BAC) 4q21.23 chr4:82531241–92371701 1.00 × 10 −16 1 of 10 many
chr22:46869824–46963276 1.00 × 10 −16 1 of 10

b Ranked by significance ( P -value). Only variants with the lowest P -value scores are shown.

Table 7.

Copy number variants detected in MR families by pair-wise comparisons of unamplified and amplified sample sets (child versus father)

Family ID[ 9 ] Relative CN Validated aberrations detected by pairwise comparison of unamplified samples [ 9 ] (100k array set) Detected by pairwise comparison of amplified samples (250k Nsp array) Variation locus a
difference Coordinates (Build 36) Mbp Validation Cyto-band Coordinates (Build 36) P Rank b
8379 Increase chr10:259695–23144645 22.88 karyotyping 10p12.2–p15.3 chr10:1000464–24070263 1.00 × 10 −16 1 of 13 many
chr15:19208413–19943075 0.73 karyotyping 15q11.2 chr15:18850150–20335459 1.00 × 10 −16 1 of 13 2748
chr14:21394980–21864733 1.00 × 10 −16 1 of 13 many
1280 Increase chr9:10069844–10104307 5.54 × 10 −7 1 of 2
chr13:100974064–101034679 2.14 × 10 −6 2 of 2
Decrease chr4:22943293–23102259 0.16 FISH (BAC) 4p15.2 chr4:22828003–23025619 3.64 × 10 −10 1 of 4 0794
3476 Increase chr5:64484426–64535538 1.00 × 10 −16 1 of 6
chr20:50794691–50801972 1.00 × 10 −16 1 of 6 3405
Decrease chr1:83242288–83274337 0.03 FISH (fosmid) 1p31.1 0104
chr4:82282746–85558739 3.28 FISH (BAC) 4q21.23 chr4:82531241–92371701 1.00 × 10 −16 1 of 10 many
chr22:46869824–46963276 1.00 × 10 −16 1 of 10
Family ID[ 9 ] Relative CN Validated aberrations detected by pairwise comparison of unamplified samples [ 9 ] (100k array set) Detected by pairwise comparison of amplified samples (250k Nsp array) Variation locus a
difference Coordinates (Build 36) Mbp Validation Cyto-band Coordinates (Build 36) P Rank b
8379 Increase chr10:259695–23144645 22.88 karyotyping 10p12.2–p15.3 chr10:1000464–24070263 1.00 × 10 −16 1 of 13 many
chr15:19208413–19943075 0.73 karyotyping 15q11.2 chr15:18850150–20335459 1.00 × 10 −16 1 of 13 2748
chr14:21394980–21864733 1.00 × 10 −16 1 of 13 many
1280 Increase chr9:10069844–10104307 5.54 × 10 −7 1 of 2
chr13:100974064–101034679 2.14 × 10 −6 2 of 2
Decrease chr4:22943293–23102259 0.16 FISH (BAC) 4p15.2 chr4:22828003–23025619 3.64 × 10 −10 1 of 4 0794
3476 Increase chr5:64484426–64535538 1.00 × 10 −16 1 of 6
chr20:50794691–50801972 1.00 × 10 −16 1 of 6 3405
Decrease chr1:83242288–83274337 0.03 FISH (fosmid) 1p31.1 0104
chr4:82282746–85558739 3.28 FISH (BAC) 4q21.23 chr4:82531241–92371701 1.00 × 10 −16 1 of 10 many
chr22:46869824–46963276 1.00 × 10 −16 1 of 10

b Ranked by significance ( P -value). Only variants with the lowest P -value scores are shown.

The amplified child versus amplified father comparisons identified 63 CNVs in copy number in total within the three pairs. Analysis of amplified family pair #8379 identified 41 copy number differences (13 relative amplifications P < 3.48 × 10 −6 , 28 relative deletions P < 8.38 × 10 −8 ), analysis of amplified family pair #1280 identified six copy number differences (two relative amplifications P < 2.14 × 10 −6 , four relative deletions P < 1.05 × 10 −8 ), and analysis of amplified family pair #3476 identified 16 copy number differences (six relative amplifications P < 2.07 × 10 −6 , 10 relative deletions P < 6.09 × 10 −9 ). These copy number differences were then ranked by P -value (most significant to least) and the coordinates compared to those of the validated aberrations. The amplified versus amplified comparisons identified four of the five CNVs (two amplifications, two deletions) validated by FISH ( 9 ) and each received the lowest P -value for its comparison ( Table 7 ). The single validated CNV that was not detected by the amplified comparisons may have been missed due to a difference in array coverage at this site. On the 250k Nsp array, this region was covered by three probe sets (10 683 bp/probe set) compared to six probe sets (5341 bp/probe set) on the 100k array. This was also the smallest feature of the set of validated CNVs (0.03 Mb) and may reflect a decrease in detection sensitivity when using amplified comparisons. Among the top-ranked variants (i.e. those with the most significant P -values), six variants were identified by the 250k WGA experiment that were not detected by the original experiments. Five of these are covered by six or fewer probe sets (5743–93 452 bp/probe set, one with no probes) on the 100k array. In addition to the possibility of an increased false positive rate due to increased array noise, differences in each array's probe coverage may explain why these regions were only detected by the experiment using amplified samples.

Genotype fidelity

To compare the fidelity of genotype calls derived from WGA product to those from corresponding unamplified samples, data from matched pairs of these sources were compared. Average genotype call rates (±1 SD) were 96.74 ± 1.14% from the unamplified samples and 93.14 ± 2.68% from the WGA samples, suggesting a modest degree of information loss following amplification. Of the SNPs which were unsuccessfully called in the amplified samples, only 2% were common to all three samples and only one of these fell within a region of WGA-induced bias (an over-amplification). Genotype concordance was 98.57 ± 0.53% between calls successfully made from both amplified and unamplified samples in each matched pair. There was very little overlap in the coordinates of SNPs with non-concordant genotypes and regions of recurrent WGA-induced bias. Of the non-concordant calls, 58.77% were called heterozygotes in the unamplified sample and homozygotes in the amplified sample (i.e. AB called as AA or BB) and 0.2% of these were located in regions of WGA-induced over-amplification while none were in regions of WGA-induced under-amplification, 40.66% were called homozygotes in the unamplified sample and heterozygotes in the amplified sample (i.e. AA or BB called as AB) of which none were located in regions of WGA-induced bias, and 0.57% were incorrectly called homozygotes (i.e. AA called as BB or BB called as AA) of which none were located in regions of WGA-induced bias. Twelve regions each containing 3–7 SNPs were identified as displaying loss of heterozygosity (LOH) in total from the three pre- and post-amplification comparisons. Three of the LOH regions had an allele-specific copy number of 3 while the others had a copy number of 2. These regions impacted a total of 58 SNPs, 0.01% of all of the SNPs assayed, and none overlapped a region recurrently over- or under-amplified by WGA. These results suggest that increased random array noise is likely a greater source of genotype non-concordance than systematic allele-specific amplification bias or polymerase error.

DISCUSSION

The ability to discover CNVs in unamplified human DNA using data generated by the Affymetrix Mapping SNP array platform has been previously demonstrated by our group and others ( 1–3 , 9 ). However, with small amounts of DNA, from tumour biopsies for example, amplification of the starting material prior to discovery of CNVs is often necessary to generate enough material to conduct such analyses. We aimed to assess the nature of biases that are introduced by this amplification, and to determine their impact on copy number detection and whether pair-wise comparisons could compensate for these biases. For the first time, we have used a high resolution microarray platform to explicitly define regions susceptible to WGA-induced bias, statistically assessed the sequence features underlying these biases, and demonstrated an ability to correct for these biases and resolve real CNVs. In this study, three unamplified DNA samples were used to establish a base line for array noise and CNV detection. These were compared to the same DNA samples that were amplified in duplicate using a WGA technique. The apparent CNVs we detected by comparing unamplified samples to the unamplified Affy48 reference set were likely real events, as the variants were relatively large, statistically significant, and 11 of the 13 CNVs corresponded to previously documented genomic variants ( 5 ). While our variant detection approach adjusts its threshold of significance based on the level of noise of each array, comparisons using amplified samples still identified hundreds of apparent CNVs not seen in the unamplified comparisons on the Affymetrix array platform. Since these comparisons were performed against an unamplified reference, it is likely that these artifactual apparent CNVs were the result of preferentially amplifying of regions of the genome and not due to an increased level of array noise. The data from the NimbleGen platform appeared to have a high level of noise that affected our ability to detect WGA-induced CNVs when co-hybridizing unamplified and amplified samples. Our results suggest that amplified and unamplified samples cannot be directly compared to uncover WGA-induced artifacts using the NimbleGen CGH array. However, this should not preclude the comparison of similarly amplified samples on this platform as we have shown using Affymetrix arrays that the biases are largely systematic and the noise is reduced substantially when comparing two amplified samples.

To explore the nature of this bias, we directly compared Affymetrix data from pre- and post-amplification sample pairs and observed a set of regions apparently over- or under-amplified in all three samples. These regions impacted a total of 21.97 Mb of sequence, consisted of 190 over-amplifications and 75 under-amplifications, and overlapped 39 of 92 regions of WGA-induced bias identified by other studies ( 16 , 17 , 20 ). The low amount of overlap is perhaps due to differences in genome coverage by the arrays used in these studies, particularly as there was no previous consensus on any region being susceptible to WGA-induced bias. Results reported are for DNA amplified using the QIAgen Mini kit and it is conceivable that DNA amplified using different protocols will exhibit different bias. While the lack of a correlation between regions of WGA-induced bias and known CNVs is different from a previous observation ( 17 ), we have demonstrated that the degree of overlap of the amplification biases we identified with known CNVs is only slightly greater than would be expected by chance. The amount of overlap observed is likely due to the fact that documented CNVs are generally large, 165 kb on average, and, in total, impact ∼27% of the genome.

The difference in size and size distribution of the over- and under-amplifications that we identified suggests focal over-amplification of specific sequences and broader under-representation of others. We observed a direct relationship between amplification efficiency and GC-content as over-amplified regions had a statistically significant increase in GC content relative to the deletions ( P = 1.93 × 10 −5 ) and the magnitude of over-amplification appeared to scale directly with GC richness ( Figure 4 ). These results are consistent with the notion that WGA-induced over-amplification bias is related to the increased binding affinity of GC-rich hexamers relative to AT rich hexamers and not a shortage of hexamers corresponding to repetitive regions in the genome. There is also the possibility that, unlike many polymerases, Phi29 polymerase is more efficient in synthesizing GC-rich sequences, thereby resulting in over-amplification of these regions. These effects likely also contribute to under-amplification of GC-poor regions distributed throughout the genome but not likely the loss of chromosome ends. The lack of a relationship between regions of WGA-induced bias and the presence of known CNVs suggests that different mechanisms account for these phenomena.

The loss of chromosome ends appears to be a consistent result of the WGA procedure as 15 of the 39 ends assayed were under-amplified in all samples compared to only three that were over-amplified. Relative to chromosome ends that were not affected by bias, the under-amplified ends were enriched for repetitive sequences ( P = 0.0022) but did not have a statistically significant difference in GC content ( P = 0.8215). These results suggest that the source of amplification bias at chromosome ends is different from GC-content-derived biases affecting the rest of the genome. One possible explanation is the positional effect of having fewer overlapping amplification products at the ends of linear stands of DNA than in the middle. However, if this were the case then all chromosome ends should be similarly under-amplified which they are not. Another possible explanation is that the limited quantities of hexamers corresponding to subtelomeric repeats result in fewer priming events in these regions. This may account for the loss of repetitive chromosome ends more frequently than less repetitive ends.

We found that samples subject to Phi29-based WGA can be used for accurate genotyping, albeit with some data loss. From the WGA samples, we consistently observed a decrease in the average number of genotype calls and a wider range of call rates compared to those from the unamplified samples. However, of the genotype calls that were made, over 98% were concordant between amplified and unamplified sample pairs. The less than 2% non-concordant calls were 99.43% discrepant heterozygotes (i.e. AB called as AA or BB, AA or BB called as AB), rather than incorrectly called homozygotes, and nearly none (<0.12%) were located in regions of WGA-induced bias. This discrepancy rate is very near that observed between unamplified replicates on the Affymetrix 500k array ( 24 ). It is likely that the source of genotype call non-concordance is related to the genotyping accuracy of the array in the presence of increased noise due to WGA and not truly genotype changes induced by WGA through allele-specific amplification or polymerase error.

Regardless of the source of the systematic biases induced by WGA, we have shown that pair-wise analysis of amplified samples is a viable strategy for CNV detection, albeit with an appropriate threshold of significance to filter the number of low-significance random artifacts induced by this technique. While the greater number of apparent copy number differences detected using amplified samples has the potential to mask real events, we observed that pair-wise comparisons of such samples can detect real differences between samples. On comparing amplified samples to amplified samples, the number of artifactual copy number differences is reduced by an order of magnitude relative to comparisons of amplified versus unamplified samples due to the systematic nature of the bias induced by the technique. Conceivably, the use of a large, amplified reference set would be a practical alternative to pair-wise comparisons for larger batches of amplified samples requiring a universal reference. Of the apparent copy number differences detected by the three pair-wise comparisons using unamplified material, all of the top deletions and two of the three top amplifications were identified as the most significant by the corresponding comparisons using amplified material. By applying this technique to paired child/father samples with known, validated copy number differences ( 9 ), four of the five validated differences detected by the original study using unamplified DNA were the most significant in the same comparisons using amplified DNA. The only validated CNV that was missed using WGA material was due to a difference in coverage by the array platforms used. A similar difference in coverage partially explains the presence of six high confidence CNVs detected by the WGA experiments not seen in the original study as one of these has recently been observed in the unamplified material using a higher resolution platform. Therefore, when evaluating the results from amplified comparisons, CNVs with the top ranked significance are more likely to be real CNVs in the unamplified sample.

ACKNOWLEDGEMENTS

The authors gratefully acknowledge the expert technical assistance of Susanna Chan, Jennifer Asano and Adrian Aly of the Affymetrix Array group at the Genome Sciences Centre, BC Cancer Agency. Support for this work and funding to pay the Open Access publication charges for this article was received from the BC Cancer Foundation, the Canada Foundation for Innovation, and Hoffmann-La Roche Ltd. of Canada. TJP is a Senior Graduate Trainee of the Michael Smith Foundation for Health Research (MSFHR) and the BC Cancer Foundation and has been supported by fellowship grants from Eli Lilly and the University of British Columbia. MG is a Senior Graduate Trainee of the MSFHR and Genome BC and is supported by a fellowship from the Natural Sciences and Engineering Research Council. MAM is a senior scholar of the MSFHR and a Terry Fox Young Investigator.

Conflict of interest statement . None declared.

REFERENCES

1

, , , , , , , , , , et al.

Common deletion polymorphisms in the human genome

,

Nat. Genet.

,

2006

, vol.

38

(pg.

86

-

92

)

2

, , .

Structural variation in the human genome

,

Nat. Rev. Genet.

,

2006

, vol.

7

(pg.

85

-

97

)

3

, , , , .

A high-resolution survey of deletion polymorphism in the human genome

,

Nat. Genet.

,

2006

, vol.

38

(pg.

75

-

81

)

4

, , , , , , , , , , et al.

Segmental duplications and copy-number variation in the human genome

,

Am. J. Hum. Genet.

,

2005

, vol.

77

(pg.

78

-

88

)

5

, , , , , , , .

Detection of large-scale variation in the human genome

,

Nat. Genet.

,

2004

, vol.

36

(pg.

949

-

951

)

6

, , , , , , , , , , et al.

Large-scale copy number polymorphism in the human genome

,

Science

,

2004

, vol.

305

(pg.

525

-

528

)

7

, , , , , , , , , , et al.

APP locus duplication causes autosomal dominant early-onset Alzheimer disease with cerebral amyloid angiopathy

,

Nat. Genet.

,

2006

, vol.

38

(pg.

24

-

26

)

8

, , , , , , , , , , et al.

High-resolution mapping of genotype-phenotype relationships in cri du chat syndrome using array comparative genomic hybridization

,

Am. J. Hum. Genet.

,

2005

, vol.

76

(pg.

312

-

326

)

9

, , , , , , , , , , et al.

Oligonucleotide microarray analysis of genomic imbalance in children with mental retardation

,

Am. J. Hum. Genet.

,

2006

, vol.

79

(pg.

500

-

513

)

10

, , , , , , , , , , et al.

High-resolution genomic profiles of human lung cancer

,

Proc. Natl Acad. Sci. USA

,

2005

, vol.

102

(pg.

9625

-

9630

)

11

, , , , , , , , , , et al.

An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays

,

Cancer Res.

,

2004

, vol.

64

(pg.

3060

-

3071

)

12

, , , .

The use of whole genome amplification in the study of human disease

,

Prog. Biophys. Mol. Biol.

,

2005

, vol.

88

(pg.

173

-

189

)

13

, , , , , , , , , , et al.

Comprehensive human genome amplification using multiple displacement amplification

,

Proc. Natl Acad. Sci. USA

,

2002

, vol.

99

(pg.

5261

-

5266

)

14

, , , , , , .

Whole-genome multiple displacement amplification from single cells

,

Nat. Protoc.

,

2006

, vol.

1

(pg.

1965

-

1970

)

15

, , , , , , , .

SNP-based chromosomal copy number ascertainment following multiple displacement whole-genome amplification

,

Biotechniques

,

2007

, vol.

42

(pg.

77

-

83

)

16

, , , , , , , , , , , et al.

Genome coverage and sequence fidelity of phi29 polymerase-based multiple strand displacement whole genome amplification

,

Nucleic Acids Res.

,

2004

, vol.

32

pg.

e71

17

, , , , , , , , , , et al.

Evaluation of Phi29-based whole-genome amplification for microarray-based comparative genomic hybridisation

,

Lab. Invest.

,

2007

, vol.

87

(pg.

75

-

83

)

18

, , , , , , , , , , et al.

Whole genome analysis of genetic alterations in small DNA samples using hyperbranched strand displacement amplification and array-CGH

,

Genome Res.

,

2003

, vol.

13

(pg.

294

-

307

)

19

, , , , , .

Genome-wide single-nucleotide polymorphism arrays demonstrate high fidelity of multiple displacement-based whole-genome amplification

,

Electrophoresis

,

2005

, vol.

26

(pg.

710

-

715

)

20

, , , , , , , , .

Amplification of whole tumor genomes and gene-by-gene mapping of genomic aberrations from limited sources of fresh-frozen and paraffin-embedded DNA

,

J. Mol. Diagn.

,

2005

, vol.

7

(pg.

171

-

182

)

21

, , .

Fidelity of phi 29 DNA polymerase. Comparison between protein-primed initiation and DNA polymerization

,

J. Biol. Chem.

,

1993

, vol.

268

(pg.

2719

-

2726

)

22

, , , , , , , , .

Assessment of whole genome amplification-induced bias through high-throughput, massively parallel whole genome sequencing

,

BMC Genomics

,

2006

, vol.

7

pg.

216

23

.

Tandem repeats finder: a program to analyze DNA sequences

,

Nucleic Acids Res.

,

1999

, vol.

27

(pg.

573

-

580

)

24

Affymetrix

BRLMM: an improved genotype calling method for the GeneChip Human Mapping 500K Array Set

,

Technical Report

,

2006

White Paper. Santa Clara, CA

Affymetrix, Inc

© 2008 The Author(s)

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data

I agree to the terms and conditions. You must accept the terms and conditions.

Submit a comment

Name

Affiliations

Comment title

Comment

You have entered an invalid code

Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.

Citations

Views

Altmetric

Metrics

Total Views 1,357

912 Pageviews

445 PDF Downloads

Since 2/1/2017

Month: Total Views:
February 2017 8
March 2017 7
April 2017 4
May 2017 6
June 2017 3
July 2017 7
August 2017 2
September 2017 6
October 2017 1
November 2017 2
December 2017 11
January 2018 15
February 2018 8
March 2018 25
April 2018 12
May 2018 11
June 2018 20
July 2018 18
August 2018 8
September 2018 6
October 2018 4
November 2018 14
December 2018 16
January 2019 11
February 2019 6
March 2019 16
April 2019 34
May 2019 29
June 2019 24
July 2019 21
August 2019 26
September 2019 30
October 2019 18
November 2019 16
December 2019 10
January 2020 10
February 2020 19
March 2020 23
April 2020 16
May 2020 16
June 2020 13
July 2020 11
August 2020 7
September 2020 29
October 2020 21
November 2020 33
December 2020 37
January 2021 24
February 2021 12
March 2021 20
April 2021 11
May 2021 7
June 2021 11
July 2021 4
August 2021 14
September 2021 10
October 2021 9
November 2021 11
December 2021 32
January 2022 16
February 2022 9
March 2022 8
April 2022 23
May 2022 10
June 2022 8
July 2022 34
August 2022 6
September 2022 19
October 2022 42
November 2022 14
December 2022 5
January 2023 10
February 2023 8
March 2023 20
April 2023 9
May 2023 7
June 2023 9
July 2023 12
August 2023 10
September 2023 14
October 2023 12
November 2023 8
December 2023 9
January 2024 15
February 2024 30
March 2024 34
April 2024 18
May 2024 14
June 2024 14
July 2024 39
August 2024 7
September 2024 9

Citations

67 Web of Science

×

Email alerts

Citing articles via

More from Oxford Academic