Estimation and assessment of raw copy numbers at the single locus level (original) (raw)

Abstract

Motivation: Although copy-number aberrations are known to contribute to the diversity of the human DNA and cause various diseases, many aberrations and their phenotypes are still to be explored. The recent development of single-nucleotide polymorphism (SNP) arrays provides researchers with tools for calling genotypes and identifying chromosomal aberrations at an order-of-magnitude greater resolution than possible a few years ago. The fundamental problem in array-based copy-number (CN) analysis is to obtain CN estimates at a single-locus resolution with high accuracy and precision such that downstream segmentation methods are more likely to succeed.

Results: We propose a preprocessing method for estimating raw CNs from Affymetrix SNP arrays. Its core utilizes a multichip probe-level model analogous to that for high-density oligonucleotide expression arrays. We extend this model by adding an adjustment for sequence-specific allelic imbalances such as cross-hybridization between allele A and allele B probes. We focus on total CN estimates, which allows us to further constrain the probe-level model to increase the signal-to-noise ratio of CN estimates. Further improvement is obtained by controlling for PCR effects. Each part of the model is fitted robustly. The performance is assessed by quantifying how well raw CNs alone differentiate between one and two copies on Chromosome X (ChrX) at a single-locus resolution (27kb) up to a 200kb resolution. The evaluation is done with publicly available HapMap data.

Availability: The proposed method is available as part of an open-source R package named aroma.affymetrix. Because it is a bounded-memory algorithm, any number of arrays can be analyzed.

Contact: hb@stat.berkeley.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

Due to high demand and strong competition, the number of loci interrogated by DNA microarrays has in recent years been pushed to levels that are of orders of magnitude greater than were available only a few years ago. In 2003, Affymetrix released a high-density oligonucleotide array named Mapping 10 K targeting genotyping and estimation of copy numbers (CNs) of 10 K (10 000) single-nucleotide polymorphisms (SNPs). Since then Affymetrix has released platforms that interrogate 100 K SNPs and 500 K SNPs (‘_SNP chips_’). More recently, Affymetrix released the Genome-wide SNP 5.0 and Genome-wide SNP 6.0 chip types (‘_SNP & CN chips_’). In addition to SNPs, these interrogate a large number of non-polymorphic (NP) loci targeting CN analysis by (often single) CN probes. The 5.0 chip has the same set of SNPs as the 500 K chip as well as 400 K NP loci. The 6.0 chip interrogates 900 K SNPs and 900 K NP loci. Illumina has released BeadArray platforms interrogating 240, 300, 370, 550, 650 and recently 1000 K loci. It is reasonable to expect that this competitive growth in density of markers will continue for at least another couple of years.

The Affymetrix SNP chips consist of a large number of units (‘probe sets’), each interrogating a particular SNP. Each unit consists of a set of 25-mer probe pairs quantifying the amount of allele A and allele B DNA target. Although aligned between the two alleles, the sequences across probe pairs are slightly shifted relative to each other such that they together cover a wider region around the SNP loci than if all pairs would be centered at the SNP loci. The number of probe pairs per unit and the amount of probe-pair shift varies with chip type and also between SNPs. For the SNP & CN chip types, the probe pairs are no longer shifted relative to each other, but instead they are technical replicates. Moreover, for the 5.0 chip type the two alleles within a probe pair are not necessarily aligned. Because of these updated chip designs, the model introduced in this article should only be applied to the SNP & CN chips with great care. We are currently developing alternative method that we believe are better suited for these new chip types.

We have developed a preprocessing method for estimating raw CNs from Affymetrix SNP arrays. By raw CNs we mean single-locus CN estimates on a continuous scale that have been obtained independently of each other in the sense that the information of their chromosomal positions have not been utilized. Raw CNs constitute the input to downstream methods for identifying chromosomal regions with CN aberrations, e.g. CN segmentation methods. The purpose of our method, referred to as ‘Copy-number estimation using Robust Multichip Analysis’ (CRMA), is to provide downstream methods with raw total CNs of as high accuracy and precision as possible, something which is likely to make those methods perform better.

The outline of this article is as follows. In Section 2, we give further details on the layout of Affymetrix SNP chips together with details on the most important systematic effects observed in these chip types. In Section 3, in addition to describing the data set used for the analysis, the CRMA method is given in detail. Other existing methods used for comparison are also described. At the end, the methods used for evaluation are explained. In Section 4, results are given. Because parts of the preprocessing can also be applied to genotyping, their outcomes are presented first. Thereafter, the method is evaluated and compared with the other methods. The proposed method and the results are discussed in Section 5. Conclusions and future research directions are given in Section 6.

2 APPROACH

Let Cij be the true number of DNA copies in a sample i = 1, … , I at locus j = 1, … , J with position xj on some chromosome. In normal individuals, with the exception for CN polymorphic regions, we expect that Cij = 2 for loci on autosomal chromosomes, as well as on ChrX in females. Furthermore, if the sample is homogeneous, we expect Cij to be discrete, otherwise continuous. One fundamental assumption in CN analysis is that loci close together are likely to have the same underlying CN.

2.1 Design of SNP arrays

Ignoring so-called mismatch (MM) probes, a SNP on an Affymetrix SNP chip is interrogated by a set of perfect-match (PM) probe pairs {(PM_jkA_, PM_jkB_)}k where each probe pair is indexed by (j, k) for k = 1, … , Kj, where Kj is the number of probe pairs for SNP j. The number of probe pairs per unit varies between different chip types and in some cases also between units.

2.2 Crosstalk between allele A and allele B

The PM_jkA_ and PM_jkB_ probes differ only by one nucleotide in the position that corresponds to the SNP locus in the genome. Their close similarity give rise to crosstalk between the two allele signals. Due to the way Affymetrix label a probe allele A or allele B, there are six possible nucleotide pairs at the interrogating position on the forward strand, namely (A,C), (A,G), (A,T), (C,G), (C,T) and (G,T). The corresponding ones on the reverse strand are (T,G), (T,C), (T,A), (G,C), (G,A) and (C,A). Note that the two sets become identical if the alleles on one of the strands are swapped. Allelic crosstalk can be observed when plotting the probe-pair signals stratified by nucleotide pair as illustrated in Figure 1 and further explained in Section 3.2.1.

Fig. 1.

Allelic crosstalk in sample NA06985 for the 107 590 PM pairs with allele nucleotide pairs (A, T) on forward strand and (T, A)-swapped on the reverse strand. The probe-pair signals cluster in three radial groups corresponding to genotypes AA (horizontal), AT (diagonal), and TT (vertical). Left: Before calibration there is a strong crosstalk between the two alleles; allele T picks up signals from allele A and vice versa. The crosstalk was estimated to have offset â = (123, 128) and crosstalk parameters Ŝ = (1.00, 0.062; 0.071, 0.986). Right: After calibration, the probe-pair signals for genotypes AA and TT are orthogonal.

2.3 Sequence-dependent effects

Ishikawa et al. (2005) and Nannya et al. (2005) observed a bias in the observed CNs (log ratios) with a strong dependency on PCR fragment length and GC content. During a PCR cycle, shorter DNA sequences are more likely to be fully replicated compared with longer ones. The outcome is that the longer the extracted fragments are, the less target DNA gets hybridized, and the weaker the observed signals become. The number of G and C nucleotides in a probe has a similar effect on the efficiency of the PCR and the hybridization to the probes. As further explained in Section 3.2.5, if these effects vary between samples, they give rise to non-linear effects in the observed log ratios.

3 METHODS

3.1 Data set

Mapping250K_Nsp CEL files (dated July 28 and 29, 2006) for the 30 male and 30 female CEU founders were downloaded from The International HapMap Project (Altshuler et al., 2005; The International HapMap Consortium, 2003). Offsprings were excluded in order to avoid biological relationships. Because female NA12145 has low true ChrX CN level (Ting et al., 2006), it was excluded from the evaluation step.

3.2 Proposed model

The CRMA method pre-processes the probe signals, summarizes them, post-processes the summarized signals, and calculates raw CNs relative to a reference. As described in detail below, the main pre-processing method is allelic crosstalk calibration, with the option to perform a follow-up quantile normalization. A multichip log-additive model fitted robustly is used to summarize the probe signals. Summarized signals are normalized for sequence-dependent effects that are related to main covariate fragment length with the option to also include covariate GC content. For paired analysis, a copy-neutral sample should be used as a reference. For non-paired analysis, a robust average across a pool of samples, where the majority are copy neutral, may be used as a reference. Here we use the pool of all samples.

3.2.1 Calibration for offset and crosstalk between alleles

By swapping the alleles for the reverse-strand probe pairs, we can model these together with the corresponding probe pairs on the forward strand, leaving us with the six unique classes identified in Section 2.2. Let xijk = (xijkA, xijkB) and yijk = (yijkA, yijkB) be the true and observed signals for probe pair (j, k) in sample i. Let {x*ijk} and {y*ijk} be the corresponding signals with the probe pairs on the antisense strand swapped. For a particular nucleotide pair, we model the allelic crosstalk and shift observed in {y*ijk} by a sample-specific affine transformation as

formula

(1)

where ai = (aiA, aiB)T denotes the offset,

formula

(2)

is the crosstalk matrix, and ε_ijk_ = (ɛ_ijkA_, ɛ_ijkB_)T is random noise. Because either xijkA or xijkB is zero for homozygote (i, j) data points, it is possible to estimate (ai, Si) from the observed data {y*ijk}jk stratified by nucleotide pair. We constrain Si such that for a subset of probe pairs that are likely to be diploid the average of {x*ijkA}jk and the average of {x*ijkB}jk equals an arbitrary constant (=2200). This constant should be the same for all arrays in order to bring the arrays to the same scale.

To obtain estimates (âi, Ŝi), we utilize a robust simplex fitting algorithm (Wirapati and Speed, 2001) implemented in the R package sfit (http://www.braju.com/R/). One advantage of this algorithm is that it is also robust against outliers in between the two homozygote arms (Fig. 1), which is the case for samples with CN gains. Estimates {◯ijk} of the true signals (up to a multiplicative constant) are obtained by backtransforming as

formula

(3)

and by reversing the swap of the probes on the antisense strand.

It is not obvious why different nucleotide pairs should contribute with different offsets. An alternative is to constrain Model (1) to have a common offset across nucleotide pairs and alleles. The small differences in offset estimates observed between groups (see Supplementary Materials) support the latter. One rationale for such a model is that there exists a scanner offset (Bengtsson et al., 2004) that dominates other offsets. However, due to limitations in the implemented algorithm available we use Model (1) and assume the model error is small.

3.2.2 Normalization for differences in probe signal distributions

Depending on the experimental setup, a reasonable assumption is that the distribution of the true amount of DNA targets over all probes should be the same across samples. This should also be true for the observed signals when all systematic effects have been removed. If this is not the case, probe signals may be transformed further by a process called quantile normalization, such that this assumption is met.

As illustrated in Section 4.1, the affine transform of the above allelic crosstalk model in Equations (1)–(3) explains most of the differences observed in the empirical distributions of the observed probe signals. However, if for reasons other than offsets and allelic crosstalk, there exist differences in signal distributions, an option is to apply quantile normalization following the crosstalk calibration.

To simplify the notation, let _k_′ = 1, … , _K_′ be the indices for all _K_′ PM probes on the array. Then, for quantile normalization, the (calibrated) probe signals {◯ _ik_′}_k_′ are transformed for each array i = 1, … , I such that their distribution become the same as a given target distribution. Various methods have been proposed to normalize empirical distributions, cf. Bolstad et al. (2003). For reasons explained below, instead of modeling the empirical distributions explicitly, we choose to normalize signals {◯ik′} as follows:

Pass A:

Choose a subset of the PM probes for which one can assume that the distribution of the true signals are equal across arrays. Let K* be the number of such probes.
For each array i = 1, … , I, construct the set {◯ i(k*)}(k*) consisting of the ordered above subset such that ◯ i(1) ≤ ◯ i(2) ≤ · · · ≤ ◯ i(K*).
For each rank (k*) = 1, … , K*, calculate the target distribution as the average signal across arrays by

(4)

Pass B:

For each array i = 1, … , I:

Fit a smooth spline fi (·) robustly to {(◯ i(k*), (k*))}(k*).
For all probes, normalize (backtransform) signals as

(5)

This algorithm will make the distribution of probe signals formula approximately the same across arrays for the subset of probes chosen in Step A1. The distributions are only approximately equal because smooth splines are used in Step B1.

There are at least two advantages with this algorithm compared with existing quantile normalization algorithms. First, because Step A3 uses a summation to calculate the average, it can be calculated incrementally such that only 2_K_′ probe signals are kept in memory at any time. As Steps B1-B2 is done for each array independently, the memory complexity of the above two-pass algorithm is O(_K_′). Second, probes that interrogate loci for which we know that the true CNs differ across samples can be excluded while estimating the normalization function fi (·) in Step B1. Note that all probe signals will still be normalized in Step B2. This allows us to exclude for instance ChrX probes in Step A1 to avoid bringing female and male distributions toward each other. Loci known to carry CN polymorphisms may also be excluded for the same reason.

The method just described is not always appropriate, for example, when samples are expected to contain a large number of CN aberrations. If not used, replace ŷijk by ◯ijk in what follows. We will see in Section 4 that this is not necessarily a bad choice.

3.2.3 Construction of non-polymorphic probe signals

In this article, we focus on optimal estimation of total CNs. In other words, we are not interested in the amount of target DNA in the individual strands but only the total amount. For this reason we sum the allele signals at the probe level as

formula

(6)

for all samples i = 1, … , I and all probes (j, k). This is also an approach taken by the dChip tool (private communication). We refer to formula as non-polymorphic probe signals to indicate that the signals no longer contain allele-specific information but only information on the amount of DNA at a specific locus. One advantage of summing the signals at the probe level is that we partly avoid the problem of modeling signals close to the signal-to-noise level (‘close to zero’), which becomes a problem especially for homozygote SNPs, as well as for the log-additive model introduced next.

3.2.4 Probe-level modeling

Let formula be as in Equation (6). For each locus j, we model the signals formula ik across samples and across probes by

formula

(7)

where e is a global constant (explained below), parameters {θ_ij_}i are the chip effects for samples i = 1, … , I at locus j, and parameters {φ_jk_}k are the affinities for probes k = 1, …, Kj at locus j, and ξ_ijk_ are zero-mean error terms with variance formula ⁠. For Equation (7) to be identifiable, probe affinities are constrained such that ∏k φ_jk_ = 1. This model is similar to the one Irizarry et al. (2003) suggested for expression arrays. The model is fitted robustly using M estimators via an iterative reweighted least squares (IWLS) method (Bolstad, 2004) implemented in the affyPLM package (Ben Bolstad, 2007). In cases when the true CN is zero, the corresponding chip effect (and raw CN) cannot be defined on a logarithmic scale. By allowing for a small bias in the chip effects, which in turn will give a small bias in the raw CNs, we can avoid this limitation of the model. Thus, by adding a global constant e, we lower the risk for obtaining non-defined chip effects due to either zero CNs or due to noisy signals close to zero. Here we use e = 300, but the exact value is not critical. We obtain very similar results for shifts in [250, 350], which, depending on array, correspond roughly to PM quantiles in [1.5%, 2.5%].

Furthermore, non-positive probe signals in formula introduced by the backtransformation in Equation (3), and that remain after the summation in Equation (6) and addition of the global constant e, are given zero weight when fitting the model. Excluding non-positive signals this way will in turn introduce a small bias in the chip effects. More research is needed in order to assess the impact of such biases on CN estimates. Alternative models that avoid the problem with non-positive signals on the logarithmic scale have been suggested (Li and Wong, 2001; Zhou and Rocke, 2005).

3.2.5 Normalization for sequence effects

By modeling the means of the chip effects as a smooth function of PCR fragment length and GC content, we correct for differences in sequence effects across arrays observable in the summarized signals (not shown). Let λ_j_ be the length in basepairs of the restriction fragment containing locus j. These lengths are predetermined by the restriction sites of the enzymes. Let κ_j_ ∈ [0,1] be the fraction of G and C nucleotides for the same sequence. For reasons outlined below, we model the sequence effects on formula instead of formula ⁠, where formula are reference signals at each locus (Section 3.2.7). Analogously to quantile normalization, normalization for PCR effects can be performed using a two-pass procedure as follows:

Pass A:

Choose a subset of the loci for which one can assume to be diploid in most or in all samples. Let J* be the number of such loci. In case non-diploid samples or loci are included by mistake, we rely on the robustness of the estimators.
For each locus j* = 1, … , J*, calculate the average log-signal by

(8)
Fit a smooth spline gT (·) robustly to ⁠, where zj = (λ_j_, κ_j_). This constitutes the average PCR effect across samples.

Pass B:

For each sample i = 1, … , I:

Fit a smooth spline robustly to {(zj*, log2 θ_ij_*)}j*. This constitutes the PCR effect for sample i.
For all loci, estimate the PCR discrepancies for sample i by

(9)
Finally, normalize all loci as

(10)

Steps B1–B3 allow us to use a pre-estimated gT(·), which turns the problem into a single-array estimation problem. This is computationally more efficient for large data sets. For further computational efficiency, constrain the fragment-length and GC-content effects to contribute additively (on the logarithmic scale), that is,

formula

(11)

where _g_FL, i (·) and _g_GC, i (·) are smooth functions that can be fitted and corrected for separately, and analogously for gT (·).

We note that the above normalization could have be done on the log-ratios {Mij}j [Equation (12)], cf. Nannya et al. (2005). However, in that case it is not clear how to exclude ChrX and other non-diploid loci from the fit and still be able to normalize all loci. Also, when normalizing {Mij}j directly there is no obvious way to go back to chip effects if additional normalization of such is wanted. For further clarification, see Bengtsson (2004) and Bengtsson and Hössjer (2006).

3.2.6 Calculation of raw copy numbers

The logarithm of the relative CN estimate, or shorter, the raw CN, for sample i and locus j, is defined as

formula

(12)

where formula is a reference (hence the subscript R) signal at locus j typically representing the mean diploid signal.

3.2.7 Reference signals

In this article, we consider the case of estimating the reference signals formula from a pool of samples. Consider locus j. Our objective is to estimate the mean μ2_j_ of the diploid signal for this locus such that the estimate has low variance and low bias. Furthermore, consider the set of samples that are truly diploid at this locus. An estimate is

formula

(13)

where the average is taken over the set of truly diploid samples. Note, we do not know exactly what samples are diploid at locus j, but instead we assume that the majority of the samples are diploid, and rely on the robustness of the estimator. Among the methods used for comparison (Section 3.4), CNAT uses Equation (13), dChip uses a trimmed sample mean, and CNAG uses the sample mean and robustifies by optimizing the selection of diploid samples (Nannya et al., 2005).

For loci on ChrX, the above estimate will be biased if male samples are included. For this reason, both CNAT and CNAG exclude males when estimating the means on ChrX. In dChip this is optional. When excluding non-diploid samples, the estimates formula have larger variances for loci on ChrX than on other chromosomes. An alternative is to utilize non-diploid samples as follows. Let

formula

(14)

be the robust average across all non-diploid samples. Then the alternative estimate of the diploid mean is

formula

(15)

where

formula

(16)

Equations (14)–(16) are applicable to all chromosomes, as well as to cases where some samples are known to be non-diploid, or where the ploidies of some samples are unknown. The approach may also be extended to handle shorter regions with known aberrations but also regions of CN polymorphism (Redon et al., 2006; Sebat et al., 2004; Iafrate et al., 2004).

3.3 Implementation

Not only does the number of loci investigated grow fast, but also the number of samples. Whilst smaller projects still involve 1 to 10s of samples, the larger projects now involve 1000s of samples. In order to handle such large data sets in everyday high-throughput analysis, as well as for developing new methods and evaluating them against existing ones, a new computational framework is needed. We have developed an open-source cross-platform environment in R (R Development Core Team, 2007) named aroma.affymetrix (http://www.braju.com/R/aroma.affymetrix/) that enables us to analyze everything from 1 to 1000s of Affymetrix arrays within virtually any given memory constraints. The CRMA method was implemented as part of this framework.

3.4 Raw copy numbers by other models

In order to assess the performance of CRMA, we compared it with dChip (Li and Wong, 2001), CNAT (Affymetrix, 2007), and CNAG (Nannya et al., 2005). For each method evaluated, we used the default settings.

dChip: For the dChip model, we used the dChip 2006 (Build: March 24, 2007; http://www.dchip.org/). Probe-level data was normalized using the invariant-set method (Li and Wong, 2001), allele signals for the PMs were summed, and the same for the MMs. Then MBEI scores (Li and Wong, 2001) (corresponding to {θ_ij_}) were estimated from {max(PM-MM, 1)}. The dChip software can calculate raw CNs from a pool of samples by using a trimmed-mean estimate. However, because there are no documented guidelines on how to choose the trimming parameter in [0, 1), we choose to export the MBEI scores to _R_ and use the more robust median for the estimates. For differences in using mean and median, see Section 4.4.

CNAT: For the CNAT v4 model, we used gtype-probeset-genotype v1.5.6 with options_—analysis quant-norm.sketch=50000.usepm=true,pm-only,sumz_ to obtain estimates of allele-specific chip effects {(θ_ijA_,θ_ijB_)} based on an allele-specific version of the log-additive model in Equation (7) fitted using median polish and applied to quantile normalized PM signals. The CNAT tool did not support using the pool of samples as a reference. Instead, a pooled reference was created by calculating the median allelic chip effect for each SNP across all samples and then using that as a common reference in a paired analysis (as suggested by Affymetrix in private communication). Then copynumber-pipeline v1.5.6_3 with options –workflow paired-copy-number –total-cn-only –gauss-bw 0 -cn cn_hmm_input_logSum0K.tsv was used to obtain raw CNs based on θ_ij_ = θ_ijA_ + θ_ijB_ that have been corrected for PCR fragment-length and GC-content effects. During the revision of this article, Affymetrix has released the Genotyping Console Software (GTC) v2.0 (Affymetrix Inc., 2007). That software uses the same CNAT v4 method as the above tool (private communcation with Affymetrix).

CNAG*: For the CNAG model, we used the CNAG v2 (Build: March 23, 2007; http://www.genome.umin.jp/). A non-paired test was used where the 60 samples was used as both test samples and reference samples, which was done by first creating an identical copy of the CEL set to be used as the reference set. This approach was taken because CNAG did not permit use of the pool of test samples as a reference set. Signals were extracted and the 60 test and 60 reference samples were labeled male or female. A ‘non-self analysis’ was used to calculate raw CNs. CNAG had to be forced to use all reference samples, because if allowing CNAG to choose an optimal subset of the reference samples, it would choose the single reference sample that is an identical copy of the test sample. This approach is slightly different from the one proposed by Nannya et al. (2005) and the reason for the asterisk.

3.5 Methods for evaluation

We assess the performance of the different methods by testing how well they can differentiate between one and two copies using raw or smoothed CN estimates. For this we use estimates from a set of females and males across the 5608 non-pseudoautosomal SNPs that are available on ChrX (located after position 2.8 Mb). For such a locus j, we assume that C ij = C i = 2 if sample i is a female, and C ij = C i = 1 if it is a male.

3.5.1 Single-locus classification

For a given locus j, we classify each sample i as formula = 2 or formula = 1 from raw CN estimates Mij by the rule

formula

(17)

where τ_j_ is an arbitrary threshold. The performance of this classification can be assessed by the Receiver Operator Characteristic (ROC) curve τ ↦ (β_j_(τ), γ_j_(τ)), which describes the relationship between the true-positive (TP) and false-positive (FP) at any given threshold. For a given τ, we estimate the TP rate β_j_ (τ) and the FP rate γ_j_(τ) from (C ij, formula (τ)) as

formula

(18)

where I I(·) is the indicator function. See Figure S1 (Supplementary Materials) for an example of ROC curves for two different loci.

Distribution of ROC curves across loci: If controlling for the FP rate, we can fix the FP rate at β_j_(τ_j_) = β for all loci and estimate the TP rates across loci j; {γ_j_(τ_j_): β_j_(τ_j_) = β}j. For an example, see Figure 4.

Distribution of TP rates on ChrX when controlling for FP rate (β = 0.0172) for CNAG*, CNAT, CRMA and dChip. The pattern remains for larger FP rates with smaller differences between methods.

Fig. 4.

Distribution of TP rates on ChrX when controlling for FP rate (β = 0.0172) for CNAG*, CNAT, CRMA and dChip. The pattern remains for larger FP rates with smaller differences between methods.

3.5.2 Multilocus classification with common threshold

We now assess how well individual estimators from a set of J* loci differentiate between one and two copies, on average, when all loci truly have the same copy number 1 or 2. This is entirely analogous to our discussion for single locus. Since we are only considering the case where all our loci have the same copy numbers, we can hope that our CN estimates also have this property. We evaluate the performance across loci using a common threshold, that is, we apply Rule (17) once more. For each set s = 1, … , S consisting of J* loci, we estimate the ROC curve formula for the multilocus case by

formula

(19)

We will consider the case where all non-pseudoautosomal loci on ChrX are included. See Figure 5 for results.

Performance of CRMA compared with CNAG* and CNAT, and dChip. Left: Full ROC curve. Right: Top-left corner of ROC curve.

Fig. 5.

Performance of CRMA compared with CNAG* and CNAT, and dChip. Left: Full ROC curve. Right: Top-left corner of ROC curve.

3.5.3 Resolution at given FP and TP rates

Consider any ensemble of S disjoint (non-overlapping) sets each with H consecutive loci. For each such set s = 1, … , S, we construct

formula

(20)

where index j_′ runs over the H loci in set s. Let the average distance between loci on the investigate chromosome be ρ_c. We can then assess the performance at resolution H · ρ_c_ by using Equation (19) to estimate ROC curves for { formula is}. Let (β(H)(τ), γ(H)(τ)) be the FP and TP rates for threshold τ. For ROC estimates of CRMA and dChip with H = 1, 2, 3, 4, see Figure 6. By constructing { formula is} with a mix of ⌊H_⌋ and ⌈_H ⌉ (the greatest/least integer less/greater than H) loci per window together with resampling techniques, we can obtain estimates of the average performance also for non-integer values on H. For an example using 1 ≤ H ≤ 7, see Figure 7.

Impact on the average ROC for CRMA (solid) and dChip (dot dash) when not averaging (H = 1; 27 kb distance), and when averaging over a window of H = 2 (54 kb), H = 3 (81 kb), and H = 4 (108 kb) loci.

Fig. 6.

Impact on the average ROC for CRMA (solid) and dChip (dot dash) when not averaging (H = 1; 27 kb distance), and when averaging over a window of H = 2 (54 kb), H = 3 (81 kb), and H = 4 (108 kb) loci.

TP rate as a function of resolution/smoothing at a 1.0% FP rate. In order of performance (low to high): CNAT, CNAG*, dChip, and CRMA. To achieve a 95% TP rate, CRMA has to use a window with on average 1.9 loci (average distance between window centers 52 kb), dChip 2.3 loci (62 kb), CNAG* 2.9 loci (78 kb), and CNAT 4.5 loci (121 kb).

Fig. 7.

4 RESULTS

In order for the results below to be less specific to an evaluation on ChrX alone, raw CNs are calculated using reference signals as in Equation (13) based on all samples. This will introduce a global bias in the ChrX estimates, but it has no effect on evaluation procedure proposed in Section 3.5. The effect of using the more efficient estimate in Equation (16) is evaluated in Section 4.4.

4.1 Improvements by various transformation steps

The effect of calibrating probe-pair signals for allelic crosstalk is illustrated in Figure 1. Before calibration, an allele A probe picks up signal from allele B, and vice versa. After calibration, much of this crosstalk is removed.

Note, for total CN estimates crosstalk that is perfectly symmetric between the two alleles will cancel out. This can be seen if using a symmetric crosstalk matrix in Equations (1)–(2) and then taking the sum in Equation (6). The crosstalk will not cancel out if it is asymmetric.

What is probably of greater importance is the correction for the offset in Equation (1). Offset in signals introduces intensity-dependent biases in log ratios (Bengtsson and Hössjer, 2006; Bengtsson et al., 2004). Such biases are of great concern in CN analysis, because they will add to the between-locus variability and because downstream segmentation methods assume no or equal bias for neighboring loci.

The crosstalk calibration method is applied to each array independently. Interestingly, this single-array method corrects for many of the systematic effects seen across arrays, such as differences in empirical distributions of probe signals. As shown in Figure 2, there are large differences before calibration, but afterward the distributions are approximately equal. As explained in detail in Bengtsson and Hössjer (2006) and Bengtsson (2004), observed differences in empirical distributions can often be explained by different offsets and different scale factors as modeled in Equations (1) and (2).

Distribution of PM probe log intensities of the 60 arrays before (left) and after (right) allelic-crosstalk calibration (ACC). ACC is a single-array calibration method.

Fig. 2.

Distribution of PM probe log intensities of the 60 arrays before (left) and after (right) allelic-crosstalk calibration (ACC). ACC is a single-array calibration method.

Note, although quantile normalization [Equations (4) and (5)] corrects for differences in distributions, it does not remove offset. It is reasonable to believe that this is the reason for seeing a better separation between one and two copies in the raw estimates when using allelic crosstalk alone compared with using quantile normalization alone, cf. left panel of Figure 3. The fact that the performance goes down when we apply quantile normalization following crosstalk calibration may be explained by the fact we are overfitting the data, at least for this particular data set. This is another reason why we suggest only using quantile normalization when large and unexpected differences in distributions remain after crosstalk calibration. Here we do not apply quantile normalization.

Fig. 3.

The impact of various CRMA steps on the average ChrX ROCs in order of performance (low to high). Left: Pre- and postprocessing using (1) quantile normalization (QN) alone, (2) allelic-crosstalk calibration (ACC) alone and (3) ACC followed by PCR fragment-length normalization (ACC+FLN). Right: Calculating ChrX reference signals (based on ACC+FLN), using (1) diploid samples only (diploid), (2) diploid and non-diploid samples with bias correction (adjusted) and (3) all samples without bias correction (all).

The existence of differences in PCR-effects across arrays affects the overall ROC performance. When correcting for PCR-fragment length effects, the TP rate increases 1% unit at a 7% FP rate (2% unit at a 4% FP rate), cf. left panel of Figure 3. We see no or even a negative effect when doing a GC-content normalization (not shown), why we suggest excluding the GC-contents function in Equation (11) unless a strong effect is observed.

4.2 Performance of different methods

The distributions of TP rates [γ(·)] for the different methods, while fixing the FP rate at β(·) = β = 1/2 · 1/29 = 0.0172 as described at the end of Section 3.5.1, are depicted in Figure 4. CRMA has a greater set of loci with high TP rates than other methods have. There is a notable increase of SNPs with very low TP rates. This may be due to some regions on ChrX being homologous to regions on chrY, but also due to misannotation or the particular model being used. For instance, the overall performance of CNAT is penalized from approximately 10% poorly performing SNPs, which we suspect is because it fits the two alleles seperately adding extra variance, especially for homozygote SNPs.

From the average ROC across all ChrX loci, we get an estimate of how well on average the different methods can separate one from two copies (based on a random locus). In Figure 5, the average ROC according to Equation (19) is shown for the four methods. We see that both CRMA and dChip outperform CNAG* and CNAT by obtaining higher TP rates at any given FP rate, and that CRMA gets slightly higher TP rates than dChip for FP rates less than 7%. See also the TP rates for the single-locus resolution in Table S1.

4.3 Gain in TP rate & loss in resolution by smoothing

By averaging raw CN estimates across a set of loci as in Equation (20), we obtain more sparse but also more precise CN estimates. As a result of the greater precision, the TP rate of Rule (17) increases at any given FP rate. This is depicted in Figure 6. Note that we neither imply that Equation (20) is optimal for smoothing raw CNs nor that raw CNs should be smoothed before passing them to the segmentation method. We use it solely for comparing methods at various amounts of smoothing.

As described in Section 3.5.3, by mixing differently-sized (non-overlapping) windows, we can obtain ROC estimates for any resolution. Thus, by fixing the FP rate, we can assess the performance as a function of resolution (in kb) for each of the methods evaluated. At a FP rate of 1.0%, the TP rate of CRMA is greater than dChip, CNAG* and CNAT at any given resolution. For TP rates for the various methods at certain resolutions, see Table S1 as well as Figure 7.

Furthermore, using the above mixing approach, it is possible to find the amount of averaging required to obtain a certain TP rate at any given FP rate. For instance, based on hybridizations on the Mapping250K_Nsp chip type alone, using CRMA it is possible to obtain a 95% TP rate at a 1.0% FP rate at a 52 kb resolution. For dChip, CNAG*, and CNAT the same is obtained at a 62, 78, and 121 kb resolution, respectively. See also Table S2.

4.4 Pooled diploid and non-diploid reference set

In the above evaluation, no distinction was made between female and male samples when estimating the reference signals from the pool of samples. This was done in order to not restrict the comparison to ChrX alone, but also to make the comparison between methods more fair. For ChrX loci we can obtain estimates of the diploid reference levels with greater precision by including male samples in the reference model, see Section 3.2.7. The impact of this varies with the ratio of males to females in the data set studied (here 29 females and 30 males). Thus, if using only female samples to estimate the reference level [Equation (13)], more than half of the data is discarded resulting in roughly twice the variance. This great loss in precision result in significant lower TP rates as depicted in the right panel of Figure 3. For instance, at a 5.0% FP rate, there is a decrease from 94.0% to 91.5% in TP rate when excluding male samples. By using the proposed improvement for calculating reference levels [Equations (13)–(16)] we obtain almost the same performance as if all samples were females.

Finally, in Equations (13)–(16) the median is used for estimating the reference signals instead of the more efficient sample mean. For the data set tested, the result differ only marginally when using the sample mean (not shown). For this reason, we recommend to use the median because of its robustness against outliers and model errors.

5 DISCUSSION

In the above evaluation, CNAT performs significantly worse than other methods. This may be explained by the fact that CNAT first estimates the chip effects for the two alleles separately and then sums them to obtain an estimate of the total amount of target DNA. Because the CNAT model operates on the logarithmic scale there will be a large variance in formula for a SNP with genotype BB, because on the intensity scale the probe signals for allele A are close to zero, and vice versa for allele B. This variance is likely to dominate the total CN estimate. See also Section 3.2.3. Furthermore, because of inflexibilities in CNAT, the reference signals had to be calculated for each allele separately, and then added. This could in addition contribute to the weaker performance.

Carvalho et al. (2007) show that by estimating chip effects for the two strands separately, the overall concordance rate for genotype calls are better. We investigated the same idea for total CN analysis by fitting one model for each strand and then averaging the two estimated chip effects (both on the intensity scale and the logarithmic scale). The result was that the ROC performance went down significantly (not shown). A reason for this may be that the allelic-crosstalk calibration removes most of the between strand effects and that when stratifying by strand more parameters are estimated based on fewer data points resulting in an increase in overall variances. If not controlling for allelic crosstalk, the systematic effect between strands are larger, which may explain why Carvalho et al. (2007) reports a gain for genotyping using an alternative approach.

Analysis of loss-of-heterozygousity (LOH) is known to provide information on copy-number losses, and can confirm or strengthen the confidence in CN regions identified by segmentation method. More generally, the genotype dimension, that is, the ratio of allele A and allele B signals, or flavors thereof, carries information on CN aberrations. In Peiffer et al. (2006), the authors use an estimate called ‘Allele Frequency’, which in our terms corresponds roughly to θ_ijB_/(θ_ijA_ + θ_ijB_), in order to confirm regions of loss and gain. A similar approach was adopted in the aforementioned GTC software, where Affymetrix present the ‘Allelic Difference’ score log2(θ_ijA_/θ_ijB_). In copy-neutral regions, these ‘raw genotyping scores’ will present themselves as a mixture of three (AA, AB and BB) distributions along the genome. If there is a deletion, there will be a mixture of two (A and B) distributions, and if there is a gain, there will be a mixture of four (AAA, AAB, ABB and BBB), and so on. These scores allow us to identify different classes of gains, but also copy-neutral aberrations such as uniparental disomy. In cases of heterogeneous DNA extracts, the structure will be more complicated as there will be a mixture of mixtures.

The above is beyond the problem of estimating raw total and/or allele-specific CNs as it requires making use of the genomic locations of the SNPs. Here we have deliberately left such ideas to the various downstream methods which each may target a different question. Given optimal raw CN and raw genotyping estimates one can for instance elaborate on methods where segmentation and genotyping calls are combined in a single- or a multi-array model. Having said this, we still want to emphasize that, if the preprocessing fails to control for all systematic effects, such as crosstalk between alleles, the raw genotypes and the raw CNs will not be orthogonal. In such cases, the accuracy of the raw CN estimates can be improved further by incorporating the raw genotypes but still without utilizing genomic locations. Related to the above, we have found that more precise total CNs can be obtained if estimated directly and not via allele-specific estimates, hence Equation (6). We are working on a method for estimating the allele-specific CNs given an estimate of the total CN.

6 CONCLUSION

In this article, we have proposed a method that, compared with popular existing methods, produces single-locus estimates that better differentiate between one and two copies on ChrX. One advantage of using ChrX data for the evaluation is that it can be applied to any data set consisting of male and female samples. As we find it to be a simple and readily accessible evaluation method for raw CNs, we propose its use in any CN analysis.

The CRMA method was designed with both paired and non-paired analysis in mind as well as analysis of other than germline data, e.g. tumor data. However, the performance of CRMA in combination with downstream segmentation methods still has to be investigated, especially in cases where the underlying assumption in the preprocessing method might no longer hold, e.g. in late-stage tumors with a very large number of genomic aberrations. We look forward to comparison on other kinds of data in order to determine the extent to which the presented conclusions apply to other chromosomes and other types of CN changes.

Finally, we would like to emphasize that parts of the CRMA method are likely to apply to the new SNP & CN chip types, e.g. the calibration of allelic crosstalk in SNPs. However, we also believe that other parts need modifications to deal with data from these new chip types. Moreover, since most of the NP units on these chips are single-probe units, the current approach where a PLM is fitted robustly for each unit can no longer be used, and other means for dealing with outliers must be found. We are currently developing such methods.

ACKNOWLEDGEMENTS

We wish to thank P. Wirapati (ISREC) for help on the simplex-fitting algorithm and suggesting the two-pass quantile normalization approach, K. Simpson (WEHI) for significant contributions to aroma.affymetrix, J. Bullard & K. Hansen (UC Berkeley) for affxparser, and B. Bolstad (Affymetrix) for modifying affyPLM according to our needs. We also thank the reviewers for valuable feedback helping improve this text. H.B. was supported by grants from the Wenner-Gren Foundation, the American-Scandinavian Foundation, and the Solander Foundation.

Conflict of Interest: none declared.

REFERENCES

Affymetrix

CNAT 4.0: Copy number and loss of heterozygosity estimation algorithms for the GeneChip human mapping 10/50/100/250/500K array set

2007

Technical report, Affymetrix

Affymetrix Inc

Affymetrix Genotyping Console 2.0 – User Manual.

2007

Affymetrix Inc

et al.

A haplotype map of the human genome

Nature

2005

, vol.

437

(pg.

1299

1320

)

Low-level analysis of microarray data.

2004

PhD Thesis, Centre for Mathematical Sciences, Division of Mathematical Statistics, Lund University

Methodological study of affine transformations of gene expression data with proposed robust non-parametric multi-dimensional normalization method

BMC Bioinformatics

2006

, vol.

pg.

100

et al.

Calibration and assessment of channel-specific biases in microarray data with extended dynamical range

BMC Bioinfo

2004

, vol.

pg.

177

et al.

A comparison of normalization methods for high density oligonucleotide array data based on variance and bias

Bioinformatics

2003

, vol.

(pg.

185

)

Low-level analysis of high-density oligonucleotide array data: background, normalization and summarization.

2004

Berkeley

PhD Thesis, University of California

affyPLM: methods for fitting probe-level models

2007

et al.

Exploration, normalization, and genotype calls of high density oligonucleotide SNP array data

Biostatistics

2007

, vol.

(pg.

485

499

)

et al.

Detection of large-scale variation in the human genome

Nature Genet

2004

, vol.

(pg.

949

951

)

et al.

Summaries of Affymetrix GeneChip probe level data

Nucleic Acids Res

2003

, vol.

pg.

e15

et al.

Allelic dosage analysis with genotyping microarrays

Biochem. Biophys. Res. Commun

2005

, vol.

333

(pg.

1309

1314

)

Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection

Proc. Natl. Acad. Sci. USA

2001

, vol.

(pg.

)

dchip

et al.

A robust algorithm for copy number detection using high-density oligonucleotide single nucleotide polymorphism genotyping arrays

Cancer Res

2005

, vol.

(pg.

6071

6079

)

et al.

High-resolution genomic profiling of chromosomal aberrations using Infinium whole-genome genotyping

Genome Res

2006

, vol.

(pg.

1136

1148

)

R Development Core Team

R: A language and environment for statistical computing.

2007

Vienna, Austria

R Foundation for Statistical Computing

ISBN 3-900051-07-0

et al.

Global variation in copy number in the human genome

Nature

2006

, vol.

444

(pg.

444

454

)

et al.

Large-scale copy number polymorphism in the human genome

Science

2004

, vol.

305

(pg.

525

)

The International HapMap Consortium

The International HapMap Project

Nature

2003

, vol.

426

(pg.

789

796

)

et al.

Analysis and visualization of chromosomal abnormalities in SNP data with SNPscan

BMC Bioinfo

2006

, vol.

pg.

Fitting polyhedrial cones and simplices to multivariate data points

2001

Draft

An expression index for Affymetrix GeneChips based on the generalized logarithm

Bioinformatics

2005

, vol.

(pg.

3983

3989

)

Author notes

Associate Editor: Alfonso Valencia

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.