Comparing CNV detection methods for SNP arrays (original) (raw)

Abstract

Data from whole genome association studies can now be used for dual purposes, genotyping and copy number detection. In this review we discuss some of the methods for using SNP data to detect copy number events. We examine a number of algorithms designed to detect copy number changes through the use of signal-intensity data and consider methods to evaluate the changes found. We describe the use of several statistical models in copy number detection in germline samples. We also present a comparison of data using these methods to assess accuracy of prediction and detection of changes in copy number.

INTRODUCTION

Structural variation in the human genome has been intensely studied in recent years [1–5]. Publications have shown rare copy number variations (CNV) with a relationship to certain diseases and much has also been done to study copy number polymorphisms (CNP) in the population, their contribution to structural variation and possible association to complex disease. Multiple methods for the detection of these structural variants exist [6, 7] but we seek to focus on methods designed to interpret results from SNP arrays.

The most prominent SNP array types are available from commercial vendors Affymetrix and Illumina. Both companies sell competing arrays and continue to offer increased coverage for detecting copy number events and SNP assays simultaneously. Assay technique for the arrays differ [8, 9] but the signal-intensity output from the both platforms present similar analysis and interpretation problems.

Successful application of these technologies has yielded a number of interesting individual CNVs with relationships to complex disease. For example, rare CNVs have been linked to schizophrenia [10] in a study where microdeletions and duplications were shown to be responsible for disrupting genes involved in neurodevelopment. The UGT2B17 gene on Chromosome 4q13.2 was linked to osteoporosis in a case-control study of 727 CNV regions in a Chinese sample set [11].

One approach to copy number event detection has been to investigate common events. Studies such as the McCarroll et al. [12] involved the characterization of deletion variations in the genome, while Redon et al. [2] have mapped the location of events found in multiple samples. Information about identified copy number events is recorded in databases such as The Database of Genomic Variants (DGV) [1]. Using the prior information about CNP location we can investigate copy number events as we would use SNP information in genotyping. Known CNPs can be genotyped in case–control populations with similar methods to the SNP-based association study. With the diversity of approaches and analysis options it is important to decide on a method most suited for the particular experimental needs. This review presents methods suggested for analysis of germ line CNV analysis, including both CNP analysis and the detection of rare CNVs.

CNV DISCOVERY AND DETECTION USING SNP CHIPS

The use of SNP arrays in copy number event detection has a number of advantages. As well as the two applications for the data which are SNP genotyping and copy number analysis, there are other aspects that promote their use over other techniques. SNP arrays use less sample per experiment compared to other techniques such as comparative genomic hybridization (CGH) arrays. Cost is also an important factor in the selection of the method. The SNP array is a cost effective technique which allows the user to increase the number of samples tested on a limited budget. Although the advances in high throughput sequencing technology has made copy number discovery much easier, the application of known CNP information means that we can target structural variation in a sample using cheaper techniques such as the SNP array without a large reduction in genome wide coverage.

One important consideration, however, is the bias of the SNP chip coverage towards known CNVs [13]. Historically, when SNPs are selected for genotyping arrays certain factors are considered which may decrease the number of copy number variants or polymorphisms typed [14]. Studies have found CNPs to be most common in regions containing high levels of segmental duplication [2], which are areas of low SNP coverage compared to other areas of the genome due to the difficulties of assay design and implementation. Common CNPs may cause assays to fail standard inheritance checks and Hardy–Weinberg tests. For example, in a situation where a father is (A, B) and the mother (B, −), the child could be (A, B) or (A, −) or (B, −). However, in SNP genotyping results, the mother would appear to be called (B, B) and the child would be called either (A, B) or (A, A) or (B, B). If the child is really (A, −) then an (A, A) call would seem to violate Mendelian inheritance patterns and often cause the SNP to be rejected.

Assays were also often selected and tested on the basis of their use in SNP genotyping, meaning the final result may produce noisy signal, which although per se does not affect the ability to genotype, is a major problem for accurate copy number detection. For instance, SNP data is typically standardized against a reference population in order to reduce the effect of factors including: between-array variation and probe-specific hybridization effects. In doing so, normalization routines implicitly assume that all members (or the large majority) of the reference population have the same copy number but, at locations of common CNV, this assumption is clearly no longer appropriate. At these genomic locations, the process of SNP data normalization and the derivation of copy number estimates should be integrated for optimal performance and the correct derivation of normalization parameters.

Several of the new array assay selections have taken the copy number detection into account, for example, Illumina includes ‘unSNPable’ genome probes on some of its products. These markers were picked to cover events recorded in the Database Genomic Variants (DGV) and some additional regions highlighted by experimental work. The Affymetrix SNP 6.0 chip was developed with an aim to assess SNPs and CNVs simultaneously. McCarroll et al. [15] studied 270 HapMap samples to design probes for their hybrid array. With these changes in assay selection techniques the SNP array has become more appealing for copy number detection and reliable interpretation of these results increases in importance.

ILLUMINA PROPRIETARY SOFTWARE FOR COPY NUMBER DETECTION

Illumina data can be initially viewed, checked and exported using the proprietary software BeadStudio. As well as the software's quality checking and genotype-calling functions it calculates a number of other values for the signal-intensity data. The normalized R value is used as a representation of intensity on individual SNP plots. The log R ratio value is then calculated from the expected normalized intensity of a sample and observed normalized intensity. The B allele frequency (BAF) is calculated from the difference between the expected position of the cluster group and the actual value. BAF and log R ratio are used by a number of the copy number event detection algorithms.

Detection of copy number events within BeadStudio uses simple algorithms which can be run rapidly for an overview of larger events in a sample. The Loss of Heterozygosity (LOH) score is calculated using heterozygote frequency. The CNV partition plug-in uses the log R ratio and BAF and compares the data to 14 different Gaussian distribution models to assess copy number level. Values can be plotted in the Chromosome Browser allowing the user to compare predicted events with BAF or log R ratio at the location for event confirmation (Figure 1).

BeadStudio Chromosome Viewer. Image from BeadStudio Chromosome Browser showing copy number values for Sample NA10861. Chromosome 22 shown with an event at 23 999 142–24 239 255 confirmed by all statistics. CNV value produced by CNV Partition algorithm.

Figure 1:

AFFYMETRIX PROPRIETARY SOFTWARE FOR COPY NUMBER DETECTION

Affymetrix SNP array data can be analysed with specially designed proprietary software. Within the Genotyping Console samples are grouped into In Bounds (good sample) and Out of bounds (problematic samples) after initial quality checks and other quality control metrics allow the user to investigate probe mismatching and individual SNP clustering. LOH scores can be calculated and the software contains a Chromosome Copy Number Analysis Tool (CNAT), which uses a reference set of data to compare the experiment signal-intensity values against and evaluates copy number changes. Results are processed by the segment reporting tool to produce a basic output of larger detected CNV events.

Tools for analysis of the different Affymetrix chip types vary but HumanGenomeSNP Array 6.0 utilizes two externally developed algorithms from the BirdSuite package [16] which dramatically improves detection. Birdseed is used for SNP genotyping and Canary genotypes the known CNPs on the chip. Each CNP has a number of targeted probes, data from these are summarized and then compared to a reference set to produce the final call. Results can be viewed in the Integrated Genome Browser (IGB) (Figure 2).

Genotyping Console Genome Viewer. Image from Affymetrix Genotyping Console showing sample NA10861. Event on chromosome 22 confirmed by CNAT algorithm (third plot) and the segmentation report (red mark) showing the single event.

Figure 2:

HIDDEN MARKOV MODELS (HMMs) IN COPY NUMBER EVENT DETECTION

Limitations of available copy number analyses within proprietary software led to the use of other methods to analyse data. The HMM assumes that observed intensities are related to an unobserved copy number state at each locus via an emission distribution (often assumed to be Gaussian). The copy number states are assumed to have a dependence structure such that neighbouring loci are assumed to have similar copy number states. Transitions between copy number states are determined by a transition matrix which describes the probability of moving from one state to another. The probabilistic structure of the HMM allows parameters in the model to be efficiently learnt from data, in both Bayesian and non-Bayesian frameworks, by using dynamic programming-based algorithms, such as the expectation maximization (EM) algorithm. When applied to event detection each copy number possibility is assigned a state and the Viterbi algorithm is used to predict the state for each observation value.

With prior knowledge of modelling statistics there are a multitude of options for copy number detection. HMMSeg [17] is a command line operated algorithm that is designed to apply HMM to genomic data. Application of correct modelling procedures is not an obvious process to non-statisticians. For these reasons software has been developed which allows guided application of these types of advanced methods.

GUIDED APPLICATION OF THE HMM

A number of solutions for guided accurate CNV detection for SNP array data have been published but these are often platform specific. QuantiSNP [18] and PennCNV [19] are academically developed and freely available for prediction purposes. They use the HMM and assist the user to apply it to their own data. The standard output from these tools is a list of detected events and brief summary statistics used for quality checking. Checking the quality of data is extremely important in accurate event prediction. Data with high signal noise often causes false positive predictions and stringency with checks at this stage is highly recommended to eliminate any problem data. Signal noise is a strong limitation particularly with samples prepared by whole genome amplification. Output from QuantiSNP allows the user to plot average and standard deviations for BAF by chromosome or sample to show outliers (Figure 3). PennCNV has a detailed set of guidelines for identifying and rejecting problem data included on the software's support website. Both can run using command line options or integrated into Illumina's BeadStudio plug-in and have unique features to recommend them.

Figure 3:

Graphical representation of quality control data from PennCNV and QuantiSNP algorithms. It is important to use quality control (QC) data from the algorithms to eliminate problem samples which would not be found during standard-genotyping analysis. Plot shows BAF score for each chromosome from analysis of sample NA10861, we can see chromosome 4 and X are outliers. Values produced by PennCNV log file also shown. NB Values shown relate to Illumina 1MDuo array.

The QuantiSNP algorithm output gives a log Bayes factor with its prediction which allows the user to rank events in order of likelihood and place their own cut off on acceptable events. Users can modify parameters to suit their own dataset, for example, changing the length parameter can allow more accurate detection of different sized events for a particular sample set. Later versions of QuantiSNP have increased flexibility for data other than the standard Illumina Infinium array and can used to process Affymetrix data and have proven accuracy on Illumina GoldenGate data [20] where SNP coverage is suitable.

PennCNV has a number of downstream analysis options. Most important to highlight is the use of family trio data in analysis [21]. The use of trio information in event prediction allows easier detection of events novel to probands. It also integrates a pipeline for Affymetrix data analysis. The PennCNV package also includes a number of options to allow more analysis of event results such as a script to compare events to known gene libraries or for changing the format to be suitable for viewer such as BeadStudio's Chromosome Browser or the web-based genome browser, UCSC (http://www.genome.ucsc.edu/).

Dchip SNP [22] was originally developed for Affymetrix data but has been modified to allow the viewing of Illumina data. It produces an LOH score which can be plotted against chromosome but its functions are best suited to the Affymetrix platform generated values, in particular, the quality control options. The software also has options to carry out paired analysis for cancer data; major copy proportion analysis [22] uses HMM to analyse tumour samples.

APPLYING APPROACHES ORIGINALLY USED IN ARRAYCGH

A number of methods for copy number event detection were originally developed for arrayCGH analysis but have been modified for SNP array analysis. The Circular Binary Segmentation (CBS) [23] algorithm is one such method. It was designed to convert noisy intensity values into regions of equal copy number. The algorithm will continue to divide a region into segments until it finds a segment, which is different to the neighbouring region. This change-point detection is designed to identify all the places which partition the chromosome into segments of the same copy number. An addition to the binary segmentation algorithm was made to allow the defining of single change inside a large segment. Segment ends were joined forming a circle to allow a further likelihood ratio test that the content has different means. Final segments are then given a cluster value, which is the median log-ratio value of the probes within the region and this value is used to define the copy number status.

An alternative to the CBS algorithm was developed by Pique-Regi et al. [24], which can now be applied to SNP arrays. The Genome Alteration Detection Algorithm (GADA) uses sparse Bayesian learning to predict CN changes. For our testing we used a package designed for use in R environment with helpful processing options and detailed instructions for Affymetrix and Illumina data. The advantage of the speed of data processing was clear and we were able to analyse data within a few minutes.

There are many other algorithms developed that could potentially be applied to SNP array data. Other reviews [6, 25] focused on the arrayCGH format present the reader with a variety of alternative options.

CNV DETECTION USING OTHER METHODS

Approaches which describe different methods to address CN event detection are common in the literature. SNP conditional mixture modelling (SCIMM) developed by Cooper et al. [13], which is based on the observation that samples with deletions appear to have unique signal-intensity clusters. They applied a mixture-likelihood clustering method within the R statistical package to identify deletions. A secondary algorithm (SCIMM-Search) was developed to help discover probes which detect copy number changes within an array dataset. The algorithms require knowledge of modelling techniques to correctly carry out the analysis.

The ITALICS [26] software focuses analysis on removal on unwanted events found in Affymetrix data. Rigaill et al. developed ITALICS (Iterative and Alternative normaLIsation and Copy number calling for affymetrix Snp arrays) to remove probes with abnormal intensities. Each iteration of the algorithm estimates the biological signal and then uses multiple linear regressions to estimate the non-linear effects on the signal. The algorithm can be run in R and has the potential to analyse the Affymetrix Human mapping 500K, Genome Wide array 5.0 and 6.0 format but was designed to process data from chip formats containing perfect match and mismatch probes.

COMMERCIALLY AVAILABLE SOFTWARE

The strength of the software packages available to purchase lies in a number of traits; the ability to combine data from other platforms for comparison, graphical user interfaces, integrated pipelines for analysis and work flows, optimized computational speed and technical support. These factors are all extremely useful to those labs with no or limited bioinformatic core support. Unfortunately, commercial companies are limited in their use of some of the methods developed in the academic environment. They are often prevented from building user interfaces and other features around academic software due to restrictions imposed by free software licences such as GNU Public Licence, and prevention from having access to the latest methods.

For our own purposes, we have chosen to look in detail at the Nexus Biodiscovery software. This uses the rank segmentation approach for detection. This approach is based on CBS but has been modified to increase speed of processing. It can be used for Affymetrix, arrayCGH or Illumina data and although weaker for Illumina event detection is an extremely useful tool for practically trained scientists.

COMBINING COPY NUMBER PREDICTION AND GENOTYPING

Copy number detection approaches described thus far have looked only at a single aspect of the data. The Birdsuite set developed by Korn et al. [16] combines SNP genotyping and copy number detection as well as independently genotyping common CNPs. It uses four different methods to analyse an Affymetrix dataset. The Canary algorithm, which genotypes common CNPs and Birdseed, which carries out SNP genotyping are included in the Affymetrix Genotyping Console. Birdseye is used to discover rare CNVs. This uses the HMM to identify and assess previously unknown CNVs in the data. Fawkes is the final stage of Birdsuite; this merges all the results from the other three stages. Combining data in this way gives a more complete picture of structural variation in a sample and allows the user to proceed with single stage of association analysis with increased coverage on the data. Korn et al. compared their software to commercially available algorithms including Nexus and report the higher detection rates of Birdsuite.

Franke et al. [27] have also presented a combined approach which focuses on single SNP interpretation. TriTyper uses maximum likelihood estimation to detect deletions in Illumina SNP data in unrelated samples. It incorporates an extra null allele into its genotyping clusters and uses deviations from the HWE as an indicator of when to use triallelic genotyping. It can also use neighbouring SNP data to impute the success of the caller which increases the accuracy of the output.

COMPARING THE DETECTION ALGORITHMS

There are a large variety of algorithms and software available for copy number event detection. Table 1 shows a summary of the software discussed in this review. A number of these software packages have been tested during the review and a brief synopsis of the results is presented here.

Table 1:

Summary of SNP array detection algorithms

Software	Platform	Related publication	Details	Strengths	Weaknesses
Birdsuite (Birdseye and Canary)	Affymetrix	[15]	Combined tool set to genotype SNPs & CNPs	Unique approach, single association of SNPs and CN	Availability limited to Affymetrix data
CNAT	Affymetrix	Technical notes	Proprietary—run in Genome Console	Integral part of Genome Console	Accuracy of event prediction (missed events)
CNVPartition 1.2.1	Illumina	Technical notes	Proprietary—run in BeadStudio	Integral part of BeadStudio	Accuracy of event prediction (missed events)
Dchip SNP	Affymetrix or Illumina	[22]	Stand alone software	Free viewer for all data	Limited applications for Illumina data
GADA	Affymetrix or Illumina	[24]	Model uses Sparse Bayesian Learning	Speed of processing and application within R	Accuracy on Illumina weaker
HMMSeg	Multiple	[17]	HMM application tool to any genomic data	Flexibility to any dataset	Statistical knowledge required for correct use Not CN specific
ITALICS	Affymetrix	[26]	R package for normalization and CN detection in Affymetrix data	Focus on removal of non-relevant effects	Designed to work on Affymetrix 100K + 500K chip (MM probe format)
Nexus Biodiscovery	Multiple	[23]	Commercial segmentation detection tool	Allows combined data from different platforms Integrated viewer	Freeware alternatives are available
PennCNV	Illumina or Affymetrix	[19]	Perl script based	Multiple downstream tools for output	No way of ranking events due to likelihood
QuantiSNP	Illumina or Affymetrix	[18]	HHM PC or LINUX command line	Bayes factor score for events, flexibility of run parameters	Limited support for further event analysis
SCIMM and SCIMM-Search	Illumina	[13]	Modelling algorithm applied in R	High detection rates compared to sequence data	Statistical knowledge required for correct use
TriTyper	Illumina	[27]	Identify and genotype SNPs with null allele	Able to interpret single SNPs	Only genotypes deletions

Software	Platform	Related publication	Details	Strengths	Weaknesses
Birdsuite (Birdseye and Canary)	Affymetrix	[15]	Combined tool set to genotype SNPs & CNPs	Unique approach, single association of SNPs and CN	Availability limited to Affymetrix data
CNAT	Affymetrix	Technical notes	Proprietary—run in Genome Console	Integral part of Genome Console	Accuracy of event prediction (missed events)
CNVPartition 1.2.1	Illumina	Technical notes	Proprietary—run in BeadStudio	Integral part of BeadStudio	Accuracy of event prediction (missed events)
Dchip SNP	Affymetrix or Illumina	[22]	Stand alone software	Free viewer for all data	Limited applications for Illumina data
GADA	Affymetrix or Illumina	[24]	Model uses Sparse Bayesian Learning	Speed of processing and application within R	Accuracy on Illumina weaker
HMMSeg	Multiple	[17]	HMM application tool to any genomic data	Flexibility to any dataset	Statistical knowledge required for correct use Not CN specific
ITALICS	Affymetrix	[26]	R package for normalization and CN detection in Affymetrix data	Focus on removal of non-relevant effects	Designed to work on Affymetrix 100K + 500K chip (MM probe format)
Nexus Biodiscovery	Multiple	[23]	Commercial segmentation detection tool	Allows combined data from different platforms Integrated viewer	Freeware alternatives are available
PennCNV	Illumina or Affymetrix	[19]	Perl script based	Multiple downstream tools for output	No way of ranking events due to likelihood
QuantiSNP	Illumina or Affymetrix	[18]	HHM PC or LINUX command line	Bayes factor score for events, flexibility of run parameters	Limited support for further event analysis
SCIMM and SCIMM-Search	Illumina	[13]	Modelling algorithm applied in R	High detection rates compared to sequence data	Statistical knowledge required for correct use
TriTyper	Illumina	[27]	Identify and genotype SNPs with null allele	Able to interpret single SNPs	Only genotypes deletions

Table 1:

Summary of SNP array detection algorithms

Software	Platform	Related publication	Details	Strengths	Weaknesses
Birdsuite (Birdseye and Canary)	Affymetrix	[15]	Combined tool set to genotype SNPs & CNPs	Unique approach, single association of SNPs and CN	Availability limited to Affymetrix data
CNAT	Affymetrix	Technical notes	Proprietary—run in Genome Console	Integral part of Genome Console	Accuracy of event prediction (missed events)
CNVPartition 1.2.1	Illumina	Technical notes	Proprietary—run in BeadStudio	Integral part of BeadStudio	Accuracy of event prediction (missed events)
Dchip SNP	Affymetrix or Illumina	[22]	Stand alone software	Free viewer for all data	Limited applications for Illumina data
GADA	Affymetrix or Illumina	[24]	Model uses Sparse Bayesian Learning	Speed of processing and application within R	Accuracy on Illumina weaker
HMMSeg	Multiple	[17]	HMM application tool to any genomic data	Flexibility to any dataset	Statistical knowledge required for correct use Not CN specific
ITALICS	Affymetrix	[26]	R package for normalization and CN detection in Affymetrix data	Focus on removal of non-relevant effects	Designed to work on Affymetrix 100K + 500K chip (MM probe format)
Nexus Biodiscovery	Multiple	[23]	Commercial segmentation detection tool	Allows combined data from different platforms Integrated viewer	Freeware alternatives are available
PennCNV	Illumina or Affymetrix	[19]	Perl script based	Multiple downstream tools for output	No way of ranking events due to likelihood
QuantiSNP	Illumina or Affymetrix	[18]	HHM PC or LINUX command line	Bayes factor score for events, flexibility of run parameters	Limited support for further event analysis
SCIMM and SCIMM-Search	Illumina	[13]	Modelling algorithm applied in R	High detection rates compared to sequence data	Statistical knowledge required for correct use
TriTyper	Illumina	[27]	Identify and genotype SNPs with null allele	Able to interpret single SNPs	Only genotypes deletions

Software	Platform	Related publication	Details	Strengths	Weaknesses
Birdsuite (Birdseye and Canary)	Affymetrix	[15]	Combined tool set to genotype SNPs & CNPs	Unique approach, single association of SNPs and CN	Availability limited to Affymetrix data
CNAT	Affymetrix	Technical notes	Proprietary—run in Genome Console	Integral part of Genome Console	Accuracy of event prediction (missed events)
CNVPartition 1.2.1	Illumina	Technical notes	Proprietary—run in BeadStudio	Integral part of BeadStudio	Accuracy of event prediction (missed events)
Dchip SNP	Affymetrix or Illumina	[22]	Stand alone software	Free viewer for all data	Limited applications for Illumina data
GADA	Affymetrix or Illumina	[24]	Model uses Sparse Bayesian Learning	Speed of processing and application within R	Accuracy on Illumina weaker
HMMSeg	Multiple	[17]	HMM application tool to any genomic data	Flexibility to any dataset	Statistical knowledge required for correct use Not CN specific
ITALICS	Affymetrix	[26]	R package for normalization and CN detection in Affymetrix data	Focus on removal of non-relevant effects	Designed to work on Affymetrix 100K + 500K chip (MM probe format)
Nexus Biodiscovery	Multiple	[23]	Commercial segmentation detection tool	Allows combined data from different platforms Integrated viewer	Freeware alternatives are available
PennCNV	Illumina or Affymetrix	[19]	Perl script based	Multiple downstream tools for output	No way of ranking events due to likelihood
QuantiSNP	Illumina or Affymetrix	[18]	HHM PC or LINUX command line	Bayes factor score for events, flexibility of run parameters	Limited support for further event analysis
SCIMM and SCIMM-Search	Illumina	[13]	Modelling algorithm applied in R	High detection rates compared to sequence data	Statistical knowledge required for correct use
TriTyper	Illumina	[27]	Identify and genotype SNPs with null allele	Able to interpret single SNPs	Only genotypes deletions

Assessing Software

To assess the accuracy of the algorithms we compared our data to the results of a well characterized sample. The sample NA12156 is the basis for our comparison (Table 2); it is from the HapMap collection and was sequenced for structural variation by Kidd et al. [28]. We have chosen to record the number of similar events between software and published data. We assume the samples with low numbers of similar events have higher false positive rates; however, we have not experimentally validated the results. While there is no faultless software we have found that at least 20% of events were confirmed by Kidd et al. in all algorithms. 27% of the overlapping detected events were found by more than one algorithm (Supplementary Table 1). Although some algorithms have a lower percentage of overlapping events it is important to consider the number of events found as well as the proportion, 49% of PennCNV detected events were confirmed but other algorithms have actually detected more in total.

Table 2:

Comparison of algorithms

Algorithm	Platform and array	Total of copy number events detected	Number of copy number events confirmed by Kidd et al. [28].
Birdsuite 1.5.5 (Birdseye & Canary)	Affymetrix 6.0	386	76 (20%)
CNAT (Genome Console 3.0.2)	Affymetrix 6.0	8	2 (25%)
GADA (R 0.7-5)	Affymetrix 6.0	546	128 (23%)
GADA (R 0.7-5)	Illumina 1M Duo	511	157 (31%)
PennCNV (2009Jan06)	Affymetrix 6.0	57	28 (49%)
PennCNV (2009Jan06)	Illumina 1M Duo	57	21 (37%)
QuantiSNP v2.0	Affymetrix 6.0	131	53 (41%)
QuantiSNP v1.1	Illumina 1M Duo	75	23 (31%)

Algorithm	Platform and array	Total of copy number events detected	Number of copy number events confirmed by Kidd et al. [28].
Birdsuite 1.5.5 (Birdseye & Canary)	Affymetrix 6.0	386	76 (20%)
CNAT (Genome Console 3.0.2)	Affymetrix 6.0	8	2 (25%)
GADA (R 0.7-5)	Affymetrix 6.0	546	128 (23%)
GADA (R 0.7-5)	Illumina 1M Duo	511	157 (31%)
PennCNV (2009Jan06)	Affymetrix 6.0	57	28 (49%)
PennCNV (2009Jan06)	Illumina 1M Duo	57	21 (37%)
QuantiSNP v2.0	Affymetrix 6.0	131	53 (41%)
QuantiSNP v1.1	Illumina 1M Duo	75	23 (31%)

Detected events from CEPH sample NA12156 are compared to events published in sequencing analysis by Kidd et al. [28]. Default parameters are used for each algorithm and any Y chromosome data was omitted. An overlap between software output and confirmed data by Kidd et al. is determined by comparing the start and end points of events. Details of events are shown in Supplementary Table 1. Percentage shows the number of confirmed CN events compared to the total detected by the algorithm.

Table 2:

Comparison of algorithms

Algorithm	Platform and array	Total of copy number events detected	Number of copy number events confirmed by Kidd et al. [28].
Birdsuite 1.5.5 (Birdseye & Canary)	Affymetrix 6.0	386	76 (20%)
CNAT (Genome Console 3.0.2)	Affymetrix 6.0	8	2 (25%)
GADA (R 0.7-5)	Affymetrix 6.0	546	128 (23%)
GADA (R 0.7-5)	Illumina 1M Duo	511	157 (31%)
PennCNV (2009Jan06)	Affymetrix 6.0	57	28 (49%)
PennCNV (2009Jan06)	Illumina 1M Duo	57	21 (37%)
QuantiSNP v2.0	Affymetrix 6.0	131	53 (41%)
QuantiSNP v1.1	Illumina 1M Duo	75	23 (31%)

Algorithm	Platform and array	Total of copy number events detected	Number of copy number events confirmed by Kidd et al. [28].
Birdsuite 1.5.5 (Birdseye & Canary)	Affymetrix 6.0	386	76 (20%)
CNAT (Genome Console 3.0.2)	Affymetrix 6.0	8	2 (25%)
GADA (R 0.7-5)	Affymetrix 6.0	546	128 (23%)
GADA (R 0.7-5)	Illumina 1M Duo	511	157 (31%)
PennCNV (2009Jan06)	Affymetrix 6.0	57	28 (49%)
PennCNV (2009Jan06)	Illumina 1M Duo	57	21 (37%)
QuantiSNP v2.0	Affymetrix 6.0	131	53 (41%)
QuantiSNP v1.1	Illumina 1M Duo	75	23 (31%)

We carried out a secondary comparison using the CEPH sample NA15510 which has been characterized in a number of publications [2, 7, 28]. Table 3 shows the variation of results between studies. Further investigation of event replication across studies is represented in the Venn Diagrams (Figure 4). PennCNV and Illumina show similar patterns of overlap although we note an increased similarity between the Korbel et al. data and QuantiSNP output. We conclude that although we found a difference between detected events in our data and published results, we found similar variation between different publications, suggesting this is problem in all comparisons and not unique to algorithms we tested.

Table 3:

Overlap between events detected by SNP array algorithms using multiple publication data

Total events found in NA15510 by algorithm	Number of copy number events (Kidd) [28]	Number of copy number events (Korbel) [7]	Number of copy number events (Redon) [2]
Events in paper	299	466	219
CNVPartition 1.2.1	39	12 (4%)	22 (5%)	9 (4%)
GADA (R 0.7-5)	69	68 (23%)	85 (18%)	42 (19%)
PennCNV (2009Jan06)	81	18 (6%)	28 (%)	30 (14%)
QuantiSNP v1.1	64	18 (6%)	41 (9%)	29 (13%)

Total events found in NA15510 by algorithm	Number of copy number events (Kidd) [28]	Number of copy number events (Korbel) [7]	Number of copy number events (Redon) [2]
Events in paper	299	466	219
CNVPartition 1.2.1	39	12 (4%)	22 (5%)	9 (4%)
GADA (R 0.7-5)	69	68 (23%)	85 (18%)	42 (19%)
PennCNV (2009Jan06)	81	18 (6%)	28 (%)	30 (14%)
QuantiSNP v1.1	64	18 (6%)	41 (9%)	29 (13%)

Data from CEPH sample NA15510 on 1M array, Illumina platform is used to compare between algorithms and other publications. Default parameters are used for each algorithm and Y chromosome data was omitted. Event lists from publications were generated by combining data from several tables to create a complete list (including all validated and un-validated events). An event was counted if any overlap was found with base event in published data; multiple predictions by an algorithm for one published event were counted as one. Value in brackets shows percentage of published events found by algorithm. We note from GADA analysis although a high number of overlaps were found, this was due to the prediction of large events that included smaller events found by Kidd et al. and Korbel et al.

Table 3:

Overlap between events detected by SNP array algorithms using multiple publication data

Total events found in NA15510 by algorithm	Number of copy number events (Kidd) [28]	Number of copy number events (Korbel) [7]	Number of copy number events (Redon) [2]
Events in paper	299	466	219
CNVPartition 1.2.1	39	12 (4%)	22 (5%)	9 (4%)
GADA (R 0.7-5)	69	68 (23%)	85 (18%)	42 (19%)
PennCNV (2009Jan06)	81	18 (6%)	28 (%)	30 (14%)
QuantiSNP v1.1	64	18 (6%)	41 (9%)	29 (13%)

Total events found in NA15510 by algorithm	Number of copy number events (Kidd) [28]	Number of copy number events (Korbel) [7]	Number of copy number events (Redon) [2]
Events in paper	299	466	219
CNVPartition 1.2.1	39	12 (4%)	22 (5%)	9 (4%)
GADA (R 0.7-5)	69	68 (23%)	85 (18%)	42 (19%)
PennCNV (2009Jan06)	81	18 (6%)	28 (%)	30 (14%)
QuantiSNP v1.1	64	18 (6%)	41 (9%)	29 (13%)

Figure 4:

Venn diagrams comparing events for NA15510 between different studies. Visual representation of data from CEPH sample NA15510 on 1M array, Illumina platform used to compare between algorithms and other publications [2, 7, 28]. Default parameters are used for each algorithm and Y chromosome data was omitted from count. Event lists from publications were generated by combining data from several tables to create a complete list (including all validated and unvalidated events). An event was counted if any overlap was found with base event in published data, multiple predictions by an algorithm for one published event were counted as one. Each total in the diagram is comprised of all the events found by the studies meaning each event in an overlapping pair is counted. Surprisingly, only 43 overlapping events are found for NA15510 in all the three studies (A). Results from the PennCNV (D) and QuantiSNP (C) comparisons show that QuantiSNP detects more events in all three software due to the detection of more events overlapping with the Korbel et al. study. Overlap between algorithms is shown in Venn Diagram B where events which are detected by the algorithm and found in at least one of the publication are compared. A large proportion of detected events between PennCNV and QuantiSNP (43) overlap.

The overlap of algorithm events of the tested software is below 50% for all cases. We used default parameters for all our algorithms for ease of replication which means some algorithms were not run at their optimal level for our data. We deliberately chose data which did not use an array-based technique for our NA12156 comparison to prevent a bias between Affymetrix and Illumina; but in doing so we accepted an increase in the number of differently detected events. Kidd et al. have shown similar data when comparing studies and found only a 12.5% overlap of events larger than 5 kb between their results and CN data generated by Affymetrix 6.0 array.

Similarities of events detected between different Software

We chose to test a single sample (NA10861) on a range of the available algorithms to compare the similarity between event detection. In all cases we found the academically developed software to be more sensitive and detect more events than proprietary algorithms (Table 4). The data also shows an increased number of events found from the sample using the Affymetrix SNP6.0 array; we assume this reflects the increase in the number of CNP probes on the array relative to Illumina's 1M chip.

Table 4:

Comparison of event numbers detected for a single sample (NA10861)

Algorithm	Platform and array	Number of CN events detected
Birdsuite 1.5.5 (Canary & Birdseye)	Affymetrix 6.0	137
CNAT (Genome Console 3.0.2)	Affymetrix 6.0	10
CNVPartition 1.2.1	Illumina 1M Duo	16
GADA (R 0.7-5)	Affymetrix 6.0	613
GADA (R 0.7-5)	Illumina 1M Duo	87
Nexus Biodiscovery 4.0.1	Affymetrix 6.0	111
Nexus Biodiscovery 4.0.1	Illumina 1M Duo	8
PennCNV (2009Jan06)	Affymetrix 6.0	67
PennCNV (2009Jan06)	Illumina 1M Duo	43
QuantiSNP v2.0	Affymetrix 6.0	193
QuantiSNP v1.1	Illumina 1M Duo	60

Algorithm	Platform and array	Number of CN events detected
Birdsuite 1.5.5 (Canary & Birdseye)	Affymetrix 6.0	137
CNAT (Genome Console 3.0.2)	Affymetrix 6.0	10
CNVPartition 1.2.1	Illumina 1M Duo	16
GADA (R 0.7-5)	Affymetrix 6.0	613
GADA (R 0.7-5)	Illumina 1M Duo	87
Nexus Biodiscovery 4.0.1	Affymetrix 6.0	111
Nexus Biodiscovery 4.0.1	Illumina 1M Duo	8
PennCNV (2009Jan06)	Affymetrix 6.0	67
PennCNV (2009Jan06)	Illumina 1M Duo	43
QuantiSNP v2.0	Affymetrix 6.0	193
QuantiSNP v1.1	Illumina 1M Duo	60

HapMap samples provided as demonstration data were analysed on both Affymetrix and Illumina platforms to give an easily reproducible comparison of event prediction. Events shown have been detected by the algorithm for CEPH sample NA10861. Default parameters were used for all algorithms and any Y chromosome data was omitted. Data from the Affymetrix array has a higher number of detected events probably linked to the number of specifically targeted probes. Proprietary software from both Illumina and Affymetrix has a low detection rate.

Table 4:

Comparison of event numbers detected for a single sample (NA10861)

Algorithm	Platform and array	Number of CN events detected
Birdsuite 1.5.5 (Canary & Birdseye)	Affymetrix 6.0	137
CNAT (Genome Console 3.0.2)	Affymetrix 6.0	10
CNVPartition 1.2.1	Illumina 1M Duo	16
GADA (R 0.7-5)	Affymetrix 6.0	613
GADA (R 0.7-5)	Illumina 1M Duo	87
Nexus Biodiscovery 4.0.1	Affymetrix 6.0	111
Nexus Biodiscovery 4.0.1	Illumina 1M Duo	8
PennCNV (2009Jan06)	Affymetrix 6.0	67
PennCNV (2009Jan06)	Illumina 1M Duo	43
QuantiSNP v2.0	Affymetrix 6.0	193
QuantiSNP v1.1	Illumina 1M Duo	60

Algorithm	Platform and array	Number of CN events detected
Birdsuite 1.5.5 (Canary & Birdseye)	Affymetrix 6.0	137
CNAT (Genome Console 3.0.2)	Affymetrix 6.0	10
CNVPartition 1.2.1	Illumina 1M Duo	16
GADA (R 0.7-5)	Affymetrix 6.0	613
GADA (R 0.7-5)	Illumina 1M Duo	87
Nexus Biodiscovery 4.0.1	Affymetrix 6.0	111
Nexus Biodiscovery 4.0.1	Illumina 1M Duo	8
PennCNV (2009Jan06)	Affymetrix 6.0	67
PennCNV (2009Jan06)	Illumina 1M Duo	43
QuantiSNP v2.0	Affymetrix 6.0	193
QuantiSNP v1.1	Illumina 1M Duo	60

Table 5 shows the amount of overlap in event prediction. We show two results for each comparison counting the number of events overlapping for each algorithm separately. The difference in values represents the number of smaller events often found in one event by a different algorithm. In general, we found a higher number of overlapping events between algorithms run on Affymetrix 6.0 arrays data. We expected the low resemblance between data generated on different platforms as a result of the different probe sets; however, we are pleased to find some overlap. We have included a comparison to events published by Redon et al. [2]; although the study does not include a comprehensive list for this sample it does show that the algorithms are detecting confirmed events.

Table 5:

Comparison of software event predictions

Published results (Redon)	Birdsuite Affymetrix	CNAT Affymetrix	CNV Partition Illumina	GADA Affymetrix	GADA Illumina	Nexus Affymetrix	Nexus Illumina	PennCNV Affymetrix	PennCNV Illumina	QuantiSNP Affymetrix	QuantiSNP Illumina
Published data (Redon)	17 (4%)	4 (40%)	3 (19%)	32 (5%)	2 (2%)	11(10%)	2 (25%)	12 (18%)	7 (16%)	18 (9%)	8 (13%)
Birdsuite	Affymetrix	17 (44%)	9 (90%)	13 (81%)	135 (22%)	21 (24%)	62 (56%)	6 (75%)	43 (64%)	20 (47%)	97 (50%)	20 (33%)
CNAT	Affymetrix	4 (10%)	15 (4%)	4 (25%)	34 (6%)	0	23 (21%)	1 (13%)	13 (19%)	2 (5%)	17 (9%)	5 (8%)
CNV Partition	Illumina	3 (8%)	16 (4%)	4 (40%)	37 (6%)	7 (8%)	20 (18%)	7 (88%)	9 (13%)	11 (26%)	16 (8%)	16 (27%)
GADA	Affymetrix	17 (44%)	106 (28%)	9 (90%)	13 (81%)	32 (37%)	91 (82%)	7 (88%)	58 (87%)	23 (53%)	153 (79%)	27 (45%)
GADA	Illumina	2 (5%)	96 (25%)	0	13 (81%)	208 (34%)	25 (23%)	2 (25%)	26 (30%)	17 (40%)	67 (35%)	23 (38%)
Nexus	Affymetrix	7 (18%)	57 (15%)	10 (100%)	7 (44%)	116 (19%)	8 (9%)	4 (50%)	45 (67%)	15 (35%)	78 (40%)	17 (28%)
Nexus	Illumina	2 (5%)	6 (2%)	1 (10%)	7 (44%)	22 (4%)	2 (2%)	4 (4%)	6 (9%)	7 (16%)	10 (5%)	9 (15%)
PennCNV	Affymetrix	11 (28%)	51 (13%)	10 (100%)	9 (56%)	105 (17%)	10 (11%)	65 (59%)	6 (75%)	19 (44%)	71 (37%)	21 (35%)
PennCNV	Illumina	6 (15%)	25 (7%)	2 (20%)	11 (69%)	44 (7%)	9 (10%)	23 (21%)	6 (75%)	18 (27%)	26 (13%)	28 (47%)
QuantiSNP	Affymetrix	14 (36%)	97 (25%)	10 (100%)	10 (63%)	199 (32%)	18 (21%)	86 (77%)	7 (88%)	65 (97%)	21 (49%)	24 (40%)
QuantiSNP	Illumina	6 (15%)	14 (4%)	5 (50%)	15 (94%)	55 (9%)	10 (11%)	30 (27%)	8 (100%)	23 (34%)	32 (74%)	31 (16%)

Published results (Redon)	Birdsuite Affymetrix	CNAT Affymetrix	CNV Partition Illumina	GADA Affymetrix	GADA Illumina	Nexus Affymetrix	Nexus Illumina	PennCNV Affymetrix	PennCNV Illumina	QuantiSNP Affymetrix	QuantiSNP Illumina
Published data (Redon)	17 (4%)	4 (40%)	3 (19%)	32 (5%)	2 (2%)	11(10%)	2 (25%)	12 (18%)	7 (16%)	18 (9%)	8 (13%)
Birdsuite	Affymetrix	17 (44%)	9 (90%)	13 (81%)	135 (22%)	21 (24%)	62 (56%)	6 (75%)	43 (64%)	20 (47%)	97 (50%)	20 (33%)
CNAT	Affymetrix	4 (10%)	15 (4%)	4 (25%)	34 (6%)	0	23 (21%)	1 (13%)	13 (19%)	2 (5%)	17 (9%)	5 (8%)
CNV Partition	Illumina	3 (8%)	16 (4%)	4 (40%)	37 (6%)	7 (8%)	20 (18%)	7 (88%)	9 (13%)	11 (26%)	16 (8%)	16 (27%)
GADA	Affymetrix	17 (44%)	106 (28%)	9 (90%)	13 (81%)	32 (37%)	91 (82%)	7 (88%)	58 (87%)	23 (53%)	153 (79%)	27 (45%)
GADA	Illumina	2 (5%)	96 (25%)	0	13 (81%)	208 (34%)	25 (23%)	2 (25%)	26 (30%)	17 (40%)	67 (35%)	23 (38%)
Nexus	Affymetrix	7 (18%)	57 (15%)	10 (100%)	7 (44%)	116 (19%)	8 (9%)	4 (50%)	45 (67%)	15 (35%)	78 (40%)	17 (28%)
Nexus	Illumina	2 (5%)	6 (2%)	1 (10%)	7 (44%)	22 (4%)	2 (2%)	4 (4%)	6 (9%)	7 (16%)	10 (5%)	9 (15%)
PennCNV	Affymetrix	11 (28%)	51 (13%)	10 (100%)	9 (56%)	105 (17%)	10 (11%)	65 (59%)	6 (75%)	19 (44%)	71 (37%)	21 (35%)
PennCNV	Illumina	6 (15%)	25 (7%)	2 (20%)	11 (69%)	44 (7%)	9 (10%)	23 (21%)	6 (75%)	18 (27%)	26 (13%)	28 (47%)
QuantiSNP	Affymetrix	14 (36%)	97 (25%)	10 (100%)	10 (63%)	199 (32%)	18 (21%)	86 (77%)	7 (88%)	65 (97%)	21 (49%)	24 (40%)
QuantiSNP	Illumina	6 (15%)	14 (4%)	5 (50%)	15 (94%)	55 (9%)	10 (11%)	30 (27%)	8 (100%)	23 (34%)	32 (74%)	31 (16%)

Algorithms were run on demonstration data for sample NA10861 on Affymetrix 6.0 chips and Illumina 1MDuo arrays. Default parameters were used and any Y chromosome data was omitted. For algorithm overall totals see Table 4. Events detected in both software are shown. Events counted as common between algorithms if part of region predicted overlaps with the other. Each comparison is carried out twice to show cases where smaller events within one algorithm make up one event in the other, therefore overlap of events depends on analysis orientation. Total value represents number of events for software on horizontal axis found in the other software dataset, bracketed value shows percentage of events detected by same software. We have found the most similarities are between data from similar platforms or algorithm method; for example Affymetrix PennCNV and QuantiSNP are both based on the HMM algorithm and as such event prediction should be very similar. We have also noted a higher number of similar events from algorithms using Affymetrix data.

Table 5:

Comparison of software event predictions

Published results (Redon)	Birdsuite Affymetrix	CNAT Affymetrix	CNV Partition Illumina	GADA Affymetrix	GADA Illumina	Nexus Affymetrix	Nexus Illumina	PennCNV Affymetrix	PennCNV Illumina	QuantiSNP Affymetrix	QuantiSNP Illumina
Published data (Redon)	17 (4%)	4 (40%)	3 (19%)	32 (5%)	2 (2%)	11(10%)	2 (25%)	12 (18%)	7 (16%)	18 (9%)	8 (13%)
Birdsuite	Affymetrix	17 (44%)	9 (90%)	13 (81%)	135 (22%)	21 (24%)	62 (56%)	6 (75%)	43 (64%)	20 (47%)	97 (50%)	20 (33%)
CNAT	Affymetrix	4 (10%)	15 (4%)	4 (25%)	34 (6%)	0	23 (21%)	1 (13%)	13 (19%)	2 (5%)	17 (9%)	5 (8%)
CNV Partition	Illumina	3 (8%)	16 (4%)	4 (40%)	37 (6%)	7 (8%)	20 (18%)	7 (88%)	9 (13%)	11 (26%)	16 (8%)	16 (27%)
GADA	Affymetrix	17 (44%)	106 (28%)	9 (90%)	13 (81%)	32 (37%)	91 (82%)	7 (88%)	58 (87%)	23 (53%)	153 (79%)	27 (45%)
GADA	Illumina	2 (5%)	96 (25%)	0	13 (81%)	208 (34%)	25 (23%)	2 (25%)	26 (30%)	17 (40%)	67 (35%)	23 (38%)
Nexus	Affymetrix	7 (18%)	57 (15%)	10 (100%)	7 (44%)	116 (19%)	8 (9%)	4 (50%)	45 (67%)	15 (35%)	78 (40%)	17 (28%)
Nexus	Illumina	2 (5%)	6 (2%)	1 (10%)	7 (44%)	22 (4%)	2 (2%)	4 (4%)	6 (9%)	7 (16%)	10 (5%)	9 (15%)
PennCNV	Affymetrix	11 (28%)	51 (13%)	10 (100%)	9 (56%)	105 (17%)	10 (11%)	65 (59%)	6 (75%)	19 (44%)	71 (37%)	21 (35%)
PennCNV	Illumina	6 (15%)	25 (7%)	2 (20%)	11 (69%)	44 (7%)	9 (10%)	23 (21%)	6 (75%)	18 (27%)	26 (13%)	28 (47%)
QuantiSNP	Affymetrix	14 (36%)	97 (25%)	10 (100%)	10 (63%)	199 (32%)	18 (21%)	86 (77%)	7 (88%)	65 (97%)	21 (49%)	24 (40%)
QuantiSNP	Illumina	6 (15%)	14 (4%)	5 (50%)	15 (94%)	55 (9%)	10 (11%)	30 (27%)	8 (100%)	23 (34%)	32 (74%)	31 (16%)

Published results (Redon)	Birdsuite Affymetrix	CNAT Affymetrix	CNV Partition Illumina	GADA Affymetrix	GADA Illumina	Nexus Affymetrix	Nexus Illumina	PennCNV Affymetrix	PennCNV Illumina	QuantiSNP Affymetrix	QuantiSNP Illumina
Published data (Redon)	17 (4%)	4 (40%)	3 (19%)	32 (5%)	2 (2%)	11(10%)	2 (25%)	12 (18%)	7 (16%)	18 (9%)	8 (13%)
Birdsuite	Affymetrix	17 (44%)	9 (90%)	13 (81%)	135 (22%)	21 (24%)	62 (56%)	6 (75%)	43 (64%)	20 (47%)	97 (50%)	20 (33%)
CNAT	Affymetrix	4 (10%)	15 (4%)	4 (25%)	34 (6%)	0	23 (21%)	1 (13%)	13 (19%)	2 (5%)	17 (9%)	5 (8%)
CNV Partition	Illumina	3 (8%)	16 (4%)	4 (40%)	37 (6%)	7 (8%)	20 (18%)	7 (88%)	9 (13%)	11 (26%)	16 (8%)	16 (27%)
GADA	Affymetrix	17 (44%)	106 (28%)	9 (90%)	13 (81%)	32 (37%)	91 (82%)	7 (88%)	58 (87%)	23 (53%)	153 (79%)	27 (45%)
GADA	Illumina	2 (5%)	96 (25%)	0	13 (81%)	208 (34%)	25 (23%)	2 (25%)	26 (30%)	17 (40%)	67 (35%)	23 (38%)
Nexus	Affymetrix	7 (18%)	57 (15%)	10 (100%)	7 (44%)	116 (19%)	8 (9%)	4 (50%)	45 (67%)	15 (35%)	78 (40%)	17 (28%)
Nexus	Illumina	2 (5%)	6 (2%)	1 (10%)	7 (44%)	22 (4%)	2 (2%)	4 (4%)	6 (9%)	7 (16%)	10 (5%)	9 (15%)
PennCNV	Affymetrix	11 (28%)	51 (13%)	10 (100%)	9 (56%)	105 (17%)	10 (11%)	65 (59%)	6 (75%)	19 (44%)	71 (37%)	21 (35%)
PennCNV	Illumina	6 (15%)	25 (7%)	2 (20%)	11 (69%)	44 (7%)	9 (10%)	23 (21%)	6 (75%)	18 (27%)	26 (13%)	28 (47%)
QuantiSNP	Affymetrix	14 (36%)	97 (25%)	10 (100%)	10 (63%)	199 (32%)	18 (21%)	86 (77%)	7 (88%)	65 (97%)	21 (49%)	24 (40%)
QuantiSNP	Illumina	6 (15%)	14 (4%)	5 (50%)	15 (94%)	55 (9%)	10 (11%)	30 (27%)	8 (100%)	23 (34%)	32 (74%)	31 (16%)

During our comparison we often saw a difference in the size of the predicted event between algorithms (Figure 5). This was to be expected when using different platforms as probe locations vary, but was also seen when analysing an identical dataset. This kind of effect can even be produced when simply altering algorithm parameters and should be a consideration when looking at breakpoints of detected events. We found that the available software tend to target and support one particular platform for analysis, which unfortunately, can limit options.

Image from UCSC Browser showing the detection of a single event using different algorithms. The deletion described is a known CNP and is recorded several times in the DGV. Each track represents a different algorithm or platform. All results for detection algorithms shown used default parameters and test sample NA10861.

Figure 5:

Recommending algorithms

Comparison of events in a dataset is a good way of assessing accuracy of detection algorithms but it is also important to take into account that the different predictions can also be informative in showing false positives caused by noisy data and conversely that those in agreement are the strongest candidates for events. Multiple predictions from different software for the same event increase confidence in the data and give clearer indications of the event boundaries or any discrepancy in this information. We would recommend using a second algorithm on a single dataset to produce the most informative results and also utilize the different advantages of each software. We also suggest using software designed specifically for the platform which generated the data as several of the dual use algorithms have been shown to weaker in one format. We have selected a range of algorithms to discuss and test and the list in Table 1 is not exhaustive, only an overview of some of the possibilities. It is also important to state, even using different algorithms one cannot definitively confirm the presence of a CN event without separate biological replication and it is unlikely that any list of events detected will contain all CNVs in a sample.

FURTHER ANALYSIS OF DETECTED CNVs

With a number of reliable options available for the detection of copy number events it becomes increasingly important to be able to summarize and use this data. Initially, we are often interested in looking for novel events in certain genes or regions. Tracks of events can be viewed in databases such as the web-based genome browser, UCSC (http://www.genome.ucsc.edu/) and events can be compared to known copy number data in the DGV such as displayed in Figure 3. Importing several tracks of data into a browser simultaneously will allow the user to compare different result sets.

Analysis of multiple events per sample is a more complicated procedure. Events and samples can be explored using pathway analysis tools to look for interesting groups or combinations of events in different genes but methods of confirming the significance of an event are required. A number of publications exist presenting ways of applying association study methods to copy number data. Barnes et al. [29] developed an R package, CNVtools, which allows the user to carry out case-control association analysis on a single CNV of interest. The publication tests a series of five alternative modelling methods before recommending a likelihood ratio test which combines CNV calling and association testing into a single model. This method was designed to eliminate problems with signal noise which is a known trait of SNP assay data. Ionita-Laza et al. [30] suggested a method to apply genome-wide family-based association studies on raw-intensity data. The Birdsuite package includes a pipeline to prepare the data for PLINK analysis. Other sources have suggested similar association study-based strategies, but an agreed approach is a subject of great discussion. Calls have been made by authors such as Scherer et al. [31] to decide on a single technique but future decisions in the field will be extremely enlightening.

As is commented much upon in literature describing SNP association study techniques, sample size and power of tests are major factors in a successful study [32]. This must also be considered when analysing copy number data. As we have discussed, there are a number of analysis options available for SNP array CNV detection, pipelines to allow guided analysis and stand alone options for more flexible analysis. Some of these applications are platform targeted but we have found that the best outcome is given by using multiple algorithms and comparing data.

SUPPLEMENTARY DATA

Supplementary data are available online at http://bib.oxfordjournals.org/.

A wide variety of software is available for CNV detection from data produced by SNP arrays. This review seeks to discuss options and statistical methods currently available for analysis of signal intensity data.
Changes in assay selection techniques for SNP arrays have made them more appealing for copy number detection as well as genotyping. Targeted probe design has made the SNP array a reliable and cheaper option for copy number analysis.
After testing a selection of the available software, comparisons were performed using Hapmap samples and published copy number data. Of the events found in our data 20–49% were replicated in previously published studies but the results clearly showed variation in data caused by differences in algorithms.
An important recommendation when choosing software for analysis is the use of a second algorithm on a dataset to produce more informative results. This enables the user to eliminate false positives not found by both software and increases confidence in replicated events.

FUNDING

JR and LW are funded by Wellcome Trust Grants. CY is funded by a UK Medical Research Council Special Training Fellowship in Biomedical Informatics (Ref No. G0701810).

Acknowledgements

The authors thank Dr Helen Butler for her ideas and contributions to the manuscript.

References

et al.

Detection of large-scale variation in the human genome

Nat Genet

2004

, vol.

(pg.

949

)

et al.

Global variation in copy number in the human genome

Nature

2006

, vol.

444

7118

(pg.

444

)

et al.

Fine-scale structural variation of the human genome

Nat Genet

2005

, vol.

(pg.

727

)

et al.

Large-scale copy number polymorphism in the human genome

Science

2004

, vol.

305

5683

(pg.

525

)

et al.

Array CGH analysis of copy number variation identifies 1284 new genes variant in healthy white males: implications for association studies of complex diseases

Hum Mol Genet

2007

, vol.

(pg.

2783

)

Methods and strategies for analyzing copy number variation using DNA microarrays

Nat Genet

2007

, vol.

39(7 Suppl)

(pg.

S16

)

et al.

Paired-end mapping reveals extensive structural variation in the human genome

Science

2007

, vol.

318

5849

(pg.

420

)

et al.

Large-scale genotyping of complex DNA

Nat Biotechnol

2003

, vol.

(pg.

1233

)

et al.

High-resolution genomic profiling of chromosomal aberrations using Infinium whole-genome genotyping

Genome Res

2006

, vol.

(pg.

1136

)

International Schizophrenia Consortium Rare chromosomal deletions and duplications increase risk of schizophrenia

Nature

2008

, vol.

455

7210

(pg.

237

)

et al.

Genome-wide copy-number-variation study identified a susceptibility gene, UGT2B17, for osteoporosis

Am J Hum Genet

2008

, vol.

(pg.

663

)

et al.

Common deletion polymorphisms in the human genome

Nat Genet

2006

, vol.

(pg.

)

et al.

Systematic assessment of copy number variant detection via genome-wide SNP genotyping

Nat Genet

2008

, vol.

(pg.

1199

203

)

Copy-number variation and association studies of human disease

Nat Genet

2007

, vol.

7 Suppl

(pg.

S37

)

et al.

Integrated detection and population-genetic analysis of SNPs and copy number variation

Nat Genet

2008

, vol.

(pg.

1166

)

et al.

Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs

Nat Genet

2008

, vol.

(pg.

1253

)

et al.

Unsupervised segmentation of continuous genomic data

Bioinformatics

2007

, vol.

(pg.

1424

)

et al.

QuantiSNP: an objective Bayes Hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data

Nucleic Acids Res

2007

, vol.

(pg.

2013

)

et al.

PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data

Genome Res

2007

, vol.

(pg.

1665

)

et al.

High-density SNP association study and copy number variation analysis of the AUTS1 and AUTS5 loci implicate the IMMP2L-DOCK4 gene region in autism susceptibility

Mol Psychiatry

2009

et al.

Modeling genetic inheritance of copy number variations

Nucleic Acids Res

2008

, vol.

pg.

e138

et al.

Major copy proportion analysis of tumor samples using SNP arrays

BMC Bioinformatics

2008

, vol.

pg.

204

Circular binary segmentation for the analysis of array-based DNA copy number data

Biostatistics

2004

, vol.

(pg.

557

)

et al.

Sparse representation and Bayesian detection of genome copy number alterations from microarray data

Bioinformatics

2008

, vol.

(pg.

309

)

Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data

Bioinformatics

2005

, vol.

(pg.

3763

)

et al.

ITALICS: an algorithm for normalization and DNA copy number calling for Affymetrix SNP arrays

Bioinformatics

2008

, vol.

(pg.

768

)

et al.

Detection, imputation, and association analysis of small deletions and null alleles on oligonucleotide arrays

Am J Hum Genet

2008

, vol.

(pg.

1316

)

et al.

Mapping and sequencing of structural variation from eight human genomes

Nature

2008

, vol.

453

7191

(pg.

)

et al.

A robust statistical method for case-control association testing with copy number variation

Nat Genet

2008

, vol.

(pg.

1245

)

et al.

On the analysis of copy-number variations in genome-wide association studies: a translation of the family-based association test

Genet Epidemiol

2008

, vol.

(pg.

273

)

et al.

Challenges and standards in integrating surveys of structural variation

Nat Genet

2007

, vol.

7 Suppl

(pg.

)

Association study designs for complex diseases

Nat Rev Genet

2001

, vol.

(pg.

)