Inferring combined CNV/SNP haplotypes from genotype data (original) (raw)
Journal Article
,
1 Department of Epidemiology and Biostatistics, School of Public Health, Imperial College, London W2 1PG, UK, 2 Ernest Gallo Clinic and Research Center, Department of Bioinformatics, University of California, San Francisco, CA 94608, USA, 3 Department of Genomics of Common Disease, School of Public Health, Imperial College, Hammersmith Hospital, Du Cane Road, London W12 0NN, UK, 4 Institute of Health, University of Oulu, Oulu, Finland, 5 CNRS 8090-Institute of Biology, Pasteur Institute, France and 6 Institute of Genetics, University College London, London WC 1E 6BT, UK
1 Department of Epidemiology and Biostatistics, School of Public Health, Imperial College, London W2 1PG, UK, 2 Ernest Gallo Clinic and Research Center, Department of Bioinformatics, University of California, San Francisco, CA 94608, USA, 3 Department of Genomics of Common Disease, School of Public Health, Imperial College, Hammersmith Hospital, Du Cane Road, London W12 0NN, UK, 4 Institute of Health, University of Oulu, Oulu, Finland, 5 CNRS 8090-Institute of Biology, Pasteur Institute, France and 6 Institute of Genetics, University College London, London WC 1E 6BT, UK
Search for other works by this author on:
,
1 Department of Epidemiology and Biostatistics, School of Public Health, Imperial College, London W2 1PG, UK, 2 Ernest Gallo Clinic and Research Center, Department of Bioinformatics, University of California, San Francisco, CA 94608, USA, 3 Department of Genomics of Common Disease, School of Public Health, Imperial College, Hammersmith Hospital, Du Cane Road, London W12 0NN, UK, 4 Institute of Health, University of Oulu, Oulu, Finland, 5 CNRS 8090-Institute of Biology, Pasteur Institute, France and 6 Institute of Genetics, University College London, London WC 1E 6BT, UK
Search for other works by this author on:
,
1 Department of Epidemiology and Biostatistics, School of Public Health, Imperial College, London W2 1PG, UK, 2 Ernest Gallo Clinic and Research Center, Department of Bioinformatics, University of California, San Francisco, CA 94608, USA, 3 Department of Genomics of Common Disease, School of Public Health, Imperial College, Hammersmith Hospital, Du Cane Road, London W12 0NN, UK, 4 Institute of Health, University of Oulu, Oulu, Finland, 5 CNRS 8090-Institute of Biology, Pasteur Institute, France and 6 Institute of Genetics, University College London, London WC 1E 6BT, UK
1 Department of Epidemiology and Biostatistics, School of Public Health, Imperial College, London W2 1PG, UK, 2 Ernest Gallo Clinic and Research Center, Department of Bioinformatics, University of California, San Francisco, CA 94608, USA, 3 Department of Genomics of Common Disease, School of Public Health, Imperial College, Hammersmith Hospital, Du Cane Road, London W12 0NN, UK, 4 Institute of Health, University of Oulu, Oulu, Finland, 5 CNRS 8090-Institute of Biology, Pasteur Institute, France and 6 Institute of Genetics, University College London, London WC 1E 6BT, UK
Search for other works by this author on:
,
1 Department of Epidemiology and Biostatistics, School of Public Health, Imperial College, London W2 1PG, UK, 2 Ernest Gallo Clinic and Research Center, Department of Bioinformatics, University of California, San Francisco, CA 94608, USA, 3 Department of Genomics of Common Disease, School of Public Health, Imperial College, Hammersmith Hospital, Du Cane Road, London W12 0NN, UK, 4 Institute of Health, University of Oulu, Oulu, Finland, 5 CNRS 8090-Institute of Biology, Pasteur Institute, France and 6 Institute of Genetics, University College London, London WC 1E 6BT, UK
1 Department of Epidemiology and Biostatistics, School of Public Health, Imperial College, London W2 1PG, UK, 2 Ernest Gallo Clinic and Research Center, Department of Bioinformatics, University of California, San Francisco, CA 94608, USA, 3 Department of Genomics of Common Disease, School of Public Health, Imperial College, Hammersmith Hospital, Du Cane Road, London W12 0NN, UK, 4 Institute of Health, University of Oulu, Oulu, Finland, 5 CNRS 8090-Institute of Biology, Pasteur Institute, France and 6 Institute of Genetics, University College London, London WC 1E 6BT, UK
Search for other works by this author on:
,
1 Department of Epidemiology and Biostatistics, School of Public Health, Imperial College, London W2 1PG, UK, 2 Ernest Gallo Clinic and Research Center, Department of Bioinformatics, University of California, San Francisco, CA 94608, USA, 3 Department of Genomics of Common Disease, School of Public Health, Imperial College, Hammersmith Hospital, Du Cane Road, London W12 0NN, UK, 4 Institute of Health, University of Oulu, Oulu, Finland, 5 CNRS 8090-Institute of Biology, Pasteur Institute, France and 6 Institute of Genetics, University College London, London WC 1E 6BT, UK
Search for other works by this author on:
,
1 Department of Epidemiology and Biostatistics, School of Public Health, Imperial College, London W2 1PG, UK, 2 Ernest Gallo Clinic and Research Center, Department of Bioinformatics, University of California, San Francisco, CA 94608, USA, 3 Department of Genomics of Common Disease, School of Public Health, Imperial College, Hammersmith Hospital, Du Cane Road, London W12 0NN, UK, 4 Institute of Health, University of Oulu, Oulu, Finland, 5 CNRS 8090-Institute of Biology, Pasteur Institute, France and 6 Institute of Genetics, University College London, London WC 1E 6BT, UK
Search for other works by this author on:
1 Department of Epidemiology and Biostatistics, School of Public Health, Imperial College, London W2 1PG, UK, 2 Ernest Gallo Clinic and Research Center, Department of Bioinformatics, University of California, San Francisco, CA 94608, USA, 3 Department of Genomics of Common Disease, School of Public Health, Imperial College, Hammersmith Hospital, Du Cane Road, London W12 0NN, UK, 4 Institute of Health, University of Oulu, Oulu, Finland, 5 CNRS 8090-Institute of Biology, Pasteur Institute, France and 6 Institute of Genetics, University College London, London WC 1E 6BT, UK
* To whom correspondence should be addressed.
Search for other works by this author on:
Received:
09 December 2009
Revision received:
15 March 2010
Cite
Shu-Yi Su, Julian E. Asher, Marjo-Riita Jarvelin, Phillipe Froguel, Alexandra I.F. Blakemore, David J. Balding, Lachlan J.M. Coin, Inferring combined CNV/SNP haplotypes from genotype data, Bioinformatics, Volume 26, Issue 11, June 2010, Pages 1437–1445, https://doi.org/10.1093/bioinformatics/btq157
Close
Navbar Search Filter Mobile Enter search term Search
Abstract
Motivation: Copy number variations (CNVs) are increasingly recognized as an substantial source of individual genetic variation, and hence there is a growing interest in investigating the evolutionary history of CNVs as well as their impact on complex disease susceptibility. CNV/SNP haplotypes are critical for this research, but although many methods have been proposed for inferring integer copy number, few have been designed for inferring CNV haplotypic phase and none of these are applicable at genome-wide scale. Here, we present a method for inferring missing CNV genotypes, predicting CNV allelic configuration and for inferring CNV haplotypic phase from SNP/CNV genotype data. Our method, implemented in the software polyHap v2.0, is based on a hidden Markov model, which models the joint haplotype structure between CNVs and SNPs. Thus, haplotypic phase of CNVs and SNPs are inferred simultaneously. A sampling algorithm is employed to obtain a measure of confidence/credibility of each estimate.
Results: We generated diploid phase-known CNV–SNP genotype datasets by pairing male X chromosome CNV–SNP haplotypes. We show that polyHap provides accurate estimates of missing CNV genotypes, allelic configuration and CNV haplotypic phase on these datasets. We applied our method to a non-simulated dataset—a region on Chromosome 2 encompassing a short deletion. The results confirm that polyHap's accuracy extends to real-life datasets.
Availability: Our method is implemented in version 2.0 of the polyHap software package and can be downloaded from http://www.imperial.ac.uk/medicine/people/l.coin
Contact: l.coin@imperial.ac.uk
Supplementary information: Supplementary data are available at Bioinformatics online.
1 INTRODUCTION
Copy number variations (CNVs) are pervasive in the human genome (Feuk et al., 2006; Redon et al., 2006) and could play a key role in human diversity and disease susceptibility (Conrad et al., 2009; McCarroll and Altshuler, 2007). Despite this, the population genetics of CNVs—and particularly so for duplications—remain relatively poorly understood. Several analytical tools, such as haplotype analysis, which are standard for SNP-based population genetics have yet to be modified to be applicable to complex multi-allelic CNVs.
Several technologies enable high-throughput CNV detection, including array comparative genomic hybridization (aCGH) and SNP genotyping arrays. Many algorithms have been proposed to detect CNV regions and to estimate the integer copy-number (CN) genotypes in each region using these technologies (Colella et al., 2007; Fiegler et al., 2006; Korn et al., 2008; Lai et al., 2005; Olshen et al., 2004; Wang et al., 2007). In particular, using SNP genotyping arrays to simultaneously produce estimates of integer CN and SNP genotype has become popular, particularly as a means to identify both SNPs and CNVs associated with disease. CNV association analyses are conducted either on estimates of integer CN genotype (Barnes et al., 2008; Korn et al., 2008), or using normalized continuous intensity data measurements (Barnes et al., 2008).
As a result of intensive genotyping efforts worldwide for genome-wide association studies, there are many datasets containing inferred CNV regions, CNV genotypes in these regions, as well as SNP genotypes. However, CNV–SNP haplotypes are rarely determined in these datasets, due largely to a lack of algorithmic development in this area. Hence, haplotype-based approaches that have been shown to be more powerful than single-marker analyses (Liu et al., 2007; Mailund et al., 2006; Su et al., 2008a) are not fully exploited in CNV association studies.
Apart from improving the sensitivity of association studies, CNV–SNP haplotypes are also invaluable for studying the evolutionary history of CNVs. In particular, many techniques for detecting positive selection rely on accurate phasing (Sabeti et al., 2007). Similarly, CNV–SNP phasing will improve accuracy of estimates of linkage disequilibrium (LD) between SNPs and CNVs (Conrad et al., 2009; de Smith et al., 2008), particularly for multi-allelic CNVs. Identification of the haplotypic background(s) of a given CNV, will also help to distinguish single versus recurrent deletion/amplification events and will also shed light on the age of the CNV (from the size of the extended haplotype containing the CNV).
Methods for inferring haplotypic phase from diploid genotypes are well developed and provide accurate inference of haplotypic phase (Browning and Browning, 2009, 2007; Kimmel and Shamir, 2005; Scheet and Stephens, 2006; Stephens and Scheet, 2005; Su et al., 2008a). For polyploid organisms, two phasing programs, SATlotyper (Neigenfind et al., 2008) and polyHap(v1.0) (Su et al., 2008b), have been proposed. By treating a CN region as a region of variable ploidy, polyHap(v1.0) can also be used for phasing CNV regions providing that the ploidy is fixed for the entire genomic region under investigation (although this can be different for different individuals). Thus, phasing complex CNV regions, in which each individual can have different CNV breakpoints, is a problem that has yet to be fully addressed.
To properly define what is meant by CNV haplotyping, we consider in Figure 1 how a CNV might arise in an ancestral genome, and subsequently be transmitted from one generation to the next. We consider separately the cases of deletion, local and dispersed duplication as illustrated in Figure 1. For both a deletion and local duplication, LD can accumulate between flanking SNPs and the CN state itself. Studies have estimated that the majority of common genotypeable CNVs are well tagged by SNPs (Conrad et al., 2009; de Smith et al., 2008). Exploiting this LD pattern to infer which haplotype contains the CNV is called non-internal phasing in our study. For bi-allelic SNPs within duplications, non-internal phasing also provides an estimate of _allelic configuration_—e.g. distinguishing two possible configurations AA/B and AB/A for a genotype AAB. On the other hand, the duplicated regions can themselves contain SNPs, which either arose prior to the duplication event, during the duplication event (through imperfect copying) or subsequently. We refer to exploiting the LD patterns within duplicated regions in order to identify the haplotypes comprising the duplication as internal phasing. The difference between internal and non-internal phasing—as applied to diploid genomes—is further illustrated in Figure 2. Non-internal phasing reconstructs haplotypes consisting of CNV–SNP alleles {−, A, B, AA, AB, BB, …} from diploid genotypes, but does not phase within duplicated alleles. Internal phasing, on the other hand, reconstructs all of the underlying SNP haplotypes present in the dataset, including those within duplicated regions. Thus, it considers diploid genomes as locally polyploid, where the ploidy is given by the maximum CN. MOCSphaser (Kato et al., 2008b) was the first program developed for inferring (non-internal) CNV–SNP haplotypes using an expectation maximization (EM) algorithm. However, it only accommodates CNs in CNV regions (does not consider variant bases in these regions) and SNP genotypes in non-CNV regions. Another recently proposed non-internal phasing program for CNV haplotype inference, CNVphaser, employed an EM and partition–ligation (PL) algorithms to infer haplotypic phase given identified CNV regions and CNs (Kato et al., 2008a).
Fig. 1.
Illustration of the process of forming a deletion and a local/dispersed duplication. The light grey box represents deleted or duplicated region. The deleted or duplicated region is transmitted over generations. The pre-existing SNPs (represented in white circles) and new SNPs (represented in dark grey circles) form the background patterns of haplotypes, from which the correlation between deletion/duplication states and flanking SNPs is captured in our model for non-internal phasing. For internal phasing, our model exploits the correlation between SNPs within duplicated regions.
Fig. 2.
Illustration of non-internal and internal phasing with a deletion and a single copy amplification. Non-internal phasing considers the genotypes as diploids and treats the duplication and deletion as extra different alleles, whereas, internal phasing considers the genotypes as triploids and introduces an extra chromosome copy. This chromosome will accommodate the extra copy of the duplicated region and will otherwise be set to a deletion state.
In this article, we describe an algorithm for both internal and non-internal CNV haplotype inference from CNV/SNP genotype data, which takes account of the shared haplotype structure between individuals in a population. Our method, polyHap(v2.0) extends the model of polyHap(v1.0) (Su et al., 2008b) to phase complex CNV regions by allowing arbitrary changes of CN within individuals and along the genomic sequence.
To investigate the effectiveness of our approach, we took SNP/CNV genotype data on male X chromosomes and randomly paired these into phase-known diploid and triploid haplotypes. We then investigated how well we could reconstruct the known phase and allelic configuration as well as infer missing CNV genotypes. The results show that our method provides accurate estimates of missing CNV genotypes, allelic configuration and haplotypic phase. We applied polyHap to a region on Chromosome 2 encompassing a short deletion. The results show that polyHap correctly detected a haplotype comprising this deletion.
2 METHODS
Our method employs a hidden Markov model (HMM) to infer an ancestral haplotype for each haplotype at each marker, reflecting the idea that similar haplotypes are likely to have descended from the same ancestral haplotype. Assume we observe the genotypes, g = (_g_1, _g_2.…, g M), at M SNPs for each individual. g m = {g _m_1,… g mN} is an unordered list of the individual's alleles at marker m, where N is the ploidy. For non-internal phasing, we infer haplotypic phase on diploid chromosomes, where N equals to 2. For internal phasing, we consider genotypes as polyploids, where N is set to the maximum CN (ploidy) observed on the individual (Fig. 2). Also, s m = {s m_1,…, s mN} and s′_m = [s _m_1,…, s _mN_] are the unordered and ordered lists of ancestral haplotypes at marker m, respectively. We write π(s_′_m) = [s _m_π(1),…, s _m_π(N)] for a permutation of s_′_m, and Π(s m) for the set of all such permutations. Thus, for example, if s_′_m = [1, 2] there are two permutations, namely [2, 1] and [1, 2], whereas if s_′_m = [1, 1] there is only one permutation.
2.1 Emission probability
In our method, each allele is assumed to be descended from one of z ancestral haplotypes, which are the hidden states (haplotype states) in the HMM. The program first learns the ancestral haplotype structure from genotypes jointly for all individuals. Based on this structure, allelic configuration, missing CNV genotypes and CNV haplotypic phase are then inferred. This relationship between the allele and the haplotype hidden state is modelled by the emission probability. In this study, we allow a deletion and a single copy amplification. Thus, the set of possible alleles is {-, A, B, AA, AB, BB} underlying a diploid model when non-internal phasing is considered, while the set of possible alleles is {-, A, B} underlying a polyploid model for internal phasing.
First, we define the emission probability of each genotype given a haplotype state. Let θ_ml_ n(h) denote the emission probability of allele h at marker m given the haplotype state l n in a haploid model, where h ∈ {-, A, B, AA, AB, BB} for non−internal phasing and h ∈ {-, A, B} for internal phasing.
We first obtain the emission probability of a list of unordered haplotypes, given an unordered list of haplotype states {_l_1,…, l N} by
(1)
where θ_ml_ n(_h_π(n)) = p(_h_π(n)|l n).
For non-internal phasing, a given CNV genotype (e.g. AAB) may be consistent with more than one unordered list of haplotype pairs (e.g. AA/B and AB/A). In this case, the observed data is represented as probability distribution p*m over unordered haplotype pairs (e.g. such that p*m(AB/A) = p*m(AA/B) = 0.5). We then write
(2)
for the emission probability, using Equation (1) to calculate the terms in this sum. We note that a normal copy genotype, e.g. AA is also consistent with two different unordered haplotype pairs, namely A/A as well as AA/-; however, we currently exclude the AA/- haplotype pair from our analysis. Equation (2) can also be used to accommodate uncertain CN genotypes, in which case p*m to reflects the probability of each CNV/SNP genotype as calculated by the CNV genotyping algorithm used. If g m is missing, we set p*m to be the uniform distribution over all CNV/SNP genotypes.
2.2 Transition probability
We first briefly describe a basic transition model for internal phasing. We then introduce the extension of this model for non-internal phasing by considering the transitions between the CNs and between the haplotypes. In this extension, a given haplotype hidden state has a fixed CN and there can be multiple haplotype states underlying each CN. In this study, we use eight ancestral haplotype states for internal phasing and nine haplotype states for non-internal phasing of which one haplotype state has the underlying CN = 0 (deletion), four haplotype states have the underlying CN = 1 (normal copy) and four haplotype states have the underlying CN = 2 (a single copy amplification). Note that the CN states are the super states which categorize haplotype states according to their underlying CNs.
2.2.1 A basic haplotype transition model
First, we define the transition probability in a HMM from haplotype states k n to l n between markers m − 1 and m by
(3)
where J m is the probability of a jump occurring at marker m − 1, and α_ml_ n is the probability that this jump results in the haplotype l n. For tightly linked markers, J m is small so that haplotype state changes occur infrequently, but are allowed between any pair of markers. Here, the parameter J m is independent of the state and α_ml_ n only depends on the l n (the ‘to’) state.
2.2.2 A modified haplotype transition model
We further modify this model to allow different models for the transition between CN states, for the transition between haplotype states that have the same CN state, and for the transition between haplotype states that have different CN states. To incorporate the CN state in the transition model, we introduce a hierarchy transition model—the first transition level is the transition between the CN states and the second is between the haplotype states given the CN states (Fig. 3). The idea of using this model is to capture the favoured transition between the CNs.
Fig. 3.
Illustration of two levels of the transitions based on the haploid model. Each box represents the CN state and the numbers in the box are the assigned haplotype states. The first level of the transition (which is between the CN states) can be considered as the transition between the boxes. The dashed and solid lines in the left panel give an example of the possible transitions from CN state 0 (the solid line represents the most likely transition). The second level of transition (which is between the haplotype states) can be considered as the the transition between the numbers given the boxes, where the dashed lines give an example of the transitions from the haplotype State 1 to the haplotype States 3, 4 and 5 given that the transition between CN states is from 0 to 1. Note that the number of haplotype states in each CN state can be specified by users 1.
The transition probability from haplotype state k n to l n is then the product of the transition probability between CN states and the transition probability between haplotype states given the CN states
(4)
where c(l n) and c(k n) are the underlying CN state for haplotype states l n and k n, respectively; and ci(l n) and ci(k n) are the indices of haplotype states l n and k n within the CN states c(l n) and c(k n). Both transition probabilities (the two terms of the product in the equation) are calculated based on Equation (3) with different parameters.
In this modification, we allow that the parameters J m depend on the k n (‘from’) state, denoted as J mk n and α_m_ is related to both the k n and l n (‘from’ and ‘to’) states, denoted as α_mk_ n l n. To capture linkage disequilibrium between duplication states and flanking SNPs, we use Equation (3) with parameters J mk n and α_mk_ n l n to compute the transition probability between CN states and between haplotype states given the transition occurring in different CN states. We use the basic transition model (the parameter J m is independent of the state and α_ml_ n only depends on the l n state) to calculate the transition probability between the haplotype states given the same CN state.
2.2.3 Polyploid transition model
We use the modified haplotype transition model for non-internal phasing and basic haplotype transition model for internal phasing. Based on these transition models, the transition probability between unordered lists of haplotype states k = _k_1,…, k N and l = _l_1,…, l N at marker m is given by
(5)
2.3 The prior and computation
We use Dirichlet priors on all of our parameters. We let θ_ml_∼Dirichlet(_u_θmθ), where mθ is the uniform vector with each element equal to 1/H (H is the length of allele space), and αm.∼ Dirichlet(_u_αmα) where mα is the uniform vector with each element equal to 1/z (z the number of ancestral haplotypes). We let J m ∼ Beta(u J(1 − e_−_d m r), u J e_−_d m r) where d m is the physical distance between consecutive markers and r = 10−8 per based pair in the population, reflecting the background recombination rate. We use _u_θ = _u_α = 1 and u J = 105 for initialization of the EM algorithm and _u_θ = _u_α = u J = 0.1 for the maximization step.
Although our HMM has many parameters, approximate posterior mode estimates are readily obtained using the Baum–Welch algorithm, which is a form of the EM algorithm. The parameters in the model are updated at each step of the EM algorithm given the observed genotype data. The training process might converge to a local maximum of the likelihood function, which is a typical problem for the EM algorithm. To deal with this problem, we combine the results from 10 repetitions of the EM algorithm with different start values. In our model, the first-order Markov chain is employed to model ancestral haplotypes across the sequence. Thus, the number of EM iterations does not depend on the number of markers. A default number of iterations is 25 for each repetition of the training process, which can be specified in our parameter file.
After obtaining the estimates of parameters at each repetition, a specified number of haplotypes are sampled from the posterior distribution conditional on the genotype data of a given individual (Su et al., 2008b). Here, we obtain 100 samples for each repetition. The most likely haplotype is then inferred from all the sampled haplotypes across the 10 repetitions of the EM algorithm. The certainty rate of this estimate is the fraction of times it is sampled. Because we consider only a small number (e.g. 10) of local modes of the posterior distribution for the HMM parameters, the certainty value is not the probability of the imputed genotype under the model, which would require integration over the posterior distribution, but it may serve as a reasonable approximation to this probability.
3 SIMULATION STUDY
In this section, we present the details of the simulation study to evaluate the performance of our method for inferring allele configurations, CNV–SNP haplotypes and missing CNV genotypes. We simulated phase-known datasets based on data obtained from French and Finnish population cohorts, respectively, with different technologies for obtaining the CN status and using different genotyping chips. The French dataset contains fewer samples but denser CNV–SNP genotypes, while the Finnish dataset contains more samples but less dense genotypes.
3.1 The French samples
We obtained data for X chromosomes from 48 males of northern French origin who were genotyped both on the Illumina 1M platform and 244K aCGH platform. The 244K aCGH chips, custom-designed for focussed investigation for putative CNV regions, provide information on the locations of CNV regions as well as CNs in these regions across the entire genome (de Smith et al., 2007). In aCGH, test and reference DNA samples, which are labelled differentially with fluorescent tags, are competitively hybridized into genomic arrays. The fluorescence ratio of test and reference hybridization signal is then determined at different positions along the genome, which provides information on the difference in CNs between test and reference samples.
CNV regions on non-pseudo-autosomal regions of the male X chromosome were identified from 244K aCGH chip data using the ADM2 algorithm developed by Agilent Technologies (Santa Clara, CA, USA), which recursively searches for CNV intervals based on log R ratios (LRRs) of fluorescent signals from probes between test and reference DNA sample (de Smith et al., 2007). A single sample from the Coriell Cell Repository (NA15510) was used as reference. The boundary and size of the CNV intervals are defined on the basis of the positions of the first and last array probes identified as lying within the CNV. The integer CN of the CN region was set to 0 if the average LRR of probes within the region was less than −0.5 (i.e. deletion) and was set to 2 otherwise (i.e. amplification on male haploid background). Haploid SNP genotypes in non-CNV regions were obtained from BeadStudio, using the Illumina 1M chip. Within amplified regions, two-copy SNP genotypes were estimated from a Gaussian mixture model using the B-allele frequency from BeadStudio. For this dataset, we analysed a 2.7 Mb non-pseudo-autosomal region of the X-chromosome (151 881 226–154 588 828 bp based on NCBI build 36) This region has 1904 aCGH probes (equally 1 probe for every 1.4 kb) and 1058 Illumina SNP probes.
3.2 The Finnish samples
We also assessed our method using a larger dataset from the Northern Finland Birth Cohort (NFBC), from which we obtained non-pseudo-autosomal X-chromosome genotype data on 695 Finnish males assayed on Illumina Hap370 chips. aCGH data were unavailable for this cohort; however, Illumina's BeadStudio software generates the log ratio of observed to expected fluorescent signal intensity (LRR), as well as a normalized measure of relative signal intensity between the two SNP alleles the B allele frequency (BAF), which can be used to detect CNV regions and infer CN genotypes (Colella et al., 2007; Wang et al., 2007). Haploid SNP genotypes were obtained from BeadStudio, while two copy SNP genotypes within amplifications were obtained on the basis of BAF. For this dataset, we analysed a 20.9 Mb region on the X chromosome (19 502 220–40 491 848 bp based on NCBI build 36), which contains 2149 markers.
3.3 Simulation of phase-known genotypes
We randomly combined SNP/CNV genotypes on male X chromosomes into pairs to create diploid genomes with up to four copies for non-internal phasing (Fig. 2). We created 24 ‘non-internal’ phase-known diploid genomes in the French dataset and 347 genomes in the Finnish dataset. These samples were inappropriate for internal phasing as the ‘internal’ haplotypes comprising the amplifications on each X-chromosome copy are not known. Thus, to evaluate internal phasing, we masked X-chromosome amplifications, and randomly grouped these X-chromosomes into 15/231 French/Finish triploid genomes, so that we obtain internal+external phase-known SNP/CNV genotype data with up to three copies.
3.4 Switch error rate
The switch error rate for each individual is defined as ψ/(n − 1), where n denotes the number of heterozygous sites for that individual and ψ the minimal number of switches needed to recover the true haplotypes. We assumed that at most one switch could occur between consecutive heterozygous sites.
For each individual, we determined if there was a switch by comparing the inferred haplotypes to the true haplotypes. If a discrepancy is identified at a heterozygous marker m, a switch error is counted and a switch is introduced in the inferred haplotypes to ensure that it matches the true haplotypes up to marker m. To identify a discrepancy, it is only necessary to compare haplotype sets as far back as to distinguish N distinct preceding haplotypes (N is the ploidy), which in diploids requires looking back to the previous heterozygous marker only.
4 RESULTS
4.1 Missing data imputation
We first examined the accuracy of our method for missing data imputation with both French and Finnish data. In each dataset, 5% and 10% of genotypes with one to four copies of alleles were set as missing at random, respectively. We report the proportion of missing genotypes for each CN that were estimated incorrectly (imputation error rate). Table 1 shows the imputation error rate in the French and Finnish datasets, respectively. Overall, our method provides accurate estimates of missing genotypes. For both missing rates (5% and 10%), our method gives an imputation error rate <0.09.
Table 1.
Error rate for estimation of missing genotype
CN of genotype | ||||
---|---|---|---|---|
Missing rate | 1 | 2 | 3 | 4 |
French dataset | ||||
5% | 0.020 | 0.034 | 0.060 | 0.0 |
10% | 0.009 | 0.030 | 0.090 | 0.0 |
Finnish dataset | ||||
5% | 0.062 | 0.075 | 0.053 | 0.028 |
10% | 0.050 | 0.081 | 0.050 | 0.027 |
CN of genotype | ||||
---|---|---|---|---|
Missing rate | 1 | 2 | 3 | 4 |
French dataset | ||||
5% | 0.020 | 0.034 | 0.060 | 0.0 |
10% | 0.009 | 0.030 | 0.090 | 0.0 |
Finnish dataset | ||||
5% | 0.062 | 0.075 | 0.053 | 0.028 |
10% | 0.050 | 0.081 | 0.050 | 0.027 |
Table 1.
Error rate for estimation of missing genotype
CN of genotype | ||||
---|---|---|---|---|
Missing rate | 1 | 2 | 3 | 4 |
French dataset | ||||
5% | 0.020 | 0.034 | 0.060 | 0.0 |
10% | 0.009 | 0.030 | 0.090 | 0.0 |
Finnish dataset | ||||
5% | 0.062 | 0.075 | 0.053 | 0.028 |
10% | 0.050 | 0.081 | 0.050 | 0.027 |
CN of genotype | ||||
---|---|---|---|---|
Missing rate | 1 | 2 | 3 | 4 |
French dataset | ||||
5% | 0.020 | 0.034 | 0.060 | 0.0 |
10% | 0.009 | 0.030 | 0.090 | 0.0 |
Finnish dataset | ||||
5% | 0.062 | 0.075 | 0.053 | 0.028 |
10% | 0.050 | 0.081 | 0.050 | 0.027 |
4.2 Allelic configuration inference
We assess the performance of our method for inferring allelic configuration on a pair of haplotypes (such as AA/B versus A/AB). Table 2 presents the distribution of CNs observed on all markers and the error rate of estimated allele configurations. In the French data, there are 6317 and 754 3-CN and 4-CN genotypes of which 1075 and 94 are heterozygous, respectively. The allelic configuration is ambiguous for all of these heterozygous 3-CN and 4-CN genotypes (excluding genotypes AAAB and ABBB for which AA/AB and AB/BB, respectively, are the only possible configurations). The allelic configuration error rate amongst these ambiguous 3-CN and 4-CN genotypes is 0.119 and 0.0, respectively. In the Finnish data, the corresponding error rates are 0.016 and 0.188, based on 24 106 and 1572 heterozygous 3-CN and 4-CN genotypes.
Table 2.
The distribution of CN and error rate of estimation of allele configuration at heterozygous sites
CN of genotype | ||||
---|---|---|---|---|
1 | 2 | 3 | 4 | |
French dataset | ||||
CNs | 1155 | 18 318 | 6317 | 754 |
Heterozygous genotypes | 0 | 2932 | 1075 | 94 |
Error rate of allelic configuration | NA | NA | 0.119 | 0.0 |
Finnish dataset | ||||
CNs | 5609 | 664 847 | 60 010 | 15 237 |
Heterozygous genotypes | 0 | 210 239 | 24 106 | 1572 |
Error rate allelic configuration | NA | NA | 0.016 | 0.188 |
CN of genotype | ||||
---|---|---|---|---|
1 | 2 | 3 | 4 | |
French dataset | ||||
CNs | 1155 | 18 318 | 6317 | 754 |
Heterozygous genotypes | 0 | 2932 | 1075 | 94 |
Error rate of allelic configuration | NA | NA | 0.119 | 0.0 |
Finnish dataset | ||||
CNs | 5609 | 664 847 | 60 010 | 15 237 |
Heterozygous genotypes | 0 | 210 239 | 24 106 | 1572 |
Error rate allelic configuration | NA | NA | 0.016 | 0.188 |
Table 2.
The distribution of CN and error rate of estimation of allele configuration at heterozygous sites
CN of genotype | ||||
---|---|---|---|---|
1 | 2 | 3 | 4 | |
French dataset | ||||
CNs | 1155 | 18 318 | 6317 | 754 |
Heterozygous genotypes | 0 | 2932 | 1075 | 94 |
Error rate of allelic configuration | NA | NA | 0.119 | 0.0 |
Finnish dataset | ||||
CNs | 5609 | 664 847 | 60 010 | 15 237 |
Heterozygous genotypes | 0 | 210 239 | 24 106 | 1572 |
Error rate allelic configuration | NA | NA | 0.016 | 0.188 |
CN of genotype | ||||
---|---|---|---|---|
1 | 2 | 3 | 4 | |
French dataset | ||||
CNs | 1155 | 18 318 | 6317 | 754 |
Heterozygous genotypes | 0 | 2932 | 1075 | 94 |
Error rate of allelic configuration | NA | NA | 0.119 | 0.0 |
Finnish dataset | ||||
CNs | 5609 | 664 847 | 60 010 | 15 237 |
Heterozygous genotypes | 0 | 210 239 | 24 106 | 1572 |
Error rate allelic configuration | NA | NA | 0.016 | 0.188 |
4.3 Inference of haplotypic phase of CN state relative to flanking SNPs (non-internal phasing)
We assessed the performance of our method for haplotypic phase inference using the switch error rate. In this case, the CNV/SNP alleles consist of {−, A, B, AA, AB, BB} and we do not distinguish the order of alleles within an amplification. Hence, when calculating switch error rate, homozygous 3-CN genotypes (A/AA or B/BB) are considered as heterozygous sites as the CNV/SNP alleles are different for each haplotype. Homozygous genotypes 4-CN genotypes (AA/AA or BB/BB) are still considered as homozygous sites. In calculating the switch error, we excluded the sites where the allelic configurations were incorrectly inferred.
For the French data, the overall switch error rate is 0.015. We then classified transitions by the ‘from’ and ‘to’ genotype CN (denoted by _N_1 → _N_2) to get error rates in Table 3. The number of observed _N_1 → _N_2 transitions is shown in brackets. Overall, the switch error rates are <0.34, apart from two cases where the error rates are 0.57 and 0.41 at heterozygous sites with CNs 1 → 3 and 2 → 3, respectively. The reduced accuracy in the French data at such sites is due to the fact that the number of observations is small, and moreover may not all occur at the same CN breakpoints. Accuracy for these CN transitions is improved by increasing the population size as can be seen in the corresponding results for the Finnish data.
Table 3.
Switch error rate for non-internal phasing
CN on the second site (_N_2) | ||||
---|---|---|---|---|
CN on the first site (_N_1) | 1 | 2 | 3 | 4 |
French dataset | ||||
1 | 0.0009 | 0.26 | 0.571 | NA |
(1062) | (15) | (7) | (0) | |
2 | 0.2 | 0.036 | 0.413 | 0.0 |
(15) | (2866) | (29) | (1) | |
3 | 0.333 | 0.322 | 0.0008 | 0.0 |
(6) | (31) | (6150) | (2) | |
4 | NA | NA | 0 | 0.057 |
(0) | (0) | (3) | (87) | |
Finnish dataset | ||||
1 | 0.067 | 0.360 | 0.396 | 0.142 |
(1373) | (3022) | (232) | (7) | |
2 | 0.383 | 0.071 | 0.264 | 0.386 |
(2810) | (204 467) | (2551) | (101) | |
3 | 0.282 | 0.158 | 0.001 | 0.188 |
(436) | (2163) | (56 688) | (303) | |
4 | 0.333 | 0.235 | 0.357 | 0.076 |
(9) | (289) | (112) | (286) |
CN on the second site (_N_2) | ||||
---|---|---|---|---|
CN on the first site (_N_1) | 1 | 2 | 3 | 4 |
French dataset | ||||
1 | 0.0009 | 0.26 | 0.571 | NA |
(1062) | (15) | (7) | (0) | |
2 | 0.2 | 0.036 | 0.413 | 0.0 |
(15) | (2866) | (29) | (1) | |
3 | 0.333 | 0.322 | 0.0008 | 0.0 |
(6) | (31) | (6150) | (2) | |
4 | NA | NA | 0 | 0.057 |
(0) | (0) | (3) | (87) | |
Finnish dataset | ||||
1 | 0.067 | 0.360 | 0.396 | 0.142 |
(1373) | (3022) | (232) | (7) | |
2 | 0.383 | 0.071 | 0.264 | 0.386 |
(2810) | (204 467) | (2551) | (101) | |
3 | 0.282 | 0.158 | 0.001 | 0.188 |
(436) | (2163) | (56 688) | (303) | |
4 | 0.333 | 0.235 | 0.357 | 0.076 |
(9) | (289) | (112) | (286) |
Table 3.
Switch error rate for non-internal phasing
CN on the second site (_N_2) | ||||
---|---|---|---|---|
CN on the first site (_N_1) | 1 | 2 | 3 | 4 |
French dataset | ||||
1 | 0.0009 | 0.26 | 0.571 | NA |
(1062) | (15) | (7) | (0) | |
2 | 0.2 | 0.036 | 0.413 | 0.0 |
(15) | (2866) | (29) | (1) | |
3 | 0.333 | 0.322 | 0.0008 | 0.0 |
(6) | (31) | (6150) | (2) | |
4 | NA | NA | 0 | 0.057 |
(0) | (0) | (3) | (87) | |
Finnish dataset | ||||
1 | 0.067 | 0.360 | 0.396 | 0.142 |
(1373) | (3022) | (232) | (7) | |
2 | 0.383 | 0.071 | 0.264 | 0.386 |
(2810) | (204 467) | (2551) | (101) | |
3 | 0.282 | 0.158 | 0.001 | 0.188 |
(436) | (2163) | (56 688) | (303) | |
4 | 0.333 | 0.235 | 0.357 | 0.076 |
(9) | (289) | (112) | (286) |
CN on the second site (_N_2) | ||||
---|---|---|---|---|
CN on the first site (_N_1) | 1 | 2 | 3 | 4 |
French dataset | ||||
1 | 0.0009 | 0.26 | 0.571 | NA |
(1062) | (15) | (7) | (0) | |
2 | 0.2 | 0.036 | 0.413 | 0.0 |
(15) | (2866) | (29) | (1) | |
3 | 0.333 | 0.322 | 0.0008 | 0.0 |
(6) | (31) | (6150) | (2) | |
4 | NA | NA | 0 | 0.057 |
(0) | (0) | (3) | (87) | |
Finnish dataset | ||||
1 | 0.067 | 0.360 | 0.396 | 0.142 |
(1373) | (3022) | (232) | (7) | |
2 | 0.383 | 0.071 | 0.264 | 0.386 |
(2810) | (204 467) | (2551) | (101) | |
3 | 0.282 | 0.158 | 0.001 | 0.188 |
(436) | (2163) | (56 688) | (303) | |
4 | 0.333 | 0.235 | 0.357 | 0.076 |
(9) | (289) | (112) | (286) |
Figure 4 shows the error rate for each transition in the Finnish data distributed according to certainty score. In general, the estimate with the higher certainty rate provides the more reliable inference. However, we observed a low proportion of estimates that have a high certainty rate (>0.9) in some cases, such as CNs 4 → 3. Supplementary Figure S1 shows that the level of LD between SNPs and CNVs, as measured by _r_2, is inversely correlated with switch error.
Fig. 4.
Histogram of certainty scores and switch error rate in each bin from the Finnish dataset. The circles indicate average switch error rates within each histogram bin. The error bar of each switch error rate is based on a 95% equal-tailed Bayesian interval given the prior Beta(1/2, 1/2). The number on the top of each cell graph represents CNs at a pair of heterozygous sites (first digit is the CN at the first site and second is the CN at the second site).
To compare the results with those from CNVphaser and MOCSphaser, we chose three and eight sites in two different CNV regions from the French data. The maximum number of CNV sites used in the original CNVphaser article (Kato et al., 2008a) is eight. MOCSphaser could not be run on the eight site data because it ran out of memory on a 32 GB machine. We also attempted to run CNVphaser using the same number of sites as presented for polyHap in Table 3, but we found that the scale of our simulated dataset was not computationally feasible for CNVphaser.
CNVphaser and MOCSphaser both return a posterior probability distribution over possible haplotypes given the observed genotypes. We selected the haplotype with the highest probability as the inferred haplotype. We show the number of individuals whose genotypes are not correctly phased at any heterozygous sites in Table 4. The CNV genotypes at three sites are all correctly phased by both polyHap and CNVphaser/MOCSphaser. For the genotypes at eight sites, the results from polyHap show that only one individual has a single switch error over all the sites, while most of the inferred haplotypes from CNVphaser are incorrect. The allele configurations are incorrectly inferred in most heterozygous sites by CNVphaser.
Table 4.
Comparison between polyHap and CNVphaser/MOCSphaser
Number of individuals having switch error | |||
---|---|---|---|
Number of sites | polyHap | CNVphaser | MOCSphaser |
3 | 0 | 0 | 0 |
8 | 1 | 24 | NA |
Number of individuals having switch error | |||
---|---|---|---|
Number of sites | polyHap | CNVphaser | MOCSphaser |
3 | 0 | 0 | 0 |
8 | 1 | 24 | NA |
Table 4.
Comparison between polyHap and CNVphaser/MOCSphaser
Number of individuals having switch error | |||
---|---|---|---|
Number of sites | polyHap | CNVphaser | MOCSphaser |
3 | 0 | 0 | 0 |
8 | 1 | 24 | NA |
Number of individuals having switch error | |||
---|---|---|---|
Number of sites | polyHap | CNVphaser | MOCSphaser |
3 | 0 | 0 | 0 |
8 | 1 | 24 | NA |
Previous studies have also used fastPhase and Beagle for CNV phasing (Conrad et al., 2009). This approach is limited to phasing bi-allelic CNVs relative to flanking SNPs not in CNV regions, which is achieved by recoding bi-allelic CNV genotypes as SNP genotypes. To compare our method to this approach, we removed multi-allelic CNVs from the Finnish dataset, and also masked SNPs within CNV regions, and finally encoded CNV genotypes as SNP genotypes. We then ran each of fastPhase/Beagle and polyHap on this dataset (Supplementary Table 1). Comparing with Table 3, we see that switch error rates for polyHap have markedly increased in most cases due to loss of information from masking SNPs. Comparing algorithms on the masked dataset (Supplementary Table 1), we see that polyHap and fastPhase had comparable switch error rates between SNPs and CNVs with different CN transitions, while Beagle had higher error rates on these CN transitions except for CN 3 → 2.
Finally, to test polyHap's accuracy on a non-simulated dataset we successfully phased a region on chromosome 2 containing a known short (<3 kb) deletion at 229.467 mb, using data generated by a 244K Agilent array CGH chip (de Smith et al., 2007). A consistent haplotype including this deletion was detected (Supplementary Fig. S2). This deletion has been previously verified by polymerase chain reaction (PCR) across the breakpoints followed by sequencing (de Smith et al., 2007).
4.4 Inference of haplotypic phase of SNPs within CN states (internal phasing)
Internal phasing can be considered as a tool for further investigating haplotypic phase of duplicated alleles locally. Thus, we report the switch error rate at sites that have the same CN. Note that here we only consider up to a single copy amplification at the genotype level. Table 5 gives the switch error rate between a pair of consecutive heterozygous sites, which have the same CN. The count of each pair of CN is shown in parentheses. For both French and Finnish datasets, the error rates are ≤0.08 for locally inferring haplotypic phase of duplicated alleles.
Table 5.
Switch error rate for internal phasing with same CN
CN on a pair sites | |||
---|---|---|---|
1→ 1 | 2 → 2 | 3 → 3 | |
French dataset | 0 | 0.005 | 0.070 |
(34) | (1034) | (3514) | |
Finnish dataset | 0.002 | 0.056 | 0.080 |
(351) | (864) | (2 29 244) |
CN on a pair sites | |||
---|---|---|---|
1→ 1 | 2 → 2 | 3 → 3 | |
French dataset | 0 | 0.005 | 0.070 |
(34) | (1034) | (3514) | |
Finnish dataset | 0.002 | 0.056 | 0.080 |
(351) | (864) | (2 29 244) |
Table 5.
Switch error rate for internal phasing with same CN
CN on a pair sites | |||
---|---|---|---|
1→ 1 | 2 → 2 | 3 → 3 | |
French dataset | 0 | 0.005 | 0.070 |
(34) | (1034) | (3514) | |
Finnish dataset | 0.002 | 0.056 | 0.080 |
(351) | (864) | (2 29 244) |
CN on a pair sites | |||
---|---|---|---|
1→ 1 | 2 → 2 | 3 → 3 | |
French dataset | 0 | 0.005 | 0.070 |
(34) | (1034) | (3514) | |
Finnish dataset | 0.002 | 0.056 | 0.080 |
(351) | (864) | (2 29 244) |
5 DISCUSSION
We have presented a method for inferring haplotypic phase for CNV/SNP genotype data among unrelated individuals. Our method allows CNV regions and ploidy to vary along the sequence and between the individuals. Our program accommodates both CNV and SNP genotype data and infers missing genotypes and haplotypic phase for both types of data. Our method allows uncertainty in the CN assignment by representing the CNV genotype as a probability distribution over multiple CNV genotypes.
It is necessary to first calculate CNV genotypes prior to running our program. In particular, polyHap does not accommodate a continuous measurement in place of the integer CN genotypes. polyHap can include—in principle—an arbitrary maximum number of copies. However, as the computational complexity scales roughly as # copies2 for non-internal phasing, and as e# copies for internal phasing, meaning that internal phasing is feasible for up to 6 copies, and non-internal for up to 20 copies. Similarly, polyHap cannot model complicated structural rearrangements, including inversions and translocations.
polyHap requires a pre-defined number of ancestral haplotypes. In this study, we use eight ancestral haplotype states for internal phasing (two CN = 0 and six CN = 1 states) and nine haplotype states for non-internal phasing (one CN = 0, four CN = 1 and four CN=2 states). We have also tried different numbers of ancestral haplotypes and found that the results are comparable. Here, we would suggest using higher number of ancestral haplotypes when dealing with rare variants. The choice of ancestral haplotype number usually does not depend on the sample size but rather on the number of haplotypes present in the population. Thus, if a very diverse, heterogeneous population or a mixture of several populations were being analysed, then it would be advisable to include more states.
The results from the simulation study demonstrate that our program provides accurate estimates of missing genotypes, allele configuration and haplotypic phase for both CNV and SNP data. Our method gives an imputation error rate <0.09 for imputing missing genotypes with one to four copies of alleles. Also, our method provides accurate estimates of allele configurations on a pair of haplotypes, with an error rate <0.19. Furthermore, polyHap successfully identified a haplotype comprising a short deletion on chromosome 2. Our method gives encouraging results for inferring CNV haplotypic phase over different CNs at heterozygous sites. Although there are several situations where the switch error rate is >0.3, this might result from rare haplotypes in the dataset, and the accuracy here would be improved by using a larger population sample. Also, reliable phase inferences can be distinguished using the uncertainty estimates. In general, a higher certainty rate indicates higher accuracy of the estimate.
polyHap outperforms two existing methods for phasing CNV-SNP haplotypes—CNVphaser and MOCSphaser—in terms of accuracy and capacity of dealing with large-scale datasets. Comparing our method with fastPhase/Beagle for phasing bi-allelic CNV, polyHap is comparable to fastPhase and gives more accurate estimates than Beagle in most cases of CN transitions. One advantage of our new method over fastPhase/Beagle for phasing CNV–SNP haplotypes is that polyHap is designed for inferring CNV–SNP haplotypes and is able to accommodate some properties of CNV that differ from SNPs and to deal with multi-allelic CNV.
Our program provides two different levels of CNV phasing—non-internal and internal. With internal phasing, the individual is considered as polyploid, and thus the phasing process is similar to that described for polyHap (Su et al., 2008b). Internal phasing enables inference of the duplicated and original haplotype, but does not say which chromosome copy contains the amplification. Non-internal phasing, on the other hand provides information about which chromosome copy contains the CNV, but not the internal structure of duplications. By providing both options, our program enables the researcher to choose a suitable level of phasing for the specific purposes of the study.
Our method is faster than CNVphaser, and is feasible for genome-wide analyses using a computing cluster. The computing time for the French dataset with nine ancestral haplotype states and 10 repetitions of the EM-training algorithm (containing 1106 markers on each of 24 individuals) was ∼0.8 h on a 8 GB computer, while the Finnish dataset (containing 2149 markers on each of 347 individuals) took 1.5 h on a 16 GB computer. The computing time increases linearly with the number of markers and individuals.
Modelling the haplotypic background of CNVs will provide a better understanding of the evolutionary processes affecting CNVs. Moreover, it will help us to better model CNV–phenotype associations—to make CNV–disease associations more robust by simultaneously identifying the underlying haplotype harbouring the CNV and to disentangle associations between CNVs and phenotype from associations with flanking SNPs.
ACKNOWLEDGEMENTS
We thank Rob Sladek for providing Illumina data and Adam de Smith for providing aCGH data. The DNA extractions, sample quality controls, biobank up-keeping and aliquotting for the NFBC was performed in the national Public Health Institute, Biomedicum Helsinki, Finland. Genotyping of the NFBC samples was supported by the National Institute of Mental Health.
Funding: Research Council UK fellowship (to L.J.M.C.); Genome Canada and Genome Quebec funded genotyping on the French samples; the NFBC1966 received financial support from the Academy of Finland (project grants 104781, 120315, 132797, and Center of Excellence in Complex Disease Genetics); University Hospital Oulu, Biocenter, University of Oulu, Finland; the European Community's Fifth/Seventh Framework Programme (EURO-BLCS, QLG1-CT-2000-01643, FP7/2007-2013); NHLBI grant 5R01HL087679-02 through the STAMPEED program (1RL1MH083268-01); ENGAGE project (HEALTH-F4-2007-201413); the Medical Research Council (centre grant G0600705); the Wellcome Trust (project grant GR069224), UK; the National Institute of Health Research (NIHR) Biomedical Research Centre Programme at Imperial College; the DNA extractions, sample quality controls, biobank up-keeping and aliquotting were performed in the National Public Health Institute, Biomedicum Helsinki, Finland, and supported financially by the Academy of Finland and Biocentrum Helsinki.
Conflict of Interest: none declared.
REFERENCES
et al.
A robust statistical method for case-control association testing with copy number variation
,
Nat. Genet.
,
2008
, vol.
40
(pg.
1245
-
1252
)
A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals
,
Amer. J. Hum. Genet.
,
2009
, vol.
84
(pg.
210
-
223
)
Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering
,
Amer. J. Hum. Genet.
,
2007
, vol.
81
(pg.
1084
-
1097
)
et al.
QuantiSNP: an objective Bayes hidden-Markov model to detect and accurately map copy number variation using SNP genotyping data
,
Nucleic Acids Res.
,
2007
, vol.
35
(pg.
2013
-
2025
)
et al.
Origins and functional impact of copy number variation in the human genome
,
Nature
,
2009
, vol.
464
(pg.
704
-
712
)
et al.
Array CGH analysis of copy number variation identifies 1284 new genes variant in healthy white males: implications for association studies of complex diseases
,
Hum. Mol. Genet.
,
2007
, vol.
16
(pg.
2783
-
2794
)
et al.
Small deletion variants have stable breakpoints commonly associated with Alu elements
,
PLoS ONE
,
2008
, vol.
3
pg.
e3104
et al.
Structural variation in the human genome
,
Nat. Rev. Genet.
,
2006
, vol.
7
(pg.
85
-
97
)
et al.
Accurate and reliable high-throughput detection of copy number variation in the human genome
,
Genome Res.
,
2006
, vol.
16
(pg.
1566
-
1574
)
et al.
An algorithm for inferring complex haplotypes in a region of copy-number variation
,
Am. J. Hum. Genet.
,
2008
, vol.
83
(pg.
157
-
169
)
et al.
MOCSphaser: a haplotype inference tool from a mixture of copy number variation and single nucleotide polymorphism data
,
Bioinformatics
,
2008
, vol.
24
(pg.
1645
-
1646
)
A block-free hidden Markov model for genotypes and its application to disease association
,
J. Comput. Biol.
,
2005
, vol.
12
(pg.
1243
-
1260
)
et al.
Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs
,
Nat. Genet.
,
2008
, vol.
40
(pg.
1253
-
1260
)
et al.
Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data
,
Bioinformatics
,
2005
, vol.
21
(pg.
3763
-
3770
)
et al.
Incorporating single-locus tests into haplotype cladistic analysis in case-control studies
,
PLoS Genet.
,
2007
, vol.
3
(pg.
0421
-
0430
)
et al.
Whole genome association mapping by incompatibilities and local perfect phylogenies
,
BMC Bioinformatics
,
2006
, vol.
7
pg.
454
Copy-number variation and association studies of human disease
,
Nat. Genet.
,
2007
, vol.
39
Suppl. 7
(pg.
S37
-
S42
)
et al.
Haplotype inference from unphased SNP data in heterozygous polyploids based on SAT
,
BMC Genomics
,
2008
, vol.
9
pg.
356
et al.
Circular binary segmentation for the analysis of array-based DNA copy number data
,
Biostatistics
,
2004
, vol.
5
(pg.
557
-
572
)
et al.
Global variation in copy number in the human genome
,
Nature
,
2006
, vol.
444
(pg.
444
-
454
)
et al.
Genome-wide detection and characterization of positive selection in human populations
,
Nature
,
2007
, vol.
449
(pg.
913
-
918
)
A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase
,
Am. J. Hum. Genet.
,
2006
, vol.
78
(pg.
629
-
644
)
Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation
,
Am. J. Hum. Genet.
,
2005
, vol.
76
(pg.
449
-
462
)
et al.
Disease association tests by inferring ancestral haplotypes using a hidden Markov model
,
Bioinformatics
,
2008
, vol.
24
(pg.
972
-
978
)
et al.
Inference of haplotypic phase and missing genotypes in polyploid organisms and variable copy number genomic regions
,
BMC Bioinformatics
,
2008
, vol.
9
pg.
513
et al.
PennCNV: An integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data
,
Genome Res.
,
2007
, vol.
17
(pg.
1665
-
1674
)
Author notes
Associate Editor: Jeffrey Barrett
© The Author 2010. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
Supplementary data
Citations
Views
Altmetric
Metrics
Total Views 1,868
1,438 Pageviews
430 PDF Downloads
Since 12/1/2016
Month: | Total Views: |
---|---|
December 2016 | 1 |
January 2017 | 5 |
February 2017 | 10 |
March 2017 | 15 |
April 2017 | 9 |
May 2017 | 8 |
June 2017 | 13 |
July 2017 | 2 |
August 2017 | 11 |
September 2017 | 3 |
October 2017 | 6 |
November 2017 | 15 |
December 2017 | 26 |
January 2018 | 13 |
February 2018 | 22 |
March 2018 | 22 |
April 2018 | 24 |
May 2018 | 24 |
June 2018 | 29 |
July 2018 | 23 |
August 2018 | 27 |
September 2018 | 28 |
October 2018 | 20 |
November 2018 | 28 |
December 2018 | 21 |
January 2019 | 11 |
February 2019 | 25 |
March 2019 | 28 |
April 2019 | 32 |
May 2019 | 22 |
June 2019 | 31 |
July 2019 | 24 |
August 2019 | 34 |
September 2019 | 30 |
October 2019 | 18 |
November 2019 | 26 |
December 2019 | 16 |
January 2020 | 36 |
February 2020 | 15 |
March 2020 | 11 |
April 2020 | 13 |
May 2020 | 15 |
June 2020 | 22 |
July 2020 | 16 |
August 2020 | 7 |
September 2020 | 11 |
October 2020 | 22 |
November 2020 | 23 |
December 2020 | 16 |
January 2021 | 14 |
February 2021 | 15 |
March 2021 | 36 |
April 2021 | 8 |
May 2021 | 10 |
June 2021 | 10 |
July 2021 | 16 |
August 2021 | 18 |
September 2021 | 14 |
October 2021 | 19 |
November 2021 | 17 |
December 2021 | 16 |
January 2022 | 18 |
February 2022 | 18 |
March 2022 | 20 |
April 2022 | 28 |
May 2022 | 26 |
June 2022 | 18 |
July 2022 | 40 |
August 2022 | 37 |
September 2022 | 48 |
October 2022 | 23 |
November 2022 | 3 |
December 2022 | 11 |
January 2023 | 7 |
February 2023 | 2 |
March 2023 | 40 |
April 2023 | 30 |
May 2023 | 11 |
June 2023 | 12 |
July 2023 | 16 |
August 2023 | 28 |
September 2023 | 20 |
October 2023 | 21 |
November 2023 | 13 |
December 2023 | 23 |
January 2024 | 43 |
February 2024 | 24 |
March 2024 | 37 |
April 2024 | 26 |
May 2024 | 19 |
June 2024 | 27 |
July 2024 | 20 |
August 2024 | 15 |
September 2024 | 12 |
October 2024 | 26 |
November 2024 | 4 |
Citations
28 Web of Science
×
Email alerts
Citing articles via
More from Oxford Academic