Copy number variant detection in inbred strains from short read sequence data (original) (raw)
Journal Article
,
Wellcome Trust Sanger Institute, Hinxton, CB10 1HH, UK
Search for other works by this author on:
,
Wellcome Trust Sanger Institute, Hinxton, CB10 1HH, UK
Search for other works by this author on:
,
Wellcome Trust Sanger Institute, Hinxton, CB10 1HH, UK
Search for other works by this author on:
Wellcome Trust Sanger Institute, Hinxton, CB10 1HH, UK
* To whom correspondence should be addressed.
Search for other works by this author on:
Revision received:
21 November 2009
Accepted:
15 December 2009
Published:
18 December 2009
Cite
Jared T. Simpson, Rebecca E. McIntyre, David J. Adams, Richard Durbin, Copy number variant detection in inbred strains from short read sequence data, Bioinformatics, Volume 26, Issue 4, February 2010, Pages 565–567, https://doi.org/10.1093/bioinformatics/btp693
Close
Navbar Search Filter Mobile Enter search term Search
Abstract
Summary: We have developed an algorithm to detect copy number variants (CNVs) in homozygous organisms, such as inbred laboratory strains of mice, from short read sequence data. Our novel approach exploits the fact that inbred mice are homozygous at virtually every position in the genome to detect CNVs using a hidden Markov model (HMM). This HMM uses both the density of sequence reads mapped to the genome, and the rate of apparent heterozygous single nucleotide polymorphisms, to determine genomic copy number. We tested our algorithm on short read sequence data generated from re-sequencing chromosome 17 of the mouse strains A/J and CAST/EiJ with the Illumina platform. In total, we identified 118 copy number variants (43 for A/J and 75 for CAST/EiJ). We investigated the performance of our algorithm through comparison to CNVs previously identified by array-comparative genomic hybridization (array CGH). We performed quantitative-PCR validation on a subset of the calls that differed from the array CGH data sets.
Availability: The software described in this manuscript, named cnD for copy number detector, is free and released under the GPL. The program is implemented in the D programming language using the Tango library. Source code and pre-compiled binaries are available at http://www.sanger.ac.uk/resources/software/cnd.html
Contact: rd@sanger.ac.uk
Supplementary information: Supplementary data are available at Bioinformatics online.
1 INTRODUCTION
Copy number variants (CNVs) are segments of DNA that have been duplicated, or lost, in the genome of one individual or strain with respect to another. CNVs are thought to contribute significantly to phenotypic differences between mouse strains. In humans, CNVs have been causally linked to a range of disorders including schizophrenia (Moon et al., 2006), autism (Sebat et al., 2007) and birth defect syndromes (Lu et al., 2008). High-resolution surveys for CNVs have been performed in common laboratory strains of mice using array-comparative genomic hybridization (array CGH) (Cahan et al., 2009; Cutler et al., 2007; Graubert et al., 2007; Henrichsen et al., 2009; She et al., 2008). These studies have found a significant level of variation between strains, such that as much as 15% of the reference C57BL/6J mouse genome may be found as CNVs in another strain. While array CGH can be an effective way of identifying CNVs, aCGH studies are limited in resolution by the number of probes that can be placed on a microarray. The widespread adoption of short read sequencing platforms has led to a rapid decrease in the cost of whole-genome re-sequencing making it a viable alternative to array CGH (Xie and Tammi, 2009). Hidden Markov Models (HMM) have previously been used to detect copy number variation from array CGH data (Cahan et al., 2008; Fridlyand et al., 2004). We have developed a HMM to detect CNVs in inbred strains from the alignments of short read sequences to a reference genome.
2 DESCRIPTION
The central idea behind our model is that the alignment of reads from regions with copy number gains (with respect to a reference genome) will be ‘collapsed’ to a single location on the reference genome. The effect of this will be 2-fold. First, the sequence depth of this location on the reference genome will be increased by an integral amount corresponding to the relative number of copies that exist in the sequenced strain. Second, any base-pair differences between the copied regions will appear to be heterozygous single nucleotide polymorphisms (SNPs) with respect to the reference. This fact is crucial to our model as laboratory strains of mice are inbred to be effectively homozygous at every position in the genome, hence any apparent heterozygous SNPs that are not sequencing errors are actually paralogous sequence variants and therefore define regions collapsed in the reference genome. Conversely, the alignment of reads from regions with copy number losses in the sequenced genome will be distributed over the corresponding copies in the reference genome and hence the reference regions will have lower sequence depth, with the important distinction that there will not be a heterozygous SNP signal. Our HMM exploits these factors to detect regions of copy number gain and loss.
Our algorithm proceeds in three stages. First, the sequence reads are aligned to the mouse reference genome (build NCBI 37, Mouse Genome Sequencing Consortium, Waterston et al., 2002) using the MAQ aligner (Li et al., 2008). MAQ calls SNPs and classifies them as homozygous or heterozygous. Summary statistics are computed for the sequence read depth, the number of heterozygous SNPs and the average number of hits per read over 1 kb windows of the reference genome sequence. This triplet of data for each 1 kb region of the reference genome is input to the HMM which classifies each region as corresponding to a gain, loss or no change in copy number.
2.1 The HMM
We developed a 10-state HMM of the copy number structure of the genome being sequenced. There are five major states of the model, representing normal sequence, a 2-fold increase in copy number, a 3-fold increase in copy number, a 2-fold decrease in copy number and zero copy number. In addition, each major state of the model has a sub-state corresponding to highly repetitive sequence, allowing the model to accommodate the frequent high-copy repeat elements dispersed throughout mammalian genomes. In all states expect for the repeat states the depth distribution is modeled by a normal distribution with the mean and variance reflecting the copy number of the state. For states representing a copy number gain, the heterozygous SNP rate is modeled by a negative binomial distribution. The heterozygous SNP rate is modeled by a Poisson distribution in all other states. More information about the HMM and emission distributions is given in the supplemental material.
The parameters of the model are learned for each chromosome in the input data set by Viterbi training for both the transition probabilities and emission distribution parameters (Durbin et al., 1998). After the model parameters have been determined, the sequence of states is computed by a final application of the Viterbi algorithm. The output of the Viterbi algorithm is processed to extract contiguous regions of gain or loss. The minimum threshold for detection is the input window size, typically one kilobase. There is a final optional filtering step to remove calls below a minimum size threshold.
3 RESULTS
We tested our model on Illumina short read sequence data from chromosome 17 for the A/J and CAST/EiJ strains of mouse that were sequenced to 22- and 34-fold, respectively (ERA accession number ERA000077). The data sets were generated using 36-bp paired-end reads of 200-bp insert libraries. For this experiment, we set a minimum call size threshold of 10 kb (see Supplementary data). We evaluated our calls against a collection of previously published aCGH copy number variation data (Cahan et al., 2009; Cutler et al., 2007; Henrichsen et al., 2009; She et al., 2008).
Our algorithm called 22 copy number gains (1.38 Mb of sequence) and 21 losses (0.49 Mb) for the A/J data set (see Fig. 1 and Supplementary Fig. 6 for example regions). The gain regions overlap 38% of the regions identified by aCGH (36% by sequence, 1.1 Mb). Seventy-seven percent of the gains cnD found were previously seen by aCGH. For CAST/EiJ, 45 gains (2.44 Mb of sequence) and 30 losses (1.16 Mb) were called. The gain regions overlap 76% of the gains called by aCGH (79% by sequence, 1.3 Mb). Thirty-six percent of the gains found by cnD were previously seen in the array CGH data set. This figure is much lower than that of A/J due to the fact that the CAST/EiJ strain was not used in the highest coverage aCGH study (Cahan et al., 2009). In both strains the regions of copy number loss called by our algorithm and aCGH differed widely (11% concordance by region for A/J and 32% for CAST/EiJ) owing to the relative difficulty of calling CNV losses compared to gains. We performed qPCR validation on a subset of both the gain calls that were novel to our algorithm (those not found by aCGH) and the novel gain calls found by aCGH. In total we attempted validation on 20 novel cnD gains, of which five were confirmed to be amplified relative to C57BL/6J. Of the 14 novel aCGH gains that we attempted to validate, one was confirmed to be a gain relative to C57BL/6J. Our concordance with array CGH and initial confirmation rates are similar to previously published copy number variation studies (Conrad et al., 2009; Redon et al., 2006; Scherer et al., 2007). Full details of the experimental validation are provided in the Supplementary data.
Fig. 1.
(A) Plot of sequencing depth across a one megabase region of A/J chromosome 17 clearly shows both a region of 3-fold increased copy number (30.6–31.1 Mb) and a region of decreased copy number (at 31.3 Mb). The solid black line above the depth plot indicates the called copy number gain and the solid black line below the plot indicates the called copy number loss. (B) Plot of the heterozygous SNP rate for the same region showing the high number of apparent heterozygous SNPs associated with the copy number gain.
ACKNOWLEDGMENTS
The authors would like to thank Thomas Keane and Jim Stalker for implementing the initial data processing pipeline and Ian Sudbery for generating the chromosome 17 sequencing data.
Funding: Medical Research Council-UK and the Wellcome Trust (077192/Z/05/Z); Cancer-Research-UK to D.J.A.
Conflict of Interest: none declared.
REFERENCES
et al.
wuHMM: a robust algorithm to detect DNA copy number variation using long oligonucleotide microarray data
,
Nucleic Acids Res.
,
2008
, vol.
36
pg.
e41
et al.
The impact of copy number variation on local gene expression in mouse hematopoietic stem and progenitor cells
,
Nat. Genet.
,
2009
, vol.
41
(pg.
430
-
437
)
et al.
Origins and functional impact of copy number variation in the human genome
,
Nature
,
2009
[Epub ahead of print, doi: 1038/nature08516, October 7, 2009]
et al.
Significant gene content variation characterizes the genomes of inbred mouse strains
,
Genome Res.
,
2007
, vol.
17
(pg.
1743
-
1754
)
et al.
Markov chains and hidden Markov models
,
Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids.
,
1998
Cambridge, UK; New York
Cambridge University Press
pg.
356
et al.
Hidden Markov models approach to the analysis of array CGH data
,
J. Multivar. Anal.
,
2004
, vol.
90
(pg.
132
-
153
)
et al.
A high-resolution map of segmental DNA copy number variation in the mouse genome
,
PLoS Genet.
,
2007
, vol.
3
pg.
e3
et al.
Segmental copy number variation shapes tissue transcriptomes
,
Nat. Genet.
,
2009
, vol.
41
(pg.
424
-
429
)
et al.
Mapping short DNA sequencing reads and calling variants using mapping quality scores
,
Genome Res.
,
2008
, vol.
18
(pg.
1851
-
1858
)
et al.
Genomic imbalances in neonates with birth defects: high detection rates by using chromosomal microarray analysis
,
Pediatrics
,
2008
, vol.
122
(pg.
1310
-
1318
)
et al.
Identification of DNA copy-number aberrations by array-comparative genomic hybridization in patients with schizophrenia
,
Biochem. Biophys. Res. Commun.
,
2006
, vol.
344
(pg.
531
-
539
)
Mouse Genome Sequencing Consortium
et al.
Initial sequencing and comparative analysis of the mouse genome
,
Nature
,
2002
, vol.
420
(pg.
520
-
562
)
et al.
Global variation in copy number in the human genome
,
Nature
,
2006
, vol.
444
(pg.
444
-
454
)
et al.
Strong association of de novo copy number mutations with autism
,
Science
,
2007
, vol.
316
(pg.
445
-
449
)
et al.
Challenges and standards in integrating surveys of structural variation
,
Nat. Genet.
,
2007
, vol.
39
(pg.
S7
-
S15
)
et al.
Mouse segmental duplication and copy number variation
,
Nat. Genet.
,
2008
, vol.
40
(pg.
909
-
914
)
CNV-seq, a new method to detect copy number variation using high-throughput sequencing
,
BMC Bioinformatics
,
2009
, vol.
10
pg.
80
Author notes
Associate Editor: Joaquin Dopazo
© The Author(s) 2009. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Supplementary data
Citations
Views
Altmetric
Metrics
Total Views 1,015
802 Pageviews
213 PDF Downloads
Since 12/1/2016
Month: | Total Views: |
---|---|
December 2016 | 8 |
January 2017 | 2 |
February 2017 | 9 |
March 2017 | 5 |
April 2017 | 6 |
May 2017 | 11 |
June 2017 | 6 |
July 2017 | 7 |
August 2017 | 9 |
September 2017 | 2 |
October 2017 | 6 |
November 2017 | 8 |
December 2017 | 18 |
January 2018 | 22 |
February 2018 | 9 |
March 2018 | 24 |
April 2018 | 15 |
May 2018 | 14 |
June 2018 | 8 |
July 2018 | 7 |
August 2018 | 13 |
September 2018 | 7 |
October 2018 | 12 |
November 2018 | 15 |
December 2018 | 11 |
January 2019 | 8 |
February 2019 | 12 |
March 2019 | 18 |
April 2019 | 31 |
May 2019 | 12 |
June 2019 | 7 |
July 2019 | 7 |
August 2019 | 6 |
September 2019 | 14 |
October 2019 | 9 |
November 2019 | 11 |
December 2019 | 11 |
January 2020 | 14 |
February 2020 | 18 |
March 2020 | 10 |
April 2020 | 6 |
May 2020 | 3 |
June 2020 | 15 |
July 2020 | 5 |
August 2020 | 5 |
September 2020 | 6 |
October 2020 | 7 |
November 2020 | 6 |
December 2020 | 6 |
January 2021 | 1 |
February 2021 | 6 |
March 2021 | 8 |
April 2021 | 7 |
May 2021 | 6 |
June 2021 | 8 |
July 2021 | 14 |
August 2021 | 14 |
September 2021 | 10 |
October 2021 | 15 |
November 2021 | 14 |
December 2021 | 4 |
January 2022 | 11 |
February 2022 | 14 |
March 2022 | 3 |
April 2022 | 9 |
May 2022 | 8 |
June 2022 | 4 |
July 2022 | 7 |
August 2022 | 13 |
September 2022 | 14 |
October 2022 | 9 |
November 2022 | 6 |
December 2022 | 15 |
January 2023 | 15 |
February 2023 | 8 |
March 2023 | 8 |
April 2023 | 15 |
May 2023 | 10 |
June 2023 | 4 |
July 2023 | 6 |
August 2023 | 13 |
September 2023 | 13 |
October 2023 | 11 |
November 2023 | 15 |
December 2023 | 13 |
January 2024 | 18 |
February 2024 | 15 |
March 2024 | 13 |
April 2024 | 21 |
May 2024 | 24 |
June 2024 | 10 |
July 2024 | 25 |
August 2024 | 23 |
September 2024 | 11 |
October 2024 | 3 |
Citations
39 Web of Science
×
Email alerts
Citing articles via
More from Oxford Academic