Copy number variant detection in inbred strains from short read sequence data (original) (raw)

Journal Article

,

Wellcome Trust Sanger Institute, Hinxton, CB10 1HH, UK

Search for other works by this author on:

,

Wellcome Trust Sanger Institute, Hinxton, CB10 1HH, UK

Search for other works by this author on:

,

Wellcome Trust Sanger Institute, Hinxton, CB10 1HH, UK

Search for other works by this author on:

Wellcome Trust Sanger Institute, Hinxton, CB10 1HH, UK

* To whom correspondence should be addressed.

Search for other works by this author on:

Revision received:

21 November 2009

Accepted:

15 December 2009

Published:

18 December 2009

Cite

Jared T. Simpson, Rebecca E. McIntyre, David J. Adams, Richard Durbin, Copy number variant detection in inbred strains from short read sequence data, Bioinformatics, Volume 26, Issue 4, February 2010, Pages 565–567, https://doi.org/10.1093/bioinformatics/btp693
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

Summary: We have developed an algorithm to detect copy number variants (CNVs) in homozygous organisms, such as inbred laboratory strains of mice, from short read sequence data. Our novel approach exploits the fact that inbred mice are homozygous at virtually every position in the genome to detect CNVs using a hidden Markov model (HMM). This HMM uses both the density of sequence reads mapped to the genome, and the rate of apparent heterozygous single nucleotide polymorphisms, to determine genomic copy number. We tested our algorithm on short read sequence data generated from re-sequencing chromosome 17 of the mouse strains A/J and CAST/EiJ with the Illumina platform. In total, we identified 118 copy number variants (43 for A/J and 75 for CAST/EiJ). We investigated the performance of our algorithm through comparison to CNVs previously identified by array-comparative genomic hybridization (array CGH). We performed quantitative-PCR validation on a subset of the calls that differed from the array CGH data sets.

Availability: The software described in this manuscript, named cnD for copy number detector, is free and released under the GPL. The program is implemented in the D programming language using the Tango library. Source code and pre-compiled binaries are available at http://www.sanger.ac.uk/resources/software/cnd.html

Contact: rd@sanger.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

Copy number variants (CNVs) are segments of DNA that have been duplicated, or lost, in the genome of one individual or strain with respect to another. CNVs are thought to contribute significantly to phenotypic differences between mouse strains. In humans, CNVs have been causally linked to a range of disorders including schizophrenia (Moon et al., 2006), autism (Sebat et al., 2007) and birth defect syndromes (Lu et al., 2008). High-resolution surveys for CNVs have been performed in common laboratory strains of mice using array-comparative genomic hybridization (array CGH) (Cahan et al., 2009; Cutler et al., 2007; Graubert et al., 2007; Henrichsen et al., 2009; She et al., 2008). These studies have found a significant level of variation between strains, such that as much as 15% of the reference C57BL/6J mouse genome may be found as CNVs in another strain. While array CGH can be an effective way of identifying CNVs, aCGH studies are limited in resolution by the number of probes that can be placed on a microarray. The widespread adoption of short read sequencing platforms has led to a rapid decrease in the cost of whole-genome re-sequencing making it a viable alternative to array CGH (Xie and Tammi, 2009). Hidden Markov Models (HMM) have previously been used to detect copy number variation from array CGH data (Cahan et al., 2008; Fridlyand et al., 2004). We have developed a HMM to detect CNVs in inbred strains from the alignments of short read sequences to a reference genome.

2 DESCRIPTION

The central idea behind our model is that the alignment of reads from regions with copy number gains (with respect to a reference genome) will be ‘collapsed’ to a single location on the reference genome. The effect of this will be 2-fold. First, the sequence depth of this location on the reference genome will be increased by an integral amount corresponding to the relative number of copies that exist in the sequenced strain. Second, any base-pair differences between the copied regions will appear to be heterozygous single nucleotide polymorphisms (SNPs) with respect to the reference. This fact is crucial to our model as laboratory strains of mice are inbred to be effectively homozygous at every position in the genome, hence any apparent heterozygous SNPs that are not sequencing errors are actually paralogous sequence variants and therefore define regions collapsed in the reference genome. Conversely, the alignment of reads from regions with copy number losses in the sequenced genome will be distributed over the corresponding copies in the reference genome and hence the reference regions will have lower sequence depth, with the important distinction that there will not be a heterozygous SNP signal. Our HMM exploits these factors to detect regions of copy number gain and loss.

Our algorithm proceeds in three stages. First, the sequence reads are aligned to the mouse reference genome (build NCBI 37, Mouse Genome Sequencing Consortium, Waterston et al., 2002) using the MAQ aligner (Li et al., 2008). MAQ calls SNPs and classifies them as homozygous or heterozygous. Summary statistics are computed for the sequence read depth, the number of heterozygous SNPs and the average number of hits per read over 1 kb windows of the reference genome sequence. This triplet of data for each 1 kb region of the reference genome is input to the HMM which classifies each region as corresponding to a gain, loss or no change in copy number.

2.1 The HMM

We developed a 10-state HMM of the copy number structure of the genome being sequenced. There are five major states of the model, representing normal sequence, a 2-fold increase in copy number, a 3-fold increase in copy number, a 2-fold decrease in copy number and zero copy number. In addition, each major state of the model has a sub-state corresponding to highly repetitive sequence, allowing the model to accommodate the frequent high-copy repeat elements dispersed throughout mammalian genomes. In all states expect for the repeat states the depth distribution is modeled by a normal distribution with the mean and variance reflecting the copy number of the state. For states representing a copy number gain, the heterozygous SNP rate is modeled by a negative binomial distribution. The heterozygous SNP rate is modeled by a Poisson distribution in all other states. More information about the HMM and emission distributions is given in the supplemental material.

The parameters of the model are learned for each chromosome in the input data set by Viterbi training for both the transition probabilities and emission distribution parameters (Durbin et al., 1998). After the model parameters have been determined, the sequence of states is computed by a final application of the Viterbi algorithm. The output of the Viterbi algorithm is processed to extract contiguous regions of gain or loss. The minimum threshold for detection is the input window size, typically one kilobase. There is a final optional filtering step to remove calls below a minimum size threshold.

3 RESULTS

We tested our model on Illumina short read sequence data from chromosome 17 for the A/J and CAST/EiJ strains of mouse that were sequenced to 22- and 34-fold, respectively (ERA accession number ERA000077). The data sets were generated using 36-bp paired-end reads of 200-bp insert libraries. For this experiment, we set a minimum call size threshold of 10 kb (see Supplementary data). We evaluated our calls against a collection of previously published aCGH copy number variation data (Cahan et al., 2009; Cutler et al., 2007; Henrichsen et al., 2009; She et al., 2008).

Our algorithm called 22 copy number gains (1.38 Mb of sequence) and 21 losses (0.49 Mb) for the A/J data set (see Fig. 1 and Supplementary Fig. 6 for example regions). The gain regions overlap 38% of the regions identified by aCGH (36% by sequence, 1.1 Mb). Seventy-seven percent of the gains cnD found were previously seen by aCGH. For CAST/EiJ, 45 gains (2.44 Mb of sequence) and 30 losses (1.16 Mb) were called. The gain regions overlap 76% of the gains called by aCGH (79% by sequence, 1.3 Mb). Thirty-six percent of the gains found by cnD were previously seen in the array CGH data set. This figure is much lower than that of A/J due to the fact that the CAST/EiJ strain was not used in the highest coverage aCGH study (Cahan et al., 2009). In both strains the regions of copy number loss called by our algorithm and aCGH differed widely (11% concordance by region for A/J and 32% for CAST/EiJ) owing to the relative difficulty of calling CNV losses compared to gains. We performed qPCR validation on a subset of both the gain calls that were novel to our algorithm (those not found by aCGH) and the novel gain calls found by aCGH. In total we attempted validation on 20 novel cnD gains, of which five were confirmed to be amplified relative to C57BL/6J. Of the 14 novel aCGH gains that we attempted to validate, one was confirmed to be a gain relative to C57BL/6J. Our concordance with array CGH and initial confirmation rates are similar to previously published copy number variation studies (Conrad et al., 2009; Redon et al., 2006; Scherer et al., 2007). Full details of the experimental validation are provided in the Supplementary data.

(A) Plot of sequencing depth across a one megabase region of A/J chromosome 17 clearly shows both a region of 3-fold increased copy number (30.6–31.1 Mb) and a region of decreased copy number (at 31.3 Mb). The solid black line above the depth plot indicates the called copy number gain and the solid black line below the plot indicates the called copy number loss. (B) Plot of the heterozygous SNP rate for the same region showing the high number of apparent heterozygous SNPs associated with the copy number gain.

Fig. 1.

(A) Plot of sequencing depth across a one megabase region of A/J chromosome 17 clearly shows both a region of 3-fold increased copy number (30.6–31.1 Mb) and a region of decreased copy number (at 31.3 Mb). The solid black line above the depth plot indicates the called copy number gain and the solid black line below the plot indicates the called copy number loss. (B) Plot of the heterozygous SNP rate for the same region showing the high number of apparent heterozygous SNPs associated with the copy number gain.

ACKNOWLEDGMENTS

The authors would like to thank Thomas Keane and Jim Stalker for implementing the initial data processing pipeline and Ian Sudbery for generating the chromosome 17 sequencing data.

Funding: Medical Research Council-UK and the Wellcome Trust (077192/Z/05/Z); Cancer-Research-UK to D.J.A.

Conflict of Interest: none declared.

REFERENCES

et al.

wuHMM: a robust algorithm to detect DNA copy number variation using long oligonucleotide microarray data

,

Nucleic Acids Res.

,

2008

, vol.

36

pg.

e41

et al.

The impact of copy number variation on local gene expression in mouse hematopoietic stem and progenitor cells

,

Nat. Genet.

,

2009

, vol.

41

(pg.

430

-

437

)

et al.

Origins and functional impact of copy number variation in the human genome

,

Nature

,

2009

[Epub ahead of print, doi: 1038/nature08516, October 7, 2009]

et al.

Significant gene content variation characterizes the genomes of inbred mouse strains

,

Genome Res.

,

2007

, vol.

17

(pg.

1743

-

1754

)

et al.

Markov chains and hidden Markov models

,

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids.

,

1998

Cambridge, UK; New York

Cambridge University Press

pg.

356

et al.

Hidden Markov models approach to the analysis of array CGH data

,

J. Multivar. Anal.

,

2004

, vol.

90

(pg.

132

-

153

)

et al.

A high-resolution map of segmental DNA copy number variation in the mouse genome

,

PLoS Genet.

,

2007

, vol.

3

pg.

e3

et al.

Segmental copy number variation shapes tissue transcriptomes

,

Nat. Genet.

,

2009

, vol.

41

(pg.

424

-

429

)

et al.

Mapping short DNA sequencing reads and calling variants using mapping quality scores

,

Genome Res.

,

2008

, vol.

18

(pg.

1851

-

1858

)

et al.

Genomic imbalances in neonates with birth defects: high detection rates by using chromosomal microarray analysis

,

Pediatrics

,

2008

, vol.

122

(pg.

1310

-

1318

)

et al.

Identification of DNA copy-number aberrations by array-comparative genomic hybridization in patients with schizophrenia

,

Biochem. Biophys. Res. Commun.

,

2006

, vol.

344

(pg.

531

-

539

)

Mouse Genome Sequencing Consortium

et al.

Initial sequencing and comparative analysis of the mouse genome

,

Nature

,

2002

, vol.

420

(pg.

520

-

562

)

et al.

Global variation in copy number in the human genome

,

Nature

,

2006

, vol.

444

(pg.

444

-

454

)

et al.

Strong association of de novo copy number mutations with autism

,

Science

,

2007

, vol.

316

(pg.

445

-

449

)

et al.

Challenges and standards in integrating surveys of structural variation

,

Nat. Genet.

,

2007

, vol.

39

(pg.

S7

-

S15

)

et al.

Mouse segmental duplication and copy number variation

,

Nat. Genet.

,

2008

, vol.

40

(pg.

909

-

914

)

CNV-seq, a new method to detect copy number variation using high-throughput sequencing

,

BMC Bioinformatics

,

2009

, vol.

10

pg.

80

Author notes

Associate Editor: Joaquin Dopazo

© The Author(s) 2009. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data

Citations

Views

Altmetric

Metrics

Total Views 1,015

802 Pageviews

213 PDF Downloads

Since 12/1/2016

Month: Total Views:
December 2016 8
January 2017 2
February 2017 9
March 2017 5
April 2017 6
May 2017 11
June 2017 6
July 2017 7
August 2017 9
September 2017 2
October 2017 6
November 2017 8
December 2017 18
January 2018 22
February 2018 9
March 2018 24
April 2018 15
May 2018 14
June 2018 8
July 2018 7
August 2018 13
September 2018 7
October 2018 12
November 2018 15
December 2018 11
January 2019 8
February 2019 12
March 2019 18
April 2019 31
May 2019 12
June 2019 7
July 2019 7
August 2019 6
September 2019 14
October 2019 9
November 2019 11
December 2019 11
January 2020 14
February 2020 18
March 2020 10
April 2020 6
May 2020 3
June 2020 15
July 2020 5
August 2020 5
September 2020 6
October 2020 7
November 2020 6
December 2020 6
January 2021 1
February 2021 6
March 2021 8
April 2021 7
May 2021 6
June 2021 8
July 2021 14
August 2021 14
September 2021 10
October 2021 15
November 2021 14
December 2021 4
January 2022 11
February 2022 14
March 2022 3
April 2022 9
May 2022 8
June 2022 4
July 2022 7
August 2022 13
September 2022 14
October 2022 9
November 2022 6
December 2022 15
January 2023 15
February 2023 8
March 2023 8
April 2023 15
May 2023 10
June 2023 4
July 2023 6
August 2023 13
September 2023 13
October 2023 11
November 2023 15
December 2023 13
January 2024 18
February 2024 15
March 2024 13
April 2024 21
May 2024 24
June 2024 10
July 2024 25
August 2024 23
September 2024 11
October 2024 3

Citations

39 Web of Science

×

Email alerts

Citing articles via

More from Oxford Academic