Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data (original) (raw)
Journal Article
,
1Institut Curie, 2INSERM, U900, Bioinformatics and Computational Systems Biology of Cancer, Paris 75248, 3Mines ParisTech, Fontainebleau 77300, 4INSERM, U830, Genetics and Biology of Cancers, Paris 75248 and 5INRIA Saclay, Orsay 91893, France
1Institut Curie, 2INSERM, U900, Bioinformatics and Computational Systems Biology of Cancer, Paris 75248, 3Mines ParisTech, Fontainebleau 77300, 4INSERM, U830, Genetics and Biology of Cancers, Paris 75248 and 5INRIA Saclay, Orsay 91893, France
1Institut Curie, 2INSERM, U900, Bioinformatics and Computational Systems Biology of Cancer, Paris 75248, 3Mines ParisTech, Fontainebleau 77300, 4INSERM, U830, Genetics and Biology of Cancers, Paris 75248 and 5INRIA Saclay, Orsay 91893, France
* To whom correspondence should be addressed.
Search for other works by this author on:
,
1Institut Curie, 2INSERM, U900, Bioinformatics and Computational Systems Biology of Cancer, Paris 75248, 3Mines ParisTech, Fontainebleau 77300, 4INSERM, U830, Genetics and Biology of Cancers, Paris 75248 and 5INRIA Saclay, Orsay 91893, France
1Institut Curie, 2INSERM, U900, Bioinformatics and Computational Systems Biology of Cancer, Paris 75248, 3Mines ParisTech, Fontainebleau 77300, 4INSERM, U830, Genetics and Biology of Cancers, Paris 75248 and 5INRIA Saclay, Orsay 91893, France
Search for other works by this author on:
,
1Institut Curie, 2INSERM, U900, Bioinformatics and Computational Systems Biology of Cancer, Paris 75248, 3Mines ParisTech, Fontainebleau 77300, 4INSERM, U830, Genetics and Biology of Cancers, Paris 75248 and 5INRIA Saclay, Orsay 91893, France
Search for other works by this author on:
,
1Institut Curie, 2INSERM, U900, Bioinformatics and Computational Systems Biology of Cancer, Paris 75248, 3Mines ParisTech, Fontainebleau 77300, 4INSERM, U830, Genetics and Biology of Cancers, Paris 75248 and 5INRIA Saclay, Orsay 91893, France
1Institut Curie, 2INSERM, U900, Bioinformatics and Computational Systems Biology of Cancer, Paris 75248, 3Mines ParisTech, Fontainebleau 77300, 4INSERM, U830, Genetics and Biology of Cancers, Paris 75248 and 5INRIA Saclay, Orsay 91893, France
1Institut Curie, 2INSERM, U900, Bioinformatics and Computational Systems Biology of Cancer, Paris 75248, 3Mines ParisTech, Fontainebleau 77300, 4INSERM, U830, Genetics and Biology of Cancers, Paris 75248 and 5INRIA Saclay, Orsay 91893, France
Search for other works by this author on:
,
1Institut Curie, 2INSERM, U900, Bioinformatics and Computational Systems Biology of Cancer, Paris 75248, 3Mines ParisTech, Fontainebleau 77300, 4INSERM, U830, Genetics and Biology of Cancers, Paris 75248 and 5INRIA Saclay, Orsay 91893, France
1Institut Curie, 2INSERM, U900, Bioinformatics and Computational Systems Biology of Cancer, Paris 75248, 3Mines ParisTech, Fontainebleau 77300, 4INSERM, U830, Genetics and Biology of Cancers, Paris 75248 and 5INRIA Saclay, Orsay 91893, France
Search for other works by this author on:
,
1Institut Curie, 2INSERM, U900, Bioinformatics and Computational Systems Biology of Cancer, Paris 75248, 3Mines ParisTech, Fontainebleau 77300, 4INSERM, U830, Genetics and Biology of Cancers, Paris 75248 and 5INRIA Saclay, Orsay 91893, France
1Institut Curie, 2INSERM, U900, Bioinformatics and Computational Systems Biology of Cancer, Paris 75248, 3Mines ParisTech, Fontainebleau 77300, 4INSERM, U830, Genetics and Biology of Cancers, Paris 75248 and 5INRIA Saclay, Orsay 91893, France
Search for other works by this author on:
,
Isabelle Janoueix-Lerosey
1Institut Curie, 2INSERM, U900, Bioinformatics and Computational Systems Biology of Cancer, Paris 75248, 3Mines ParisTech, Fontainebleau 77300, 4INSERM, U830, Genetics and Biology of Cancers, Paris 75248 and 5INRIA Saclay, Orsay 91893, France
1Institut Curie, 2INSERM, U900, Bioinformatics and Computational Systems Biology of Cancer, Paris 75248, 3Mines ParisTech, Fontainebleau 77300, 4INSERM, U830, Genetics and Biology of Cancers, Paris 75248 and 5INRIA Saclay, Orsay 91893, France
Search for other works by this author on:
,
1Institut Curie, 2INSERM, U900, Bioinformatics and Computational Systems Biology of Cancer, Paris 75248, 3Mines ParisTech, Fontainebleau 77300, 4INSERM, U830, Genetics and Biology of Cancers, Paris 75248 and 5INRIA Saclay, Orsay 91893, France
1Institut Curie, 2INSERM, U900, Bioinformatics and Computational Systems Biology of Cancer, Paris 75248, 3Mines ParisTech, Fontainebleau 77300, 4INSERM, U830, Genetics and Biology of Cancers, Paris 75248 and 5INRIA Saclay, Orsay 91893, France
Search for other works by this author on:
1Institut Curie, 2INSERM, U900, Bioinformatics and Computational Systems Biology of Cancer, Paris 75248, 3Mines ParisTech, Fontainebleau 77300, 4INSERM, U830, Genetics and Biology of Cancers, Paris 75248 and 5INRIA Saclay, Orsay 91893, France
1Institut Curie, 2INSERM, U900, Bioinformatics and Computational Systems Biology of Cancer, Paris 75248, 3Mines ParisTech, Fontainebleau 77300, 4INSERM, U830, Genetics and Biology of Cancers, Paris 75248 and 5INRIA Saclay, Orsay 91893, France
1Institut Curie, 2INSERM, U900, Bioinformatics and Computational Systems Biology of Cancer, Paris 75248, 3Mines ParisTech, Fontainebleau 77300, 4INSERM, U830, Genetics and Biology of Cancers, Paris 75248 and 5INRIA Saclay, Orsay 91893, France
Search for other works by this author on:
Received:
09 September 2011
Revision received:
03 November 2011
Accepted:
29 November 2011
Published:
06 December 2011
Cite
Valentina Boeva, Tatiana Popova, Kevin Bleakley, Pierre Chiche, Julie Cappo, Gudrun Schleiermacher, Isabelle Janoueix-Lerosey, Olivier Delattre, Emmanuel Barillot, Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data, Bioinformatics, Volume 28, Issue 3, February 2012, Pages 423–425, https://doi.org/10.1093/bioinformatics/btr670
Close
Navbar Search Filter Mobile Enter search term Search
Abstract
Summary: More and more cancer studies use next-generation sequencing (NGS) data to detect various types of genomic variation. However, even when researchers have such data at hand, single-nucleotide polymorphism arrays have been considered necessary to assess copy number alterations and especially loss of heterozygosity (LOH). Here, we present the tool Control-FREEC that enables automatic calculation of copy number and allelic content profiles from NGS data, and consequently predicts regions of genomic alteration such as gains, losses and LOH. Taking as input aligned reads, Control-FREEC constructs copy number and B-allele frequency profiles. The profiles are then normalized, segmented and analyzed in order to assign genotype status (copy number and allelic content) to each genomic region. When a matched normal sample is provided, Control-FREEC discriminates somatic from germline events. Control-FREEC is able to analyze overdiploid tumor samples and samples contaminated by normal cells. Low mappability regions can be excluded from the analysis using provided mappability tracks.
Availability: C++ source code is available at: http://bioinfo.curie.fr/projects/freec/
Contact: freec@curie.fr
Supplementary information: Supplementary data are available at Bioinformatics online.
1 INTRODUCTION
Cancer genomes often display copy number alterations (CNAs) and/or losses of heterozygosity (LOH) (Hanahan and Weinberg, 2011). Genetic abnormalities in specific regions may be related to the aggressiveness of a cancer and be associated with clinical outcomes (Caren et al., 2010; Suzuki et al., 2000).
To detect CNA and LOH regions, single-nucleotide polymorphism (SNP) arrays have been recently much in use (Popova et al., 2009). Furthermore, next-generation sequencing (NGS) has been moving to replace SNP-arrays in prediction of CNAs (Boeva et al., 2010). A recent study presented ExomeCNV, a tool to predict CNAs and LOH using exome sequencing data (Sathirapongsasuti et al., 2011). However, detection of LOH regions and, more generally, prediction of genotype status (copy number and allelic content) of an altered region using whole-genome sequencing data has remained unsolved. The main challenges to doing so are non-uniform read coverage of genomic positions [for example, due to different mappability and GC-content (Boeva et al., 2010)] and alignment bias (reference allele coverage is usually higher than the coverage of the alternative allele). Thus, the resulting signal is noisier and more difficult to process than in the case of SNP arrays.
Here, we present Control-FREEC (Control-FREE Copy number and allelic content caller)—a tool that annotates genotypes and discovers CNAs and LOH. Control-FREEC inherits many features from FREEC (Boeva et al., 2010) (assessment of copy number variation and evaluation of contamination by normal cells) as well as the general methodology of the GAP algorithm for SNP arrays (Popova et al., 2009). Control-FREEC takes as an input aligned reads, then constructs and normalizes the copy number profile, constructs the B-allele frequency (BAF) profile, segments both profiles, ascribes the genotype status to each segment using both copy number and allelic frequency information, then annotates genomic alterations. If a control (matched normal) sample is available, Control-FREEC discerns somatic variants from germline ones.
2 METHODS
Workflow: the workflow of Control-FREEC consists of three steps: (i) calculation and segmentation of copy number profiles; (ii) calculation and segmentation of smoothed BAF profiles; (iii) prediction of final genotype status, i.e. copy number and allelic content for each segment (for example, A, AB, AAB, etc.).
- (i) Calculation of copy number profiles is mainly done as described in our previous publication (Boeva et al., 2010). The most important features of the procedure are: (a) possibility to use GC-content and mappability profiles to normalize read count if a control sample is unavailable; (b) proper characterization of overdiploid genomes; (c) correction for possible contamination by normal cells when constructing the copy number profile of a tumor genome. The new tool Control-FREEC can also be used on non-mammalian genomes and includes many new user control settings, such as (a) defining the program's behavior in low mappability regions (http://bioinfo.curie.fr/projects/freec/tutorial.html); (b) choosing the minimal number of consecutive windows required to call a CNA.
- (ii) We characterize the allelic content via the BAF introduced previously for SNP arrays (Popova et al., 2009). We limit the list of genomic positions that we consider to evaluate allelic content to known SNPs only (Sherry et al., 2001). By the B allele, we mean the alternative variant in SNP database (dbSNP). SNPs that are homozygous in the genome being considered give no information about allelic content (in SNP arrays they are denoted as non-informative); therefore putatively homozygous positions are discarded. A position is discarded if the probability of having variation due to sequencing errors under the condition of actual homozygosity is greater than a specified threshold (Supplementary Materials).
We calculate the total coverage and B-allele coverage for each known putatively heterozygous SNP position. For each window i, we calculate the median of the BAF values: Med_j_ = median(abs(x_ij_−0.5)), where {x_ij_} are BAF values of the remaining SNP positions. We segment {Med_j_} using the same lasso-based algorithm as used for copy numbers (Harchaoui and Lévy-Leduc, 2008). - (iii) We predict genotype status for each genomic segment independently, by choosing the allelic content that corresponds to the maximal log-likelihood, given the copy number detected previously.
First, we combine breakpoints issued from both copy number and median BAF segmentations to get genomic segments with presumably one status. Second, copy number status of each segment is detected as described previously (Boeva et al., 2010). If the CNA is present in most of the cells, there is no ambiguity in determining exact copy number of the region (see Supplementary Materials for more details on the strategy in the case of presence of subclones or normal contamination). Third, given the copy number of the region, we fit Gaussian mixture models (GMMs) with fixed means to the observed BAF values and select the model that provides the highest log-likelihood. For example, for a region with a copy number of two, we fit a two component model (mixture of ‘AA’ and ‘BB’ alleles) and a three component model (‘AA’, ‘AB’ and ‘BB’, with a condition on the minimal weight of ‘AB’). The component means in the GMM depend on the level of contamination by normal DNA (Supplementary Materials).
Input and output: the input consists of a SAM pileup (http://samtools.sourceforge.net/pileup.shtml) and a dbSNP file. The control dataset is optional if a reference genome is provided. The output contains a list of CNAs and LOH regions as well as read count, copy number, BAF and genotype information for each window. If a control (matched normal) dataset is available, each event is annotated as somatic or germline.
3 RESULTS
We applied Control-FREEC to detect CNAs and LOH regions in a tumor/normal dataset for a neuroblastoma patient (~ 30x-coverage, unpublished data). Control-FREEC detected somatic CNA and LOH regions covering 75% of the tumor genome (Fig. 1) and was able to identify the genotype status despite contamination of the tumor sample by normal cells (estimated percent of tumor cells was 60%).
Fig. 1.
Control-FREEC calculates copy number and BAF profiles and detects regions of copy number gain/loss and LOH regions. Tumor chromosomes 17 and 19 (bottom panels) versus ‘normal’ chromosomes (top panels; unpublished data). Predicted BAF and copy number profiles are shown in black. Gains, losses (left panels) and LOH (right panels) are shown in red, blue and light blue, respectively.
Our results agreed with the SNP-array analysis output. We obtained 95.4% consistency between the results of Control-FREEC and GAP (Popova et al., 2009), which we applied to SNP array data generated for the same tumor sample (Supplementary Materials).
4 CONCLUSION
Control-FREEC is a tool for automatic detection of CNAs and LOH regions using NGS data. It accurately calls genotype status even when no control experiment is available and/or the genome is polyploid. It corrects for GC-content and mappability biases. In the case of tumor samples, Control-FREEC is able to evaluate the level of contamination by normal cells. The software is written in C++ and freely available.
Funding: ‘Projet Incitatif et Collaboratif Bioinformatique et Biostatistiques’ of the Institut Curie; Ligue Nationale Contre le Cancer.
Conflict of Interest: none declared.
REFERENCES
et al.
Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization
,
Bioinformatics
,
2011
, vol.
27
(pg.
268
-
269
)
et al.
High-risk neuroblastoma tumors with 11q-deletion display a poor prognostic, chromosome instability phenotype with later onset
,
Proc. Natl Acad. Sci. USA
,
2010
, vol.
107
(pg.
4323
-
4328
)
Hallmarks of cancer: the next generation
,
Cell
,
2011
, vol.
144
(pg.
646
-
674
)
Catching change-points with lasso
,
Adv. Neural Inform. Process. Syst.
,
2008
, vol.
22
(pg.
617
-
624
)
et al.
Genome Alteration Print (GAP): a tool to visualize and mine complex cancer genomic profiles obtained by SNP arrays
,
Genome Biol.
,
2009
, vol.
10
pg.
R128
et al.
Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV
,
Bioinformatics
,
2011
, vol.
27
(pg.
2648
-
2654
)
et al.
dbSNP: the NCBI database of genetic variation
,
Nucleic Acids Res.
,
2001
, vol.
29
(pg.
308
-
311
)
et al.
An approach to analysis of large-scale correlations between genome changes and clinical endpoints in ovarian cancer
,
Cancer Res
,
2000
, vol.
60
(pg.
5382
-
5385
)
Author notes
Associate Editor: Alex Bateman
© The Author(s) 2011. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Supplementary data
Citations
Views
Altmetric
Metrics
Total Views 16,386
13,011 Pageviews
3,375 PDF Downloads
Since 11/1/2016
Month: | Total Views: |
---|---|
November 2016 | 16 |
December 2016 | 19 |
January 2017 | 68 |
February 2017 | 91 |
March 2017 | 105 |
April 2017 | 73 |
May 2017 | 88 |
June 2017 | 75 |
July 2017 | 91 |
August 2017 | 91 |
September 2017 | 94 |
October 2017 | 105 |
November 2017 | 62 |
December 2017 | 182 |
January 2018 | 217 |
February 2018 | 195 |
March 2018 | 313 |
April 2018 | 290 |
May 2018 | 300 |
June 2018 | 350 |
July 2018 | 341 |
August 2018 | 248 |
September 2018 | 261 |
October 2018 | 214 |
November 2018 | 229 |
December 2018 | 202 |
January 2019 | 222 |
February 2019 | 202 |
March 2019 | 201 |
April 2019 | 226 |
May 2019 | 193 |
June 2019 | 221 |
July 2019 | 274 |
August 2019 | 265 |
September 2019 | 255 |
October 2019 | 196 |
November 2019 | 172 |
December 2019 | 142 |
January 2020 | 146 |
February 2020 | 145 |
March 2020 | 119 |
April 2020 | 84 |
May 2020 | 103 |
June 2020 | 150 |
July 2020 | 109 |
August 2020 | 132 |
September 2020 | 128 |
October 2020 | 190 |
November 2020 | 137 |
December 2020 | 180 |
January 2021 | 105 |
February 2021 | 144 |
March 2021 | 185 |
April 2021 | 162 |
May 2021 | 201 |
June 2021 | 189 |
July 2021 | 201 |
August 2021 | 149 |
September 2021 | 168 |
October 2021 | 156 |
November 2021 | 181 |
December 2021 | 174 |
January 2022 | 148 |
February 2022 | 161 |
March 2022 | 205 |
April 2022 | 181 |
May 2022 | 143 |
June 2022 | 197 |
July 2022 | 160 |
August 2022 | 177 |
September 2022 | 172 |
October 2022 | 167 |
November 2022 | 142 |
December 2022 | 146 |
January 2023 | 147 |
February 2023 | 144 |
March 2023 | 236 |
April 2023 | 138 |
May 2023 | 147 |
June 2023 | 146 |
July 2023 | 165 |
August 2023 | 163 |
September 2023 | 160 |
October 2023 | 147 |
November 2023 | 176 |
December 2023 | 155 |
January 2024 | 193 |
February 2024 | 202 |
March 2024 | 403 |
April 2024 | 147 |
May 2024 | 204 |
June 2024 | 153 |
July 2024 | 163 |
August 2024 | 180 |
September 2024 | 167 |
October 2024 | 124 |
Citations
679 Web of Science
×
Email alerts
Citing articles via
More from Oxford Academic