Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data (original) (raw)
Journal Article
,
1Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA,
Search for other works by this author on:
,
1Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA,
Search for other works by this author on:
,
2Department of Pathology, Yale University School of Medicine, New Haven, CT 06511, USA and
Search for other works by this author on:
,
1Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA,
Search for other works by this author on:
,
3Bioengineering Program, Faculty of Engineering, Bar-Ilan University, Ramat Gan 52900, Israel
Search for other works by this author on:
1Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA,
2Department of Pathology, Yale University School of Medicine, New Haven, CT 06511, USA and
*To whom correspondence should be addressed.
Search for other works by this author on:
†The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors.
Associate Editor: Alfonso Valencia
Received:
17 October 2014
Revision received:
30 April 2015
Cite
Namita T. Gupta, Jason A. Vander Heiden, Mohamed Uduman, Daniel Gadala-Maria, Gur Yaari, Steven H. Kleinstein, Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data, Bioinformatics, Volume 31, Issue 20, October 2015, Pages 3356–3358, https://doi.org/10.1093/bioinformatics/btv359
Close
Navbar Search Filter Mobile Enter search term Search
Abstract
Summary: Advances in high-throughput sequencing technologies now allow for large-scale characterization of B cell immunoglobulin (Ig) repertoires. The high germline and somatic diversity of the Ig repertoire presents challenges for biologically meaningful analysis, which requires specialized computational methods. We have developed a suite of utilities, Change-O, which provides tools for advanced analyses of large-scale Ig repertoire sequencing data. Change-O includes tools for determining the complete set of Ig variable region gene segment alleles carried by an individual (including novel alleles), partitioning of Ig sequences into clonal populations, creating lineage trees, inferring somatic hypermutation targeting models, measuring repertoire diversity, quantifying selection pressure, and calculating sequence chemical properties. All Change-O tools utilize a common data format, which enables the seamless integration of multiple analyses into a single workflow.
Availability and implementation: Change-O is freely available for non-commercial use and may be downloaded from http://clip.med.yale.edu/changeo.
Contact: steven.kleinstein@yale.edu
1 Introduction
Large-scale characterization of immunoglobulin (Ig) repertoires is now feasible due to dramatic improvements in high-throughput sequencing technology. Repertoire sequencing is a rapidly growing area, with applications including detection of minimum residual disease, prognosis following transplant, monitoring vaccination responses, identification of neutralizing antibodies and inferring B cell trafficking patterns (Robins, 2013; Stern et al., 2014). We previously developed the repertoire sequencing toolkit (pRESTO) for producing assembled and error-corrected reads from high-throughput lymphocyte receptor sequencing experiments (Vander Heiden et al., 2014), which may then be fed into existing methods for alignment against V(D)J germline databases [e.g. IMGT/HighV-QUEST (Alamyar et al., 2012), IgBLAST (Ye et al., 2013), iHMMune-align (Gaëta et al., 2007)]. However, extracting measures of biological and clinical interest from the resulting germline-annotated repertoire remains a time-consuming and error-prone process that is often dependent upon custom analysis scripts. Here, we introduce Change-O, a suite of utilities that cover a range of complex analysis tasks for Ig repertoire sequencing data.
2 Features
The Change-O suite is composed of four software packages: a collection of Python commandline tools (changeo-ctl) and three separate R (R Core Team, 2015) packages (alakazam, shm, and tigger) (Table 1). Data are passed to Change-O utilities in the form of a tab-delimited text file. Each utility identifies the relevant input data based on standardized column names and adds new columns to the file with the output information to be carried through to the next analysis step. Change-O provides tools to import data from the frequently used IMGT/HighV-QUEST (Alamyar et al., 2012) tool as well as a set of utilities to perform basic database operations, such as sorting, filtering and modifying annotations.
Table 1.
Summary of Change-O features
Package | Analysis tasks |
---|---|
changeo-clt | Parsing of V(D)J assignment output |
Basic database manipulation | |
Multiple alignment of sequence records | |
Assignment of sequences into clonal groups | |
Calculation of CDR3 physiochemical properties | |
alakazam | Clonal diversity analysis |
Lineage reconstruction | |
shm | SHM hot/cold-spot modeling |
Quantification of selection pressure | |
tigger | Inference of novel germline alleles |
Construction of personalized germline genotype |
Package | Analysis tasks |
---|---|
changeo-clt | Parsing of V(D)J assignment output |
Basic database manipulation | |
Multiple alignment of sequence records | |
Assignment of sequences into clonal groups | |
Calculation of CDR3 physiochemical properties | |
alakazam | Clonal diversity analysis |
Lineage reconstruction | |
shm | SHM hot/cold-spot modeling |
Quantification of selection pressure | |
tigger | Inference of novel germline alleles |
Construction of personalized germline genotype |
Table 1.
Summary of Change-O features
Package | Analysis tasks |
---|---|
changeo-clt | Parsing of V(D)J assignment output |
Basic database manipulation | |
Multiple alignment of sequence records | |
Assignment of sequences into clonal groups | |
Calculation of CDR3 physiochemical properties | |
alakazam | Clonal diversity analysis |
Lineage reconstruction | |
shm | SHM hot/cold-spot modeling |
Quantification of selection pressure | |
tigger | Inference of novel germline alleles |
Construction of personalized germline genotype |
Package | Analysis tasks |
---|---|
changeo-clt | Parsing of V(D)J assignment output |
Basic database manipulation | |
Multiple alignment of sequence records | |
Assignment of sequences into clonal groups | |
Calculation of CDR3 physiochemical properties | |
alakazam | Clonal diversity analysis |
Lineage reconstruction | |
shm | SHM hot/cold-spot modeling |
Quantification of selection pressure | |
tigger | Inference of novel germline alleles |
Construction of personalized germline genotype |
The more computationally expensive components have built-in multiprocessing support. Each utility includes detailed help documentation and optional logging to track errors. Example workflow scripts are provided on the website, which can easily be modified by adding, removing or reordering analysis steps to meet different analysis goals. As detailed later, several repertoire analyses may be carried out, depending on the nature of the study.
2.1 Inference of novel alleles and individual genotype
Germline segment assignment tools, such as IMGT/HighV-QUEST, work by aligning each sequence against a database of known alleles. However, this process is inaccurate for sequences that utilize previously undetected alleles. In this case, the sequence will be assigned to the closest known allele and any polymorphisms will be incorrectly identified as somatic mutations. To address this problem, the Tool for Immunoglobulin Genotype Elucidation (TIgGER) (Gadala-Maria et al., 2015) has been implemented as an R package for inclusion in Change-O. TIgGER determines the complete set of variable region gene segments carried by an individual and identifies novel alleles, yielding a set of germline alleles personalized to an individual. The germline variable region allele assignments are then adjusted based on this individual Ig genotype. This process significantly improves the quality of germline assignments, thus increasing the confidence of downstream analysis dependent upon mutation profiles.
2.2 Partitioning sequences into clonally related groups
Identifying sequences that are descended from the same B cell (clonal groups) is important to virtually all Ig repertoire analyses. Clonal group sizes and lineage structures provide information on the underlying response, and clonally related sequences cannot be treated independently in statistical analyses and models. Change-O provides several methods for partitioning sequences into clones. Along with published methods based on hierarchical clustering (Ademokun et al., 2011; Chen et al., 2010; Glanville et al., 2009), users also have the option to employ several published somatic hypermutation (SHM) hot/cold-spot targeting models as distance metrics in the clustering methods (Smith et al., 1996; Yaari et al., 2013; Stern et al., 2014). Users may alter the clustering thresholds, and Change-O also includes tools to tune the thresholds based on distance patterns in the repertoire (Glanville et al., 2009).
2.3 Quantification of repertoire diversity
To assess repertoire diversity, Change-O provides an implementation of the general diversity index (qD) proposed by Hill (1973), which encompasses a range of diversity measures as a smooth curve over a single varying parameter q. Special cases of this general index of diversity correspond to the most popular diversity measures: species richness (q = 0), the exponential Shannon-Weiner index (as q→1), the inverse of the Simpson index (q = 2), and the reciprocal abundance of the largest clone (as q→∞). Resampling strategies are also provided to perform significance tests and allow comparison across samples with varying sequencing depth (Wu et al., 2014; Stern et al., 2014).
2.4 Generation of B cell lineage trees
Lineage trees provide a means to trace the ancestral relationships of cells within a clone. This information has been used to estimate mutation rates (Kleinstein et al., 2003), infer B cell trafficking patterns (Stern et al., 2014) and trace the accumulation of mutations that drive affinity maturation (Uduman et al., 2014; Wu et al., 2012). Change-O provides a tool for generating lineage trees using PHYLIP’s maximum parsimony algorithm (Felsenstein, 1989), with modifications to meet the requirements of an Ig lineage tree (Barak et al., 2008; Stern et al., 2014). Trees may be viewed and exported into different file formats using the igraph (Csardi and Nepusz, 2006) R package.
2.5 Somatic hypermutation hot/cold-spot motifs
SHM is a process that operates in activated B cells and introduces point mutations into the DNA coding for the Ig receptor at a very high rate (≈10−3 per base-pair per division) (Kleinstein et al., 2003; McKean et al., 1984). Accurate background models of SHM are critical, since SHM displays intrinsic hot/cold-spot biases (Yaari et al., 2013). Change-O provides utilities for estimating the mutability and substitution rates of DNA motifs from large-scale Ig sequencing data to construct hot/cold-spot motif models. Furthermore, models may be generated based solely on silent mutations, thereby avoiding the confounding influence of selection pressures (Yaari et al., 2013). These tools can be used to build models of SHM targeting and gain insight into the relative contributions of different error-prone repair pathways in SHM.
2.6 Analysis of selection pressure
For quantifying selection pressure in Ig sequences, Change-O includes the BASELINe (Yaari et al., 2012) method, which has been implemented as an R package for inclusion in the suite. BASELINe quantifies deviations in the frequency of replacement mutations compared with a background model of SHM. Users may choose between published background models (Smith et al., 1996; Yaari et al., 2013) or infer the background from their own data using the SHM model building tools described above.
3 Conclusion
Change-O is a suite of utilities implementing a wide range of B cell repertoire analysis methods. Together these tools allow researchers to quickly implement advanced analysis pipelines for large datasets generated by repertoire sequencing experiments. A simple tab-delimited file with standardized column names allows for communication between the utilities and can easily be viewed using any spreadsheet application. This format also allows research groups the flexibility to incorporate other analysis tools into their in-house analysis pipelines by simply adding additional columns of information to the central file. Change-O, along with pRESTO (Vander Heiden et al., 2014), provides key components of an analytical ecosystem that enables sophisticated analysis of high-throughput Ig repertoire sequencing datasets.
Acknowledgements
The authors thank the Yale University Biomedical High Performance Computing Center [funded by National Institutes of Health grants RR19895 and RR029676-01] for use of their computing resources. The authors also thank Chris Bolen, Moriah Cohen, Jingli Shan and Sonia Timberlake for testing Change-O and providing helpful feedback.
Funding
This work was supported by the National Institutes of Health [R01AI104739 to S.H.K.; T15LM07056 to N.T.G., T15LM07056 to J.A.V.H. from National Library of Medicine (NLM)] and by the United States-Israel Binational Science Foundation [2013395 to G.Y. and S.H.K.].
Conflict of Interest: none declared.
References
et al. . (
2011
)
Vaccination-induced changes in human B-cell repertoire and pneumococcal IgM and IgA antibody at different ages
.
Aging cell
,
10
,
922
–
930
.
et al. . (
2012
)
IMGT(®) tools for the nucleotide analysis of immunoglobulin (IG) and T cell receptor (TR) V-(D)-J repertoires, polymorphisms, and IG mutations: IMGT/V-QUEST and IMGT/HighV-QUEST for NGS
.
Methods Mol. Biol.
,
882
,
569
–
604
.
et al. . (
2008
)
IgTree: creating immunoglobulin variable region gene lineage trees
.
J. Immunol. Methods
,
338
,
67
–
74
.
et al. . (
2010
)
Clustering-based identification of clonally-related immunoglobulin gene sequence sets
.
Immunome Res.
,
6
(
Suppl. 1
),
S4
.
(
2006
)
The igraph software package for complex network research
.
InterJournal
,
Complex Systems, 1695
.
(
1989
)
PHYLIP - Phylogeny inference package (Version 3.2)
.
Cladistics
,
5
,
164
–
166
.
et al. . (
2015
)
Automated analysis of high-throughput B-cell sequencing data reveals a high frequency of novel immunoglobulin V gene segment alleles
.
Proc. Natl. Acad. Sci. USA
,
112
,
201417683
.
et al. . (
2007
)
iHMMune-align: hidden Markov model-based alignment and identification of germline genes in rearranged immunoglobulin gene sequences
.
Bioinformatics
,
23
,
1580
–
1587
.
et al. . (
2009
)
Precise determination of the diversity of a combinatorial antibody library gives insight into the human immunoglobulin repertoire
.
Proc. Natl. Acad. Sci. USA
,
106
,
20216
–
20221
.
(
1973
)
Diversity and evenness: a unifying notation and its consequences
.
Ecology
,
54
,
427
.
et al. . (
2003
)
Estimating hypermutation rates from clonal tree data
.
J. Immunol.
,
171
,
4639
–
4649
.
et al. . (
1984
)
Generation of antibody diversity in the immune response of BALB/c mice to influenza virus hemagglutinin
.
Proc. Natl. Acad. Sci. USA
,
81
,
3180
–
3184
.
R Core Team
(
2015
)
R: A Language and Environment for Statistical Computing
.
R Foundation for Statistical Computing, Vienna, Austria
.
(
2013
)
Immunosequencing: applications of immune repertoire deep sequencing
.
Curr. Opin. Immunol.
,
25
,
646
–
652
et al. . (
1996
)
Di- and trinucleotide target preferences of somatic mutagenesis in normal and autoreactive B cells
.
J. Immunol.
,
156
,
2642
–
2652
.
et al. . (
2014
)
B cells populating the multiple sclerosis brain mature in the draining cervical lymph nodes
.
Sci. Transl. Med.
,
6
,
248ra107
.
et al. . (
2014
)
Integrating B cell lineage information into statistical tests for detecting selection in Ig sequences
.
J. Immunol.
,
192
,
867
–
874
.
et al. . (
2014
)
pRESTO: a toolkit for processing high-throughput sequencing raw reads of lymphocyte receptor repertoires
.
Bioinformatics
,
30
,
1930
–
1932
et al. . (
2012
)
Age-related changes in human peripheral blood IGH repertoire following vaccination
.
Front. Immunol.
,
3
,
193
.
et al. . (
2014
)
Influence of seasonal exposure to grass pollen on local and peripheral blood IgE repertoires in patients with allergic rhinitis
.
J. Allergy Clin. Immunol.
,
134
,
604
–
612
.
et al. . (
2012
)
Quantifying selection in high-throughput immunoglobulin sequencing data sets
.
Nucleic Acids Res.
,
40
,
e134
.
et al. . (
2013
)
Models of somatic hypermutation targeting and substitution based on synonymous mutations from high-throughput immunoglobulin sequencing data
.
Front. Immunol.
,
4
,
358
.
et al. . (
2013
)
IgBLAST: an immunoglobulin variable domain sequence analysis tool
.
Nucleic Acids Res.
,
41
(
Web Server Issue
),
W34
–
W40
.
Author notes
†The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors.
Associate Editor: Alfonso Valencia
© The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
Citations
Views
Altmetric
Metrics
Total Views 13,014
10,121 Pageviews
2,893 PDF Downloads
Since 11/1/2016
Month: | Total Views: |
---|---|
November 2016 | 8 |
December 2016 | 11 |
January 2017 | 20 |
February 2017 | 27 |
March 2017 | 80 |
April 2017 | 36 |
May 2017 | 55 |
June 2017 | 50 |
July 2017 | 50 |
August 2017 | 68 |
September 2017 | 40 |
October 2017 | 36 |
November 2017 | 23 |
December 2017 | 81 |
January 2018 | 94 |
February 2018 | 66 |
March 2018 | 137 |
April 2018 | 71 |
May 2018 | 62 |
June 2018 | 61 |
July 2018 | 79 |
August 2018 | 115 |
September 2018 | 113 |
October 2018 | 75 |
November 2018 | 102 |
December 2018 | 72 |
January 2019 | 62 |
February 2019 | 84 |
March 2019 | 63 |
April 2019 | 104 |
May 2019 | 130 |
June 2019 | 115 |
July 2019 | 158 |
August 2019 | 91 |
September 2019 | 116 |
October 2019 | 93 |
November 2019 | 92 |
December 2019 | 87 |
January 2020 | 95 |
February 2020 | 83 |
March 2020 | 128 |
April 2020 | 127 |
May 2020 | 94 |
June 2020 | 175 |
July 2020 | 168 |
August 2020 | 150 |
September 2020 | 116 |
October 2020 | 106 |
November 2020 | 130 |
December 2020 | 129 |
January 2021 | 149 |
February 2021 | 154 |
March 2021 | 162 |
April 2021 | 167 |
May 2021 | 135 |
June 2021 | 159 |
July 2021 | 164 |
August 2021 | 128 |
September 2021 | 157 |
October 2021 | 199 |
November 2021 | 170 |
December 2021 | 134 |
January 2022 | 173 |
February 2022 | 180 |
March 2022 | 206 |
April 2022 | 228 |
May 2022 | 228 |
June 2022 | 233 |
July 2022 | 208 |
August 2022 | 165 |
September 2022 | 221 |
October 2022 | 227 |
November 2022 | 142 |
December 2022 | 129 |
January 2023 | 147 |
February 2023 | 172 |
March 2023 | 214 |
April 2023 | 206 |
May 2023 | 182 |
June 2023 | 188 |
July 2023 | 245 |
August 2023 | 202 |
September 2023 | 221 |
October 2023 | 188 |
November 2023 | 209 |
December 2023 | 206 |
January 2024 | 294 |
February 2024 | 267 |
March 2024 | 221 |
April 2024 | 222 |
May 2024 | 191 |
June 2024 | 122 |
July 2024 | 156 |
August 2024 | 174 |
September 2024 | 163 |
October 2024 | 164 |
November 2024 | 114 |
Citations
463 Web of Science
×
Email alerts
Citing articles via
More from Oxford Academic