Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data (original) (raw)

Journal Article

,

1Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA,

Search for other works by this author on:

,

1Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA,

Search for other works by this author on:

,

2Department of Pathology, Yale University School of Medicine, New Haven, CT 06511, USA and

Search for other works by this author on:

,

1Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA,

Search for other works by this author on:

,

3Bioengineering Program, Faculty of Engineering, Bar-Ilan University, Ramat Gan 52900, Israel

Search for other works by this author on:

1Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA,

2Department of Pathology, Yale University School of Medicine, New Haven, CT 06511, USA and

*To whom correspondence should be addressed.

Search for other works by this author on:

†The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors.

Associate Editor: Alfonso Valencia

Author Notes

Received:

17 October 2014

Revision received:

30 April 2015

Cite

Namita T. Gupta, Jason A. Vander Heiden, Mohamed Uduman, Daniel Gadala-Maria, Gur Yaari, Steven H. Kleinstein, Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data, Bioinformatics, Volume 31, Issue 20, October 2015, Pages 3356–3358, https://doi.org/10.1093/bioinformatics/btv359
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

Summary: Advances in high-throughput sequencing technologies now allow for large-scale characterization of B cell immunoglobulin (Ig) repertoires. The high germline and somatic diversity of the Ig repertoire presents challenges for biologically meaningful analysis, which requires specialized computational methods. We have developed a suite of utilities, Change-O, which provides tools for advanced analyses of large-scale Ig repertoire sequencing data. Change-O includes tools for determining the complete set of Ig variable region gene segment alleles carried by an individual (including novel alleles), partitioning of Ig sequences into clonal populations, creating lineage trees, inferring somatic hypermutation targeting models, measuring repertoire diversity, quantifying selection pressure, and calculating sequence chemical properties. All Change-O tools utilize a common data format, which enables the seamless integration of multiple analyses into a single workflow.

Availability and implementation: Change-O is freely available for non-commercial use and may be downloaded from http://clip.med.yale.edu/changeo.

Contact: steven.kleinstein@yale.edu

1 Introduction

Large-scale characterization of immunoglobulin (Ig) repertoires is now feasible due to dramatic improvements in high-throughput sequencing technology. Repertoire sequencing is a rapidly growing area, with applications including detection of minimum residual disease, prognosis following transplant, monitoring vaccination responses, identification of neutralizing antibodies and inferring B cell trafficking patterns (Robins, 2013; Stern et al., 2014). We previously developed the repertoire sequencing toolkit (pRESTO) for producing assembled and error-corrected reads from high-throughput lymphocyte receptor sequencing experiments (Vander Heiden et al., 2014), which may then be fed into existing methods for alignment against V(D)J germline databases [e.g. IMGT/HighV-QUEST (Alamyar et al., 2012), IgBLAST (Ye et al., 2013), iHMMune-align (Gaëta et al., 2007)]. However, extracting measures of biological and clinical interest from the resulting germline-annotated repertoire remains a time-consuming and error-prone process that is often dependent upon custom analysis scripts. Here, we introduce Change-O, a suite of utilities that cover a range of complex analysis tasks for Ig repertoire sequencing data.

2 Features

The Change-O suite is composed of four software packages: a collection of Python commandline tools (changeo-ctl) and three separate R (R Core Team, 2015) packages (alakazam, shm, and tigger) (Table 1). Data are passed to Change-O utilities in the form of a tab-delimited text file. Each utility identifies the relevant input data based on standardized column names and adds new columns to the file with the output information to be carried through to the next analysis step. Change-O provides tools to import data from the frequently used IMGT/HighV-QUEST (Alamyar et al., 2012) tool as well as a set of utilities to perform basic database operations, such as sorting, filtering and modifying annotations.

Table 1.

Summary of Change-O features

Package Analysis tasks
changeo-clt Parsing of V(D)J assignment output
Basic database manipulation
Multiple alignment of sequence records
Assignment of sequences into clonal groups
Calculation of CDR3 physiochemical properties
alakazam Clonal diversity analysis
Lineage reconstruction
shm SHM hot/cold-spot modeling
Quantification of selection pressure
tigger Inference of novel germline alleles
Construction of personalized germline genotype
Package Analysis tasks
changeo-clt Parsing of V(D)J assignment output
Basic database manipulation
Multiple alignment of sequence records
Assignment of sequences into clonal groups
Calculation of CDR3 physiochemical properties
alakazam Clonal diversity analysis
Lineage reconstruction
shm SHM hot/cold-spot modeling
Quantification of selection pressure
tigger Inference of novel germline alleles
Construction of personalized germline genotype

Table 1.

Summary of Change-O features

Package Analysis tasks
changeo-clt Parsing of V(D)J assignment output
Basic database manipulation
Multiple alignment of sequence records
Assignment of sequences into clonal groups
Calculation of CDR3 physiochemical properties
alakazam Clonal diversity analysis
Lineage reconstruction
shm SHM hot/cold-spot modeling
Quantification of selection pressure
tigger Inference of novel germline alleles
Construction of personalized germline genotype
Package Analysis tasks
changeo-clt Parsing of V(D)J assignment output
Basic database manipulation
Multiple alignment of sequence records
Assignment of sequences into clonal groups
Calculation of CDR3 physiochemical properties
alakazam Clonal diversity analysis
Lineage reconstruction
shm SHM hot/cold-spot modeling
Quantification of selection pressure
tigger Inference of novel germline alleles
Construction of personalized germline genotype

The more computationally expensive components have built-in multiprocessing support. Each utility includes detailed help documentation and optional logging to track errors. Example workflow scripts are provided on the website, which can easily be modified by adding, removing or reordering analysis steps to meet different analysis goals. As detailed later, several repertoire analyses may be carried out, depending on the nature of the study.

2.1 Inference of novel alleles and individual genotype

Germline segment assignment tools, such as IMGT/HighV-QUEST, work by aligning each sequence against a database of known alleles. However, this process is inaccurate for sequences that utilize previously undetected alleles. In this case, the sequence will be assigned to the closest known allele and any polymorphisms will be incorrectly identified as somatic mutations. To address this problem, the Tool for Immunoglobulin Genotype Elucidation (TIgGER) (Gadala-Maria et al., 2015) has been implemented as an R package for inclusion in Change-O. TIgGER determines the complete set of variable region gene segments carried by an individual and identifies novel alleles, yielding a set of germline alleles personalized to an individual. The germline variable region allele assignments are then adjusted based on this individual Ig genotype. This process significantly improves the quality of germline assignments, thus increasing the confidence of downstream analysis dependent upon mutation profiles.

Identifying sequences that are descended from the same B cell (clonal groups) is important to virtually all Ig repertoire analyses. Clonal group sizes and lineage structures provide information on the underlying response, and clonally related sequences cannot be treated independently in statistical analyses and models. Change-O provides several methods for partitioning sequences into clones. Along with published methods based on hierarchical clustering (Ademokun et al., 2011; Chen et al., 2010; Glanville et al., 2009), users also have the option to employ several published somatic hypermutation (SHM) hot/cold-spot targeting models as distance metrics in the clustering methods (Smith et al., 1996; Yaari et al., 2013; Stern et al., 2014). Users may alter the clustering thresholds, and Change-O also includes tools to tune the thresholds based on distance patterns in the repertoire (Glanville et al., 2009).

2.3 Quantification of repertoire diversity

To assess repertoire diversity, Change-O provides an implementation of the general diversity index (⁠qD⁠) proposed by Hill (1973), which encompasses a range of diversity measures as a smooth curve over a single varying parameter q. Special cases of this general index of diversity correspond to the most popular diversity measures: species richness (q = 0), the exponential Shannon-Weiner index (as q→1⁠), the inverse of the Simpson index (q = 2), and the reciprocal abundance of the largest clone (as q→∞⁠). Resampling strategies are also provided to perform significance tests and allow comparison across samples with varying sequencing depth (Wu et al., 2014; Stern et al., 2014).

2.4 Generation of B cell lineage trees

Lineage trees provide a means to trace the ancestral relationships of cells within a clone. This information has been used to estimate mutation rates (Kleinstein et al., 2003), infer B cell trafficking patterns (Stern et al., 2014) and trace the accumulation of mutations that drive affinity maturation (Uduman et al., 2014; Wu et al., 2012). Change-O provides a tool for generating lineage trees using PHYLIP’s maximum parsimony algorithm (Felsenstein, 1989), with modifications to meet the requirements of an Ig lineage tree (Barak et al., 2008; Stern et al., 2014). Trees may be viewed and exported into different file formats using the igraph (Csardi and Nepusz, 2006) R package.

2.5 Somatic hypermutation hot/cold-spot motifs

SHM is a process that operates in activated B cells and introduces point mutations into the DNA coding for the Ig receptor at a very high rate (⁠≈10−3 per base-pair per division) (Kleinstein et al., 2003; McKean et al., 1984). Accurate background models of SHM are critical, since SHM displays intrinsic hot/cold-spot biases (Yaari et al., 2013). Change-O provides utilities for estimating the mutability and substitution rates of DNA motifs from large-scale Ig sequencing data to construct hot/cold-spot motif models. Furthermore, models may be generated based solely on silent mutations, thereby avoiding the confounding influence of selection pressures (Yaari et al., 2013). These tools can be used to build models of SHM targeting and gain insight into the relative contributions of different error-prone repair pathways in SHM.

2.6 Analysis of selection pressure

For quantifying selection pressure in Ig sequences, Change-O includes the BASELINe (Yaari et al., 2012) method, which has been implemented as an R package for inclusion in the suite. BASELINe quantifies deviations in the frequency of replacement mutations compared with a background model of SHM. Users may choose between published background models (Smith et al., 1996; Yaari et al., 2013) or infer the background from their own data using the SHM model building tools described above.

3 Conclusion

Change-O is a suite of utilities implementing a wide range of B cell repertoire analysis methods. Together these tools allow researchers to quickly implement advanced analysis pipelines for large datasets generated by repertoire sequencing experiments. A simple tab-delimited file with standardized column names allows for communication between the utilities and can easily be viewed using any spreadsheet application. This format also allows research groups the flexibility to incorporate other analysis tools into their in-house analysis pipelines by simply adding additional columns of information to the central file. Change-O, along with pRESTO (Vander Heiden et al., 2014), provides key components of an analytical ecosystem that enables sophisticated analysis of high-throughput Ig repertoire sequencing datasets.

Acknowledgements

The authors thank the Yale University Biomedical High Performance Computing Center [funded by National Institutes of Health grants RR19895 and RR029676-01] for use of their computing resources. The authors also thank Chris Bolen, Moriah Cohen, Jingli Shan and Sonia Timberlake for testing Change-O and providing helpful feedback.

Funding

This work was supported by the National Institutes of Health [R01AI104739 to S.H.K.; T15LM07056 to N.T.G., T15LM07056 to J.A.V.H. from National Library of Medicine (NLM)] and by the United States-Israel Binational Science Foundation [2013395 to G.Y. and S.H.K.].

Conflict of Interest: none declared.

References

et al. . (

2011

)

Vaccination-induced changes in human B-cell repertoire and pneumococcal IgM and IgA antibody at different ages

.

Aging cell

,

10

,

922

930

.

et al. . (

2012

)

IMGT(®) tools for the nucleotide analysis of immunoglobulin (IG) and T cell receptor (TR) V-(D)-J repertoires, polymorphisms, and IG mutations: IMGT/V-QUEST and IMGT/HighV-QUEST for NGS

.

Methods Mol. Biol.

,

882

,

569

604

.

et al. . (

2008

)

IgTree: creating immunoglobulin variable region gene lineage trees

.

J. Immunol. Methods

,

338

,

67

74

.

et al. . (

2010

)

Clustering-based identification of clonally-related immunoglobulin gene sequence sets

.

Immunome Res.

,

6

(

Suppl. 1

),

S4

.

(

2006

)

The igraph software package for complex network research

.

InterJournal

,

Complex Systems, 1695

.

(

1989

)

PHYLIP - Phylogeny inference package (Version 3.2)

.

Cladistics

,

5

,

164

166

.

et al. . (

2015

)

Automated analysis of high-throughput B-cell sequencing data reveals a high frequency of novel immunoglobulin V gene segment alleles

.

Proc. Natl. Acad. Sci. USA

,

112

,

201417683

.

et al. . (

2007

)

iHMMune-align: hidden Markov model-based alignment and identification of germline genes in rearranged immunoglobulin gene sequences

.

Bioinformatics

,

23

,

1580

1587

.

et al. . (

2009

)

Precise determination of the diversity of a combinatorial antibody library gives insight into the human immunoglobulin repertoire

.

Proc. Natl. Acad. Sci. USA

,

106

,

20216

20221

.

(

1973

)

Diversity and evenness: a unifying notation and its consequences

.

Ecology

,

54

,

427

.

et al. . (

2003

)

Estimating hypermutation rates from clonal tree data

.

J. Immunol.

,

171

,

4639

4649

.

et al. . (

1984

)

Generation of antibody diversity in the immune response of BALB/c mice to influenza virus hemagglutinin

.

Proc. Natl. Acad. Sci. USA

,

81

,

3180

3184

.

R Core Team

(

2015

)

R: A Language and Environment for Statistical Computing

.

R Foundation for Statistical Computing, Vienna, Austria

.

(

2013

)

Immunosequencing: applications of immune repertoire deep sequencing

.

Curr. Opin. Immunol.

,

25

,

646

652

et al. . (

1996

)

Di- and trinucleotide target preferences of somatic mutagenesis in normal and autoreactive B cells

.

J. Immunol.

,

156

,

2642

2652

.

et al. . (

2014

)

B cells populating the multiple sclerosis brain mature in the draining cervical lymph nodes

.

Sci. Transl. Med.

,

6

,

248ra107

.

et al. . (

2014

)

Integrating B cell lineage information into statistical tests for detecting selection in Ig sequences

.

J. Immunol.

,

192

,

867

874

.

et al. . (

2014

)

pRESTO: a toolkit for processing high-throughput sequencing raw reads of lymphocyte receptor repertoires

.

Bioinformatics

,

30

,

1930

1932

et al. . (

2012

)

Age-related changes in human peripheral blood IGH repertoire following vaccination

.

Front. Immunol.

,

3

,

193

.

et al. . (

2014

)

Influence of seasonal exposure to grass pollen on local and peripheral blood IgE repertoires in patients with allergic rhinitis

.

J. Allergy Clin. Immunol.

,

134

,

604

612

.

et al. . (

2012

)

Quantifying selection in high-throughput immunoglobulin sequencing data sets

.

Nucleic Acids Res.

,

40

,

e134

.

et al. . (

2013

)

Models of somatic hypermutation targeting and substitution based on synonymous mutations from high-throughput immunoglobulin sequencing data

.

Front. Immunol.

,

4

,

358

.

et al. . (

2013

)

IgBLAST: an immunoglobulin variable domain sequence analysis tool

.

Nucleic Acids Res.

,

41

(

Web Server Issue

),

W34

W40

.

Author notes

†The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors.

Associate Editor: Alfonso Valencia

© The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

Citations

Views

Altmetric

Metrics

Total Views 13,014

10,121 Pageviews

2,893 PDF Downloads

Since 11/1/2016

Month: Total Views:
November 2016 8
December 2016 11
January 2017 20
February 2017 27
March 2017 80
April 2017 36
May 2017 55
June 2017 50
July 2017 50
August 2017 68
September 2017 40
October 2017 36
November 2017 23
December 2017 81
January 2018 94
February 2018 66
March 2018 137
April 2018 71
May 2018 62
June 2018 61
July 2018 79
August 2018 115
September 2018 113
October 2018 75
November 2018 102
December 2018 72
January 2019 62
February 2019 84
March 2019 63
April 2019 104
May 2019 130
June 2019 115
July 2019 158
August 2019 91
September 2019 116
October 2019 93
November 2019 92
December 2019 87
January 2020 95
February 2020 83
March 2020 128
April 2020 127
May 2020 94
June 2020 175
July 2020 168
August 2020 150
September 2020 116
October 2020 106
November 2020 130
December 2020 129
January 2021 149
February 2021 154
March 2021 162
April 2021 167
May 2021 135
June 2021 159
July 2021 164
August 2021 128
September 2021 157
October 2021 199
November 2021 170
December 2021 134
January 2022 173
February 2022 180
March 2022 206
April 2022 228
May 2022 228
June 2022 233
July 2022 208
August 2022 165
September 2022 221
October 2022 227
November 2022 142
December 2022 129
January 2023 147
February 2023 172
March 2023 214
April 2023 206
May 2023 182
June 2023 188
July 2023 245
August 2023 202
September 2023 221
October 2023 188
November 2023 209
December 2023 206
January 2024 294
February 2024 267
March 2024 221
April 2024 222
May 2024 191
June 2024 122
July 2024 156
August 2024 174
September 2024 163
October 2024 164
November 2024 114

Citations

463 Web of Science

×

Email alerts

Citing articles via

More from Oxford Academic