Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data (original) (raw)

Journal Article

1Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA,

Search for other works by this author on:

1Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA,

Search for other works by this author on:

2Department of Pathology, Yale University School of Medicine, New Haven, CT 06511, USA and

Search for other works by this author on:

1Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA,

Search for other works by this author on:

3Bioengineering Program, Faculty of Engineering, Bar-Ilan University, Ramat Gan 52900, Israel

Search for other works by this author on:

1Interdepartmental Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06511, USA,

2Department of Pathology, Yale University School of Medicine, New Haven, CT 06511, USA and

*To whom correspondence should be addressed.

Search for other works by this author on:

†The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors.

Associate Editor: Alfonso Valencia

Author Notes

Received:

17 October 2014

Revision received:

30 April 2015

Cite

Namita T. Gupta, Jason A. Vander Heiden, Mohamed Uduman, Daniel Gadala-Maria, Gur Yaari, Steven H. Kleinstein, Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data, Bioinformatics, Volume 31, Issue 20, October 2015, Pages 3356–3358, https://doi.org/10.1093/bioinformatics/btv359
Close

Navbar Search Filter Mobile Enter search term Search

Abstract

Summary: Advances in high-throughput sequencing technologies now allow for large-scale characterization of B cell immunoglobulin (Ig) repertoires. The high germline and somatic diversity of the Ig repertoire presents challenges for biologically meaningful analysis, which requires specialized computational methods. We have developed a suite of utilities, Change-O, which provides tools for advanced analyses of large-scale Ig repertoire sequencing data. Change-O includes tools for determining the complete set of Ig variable region gene segment alleles carried by an individual (including novel alleles), partitioning of Ig sequences into clonal populations, creating lineage trees, inferring somatic hypermutation targeting models, measuring repertoire diversity, quantifying selection pressure, and calculating sequence chemical properties. All Change-O tools utilize a common data format, which enables the seamless integration of multiple analyses into a single workflow.

Availability and implementation: Change-O is freely available for non-commercial use and may be downloaded from http://clip.med.yale.edu/changeo.

Contact: steven.kleinstein@yale.edu

1 Introduction

Large-scale characterization of immunoglobulin (Ig) repertoires is now feasible due to dramatic improvements in high-throughput sequencing technology. Repertoire sequencing is a rapidly growing area, with applications including detection of minimum residual disease, prognosis following transplant, monitoring vaccination responses, identification of neutralizing antibodies and inferring B cell trafficking patterns (Robins, 2013; Stern et al., 2014). We previously developed the repertoire sequencing toolkit (pRESTO) for producing assembled and error-corrected reads from high-throughput lymphocyte receptor sequencing experiments (Vander Heiden et al., 2014), which may then be fed into existing methods for alignment against V(D)J germline databases [e.g. IMGT/HighV-QUEST (Alamyar et al., 2012), IgBLAST (Ye et al., 2013), iHMMune-align (Gaëta et al., 2007)]. However, extracting measures of biological and clinical interest from the resulting germline-annotated repertoire remains a time-consuming and error-prone process that is often dependent upon custom analysis scripts. Here, we introduce Change-O, a suite of utilities that cover a range of complex analysis tasks for Ig repertoire sequencing data.

2 Features

The Change-O suite is composed of four software packages: a collection of Python commandline tools (changeo-ctl) and three separate R (R Core Team, 2015) packages (alakazam, shm, and tigger) (Table 1). Data are passed to Change-O utilities in the form of a tab-delimited text file. Each utility identifies the relevant input data based on standardized column names and adds new columns to the file with the output information to be carried through to the next analysis step. Change-O provides tools to import data from the frequently used IMGT/HighV-QUEST (Alamyar et al., 2012) tool as well as a set of utilities to perform basic database operations, such as sorting, filtering and modifying annotations.

Table 1.

Summary of Change-O features

Package	Analysis tasks
changeo-clt	Parsing of V(D)J assignment output
Basic database manipulation
Multiple alignment of sequence records
Assignment of sequences into clonal groups
Calculation of CDR3 physiochemical properties
alakazam	Clonal diversity analysis
Lineage reconstruction
shm	SHM hot/cold-spot modeling
Quantification of selection pressure
tigger	Inference of novel germline alleles
Construction of personalized germline genotype

Package	Analysis tasks
changeo-clt	Parsing of V(D)J assignment output
Basic database manipulation
Multiple alignment of sequence records
Assignment of sequences into clonal groups
Calculation of CDR3 physiochemical properties
alakazam	Clonal diversity analysis
Lineage reconstruction
shm	SHM hot/cold-spot modeling
Quantification of selection pressure
tigger	Inference of novel germline alleles
Construction of personalized germline genotype

Table 1.

Summary of Change-O features

Package	Analysis tasks
changeo-clt	Parsing of V(D)J assignment output
Basic database manipulation
Multiple alignment of sequence records
Assignment of sequences into clonal groups
Calculation of CDR3 physiochemical properties
alakazam	Clonal diversity analysis
Lineage reconstruction
shm	SHM hot/cold-spot modeling
Quantification of selection pressure
tigger	Inference of novel germline alleles
Construction of personalized germline genotype

Package	Analysis tasks
changeo-clt	Parsing of V(D)J assignment output
Basic database manipulation
Multiple alignment of sequence records
Assignment of sequences into clonal groups
Calculation of CDR3 physiochemical properties
alakazam	Clonal diversity analysis
Lineage reconstruction
shm	SHM hot/cold-spot modeling
Quantification of selection pressure
tigger	Inference of novel germline alleles
Construction of personalized germline genotype

The more computationally expensive components have built-in multiprocessing support. Each utility includes detailed help documentation and optional logging to track errors. Example workflow scripts are provided on the website, which can easily be modified by adding, removing or reordering analysis steps to meet different analysis goals. As detailed later, several repertoire analyses may be carried out, depending on the nature of the study.

2.1 Inference of novel alleles and individual genotype

Germline segment assignment tools, such as IMGT/HighV-QUEST, work by aligning each sequence against a database of known alleles. However, this process is inaccurate for sequences that utilize previously undetected alleles. In this case, the sequence will be assigned to the closest known allele and any polymorphisms will be incorrectly identified as somatic mutations. To address this problem, the Tool for Immunoglobulin Genotype Elucidation (TIgGER) (Gadala-Maria et al., 2015) has been implemented as an R package for inclusion in Change-O. TIgGER determines the complete set of variable region gene segments carried by an individual and identifies novel alleles, yielding a set of germline alleles personalized to an individual. The germline variable region allele assignments are then adjusted based on this individual Ig genotype. This process significantly improves the quality of germline assignments, thus increasing the confidence of downstream analysis dependent upon mutation profiles.

Identifying sequences that are descended from the same B cell (clonal groups) is important to virtually all Ig repertoire analyses. Clonal group sizes and lineage structures provide information on the underlying response, and clonally related sequences cannot be treated independently in statistical analyses and models. Change-O provides several methods for partitioning sequences into clones. Along with published methods based on hierarchical clustering (Ademokun et al., 2011; Chen et al., 2010; Glanville et al., 2009), users also have the option to employ several published somatic hypermutation (SHM) hot/cold-spot targeting models as distance metrics in the clustering methods (Smith et al., 1996; Yaari et al., 2013; Stern et al., 2014). Users may alter the clustering thresholds, and Change-O also includes tools to tune the thresholds based on distance patterns in the repertoire (Glanville et al., 2009).

2.3 Quantification of repertoire diversity

To assess repertoire diversity, Change-O provides an implementation of the general diversity index (⁠qD⁠) proposed by Hill (1973), which encompasses a range of diversity measures as a smooth curve over a single varying parameter q. Special cases of this general index of diversity correspond to the most popular diversity measures: species richness (q = 0), the exponential Shannon-Weiner index (as q→1⁠), the inverse of the Simpson index (q = 2), and the reciprocal abundance of the largest clone (as q→∞⁠). Resampling strategies are also provided to perform significance tests and allow comparison across samples with varying sequencing depth (Wu et al., 2014; Stern et al., 2014).

2.4 Generation of B cell lineage trees

Lineage trees provide a means to trace the ancestral relationships of cells within a clone. This information has been used to estimate mutation rates (Kleinstein et al., 2003), infer B cell trafficking patterns (Stern et al., 2014) and trace the accumulation of mutations that drive affinity maturation (Uduman et al., 2014; Wu et al., 2012). Change-O provides a tool for generating lineage trees using PHYLIP’s maximum parsimony algorithm (Felsenstein, 1989), with modifications to meet the requirements of an Ig lineage tree (Barak et al., 2008; Stern et al., 2014). Trees may be viewed and exported into different file formats using the igraph (Csardi and Nepusz, 2006) R package.

2.5 Somatic hypermutation hot/cold-spot motifs

SHM is a process that operates in activated B cells and introduces point mutations into the DNA coding for the Ig receptor at a very high rate (⁠≈10−3 per base-pair per division) (Kleinstein et al., 2003; McKean et al., 1984). Accurate background models of SHM are critical, since SHM displays intrinsic hot/cold-spot biases (Yaari et al., 2013). Change-O provides utilities for estimating the mutability and substitution rates of DNA motifs from large-scale Ig sequencing data to construct hot/cold-spot motif models. Furthermore, models may be generated based solely on silent mutations, thereby avoiding the confounding influence of selection pressures (Yaari et al., 2013). These tools can be used to build models of SHM targeting and gain insight into the relative contributions of different error-prone repair pathways in SHM.

2.6 Analysis of selection pressure

For quantifying selection pressure in Ig sequences, Change-O includes the BASELINe (Yaari et al., 2012) method, which has been implemented as an R package for inclusion in the suite. BASELINe quantifies deviations in the frequency of replacement mutations compared with a background model of SHM. Users may choose between published background models (Smith et al., 1996; Yaari et al., 2013) or infer the background from their own data using the SHM model building tools described above.

3 Conclusion

Change-O is a suite of utilities implementing a wide range of B cell repertoire analysis methods. Together these tools allow researchers to quickly implement advanced analysis pipelines for large datasets generated by repertoire sequencing experiments. A simple tab-delimited file with standardized column names allows for communication between the utilities and can easily be viewed using any spreadsheet application. This format also allows research groups the flexibility to incorporate other analysis tools into their in-house analysis pipelines by simply adding additional columns of information to the central file. Change-O, along with pRESTO (Vander Heiden et al., 2014), provides key components of an analytical ecosystem that enables sophisticated analysis of high-throughput Ig repertoire sequencing datasets.

Acknowledgements

The authors thank the Yale University Biomedical High Performance Computing Center [funded by National Institutes of Health grants RR19895 and RR029676-01] for use of their computing resources. The authors also thank Chris Bolen, Moriah Cohen, Jingli Shan and Sonia Timberlake for testing Change-O and providing helpful feedback.

Funding

This work was supported by the National Institutes of Health [R01AI104739 to S.H.K.; T15LM07056 to N.T.G., T15LM07056 to J.A.V.H. from National Library of Medicine (NLM)] and by the United States-Israel Binational Science Foundation [2013395 to G.Y. and S.H.K.].

Conflict of Interest: none declared.

References

et al. . (

2011

)

Vaccination-induced changes in human B-cell repertoire and pneumococcal IgM and IgA antibody at different ages

Aging cell

922

–

930

et al. . (

2012

)

IMGT(®) tools for the nucleotide analysis of immunoglobulin (IG) and T cell receptor (TR) V-(D)-J repertoires, polymorphisms, and IG mutations: IMGT/V-QUEST and IMGT/HighV-QUEST for NGS

Methods Mol. Biol.

882

569

–

604

et al. . (

2008

)

IgTree: creating immunoglobulin variable region gene lineage trees

J. Immunol. Methods

338

–

et al. . (

2010

)

Clustering-based identification of clonally-related immunoglobulin gene sequence sets

Immunome Res.

(

Suppl. 1

(

2006

)

The igraph software package for complex network research

InterJournal

Complex Systems, 1695

(

1989

)

PHYLIP - Phylogeny inference package (Version 3.2)

Cladistics

164

–

166

et al. . (

2015

)

Automated analysis of high-throughput B-cell sequencing data reveals a high frequency of novel immunoglobulin V gene segment alleles

Proc. Natl. Acad. Sci. USA

112

201417683

et al. . (

2007

)

iHMMune-align: hidden Markov model-based alignment and identification of germline genes in rearranged immunoglobulin gene sequences

Bioinformatics

1580

–

1587

et al. . (

2009

)

Precise determination of the diversity of a combinatorial antibody library gives insight into the human immunoglobulin repertoire

Proc. Natl. Acad. Sci. USA

106

20216

–

20221

(

1973

)

Diversity and evenness: a unifying notation and its consequences

Ecology

427

et al. . (

2003

)

Estimating hypermutation rates from clonal tree data

J. Immunol.

171

4639

–

4649

et al. . (

1984

)

Generation of antibody diversity in the immune response of BALB/c mice to influenza virus hemagglutinin

Proc. Natl. Acad. Sci. USA

3180

–

3184

R Core Team

(

2015

)

R: A Language and Environment for Statistical Computing

R Foundation for Statistical Computing, Vienna, Austria

(

2013

)

Immunosequencing: applications of immune repertoire deep sequencing

Curr. Opin. Immunol.

646

–

652

et al. . (

1996

)

Di- and trinucleotide target preferences of somatic mutagenesis in normal and autoreactive B cells

J. Immunol.

156

2642

–

2652

et al. . (

2014

)

B cells populating the multiple sclerosis brain mature in the draining cervical lymph nodes

Sci. Transl. Med.

248ra107

et al. . (

2014

)

Integrating B cell lineage information into statistical tests for detecting selection in Ig sequences

J. Immunol.

192

867

–

874

et al. . (

2014

)

pRESTO: a toolkit for processing high-throughput sequencing raw reads of lymphocyte receptor repertoires

Bioinformatics

1930

–

1932

et al. . (

2012

)

Age-related changes in human peripheral blood IGH repertoire following vaccination

Front. Immunol.

193

et al. . (

2014

)

Influence of seasonal exposure to grass pollen on local and peripheral blood IgE repertoires in patients with allergic rhinitis

J. Allergy Clin. Immunol.

134

604

–

612

et al. . (

2012

)

Quantifying selection in high-throughput immunoglobulin sequencing data sets

Nucleic Acids Res.

e134

et al. . (

2013

)

Models of somatic hypermutation targeting and substitution based on synonymous mutations from high-throughput immunoglobulin sequencing data

Front. Immunol.

358

et al. . (

2013

)

IgBLAST: an immunoglobulin variable domain sequence analysis tool

Nucleic Acids Res.

(

Web Server Issue

W34

–

W40

Author notes

†The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors.

Associate Editor: Alfonso Valencia

Citations

Views

Altmetric

Metrics

Total Views 13,014

10,121 Pageviews

2,893 PDF Downloads

Since 11/1/2016

Month:	Total Views:
November 2016	8
December 2016	11
January 2017	20
February 2017	27
March 2017	80
April 2017	36
May 2017	55
June 2017	50
July 2017	50
August 2017	68
September 2017	40
October 2017	36
November 2017	23
December 2017	81
January 2018	94
February 2018	66
March 2018	137
April 2018	71
May 2018	62
June 2018	61
July 2018	79
August 2018	115
September 2018	113
October 2018	75
November 2018	102
December 2018	72
January 2019	62
February 2019	84
March 2019	63
April 2019	104
May 2019	130
June 2019	115
July 2019	158
August 2019	91
September 2019	116
October 2019	93
November 2019	92
December 2019	87
January 2020	95
February 2020	83
March 2020	128
April 2020	127
May 2020	94
June 2020	175
July 2020	168
August 2020	150
September 2020	116
October 2020	106
November 2020	130
December 2020	129
January 2021	149
February 2021	154
March 2021	162
April 2021	167
May 2021	135
June 2021	159
July 2021	164
August 2021	128
September 2021	157
October 2021	199
November 2021	170
December 2021	134
January 2022	173
February 2022	180
March 2022	206
April 2022	228
May 2022	228
June 2022	233
July 2022	208
August 2022	165
September 2022	221
October 2022	227
November 2022	142
December 2022	129
January 2023	147
February 2023	172
March 2023	214
April 2023	206
May 2023	182
June 2023	188
July 2023	245
August 2023	202
September 2023	221
October 2023	188
November 2023	209
December 2023	206
January 2024	294
February 2024	267
March 2024	221
April 2024	222
May 2024	191
June 2024	122
July 2024	156
August 2024	174
September 2024	163
October 2024	164
November 2024	114

Citations

463 Web of Science

Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data (original) (raw)

Cite

Abstract

1 Introduction

2 Features

2.1 Inference of novel alleles and individual genotype

2.3 Quantification of repertoire diversity

2.4 Generation of B cell lineage trees

2.5 Somatic hypermutation hot/cold-spot motifs

2.6 Analysis of selection pressure

3 Conclusion

Acknowledgements

Funding

References

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Cited

Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data (original) (raw)

Cite

Abstract

1 Introduction

2 Features

2.1 Inference of novel alleles and individual genotype

2.2 Partitioning sequences into clonally related groups

2.3 Quantification of repertoire diversity

2.4 Generation of B cell lineage trees

2.5 Somatic hypermutation hot/cold-spot motifs

2.6 Analysis of selection pressure

3 Conclusion

Acknowledgements

Funding

References

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited