Decoding T- and B-cell receptor repertoires with ClustIRR (original) (raw)

Contents

library(knitr)
library(ClustIRR)
library(igraph)
library(ggplot2)
theme_set(new = theme_bw(base_size = 10))
library(ggrepel)
library(patchwork)
options(digits=2)

Introduction

Adaptive immunity employs diverse immune receptor repertoires (IRRs; B- and T-cell receptors) to combat evolving pathogens including viruses, bacteria, and cancers. Receptor diversity arises through V(D)J recombination - combinatorial assembly of germline genes generating unique sequences. Each naive lymphocyte consequently expresses a distinct receptor, enabling broad antigen recognition.

B cells engage antigens directly via BCR complementarity-determining regions (CDRs), while T cells recognize peptide-MHC complexes through TCR CDRs. Antigen recognition triggers clonal expansion, producing effector populations that mount targeted immune responses.

High-throughput sequencing (HT-seq) enables tracking of TCR/BCR community dynamics across biological conditions (e.g., pre-/post-treatment), offering insights into responses to immunotherapy and vaccination. However, two key challenges complicate this approach:

  1. Extreme diversity and privacy: TCRs/BCRs are highly personalized, with incomplete sampling particularly problematic in clinical settings where sample volumes are limited. Even comprehensive sampling reveals minimal repertoire overlap between individuals.
  2. Similar TCRs/BCRs recognize similar antigens

This vignette introduces ClustIRR, a computational method that addresses these challenges by: (1) Identifying specificity-associated receptor communities through sequence clustering, and (2) Applying Bayesian models to quantify differential community abundance across conditions.

Installation

ClustIRR is freely available as part of Bioconductor, filling the gap that currently exists in terms of software for quantitative analysis of IRRs.

To install ClustIRR please start R and enter:

if(!require("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("ClustIRR")

ClustIRR algorithm

Input

The main input of ClustIRR is a repertoire (s), which should be provided as data.frame. The rows in the data.frame correspond to clones (clone = group of cells derived from a common parent cell by clonal expansion). We use the following data from each clone:

In a typical scenario, the user will have more than one repertoire (see workflow). For instance, the user will analyze longitudinal repertoire data, i.e., two or three repertoires taken at different time points; or across different biological conditions.

Let’s look at dataset D1 that is provided within_ClustIRR_. D1 contains three TCR\(\alpha\beta\)repertoires: \(a\), \(b\), and \(c\) and their metadata: ma, mb and mc.

data("D1", package = "ClustIRR")
str(D1)
List of 6
 $ a :'data.frame': 500 obs. of  4 variables:
  ..$ CDR3a     : chr [1:500] "CASSLRGAHNEQFF" "CASGLRQGYGYTF" "CSAGGFRETQYF" "CGSSLSQGSEQYF" ...
  ..$ CDR3b     : chr [1:500] "CASTVTSGSNEKLFF" "CASSLTGTGYTF" "CSALTPGLIYNEQFF" "CSARASWGTNEKLFF" ...
  ..$ sample    : chr [1:500] "a" "a" "a" "a" ...
  ..$ clone_size: num [1:500] 3 2 2 5 3 5 5 5 3 4 ...
 $ b :'data.frame': 500 obs. of  4 variables:
  ..$ CDR3a     : chr [1:500] "CASSLRGAHNEQFF" "CASGLRQGYGYTF" "CSAGGFRETQYF" "CGSSLSQGSEQYF" ...
  ..$ CDR3b     : chr [1:500] "CASTVTSGSNEKLFF" "CASSLTGTGYTF" "CSALTPGLIYNEQFF" "CSARASWGTNEKLFF" ...
  ..$ sample    : chr [1:500] "b" "b" "b" "b" ...
  ..$ clone_size: num [1:500] 3 2 2 7 4 5 5 7 3 4 ...
 $ c :'data.frame': 500 obs. of  4 variables:
  ..$ CDR3a     : chr [1:500] "CASSLRGAHNEQFF" "CASGLRQGYGYTF" "CSAGGFRETQYF" "CGSSLSQGSEQYF" ...
  ..$ CDR3b     : chr [1:500] "CASTVTSGSNEKLFF" "CASSLTGTGYTF" "CSALTPGLIYNEQFF" "CSARASWGTNEKLFF" ...
  ..$ sample    : chr [1:500] "c" "c" "c" "c" ...
  ..$ clone_size: num [1:500] 3 2 2 9 5 5 5 9 3 4 ...
 $ ma:'data.frame': 500 obs. of  8 variables:
  ..$ clone_id: int [1:500] 1 2 3 4 5 6 7 8 9 10 ...
  ..$ cell    : chr [1:500] "CD8" "CD4" "CD8" "CD8" ...
  ..$ HLA_A   : chr [1:500] "HLA-A∗24" "HLA-A∗24" "HLA-A∗24" "HLA-A∗24" ...
  ..$ age     : num [1:500] 24 24 24 24 24 24 24 24 24 24 ...
  ..$ TRAV    : chr [1:500] "TRAV27" "TRAV27" "TRAV20-1" "TRAV7-7" ...
  ..$ TRAJ    : chr [1:500] "TRAJ2-1" "TRAJ1-2" "TRAJ2-5" "TRAJ2-7" ...
  ..$ TRBV    : chr [1:500] "TRBV6-8" "TRBV7-3" "TRBV20-1" "TRBV20-1" ...
  ..$ TRBJ    : chr [1:500] "TRBJ1-4" "TRBJ1-2" "TRBJ2-1" "TRBJ1-4" ...
 $ mb:'data.frame': 500 obs. of  8 variables:
  ..$ clone_id: int [1:500] 1 2 3 4 5 6 7 8 9 10 ...
  ..$ cell    : chr [1:500] "CD8" "CD4" "CD4" "CD4" ...
  ..$ HLA_A   : chr [1:500] "HLA-A∗02" "HLA-A∗02" "HLA-A∗02" "HLA-A∗02" ...
  ..$ age     : num [1:500] 30 30 30 30 30 30 30 30 30 30 ...
  ..$ TRAV    : chr [1:500] "TRAV27" "TRAV27" "TRAV20-1" "TRAV7-7" ...
  ..$ TRAJ    : chr [1:500] "TRAJ2-1" "TRAJ1-2" "TRAJ2-5" "TRAJ2-7" ...
  ..$ TRBV    : chr [1:500] "TRBV6-8" "TRBV7-3" "TRBV20-1" "TRBV20-1" ...
  ..$ TRBJ    : chr [1:500] "TRBJ1-4" "TRBJ1-2" "TRBJ2-1" "TRBJ1-4" ...
 $ mc:'data.frame': 500 obs. of  8 variables:
  ..$ clone_id: int [1:500] 1 2 3 4 5 6 7 8 9 10 ...
  ..$ cell    : chr [1:500] "CD8" "CD8" "CD8" "CD8" ...
  ..$ HLA_A   : chr [1:500] "HLA-A∗11" "HLA-A∗11" "HLA-A∗11" "HLA-A∗11" ...
  ..$ age     : num [1:500] 40 40 40 40 40 40 40 40 40 40 ...
  ..$ TRAV    : chr [1:500] "TRAV27" "TRAV27" "TRAV20-1" "TRAV7-7" ...
  ..$ TRAJ    : chr [1:500] "TRAJ2-1" "TRAJ1-2" "TRAJ2-5" "TRAJ2-7" ...
  ..$ TRBV    : chr [1:500] "TRBV6-8" "TRBV7-3" "TRBV20-1" "TRBV20-1" ...
  ..$ TRBJ    : chr [1:500] "TRBJ1-4" "TRBJ1-2" "TRBJ2-1" "TRBJ1-4" ...

Extract the data.frames for each TCR repertoire and their metadata:

We will merge three TCR repertoires into the data.frame tcr_reps.

tcr_reps <- rbind(D1$a, D1$b, D1$c)

We will do the same for the metadata

meta <- rbind(D1$ma, D1$mb, D1$mc)

Algorithm

ClustIRR performs the following steps.

  1. Compute similarities between T-cell clones within each TCR repertoire
  2. Construct a graph from each TCR repertoire
  3. Construct a joint similarity graph (\(J\))
  4. Detect communities in \(J\)
  5. Analyze Differential Community Occupancy (DCO)
    a. Between individual TCR repertoires with model \(M\)
    b. Between groups of TCR repertoires from biological conditions with model \(M_h\)
  6. Inspect results

Step 1. Compute TCR clone similarities in a repertoire with cluster_irr

ClustIRR aims to quantify the similarity between pairs of TCR clones based on the similarities of their CDR3s sequences. For this it employs Basic Local-Alignment Search Tool (BLAST) via the R-package_blaster_. Briefly, a protein database is constructed from all CDR3 sequences, and each CDR3 sequence is used as a query. This enables fast sequence similarity searches. Furthermore, only CDR3 sequences matches with \(\geq\) 70% sequence identity to the query are retained. This step reduces the computational and memory requirements, without impacting downstream community analyses, as CDR3 sequences with lower typically yield low similarity scores.

For matched CDR3 pair, an alignment score (\(\omega\)) is computed using BLOSUM62 substitution scores with gap opening penalty of -10 and gap extension penalty of -4. \(\omega\) is the sum of substitution scores and gap penalties in the alignment. Identical or highly similar CDR3 sequence pairs receive large positive \(\omega\) scores, while dissimilar pairs receive low or negative \(\omega\). To normalize \(\omega\) for alignment length, ClustIRR computes\(\bar{\omega} = \omega/l\), where \(l\) is the alignment length yielding normalized alignment score \(\bar{\omega}\). This normalization, also used in iSMART (Zhang, 2020), ensures comparability across CDR3 pairs of varying lengths.

ClustIRR also computes alignment scores for the CDR3core regions (\(\omega^c\) and \(\bar{\omega}^c\)). The CDR3 core, representing the central loop region with high antigen-contact probability (Glanville, 2017), is generated by trimming three residues from each end of the CDR3 sequence. Comparing \(\bar{\omega}^c\) and\(\bar{\omega}\) allows assessment of whether sequence similarity is concentrated in the core or flanking regions.

Step 2-3. Construct TCR repertoire graphs and join them into \(J\)

Next, ClustIRR builds a graph for each TCR repertoire. The graphs have nodes and weighted edges:

Then the graphs are joined together: edges between TCR clones from different TCR repertoires are computed using the same procedure outlined in step 1. The joint graph \(J\) is stored as an igraph object.

Run steps 1-3 with clustirr

Step 1. involves calling the function clust_irr which returns an S4 object of class clust_irr.

cl <- clustirr(s = tcr_reps, meta = meta, control = list(gmi = 0.7))

The output complex list with three elements:

  1. graph = the joint graph \(J\)
  2. clust_irrs = list with S4 clust_irr objects one for each TCR repertoire
  3. multigraph = logical value which tell us whether \(J\) contains multiple or a single TCR repertoire

Inspect the content of clust_irrs

We can look at the similarity scores between CDR3 sequences of TCR repertoire a. Each row is a pair of CDR3 sequences from the repertoire annotated with the following metrics:

The results for CDR3\(\alpha\) sequence pairs from repertoire a:

kable(head(cl$clust_irrs[["a"]]@clust$CDR3a), digits = 2)

… the same table for CDR3\(\beta\) sequence pairs from repertoire a:

kable(head(cl$clust_irrs[["a"]]@clust$CDR3b), digits = 2)

Notice that very similar CDR3\(\alpha\) or CDR3\(\beta\) pairs have high normalized alignment scores (\(\bar{\omega}\)). We have similar tables for repertoire b and c. These are used internally to construct graphs in the next step.

Inspect graphs with plot_graph

We can use the function plot_graph for interactive visualization of relatively small graphs.

The function clust_irr performs automatic annotation of TCR clones based on multiple databases (DBs), including: VDJdb, TCR3d, McPAS-TCR. Each TCR clone recieves two types of annotations (per chain and per database):

Hence, in the plot we can highlight TCR clones that recognize certain antigenic species or genes (see dropdown menu), or use the hoovering function to look at the CDR3 sequences of nodes.

Do this now!

plot_graph(cl, select_by = "Ag_species", as_visnet = TRUE)

Notice that even in this small case study–where each TCR repertoire contains only 500 TCR clones–we observe that the joint graph is complex! This complexity makes qualitative analyses impractical. To address this, ClustIRR focuses on quantitative analyses (see the next steps).

You can evaluate \(J\) with igraph

We can use igraph functions to inspect various properties of the graph in cl$graph. For instance, below, we extract the edge attributes and visualize the distributions of the edge attributes ncweight andnweight for all CDR3\(\alpha\) and CDR3\(\beta\) sequence pairs.

# data.frame of edges and their attributes
e <- igraph::as_data_frame(x = cl$graph, what = "edges")
ggplot(data = e)+
  geom_density(aes(ncweight, col = chain))+
  geom_density(aes(nweight, col = chain), linetype = "dashed")+
  xlab(label = "edge weight (solid = ncweight, dashed = nweight)")+
  theme(legend.position = "top")

Can you guess why we observe trimodal distributions?

Step 6. Inspect results

Visualizing the distribution of \(\beta\) with get_beta_violins

beta_violins <- get_beta_violins(node_summary = gcd$node_summary,
                                 beta = d$posterior_summary$beta,
                                 ag_species = NULL,
                                 db = "vdjdb",
                                 db_dist = 0,
                                 chain = "both")
beta_violins$violins
[[1]]

Compare \(\beta\)s of clones specific for CMV, EBV, flu or MLANA?

beta_violins <- get_beta_violins(node_summary = gcd$node_summary,
                                 beta = d$posterior_summary$beta,
                                 ag_species = c("CMV", "EBV", "Influenza"),
                                 ag_genes = c("MLANA"),
                                 db = "vdjdb",
                                 db_dist = 0,
                                 chain = "both")
patchwork::wrap_plots(beta_violins$violins, ncol = 2)

Compare \(\beta\) between repertoires with get_beta_scatterplot

beta_scatterplots <- get_beta_scatterplot(node_summary = gcd$node_summary,
                                          beta = d$posterior_summary$beta,
                                          ag_species = c("CMV"),
                                          db = "vdjdb",
                                          db_dist = 0,
                                          chain = "both")
patchwork::wrap_plots(beta_scatterplots$scatterplots[[1]], ncol = 3)

Posterior predictive checks

Before we can start interpreting the model results, we have to make sure that the model is valid. One standard approach is to check whether our model can retrodict the observed data (community occupancy matrix) which was used to fit model parameters.

General idea of posterior predictive checks:

  1. fit model based on data \(y\)
  2. simulate new data \(\hat{y}\)
  3. compare \(y\) and \(\hat{y}\)

ClustIRR provides \(y\) and \(\hat{y}\) of each repertoire, which we can visualize with ggplot:

ggplot(data = d$posterior_summary$y_hat)+
  facet_wrap(facets = ~sample, nrow = 1, scales = "free")+
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", col = "gray")+
  geom_errorbar(aes(x = y_obs, y = mean, ymin = L95, ymax = H95),
                col = "darkgray", width=0)+
  geom_point(aes(x = y_obs, y = mean), size = 0.8)+
  xlab(label = "observed y")+
  ylab(label = "predicted y (and 95% HDI)")