Probing T-cell response by sequence-based probabilistic modeling - PubMed (original) (raw)

Probing T-cell response by sequence-based probabilistic modeling

Barbara Bravi et al. PLoS Comput Biol. 2021.

Abstract

With the increasing ability to use high-throughput next-generation sequencing to quantify the diversity of the human T cell receptor (TCR) repertoire, the ability to use TCR sequences to infer antigen-specificity could greatly aid potential diagnostics and therapeutics. Here, we use a machine-learning approach known as Restricted Boltzmann Machine to develop a sequence-based inference approach to identify antigen-specific TCRs. Our approach combines probabilistic models of TCR sequences with clone abundance information to extract TCR sequence motifs central to an antigen-specific response. We use this model to identify patient personalized TCR motifs that respond to individual tumor and infectious disease antigens, and to accurately discriminate specific from non-specific responses. Furthermore, the hidden structure of the model results in an interpretable representation space where TCRs responding to the same antigen cluster, correctly discriminating the response of TCR to different viral epitopes. The model can be used to identify condition specific responding TCRs. We focus on the examples of TCRs reactive to candidate neoantigens and selected epitopes in experiments of stimulated TCR clone expansion.

PubMed Disclaimer

Conflict of interest statement

I have read the journal’s policy and the authors of this manuscript have the following competing interests: B.G. has received honoraria for speaking engagements from Merck, Bristol-Meyers Squibb, and Chugai Pharmaceuticals, has received research funding from Bristol-Meyers Squibb, and has been a compensated consultant for Darwin health, PMV Pharma and Rome Therapeutics of which he is a cofounder.

Figures

Fig 1

Fig 1. Schematic of the approach.

A: Response to the antigen occurs via clonal expansion, where responding T cells proliferate by cell reproduction. B: Structure of count-weighted training datasets. Repertoire statistics before (left) and after (right) stimulation can be illustrated in terms of logos that convey information about position dependent residue single-site frequency, and correlation matrices. C: Distribution of CDR3 lengths without alignment (lower _x_-axis). The number of gaps inserted in the sequence center by the alignment (upper _x_-axis) is equal to the difference between the alignment length (N σ = 19) and the CDR3 original length. D: The RBM model graphical architecture. The ratio of the RBM model scores before and after stimulation highlights sequence motifs that are characteristic to antigen stimulation (dark red boxes). Data shown for stimulation by NA of patient Pt3 PBMC from Ref [16].

Fig 2

Fig 2. Inferred model parameters enable feature extraction.

A: Using count-weighted datasets, the RBM infers the probability P21NA(σ) of detecting a CDR3 sequence σ with high abundance at day 21 post stimulation by NA. P21NA(σ) is parametrized in terms of local biases g acting on CDR3 sites (left column) and weights connecting each CDR3 site to a hidden unit, effectively coupling the CDR3 sites (right column). Inset: example of weights entering hidden unit 8, w8. The logo representation of exp(g) normalized (left column) shows that single-site biases, by capturing the frequency of amino acids at each CDR3 site, reflect mainly the preferential usage of certain residues during the receptor generation process (e.g. Cysteine at the beginning of the CDR3, Phenylalanine at the end position, blue boxes). For a given hidden unit, there is a set of weights for each site, with different values for each amino acid appearing at that site. These values can be either positive or negative and the height of the letter reflects the weight’s magnitude for the respective amino acid. Symbols formula image give weight values for the gap. Weights carry information mainly about experiment-specific enrichment in sequence motifs (e.g. VVV or WSA at positions 3–5, dark red box). B: The probability P21NA(σ) inferred by SONIA is expressed in terms of selection or _q_-factors, one set for the left CDR3 alignment (qL) and one set for the right CDR3 alignment (qR), here represented as sequence logos. Inset: sequence logos at day 21 post-stimulation and for the baseline distribution _P_gen with left and right alignment. In contrast to RBM biases, _q_-factors quantify only the enrichment (dark red box) with respect to the TRB generation amino acid usage preferences described by _P_gen (blue box in the inset). The gaps appearing in the RBM motifs have been removed from the logo representations in all figures for clarity of presentation. The dataset used is the same as in Fig 1 (stimulation by NA of Pt3 PBMC from [16]).

Fig 3

Fig 3. Model scores of T-cell clonal expansion.

A: The RBM calculated response scores Srespp for each TRB clone in each antigen-stimulation condition p. The correlation between the Srespp and clone fold change is poor when considering all clones, as low-count clones are a source of noise. A significant correlation is recovered by progressively filtering out the low-abundance sequences from the testing set, shown on the example of stimulation by p = NA of Pt3 PBMC. For both RBM and SONIA (shown on the right), the correlation is robust to the choice of different background distributions: _P_0 (the probability learned on the same patient’s dataset at day 0), _P_gen (the probability of generating a given CDR3, described by OLGA [21]) and _P_post (the post-thymic selection CDR3 distribution, sampled here from a default human TRB model available in SONIA [15]). The small differences between the 3 curves are mainly due to sampling, see also Materials and methods. B: The correlation coefficient between the RBM Srespp and clone fold change for each antigen-stimulation experiment p (color-coded by antigen) when retaining the 125 most abundant clones (the golden dot in panel A). Data from [16] correspond to p = NA, CR, WT, MUC16, MUC16WT (Pt1, Pt2), p = NA, CR (Pt3, Pt5, Pt7), p = NA, CR, WT (Pt4), P = NA1, CR1, NA2, CR2 (Pt6), see also Table 1. C: Scatter plot comparing Srespp from the RBM and SONIA.

Fig 4

Fig 4. Model scores of T-cell response specificity.

A: Histogram of RBM specificity scores Sspecp showing their ability to discriminate specific responders to p = NA (assigned positive values) from unspecific ones (assigned negative values), for two examples: of less specific (Pt7 samples, top), and more specific expansion (Pt3 samples, bottom). ROC curves give the fraction of specific responders to p = NA (clones specifically expanded upon NA stimulation) predicted by SspecNA against the fraction of unspecific responders (clones responsive to the alternative stimulation, here CR) when varying the threshold of discrimination. The Area Under the ROC curve (AUROC) is taken as quantitative measure of specificity of expansion. B: Ref. [16] identified 78 clones that expanded under the stimulation by both NA and CR in Pt3. x and _y_-axis give their log fold change in the NA and CR experiments. The color code gives their specificity score SspecNA in absolute value (note that |SspecNA|=|SspecCR|). High |SspecNA| is assigned only to clones whose expansion was more significant in one experiment than in the other, while clones that are significantly expanded in both experiments (highly cross-reactive) are assigned |SspecNA| closer to zero. These clones do not contribute to discriminating the two repertoires. In A-B, SspecNA and log fold change are given in log base 10. C: ROC for Pt7 and Pt3 samples obtained by RBM-LR and SONIA, which take input training data in the Left+Right encoding. D: AUROC calculated from the RBM for all datasets under consideration from [16]. The dashed line gives the expected value (AUROC = 0.5) when comparing statistically indistinguishable samples (the “control”, gray bar), calculated from two replicates of Pt7 sample at day 0. E: Comparison of the AUROC obtained by the RBM, the RBM-LR model and SONIA for all datasets in D.

Fig 5

Fig 5. RBM-based dimensionality reduction of TRB response.

A: A lower-dimensional representation of RBM predicted TRB sequences of Pt3 responding to the NA antigen 21 days post-stimulation (same model as in Figs 1 and 2). We rank CDR3 sequences by likelihood and look at the top ones (8 clones). We select 2 sets of RBM weights displaying well-defined patterns of amino acid usage at the beginning and at the end of the CDR3 (respectively **w**8 and **w**12). The projection of top-likelihood clones onto these weights define the inputs to the corresponding hidden units (h8 and h12) in two dimensions. Depending on sequence patterns (see colored amino acids), responding clones have either a positive or negative projection onto the selected weights and end up occupying different portions of the space of inputs to hidden units. The color code (log base 10 of fold change) highlights that high model likelihood reflects high fold change. B: Same data representation by the RBM as in A, with the 12 top-likelihood clones for the Pt5 CR assay.

Fig 6

Fig 6. RBM retrieves TRB sequence motifs reported in Dash et al. 2017 [17] and Glanville et al. 2017 [22].

A: Sequence logos of aligned CDR3 of each human TCR repertoire specific to CMV, EBV and Flu epitopes based on the tetramer sorted data from Ref. [17]. Sequence motifs identified in Ref. [17] are marked by colored boxes. The sequence motifs visualized here from aligned CDR3 are similar but not exactly the same as the ones in Ref. [17]. B: We learned an RBM on the full list of CDR3 sequences from the three repertoires in A. The learned weights show the same sequence motifs as in A and the inputs to the corresponding hidden units identify two groups of clones: one specific to the M158 epitope (magenta dots) and the other to the BMLF1280 epitope (blue dots). C-D: Same representation as in A-B for the tetramer sorted data from Ref. [22] and the RBM trained on them. For the sake of comparison we have limited the logos in C to the same epitopes as in A. The pp65495 repertoires (orange dots both in B and D) is the most heterogeneous (see Fig 7A) and the characteristic sequence motifs reported in [17, 22] are not easily visible in A-C (see instead S7A–S7C Fig). In both cases, the RBM is able to learn a set of weights reflecting such motifs, see S7A–S7D Fig. E: Same as D, but where T-cell clones belonging to 4 of the 35 clusters found by GLIPH in [22] are circled in color. These clones are well separated in RBM space. The RBM representation space captures the similarity of sequences that have the same epitope specificity but are placed into different clusters by GLIPH (blue- and green-circled clusters).

Fig 7

Fig 7. Diversity among responding clones and model’s predictive power.

A: The sequence dissimilarity index (a modified TCRdiv diversity index, see Materials and methods) calculated for the pool of expanded clones in the Pt1,…,Pt7 assays from [16]. The dissimilarity index values lie in a range comparable only to the most diverse epitope-specific human repertoire of tetramer-sorted TCRs from Dash et al. [17] and Glanville et al. [22] (black bars), see also Fig 6. B: Probabilistic scores from the RBM (trained on the M1-specific repertoire from [17]) evaluated for a testing set of M1-specific CDR3 and generic CDR3 randomly drawn from the day 0 of Pt1,…,Pt7 samples. The model’s ability to predict M1-specific clones is quantified by the high AUROC value (C). D: AUROCs for models trained on lower-diversity repertoires (tetramer-sorted TCR data from [17, 22], black bar) are in average higher than for the Pt1,…,Pt7 datasets (gray bar). Bar plots give the average AUROC and its standard deviation. AUROC values are estimated by a leave-one-out 5-fold validation protocol (see Materials and methods). E: AUROC for the tetramer-sorted and Pt1,…,Pt7 datasets is inversely correlated with the sequence dissimilarity index, Pearson correlation r of magnitude |r| = 0.95, _p_-value for testing non-correlation = 2.53 × 10−14 (|r| = 0.87 and _p_-value = 1.9 × 10−7 considering only Pt1,…,Pt7 datasets). D-E show that the predictive power of sequence-based models is degraded for high-diversity repertoires. We discarded from the AUROC test of B-E the pools of responding clones consisting of fewer than 100 sequences (Pt6 NA1, BMLF1280 and pp65495 from [17], NP44 and pp65417 from [22]).

Fig 8

Fig 8. Comparison of the methods.

A: Schematic summary of the characteristics of the three sequence-based probabilistic modeling approaches used: RBM, RBM-LR, SONIA. B-C: Summary of the results we obtained for the samples from [16] using the approaches in A, in particular the correlation between the model’s response score Srespp and TRB clone fold change when retaining the 125 most abundant clones (B) and the AUROC-based measure of response specificity (C). Both these measures highlight a different degree of specific response to the stimulation, that overall stays high across the samples apart from Pt7, where response to the peptides tested was rather unspecific. For each sample, the results from the three methods are well in agreement (see also Figs 3C and 4E). High values of the AUROC and of the Srespp-fold change correlation reflect also how well the probability inferred by each method reproduces the clone abundance (see S3 Fig, Materials and methods), thus we report the averages over all samples (last column) for a comparison of the methods’ performance. This comparison shows that the general performance is higher for RBM-based approaches than for SONIA and is the same for RBM and RBM-LR.

Similar articles

Cited by

References

    1. Topalian SL, Drake CG, Pardoll DM. Immune Checkpoint Blockade: A Common Denominator Approach to Cancer Therapy. Cancer Cell. 2015;27(4):450–461. doi: 10.1016/j.ccell.2015.03.001 - DOI - PMC - PubMed
    1. Yarchoan M, Johnson BA, Lutz ER, Laheru DA, Jaffee EM. Targeting Neoantigens to Augment Antitumour Immunity. Nat Rev Cancer. 2017;17(4):209–222. doi: 10.1038/nrc.2016.154 - DOI - PMC - PubMed
    1. Wells DK, van Buuren MM, Dang KK, Hubbard-Lucey VM, Sheehan KCF, Campbell KM, et al.. Key Parameters of Tumor Epitope Immunogenicity Revealed Through a Consortium Approach Improve Neoantigen Prediction. Cell. 2020;183(3):818–834.e13. doi: 10.1016/j.cell.2020.09.015 - DOI - PMC - PubMed
    1. Dudley ME, Rosenberg SA. Adoptive-Cell-Transfer Therapy for the treatment of patients with cancer. Nat Rev Cancer. 2003;3(9):666–675. doi: 10.1038/nrc1167 - DOI - PMC - PubMed
    1. Song I, Gil A, Mishra R, Ghersi D, Selin LK, Stern LJ. Broad TCR Repertoire and Diverse Structural Solutions for Recognition of an Immunodominant CD8+ T Cell Epitope. Nat Struct Mol Biol. 2017;24(4):395–406. doi: 10.1038/nsmb.3383 - DOI - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources