Quantifying selection in immune receptor repertoires - PubMed (original) (raw)

Quantifying selection in immune receptor repertoires

Yuval Elhanati et al. Proc Natl Acad Sci U S A. 2014.

Abstract

The efficient recognition of pathogens by the adaptive immune system relies on the diversity of receptors displayed at the surface of immune cells. T-cell receptor diversity results from an initial random DNA editing process, called VDJ recombination, followed by functional selection of cells according to the interaction of their surface receptors with self and foreign antigenic peptides. Using high-throughput sequence data from the β-chain of human T-cell receptors, we infer factors that quantify the overall effect of selection on the elements of receptor sequence composition: the V and J gene choice and the length and amino acid composition of the variable region. We find a significant correlation between biases induced by VDJ recombination and our inferred selection factors together with a reduction of diversity during selection. Both effects suggest that natural selection acting on the recombination process has anticipated the selection pressures experienced during somatic evolution. The inferred selection factors differ little between donors or between naive and memory repertoires. The number of sequences shared between donors is well-predicted by our model, indicating a stochastic origin of such public sequences. Our approach is based on a probabilistic maximum likelihood method, which is necessary to disentangle the effects of selection from biases inherent in the recombination process.

Keywords: T cell; public repertoire; statistical inference; thymic selection.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.

Fig. 1.

Graphical representation of our method. (A) T-cell receptor β-chain sequences are formed during VDJ recombination. Sequences from this probability distribution, described by _P_pre, are then selected with a factor Q defined for each sequence, resulting in the observed _P_post distribution of receptor sequences. Selection is assumed to act independently on the V and J genes, the length of the CDR3 region, and each of the amino acids, a i, therein. (B) A schematic of the fitting procedure: the parameters are set so that _P_post fits the marginal frequencies of amino acids at each position, the distribution of CDR3 lengths, and VJ gene choices. Because the latter is not known unambiguously from the observed sequences, it is estimated probabilistically using the model itself in an iterative procedure.

Fig. 2.

Fig. 2.

Characteristics of selection. (A) CDR3 length distributions pre- and postselection and the length selection factor q L (green). Selection makes the length distribution of CDR3 regions in the preselection repertoire more peaked for the naive and memory repertoires (overlapping). Error bars show standard deviation over nine individuals. (B) Comparison between data and the model of the connected pairwise correlation functions, which were not fitted by our model. The excellent agreement validates the inference procedure. As a control, the prediction from the preselection model (green) does not agree with the data as well. (C) Values of the inferred amino acid selection factors for each amino acid, ordered by length of the CDR3 region (ordinate) and position in the region (abscissa). (D) Values of the VJ gene selection factors.

Fig. 3.

Fig. 3.

Repertoire diversity. (A–C) Variability between repertoires. The scatter between q i;L selection factors between two sample individuals A and B for (A) naive and (B) memory repertoires compared with that of (C) memory and naive repertoires for the same individual shows great similarity between them (

SI Appendix, Fig. S4

). (D) The entropy of the preselection repertoire (Upper) is reduced in the postselection repertoire (Lower). (E and F) Distribution of (E) VJ and (F) DJ insertions in the preselection and naive repertoires shows elimination of long insertions. Error bars show standard deviations over nine donors. The insertion distributions for the memory repertoire are the same as for the naive repertoire (scatter plots in Insets).

Fig. 4.

Fig. 4.

Probability of passing selection. (A and B) Ratio of the distributions of sequence-wide selection factors Q between the observed sequences and the preselection ensemble (red line), plotted as a function of Q for (A) naive and (B) memory repertoires. The model prediction _P_post(Q)/_P_pre(Q) = Q is shown in black, and the preselection and observed distributions of Q are shown in Insets. The selection ratio saturates around approximately seven, which may be interpreted as the maximum probability of being selected. Naive and memory repertoires show similar behaviors. (C) A cartoon of the effective selection landscape captured by our model (red line). Our method does not capture localized selection pressures (such as avoiding self) specific to each individual but captures general global properties.

Fig. 5.

Fig. 5.

Correlations between the pre- and postselection repertoires. (A) A histogram of Spearman correlation coefficient (CC) values between the q i;L(a) selection factors in the CDR3 region and their generation probabilities P i:L,pre(a) for all i, L shows an abundance of positive correlations. (B) Heat map of the joint distribution of the preselection probability distribution _P_pre and selection factors Q for each sequence shows that the two quantities are correlated. (C) Sequences in the observed selected repertoire (green line) had a higher probability to have been generated by recombination than unselected sequences (blue line). Agreement between the postselection model (red line) and data distribution (green line) is a validation of the model.

Fig. 6.

Fig. 6.

Shared sequences between individuals. (A) The mean number of shared sequences between any pair of individuals compared with the number expected by chance (model prediction) for one common model for all individuals (red crosses) and private models learned independently for each individual (blue crosses). Error bars are standard deviations from distributions over pairs. The distribution of shared sequences between (B) triplets and (C) quadruplets of individuals for the data (black histogram) from common (red line) and private (blue line) models. (D) The shared sequences are most likely to be generated and selected: comparison of the _P_post postselection distribution for sequences from the preselection (dotted line) and postselection repertoires (according to the model in gray and the data in black) as well as the sequences shared by at least two donors (model prediction in magenta and data in red).

Similar articles

Cited by

References

    1. Murugan A, Mora T, Walczak AM, Callan CG., Jr Statistical inference of the generation probability of T-cell receptors from sequence repertoires. Proc Natl Acad Sci USA. 2012;109(40):16161–16166. - PMC - PubMed
    1. Janeway C. Immunobiology, the Immune System in Health and Disease. New York: Garland; 2005.
    1. Weinstein JA, Jiang N, White RA, Fisher DS, Quake SR. High-throughput sequencing of the zebrafish antibody repertoire. Science. 2009;324(5928):807–810. - PMC - PubMed
    1. Ndifon W, et al. Chromatin conformation governs T-cell receptor Jβ gene segment usage. Proc Natl Acad Sci USA. 2012;109(39):15865–15870. - PMC - PubMed
    1. Mora T, Walczak AM, Bialek W, Callan CG., Jr Maximum entropy models for antibody diversity. Proc Natl Acad Sci USA. 2010;107(12):5405–5410. - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources