High-throughput immune repertoire analysis with IGoR - PubMed (original) (raw)
High-throughput immune repertoire analysis with IGoR
Quentin Marcou et al. Nat Commun. 2018.
Abstract
High-throughput immune repertoire sequencing is promising to lead to new statistical diagnostic tools for medicine and biology. Successful implementations of these methods require a correct characterization, analysis, and interpretation of these data sets. We present IGoR (Inference and Generation Of Repertoires)-a comprehensive tool that takes B or T cell receptor sequence reads and quantitatively characterizes the statistics of receptor generation from both cDNA and gDNA. It probabilistically annotates sequences and its modular structure can be used to investigate models of increasing biological complexity for different organisms. For B cells, IGoR returns the hypermutation statistics, which we use to reveal co-localization of hypermutations along the sequence. We demonstrate that IGoR outperforms existing tools in accuracy and estimate the sample sizes needed for reliable repertoire characterization.
Conflict of interest statement
The authors declare no competing financial interests.
Figures
Fig. 1
IGoR’s pipeline for sequence analysis. a V(D)J recombination proceeds by joining randomly selected segments (V, D, and J segments in the case of TRB and IGH). Each segment gets trimmed at its ends (hashed areas), and a varying number of non-templated insertions are added between them (orange). Hypermutations (in the case of B cells) or sequencing errors (in red) further enhance diversity. IGoR lists putative recombination scenarios consistent with the observed sequence, and weighs them according to their likelihood. b The likelihood of each scenario is computed using a Bayesian network of dependencies between the recombination features (V, D, J segment choices, insertions, and deletions), as illustrated here for the human TRB locus. Architectures for TRA and IGH are described in Methods. c IGoR’s pipeline includes three modes. In the learning mode, IGoR learns recombination statistics from data sequences. In the analysis mode, IGoR outputs detailed recombination scenario statistics for each sequence. In the generation mode, IGoR produces synthetic sequences with specified recombination statistics
Fig. 2
IGoR infers reproducible recombination statistics between individuals. a Distribution of the number of insertions at the junctions of recombined genes: IGH at the VD and DJ junctions from DNA data, TRB at the VD and DJ junction from both DNA and mRNA data, and TRA at the VJ insertion site from mRNA data. The insertion profile is assumed to be universal for all genes and the distributions are also reproducible between TRA and TRB. b, c Average distribution over all genes of the number of deletions across b V and c J genes. The gene-by-gene distributions of the most frequent genes are reported in Supplementary Fig. 5. Negative deletions correspond to palindromic insertions (P-nucleotides), e.g., −2 means 2 P-nucleotides. The inferred distributions are robust to the choice of individuals, genetic material (mRNA or DNA), and sequencing technology. Error bars show 1 standard deviation across individuals
Fig. 3
Validation on synthetic data. Short synthetic reads of recombined TRB sequences were generated with known recombination statistics, and given to IGoR as input to reinfer these statistics. Inference with 105 TRB sequences and a typical sequencing error rate of 10−3 gives excellent agreement for a gene usage and insertion statistics and b deletion statistics (Pearson’s r for deletions is calculated on the joint statistics of gene usage and deletion number; cross size scales with gene usage). c Discrepancy between true and inferred values of the recombination statistics for TRB, measured by the Kullback–Leibler divergence, as a function of the number of unique sequences in the sample, and decomposed according to the features of the recombination scenario. d Same as c, for increasing rates of sequencing errors
Fig. 4
Probabilistic analysis of putative recombination scenarios and comparison to existing methods. Synthetic 130-bp reads of recombined hypermutation-free IGH sequences and 60-bp reads of TRB sequences were generated with a 5 × 10−3 error rate, and processed for analysis by IGoR and two existing methods, MiXCR and Partis. IGoR ranks putative scenarios by descending order of likelihood. a Distribution of the rank of the true scenario as called by IGoR for both TRB and IGH. Note that the best-ranked (maximum-likelihood) scenario is the correct one in less than 30% of cases. b Distribution of the number of scenarios that need to be enumerated (from most to least likely) to include the true scenario with 50% (blue), 75% (green), 90% (red), or 95% (cyan) confidence for IGH (see Supplementary Fig. 10 for equivalent figure for TRB). c Frequency with which IGoR, MiXCR, and Partis call the correct scenario of recombination as the most likely one (“scenario”) in hypermutation-free IGH, as well as each separate feature of the scenario (“V gene,” etc.). “Failed” corresponds to sequences for which the algorithm did not output an assignment. d Usage frequency of TRB D gene conditioned on the J gene, inferred by the IGoR and MiXCR (Partis does not handle TCR sequences). IGoR recovers the physiological exclusion between D2 and J1, while MiXCR does not
Fig. 5
Hypermutation landscape. a Position weight matrix (PWM) model for predicting hypermutation hotspots in IGH. Each nucleotide σ at position i within ±m of the hypermutation site (in red) has an additive contribution e i(σ) to the hypermutation log odds (Eq. (3)). The PWM is learned by expectation-maximization from the out-of-frame sequences of memory B cells. b Comparison between the observed mutation rate per nucleotide and its prediction by the PWM model, as a function of position along the V segment, for the four most frequent V genes. Pearson correlation coefficient ρ and gene usage are given for each. c PWMs inferred from the V, D, and J genes. d Distribution of the number of mutations in each sequence. Data sequences have a broader distribution than predicted by the model (as computed from generating synthetic sequences and mutations with a data-inferred 7-mer PWM model). e Spatial co-localization index g(r), measuring the overrepresentation of pairs of hypermutations at genomic distance r from each other. Synthetic sequences have g(r) ≈ 1 by construction (green)
Similar articles
- Immune repertoire analysis is gaining interest as a clinical NGS application.
Brunstein J. Brunstein J. MLO Med Lab Obs. 2016 Oct;48(10):26-27. MLO Med Lab Obs. 2016. PMID: 30047649 No abstract available. - Inferring processes underlying B-cell repertoire diversity.
Elhanati Y, Sethna Z, Marcou Q, Callan CG Jr, Mora T, Walczak AM. Elhanati Y, et al. Philos Trans R Soc Lond B Biol Sci. 2015 Sep 5;370(1676):20140243. doi: 10.1098/rstb.2014.0243. Philos Trans R Soc Lond B Biol Sci. 2015. PMID: 26194757 Free PMC article. - T and B Cell Receptor Immune Repertoire Analysis using Next-generation Sequencing.
Werner L, Dor C, Salamon N, Nagar M, Shouval DS. Werner L, et al. J Vis Exp. 2021 Jan 12;(167). doi: 10.3791/61792. J Vis Exp. 2021. PMID: 33522509 - Analyzing Immunoglobulin Repertoires.
Chaudhary N, Wesemann DR. Chaudhary N, et al. Front Immunol. 2018 Mar 14;9:462. doi: 10.3389/fimmu.2018.00462. eCollection 2018. Front Immunol. 2018. PMID: 29593723 Free PMC article. Review. - Immune repertoire: A potential biomarker and therapeutic for hepatocellular carcinoma.
Han Y, Li H, Guan Y, Huang J. Han Y, et al. Cancer Lett. 2016 Sep 1;379(2):206-12. doi: 10.1016/j.canlet.2015.06.022. Epub 2015 Jul 15. Cancer Lett. 2016. PMID: 26188280 Review.
Cited by
- Learning antibody sequence constraints from allelic inclusion.
Jagota M, Hsu C, Mazumder T, Sung K, DeWitt WS, Listgarten J, Matsen FA 4th, Ye CJ, Song YS. Jagota M, et al. bioRxiv [Preprint]. 2024 Oct 25:2024.10.22.619760. doi: 10.1101/2024.10.22.619760. bioRxiv. 2024. PMID: 39484623 Free PMC article. Preprint. - Statistical analysis of repertoire data demonstrates the influence of microhomology in V(D)J recombination.
Russell ML, Trofimov A, Bradley P, Matsen FA 4th. Russell ML, et al. bioRxiv [Preprint]. 2024 Oct 18:2024.10.16.618753. doi: 10.1101/2024.10.16.618753. bioRxiv. 2024. PMID: 39464162 Free PMC article. Preprint. - Combining mutation and recombination statistics to infer clonal families in antibody repertoires.
Spisak N, Athènes G, Dupic T, Mora T, Walczak AM. Spisak N, et al. Elife. 2024 Aug 9;13:e86181. doi: 10.7554/eLife.86181. Elife. 2024. PMID: 39120133 Free PMC article. - Bioinformatics tools and resources for cancer and application.
Huang J, Mao L, Lei Q, Guo AY. Huang J, et al. Chin Med J (Engl). 2024 Sep 5;137(17):2052-2064. doi: 10.1097/CM9.0000000000003254. Epub 2024 Jul 30. Chin Med J (Engl). 2024. PMID: 39075637 Free PMC article. Review. - Benchmarking and integrating human B-cell receptor genomic and antibody proteomic profiling.
Lê Quý K, Chernigovskaya M, Stensland M, Singh S, Leem J, Revale S, Yadin DA, Nice FL, Povall C, Minns DH, Galson JD, Nyman TA, Snapkow I, Greiff V. Lê Quý K, et al. NPJ Syst Biol Appl. 2024 Jul 12;10(1):73. doi: 10.1038/s41540-024-00402-z. NPJ Syst Biol Appl. 2024. PMID: 38997321 Free PMC article.
References
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources