High-throughput immune repertoire analysis with IGoR - PubMed (original) (raw)

High-throughput immune repertoire analysis with IGoR

Quentin Marcou et al. Nat Commun. 2018.

Abstract

High-throughput immune repertoire sequencing is promising to lead to new statistical diagnostic tools for medicine and biology. Successful implementations of these methods require a correct characterization, analysis, and interpretation of these data sets. We present IGoR (Inference and Generation Of Repertoires)-a comprehensive tool that takes B or T cell receptor sequence reads and quantitatively characterizes the statistics of receptor generation from both cDNA and gDNA. It probabilistically annotates sequences and its modular structure can be used to investigate models of increasing biological complexity for different organisms. For B cells, IGoR returns the hypermutation statistics, which we use to reveal co-localization of hypermutations along the sequence. We demonstrate that IGoR outperforms existing tools in accuracy and estimate the sample sizes needed for reliable repertoire characterization.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interests.

Figures

Fig. 1

Fig. 1

IGoR’s pipeline for sequence analysis. a V(D)J recombination proceeds by joining randomly selected segments (V, D, and J segments in the case of TRB and IGH). Each segment gets trimmed at its ends (hashed areas), and a varying number of non-templated insertions are added between them (orange). Hypermutations (in the case of B cells) or sequencing errors (in red) further enhance diversity. IGoR lists putative recombination scenarios consistent with the observed sequence, and weighs them according to their likelihood. b The likelihood of each scenario is computed using a Bayesian network of dependencies between the recombination features (V, D, J segment choices, insertions, and deletions), as illustrated here for the human TRB locus. Architectures for TRA and IGH are described in Methods. c IGoR’s pipeline includes three modes. In the learning mode, IGoR learns recombination statistics from data sequences. In the analysis mode, IGoR outputs detailed recombination scenario statistics for each sequence. In the generation mode, IGoR produces synthetic sequences with specified recombination statistics

Fig. 2

Fig. 2

IGoR infers reproducible recombination statistics between individuals. a Distribution of the number of insertions at the junctions of recombined genes: IGH at the VD and DJ junctions from DNA data, TRB at the VD and DJ junction from both DNA and mRNA data, and TRA at the VJ insertion site from mRNA data. The insertion profile is assumed to be universal for all genes and the distributions are also reproducible between TRA and TRB. b, c Average distribution over all genes of the number of deletions across b V and c J genes. The gene-by-gene distributions of the most frequent genes are reported in Supplementary Fig. 5. Negative deletions correspond to palindromic insertions (P-nucleotides), e.g., −2 means 2 P-nucleotides. The inferred distributions are robust to the choice of individuals, genetic material (mRNA or DNA), and sequencing technology. Error bars show 1 standard deviation across individuals

Fig. 3

Fig. 3

Validation on synthetic data. Short synthetic reads of recombined TRB sequences were generated with known recombination statistics, and given to IGoR as input to reinfer these statistics. Inference with 105 TRB sequences and a typical sequencing error rate of 10−3 gives excellent agreement for a gene usage and insertion statistics and b deletion statistics (Pearson’s r for deletions is calculated on the joint statistics of gene usage and deletion number; cross size scales with gene usage). c Discrepancy between true and inferred values of the recombination statistics for TRB, measured by the Kullback–Leibler divergence, as a function of the number of unique sequences in the sample, and decomposed according to the features of the recombination scenario. d Same as c, for increasing rates of sequencing errors

Fig. 4

Fig. 4

Probabilistic analysis of putative recombination scenarios and comparison to existing methods. Synthetic 130-bp reads of recombined hypermutation-free IGH sequences and 60-bp reads of TRB sequences were generated with a 5 × 10−3 error rate, and processed for analysis by IGoR and two existing methods, MiXCR and Partis. IGoR ranks putative scenarios by descending order of likelihood. a Distribution of the rank of the true scenario as called by IGoR for both TRB and IGH. Note that the best-ranked (maximum-likelihood) scenario is the correct one in less than 30% of cases. b Distribution of the number of scenarios that need to be enumerated (from most to least likely) to include the true scenario with 50% (blue), 75% (green), 90% (red), or 95% (cyan) confidence for IGH (see Supplementary Fig. 10 for equivalent figure for TRB). c Frequency with which IGoR, MiXCR, and Partis call the correct scenario of recombination as the most likely one (“scenario”) in hypermutation-free IGH, as well as each separate feature of the scenario (“V gene,” etc.). “Failed” corresponds to sequences for which the algorithm did not output an assignment. d Usage frequency of TRB D gene conditioned on the J gene, inferred by the IGoR and MiXCR (Partis does not handle TCR sequences). IGoR recovers the physiological exclusion between D2 and J1, while MiXCR does not

Fig. 5

Fig. 5

Hypermutation landscape. a Position weight matrix (PWM) model for predicting hypermutation hotspots in IGH. Each nucleotide σ at position i within ±m of the hypermutation site (in red) has an additive contribution e i(σ) to the hypermutation log odds (Eq. (3)). The PWM is learned by expectation-maximization from the out-of-frame sequences of memory B cells. b Comparison between the observed mutation rate per nucleotide and its prediction by the PWM model, as a function of position along the V segment, for the four most frequent V genes. Pearson correlation coefficient ρ and gene usage are given for each. c PWMs inferred from the V, D, and J genes. d Distribution of the number of mutations in each sequence. Data sequences have a broader distribution than predicted by the model (as computed from generating synthetic sequences and mutations with a data-inferred 7-mer PWM model). e Spatial co-localization index g(r), measuring the overrepresentation of pairs of hypermutations at genomic distance r from each other. Synthetic sequences have g(r) ≈ 1 by construction (green)

Similar articles

Cited by

References

    1. Warren EH, Matsen Fa, Chou J. High-throughput sequencing of B- and T-lymphocyte antigen receptors in hematology. Blood. 2013;122:19–22. doi: 10.1182/blood-2013-03-453142. - DOI - PMC - PubMed
    1. Six A, et al. The past, present and future of immune repertoire biology - the rise of next-generation repertoire analysis. Front. Immunol. 2013;4:413. doi: 10.3389/fimmu.2013.00413. - DOI - PMC - PubMed
    1. Woodsworth DJ, Castellarin M, Holt Ra Sequence analysis of T-cell repertoires in health and disease. Genome Med. 2013;5:98. doi: 10.1186/gm502. - DOI - PMC - PubMed
    1. Georgiou G, et al. The promise and challenge of high-throughput sequencing of the antibody repertoire. Nat. Biotechnol. 2014;32:158–168. doi: 10.1038/nbt.2782. - DOI - PMC - PubMed
    1. Brochet X, Lefranc MP, Giudicelli V. IMGT/V-QUEST: the highly customized and integrated system for IG and TR standardized V-J and V-D-J sequence analysis. Nucleic Acids Res. 2008;36:503–508. doi: 10.1093/nar/gkn316. - DOI - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources