GitHub - AGImkeller/scIGD: Typing and quantification of immune gene alleles in single cell and spatial transcriptomic data (original) (raw)

Biological background and motivation

Immune molecules such as B and T cell receptors, human leukocyte antigens (HLAs) or killer Ig-like receptors (KIRs) are encoded in the genetically most diverse loci of the human genome. Many of these immune genes are hyperpolymorphic, showing high allelic diversity across human populations. In addition, typical immune molecules are polygenic, which means that multiple functionally similar genes encode the same protein subunit.

However, integrative single-cell methods commonly used to analyze immune cells in large patient cohorts do not consider this. This leads to erroneous quantification of important immune mediators and impaired inter-donor comparability, which ultimately obscures immunological information contained in the data.

In response to these challenges, we introduce scIGD. This Snakemake workflow not only automates HLA allele-typing processes, but also enables allele-specific quantification from single-cell RNA-sequencing (scRNA-seq) data.

scIGD workflow

Automated genotyping and allele-specific quantification for immune genes from scRNA-seq data

scIGD (single-cell ImmunoGenomic Diversity) is a Snakemake workflow that has been designed to automate and streamline the genotyping process for HLA genes, and enabling allele-specific quantification from scRNA-seq data using donor-specific references.

The workflow is organized into three distinct stages, each addressing specific objectives (Figure 1):

  1. Demultiplexing multiplexed scRNA-seq datasets:
    • The initial stage is focused on demultiplexing scRNA-seq datasets which contain reads from multiple donors or samples. Thus, this step is skipped if the input data consists of only a single donor. The primary goal of this step is to generate donor-specific FASTQ files, which are essential for subsequent genotyping and allele-specific quantification.
  2. Allele-typing procedures on immune genes:
    • Following demultiplexing, the workflow executes allele-typing procedures on HLAs. This process utilizes arcasHLA, which extracts reads mapped to chromosome 6 (housing the HLA genes), and uses the IMGT/HLA database and an Expectation-Maximization (EM) model to determine the most representative HLA alleles.
    • The output for each gene will consist of one or two alleles that best match the data.
  3. Quantification of genes and typed alleles:
    • The final stage involves the quantification of genes and typed alleles. Kallisto, an alignment-free tool for rapid transcript quantification, is utilized with the newly built donor-specific reference generated in the previous stage.
    • The resulting output is a count matrix, which provides expression levels for both genes and specifically typed alleles, enabling downstream analysis and exploration of gene and allele expression patterns.

This workflow is designed to support both 10x and BD Rhapsody data, encompassing amplicon/targeted sequencing as well as whole-transcriptome-based data, providing flexibility to users working with different experimental setups.

To maximize the analytical potential of the results, we have created an R/Bioconductor package, SingleCellAlleleExperiment. This package provides a comprehensive multi-layer data structure, enabling the representation of HLA genes at various levels, including alleles, genes, and functional classes.

alt text here

Figure 1: Overview of the scIGD workflow for unraveling immunogenomic diversity in single-cell data, highlighting the integration of the SingleCellAlleleExperiment package for comprehensive data analysis.

Installation

Preparing a working directory

First, create a new directory scIGD at a place you can easily remember and change into that directory in your terminal:

Then, download the workflow:

curl -L https://api.github.com/repos/AGImkeller/scIGD/tarball -o scIGD.tar.gz

Next, extract the data. On Linux, run:

tar --wildcards -xf scIGD.tar.gz --strip 1 "/data" "/demo" "/scripts" "/Snakefile" "/config.yaml" "/environment.yaml" && chmod +x ./scripts/*.sh

On MacOS, run:

tar -xf scIGD.tar.gz --strip 1 "/data" "/demo" "/scripts" "/Snakefile" "/config.yaml" "/environment.yaml" && chmod +x ./scripts/*.sh

This will create three files: Snakefile, config.yaml and environment.yaml. In addition, three folders will be created: data, demo and scripts (Figure 2).

alt text here

Figure 2: Overview of the folder strcutre.

Creating an environment with the required software

Assuming that you have a 64-bit system, on Linux, download and install Miniconda 3 with:

curl -L https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh -o Mambaforge-Linux-x86_64.sh bash Mambaforge-Linux-x86_64.sh

If you are asked the question

Do you wish the installer to prepend the install location to PATH ...? [yes|no]

answer with yes. Along with a minimal Python 3 environment, Mambaforge contains the package manager Mamba. After closing your current terminal and opening a new terminal, you can use the new conda command to install software packages and create isolated environments to, for example, use different versions of the same package. We will now utilize this to create an isolated environment with all the required software for this workflow.

First, make sure to activate the conda base environment with:

The environment.yaml file that you have obtained with the previous step (Preparing a working directory) can be used to install all required software into an isolated Conda environment with the name scIGD via:

mamba env create --name scIGD --file environment.yaml

Written in C++, Mamba is a faster and more robust reimplementation of Conda.

Once the environment has been created, activate it by executing:

and deactivate it by executing:

(but don't do this if you want to run the workflow now)

Input data

Upon preparation of the working directory, two essential folders were generated: data and scripts (refer to Figure 2).

The scripts directory houses essential scripts integral to the workflow's execution, requiring no user intervention.

On the other hand, the data directory serves as a space for metadata necessary for the workflow. Within its meta subfolder, two files are found. Both files are retrieved from the IMGT/HLA database and necessary for arcasHLA allele-typing:

In addition, users are required to supply the following files. Some example files can be found in the demo folder:

Configuration

The configuration for this workflow is designed to offer flexibility and adaptability to different user requirements. Users can customize their parameters in a dedicated configuration file (config.yaml) to suit specific use case and control various aspects of the workflow. Examples for populating these parameters are provided in the config.yaml file. Please ensure all paths defined in this configuration file are either absolute or relative to the location of the Snakefile!

Below are the essential parameters available for configuration:

Running the workflow

Begin by confirming that you are in the directory containing the Snakefile. Once confirmed, you only need to execute the following command:

snakemake --resources mem_gb= --cores all

Replace <your_memory_allocation> with the desired memory allocation value, and <your_core_count> with the desired number of CPU cores for parallel execution of the workflow. This flexibility allows you to tailor the resource allocation based on your system's capabilities and requirements. We recommend allocating a minimum of 12 GB of memory and utilizing 12 CPU cores for optimal performance.

Additionally, it's worth noting that using amplicon-based data is significantly faster than whole-transcriptome-based data, as the latter requires the generation of a BAM file, which is computationally expensive.

This workflow has been rigorously tested. Here, we report two distinct scenarios:

  1. Amplicon-based data:
    • 8 donors, 280 million reads in 1 FASTQ file
    • Completed in 80 minutes utilizing 8 CPU cores and 8 GB of memory
  2. Whole-transcriptome-based data:
    • 1 donor, 50 million reads in 1 FASTQ file
    • Completed in ~ 3 hours utilizing 32 CPU cores and 40 GB of memory

When dealing with whole-transcriptome-based data, it is advisable to use the tool on a cluster. A Snakefile that was executed on a SLURM cluster can be found in the demo folder. It was run with the following parameters: nodes=1, ntasks=1, cpus-per-task=40 and mem-per-cpu=2000.

Output

As a product of kallisto, the resulting output comprises a count matrix (cells_x_genes.mtx), a feature list encompassing genes and typed-alleles (cells_x_genes.genes.txt), and a barcode list (cells_x_genes.barcodes.txt). The matrix serves as a source for downstream analysis and exploration, capturing the expression levels of genes and specifically typed alleles.

In addition, the output includes a lookup table (lookup_table_HLA.csv) to facilitate the creation of the relevant additional data layers during object generation for analysis.

Example datasets and outputs are available in our data package hosted on Bioconductor's ExperimentHub: scaeData.

To facilitate the analysis of this output, we offer a structured data representation in the form of an R/Bioconductor package named SingleCellAlleleExperiment. For detailed instructions on utilizing this package, please refer to its documentation.

References

Tools and packages

Data resources

Citation

To be added.....

Authors