scPipe: a flexible data preprocessing pipeline for 3’ end scRNA-seq data (original) (raw)

Fastq reformatting

The pipeline starts with fastq file reformatting. We move the barcode and UMI sequences to the read name and leave the transcript sequence as is. This outputs a read name that looks like @[barcode_sequence]*[UMI_sequence]#[readname] … The read structure of read 2 in our example dataset has the 8 bp long cell barcode starting at position 6 and the 6 bp long UMI sequence starting at the first position. So the read structure will be : list(bs1=-1, bl1=0, bs2=6, bl2=8, us=0, ul=6). bs1=-1, bl1=0 means we don’t have an index in read 1 so we set a negative value as its start position and give it zero length. bs2=6, bl2=8 means we have an index in read two which starts at position 6 in the read and is 8 bases in length. us=0, ul=6 means we have a UMI at the start of read 2 which is 6 bases long. NOTE: we use a zero based index system, so the indexing of the sequence starts at zero.

sc_trim_barcode(file.path(data_dir, "combined.fastq.gz"),
                fq_R1,
                fq_R2,
                read_structure = list(bs1=-1, bl1=0, bs2=6, bl2=8, us=0, ul=6))

## trimming fastq file...
## pass QC: 46009
## removed_have_N: 0
## removed_low_qual: 0
## time elapsed: 96 milliseconds

Aligning reads to a reference genome

Next we align reads to the genome. This example uses Rsubread but any aligner that support RNA-seq alignment and gives standard BAM output can be used here. The Rsubread package is not available on Windows, so the following alignment step will only run on Mac OSX or Linux.

Rsubread::buildindex(basename=file.path(data_dir, "ERCC_index"), reference=ERCCfa_fn)

## 
##         ==========     _____ _    _ ____  _____  ______          _____  
##         =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
##           =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
##             ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
##               ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
##         ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
##        Rsubread 2.22.0
## 
## //================================= setting ==================================\\
## ||                                                                            ||
## ||                Index name : ERCC_index                                     ||
## ||               Index space : base space                                     ||
## ||               Index split : no-split                                       ||
## ||          Repeat threshold : 100 repeats                                    ||
## ||              Gapped index : no                                             ||
## ||                                                                            ||
## ||       Free / total memory : 49.1GB / 125.4GB                               ||
## ||                                                                            ||
## ||               Input files : 1 file in total                                ||
## ||                             o ERCC92.fa                                    ||
## ||                                                                            ||
## \\============================================================================//
## 
## //================================= Running ==================================\\
## ||                                                                            ||
## || Check the integrity of provided reference sequences ...                    ||
## || There were 1 notes for reference sequences.                                ||
## || The notes can be found in the log file, '/tmp/RtmpWaJH3t/ERCC_index.log'.  ||
## || Scan uninformative subreads in reference sequences ...                     ||
## || 1 uninformative subreads were found.                                       ||
## || These subreads were excluded from index building.                          ||
## || Estimate the index size...                                                 ||
## ||    8%,   0 mins elapsed, rate=121.4k bps/s                                 ||
## ||   49%,   0 mins elapsed, rate=703.8k bps/s                                 ||
## ||   66%,   0 mins elapsed, rate=925.7k bps/s                                 ||
## || 3.0 GB of memory is needed for index building.                             ||
## || Build the index...                                                         ||
## ||    8%,   0 mins elapsed, rate=12.6k bps/s                                  ||
## ||   49%,   0 mins elapsed, rate=75.3k bps/s                                  ||
## ||   66%,   0 mins elapsed, rate=100.0k bps/s                                 ||
## || Save current index block...                                                ||
## ||  [ 0.0% finished ]                                                         ||
## ||  [ 10.0% finished ]                                                        ||
## ||  [ 20.0% finished ]                                                        ||
## ||  [ 30.0% finished ]                                                        ||
## ||  [ 40.0% finished ]                                                        ||
## ||  [ 50.0% finished ]                                                        ||
## ||  [ 60.0% finished ]                                                        ||
## ||  [ 70.0% finished ]                                                        ||
## ||  [ 80.0% finished ]                                                        ||
## ||  [ 90.0% finished ]                                                        ||
## ||  [ 100.0% finished ]                                                       ||
## ||                                                                            ||
## ||                      Total running time: 0.2 minutes.                      ||
## ||          Index /tmp/RtmpWaJH3t/ERCC_index was successfully built.          ||
## ||                                                                            ||
## \\============================================================================//

Rsubread::align(index=file.path(data_dir, "ERCC_index"),
  readfile1=file.path(data_dir, "combined.fastq.gz"),
  output_file=file.path(data_dir, "out.aln.bam"), phredOffset=64)

## 
##         ==========     _____ _    _ ____  _____  ______          _____  
##         =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
##           =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
##             ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
##               ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
##         ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
##        Rsubread 2.22.0
## 
## //================================= setting ==================================\\
## ||                                                                            ||
## || Function      : Read alignment (RNA-Seq)                                   ||
## || Input file    : combined.fastq.gz                                          ||
## || Output file   : out.aln.bam (BAM)                                          ||
## || Index name    : ERCC_index                                                 ||
## ||                                                                            ||
## ||                    ------------------------------------                    ||
## ||                                                                            ||
## ||                               Threads : 1                                  ||
## ||                          Phred offset : 64                                 ||
## ||                             Min votes : 3 / 10                             ||
## ||                        Max mismatches : 3                                  ||
## ||                      Max indel length : 5                                  ||
## ||            Report multi-mapping reads : yes                                ||
## || Max alignments per multi-mapping read : 1                                  ||
## ||                                                                            ||
## \\============================================================================//
## 
## //=============== Running (15-Apr-2025 20:21:44, pid=2589009) ================\\
## ||                                                                            ||
## || Check the input reads.                                                     ||
## || The input file contains base space reads.                                  ||
## || Initialise the memory objects.                                             ||
## || Estimate the mean read length.                                             ||
## || The range of Phred scores observed in the data is [1,1]                    ||
## || Create the output BAM file.                                                ||
## || Check the index.                                                           ||
## || Init the voting space.                                                     ||
## || Global environment is initialised.                                         ||
## || Load the 1-th index block...                                               ||
## || The index block has been loaded.                                           ||
## || Start read mapping in chunk.                                               ||
## ||   23% completed, 0.6 mins elapsed, rate=123.7k reads per second            ||
## ||   43% completed, 0.6 mins elapsed, rate=148.9k reads per second            ||
## ||   62% completed, 0.6 mins elapsed, rate=162.6k reads per second            ||
## ||   82% completed, 0.6 mins elapsed, rate=169.1k reads per second            ||
## ||   72% completed, 0.6 mins elapsed, rate=0.6k reads per second              ||
## ||   78% completed, 0.6 mins elapsed, rate=0.7k reads per second              ||
## ||   83% completed, 0.6 mins elapsed, rate=0.7k reads per second              ||
## ||   88% completed, 0.6 mins elapsed, rate=0.8k reads per second              ||
## ||   92% completed, 0.6 mins elapsed, rate=0.8k reads per second              ||
## ||   97% completed, 0.6 mins elapsed, rate=0.9k reads per second              ||
## ||  102% completed, 0.6 mins elapsed, rate=0.9k reads per second              ||
## ||  107% completed, 0.6 mins elapsed, rate=1.0k reads per second              ||
## ||  113% completed, 0.6 mins elapsed, rate=1.0k reads per second              ||
## ||                                                                            ||
## ||                           Completed successfully.                          ||
## ||                                                                            ||
## \\====================================    ====================================//
## 
## //================================   Summary =================================\\
## ||                                                                            ||
## ||                 Total reads : 46,009                                       ||
## ||                      Mapped : 46,009 (100.0%)                              ||
## ||             Uniquely mapped : 46,009                                       ||
## ||               Multi-mapping : 0                                            ||
## ||                                                                            ||
## ||                    Unmapped : 0                                            ||
## ||                                                                            ||
## ||                      Indels : 0                                            ||
## ||                                                                            ||
## ||                Running time : 0.6 minutes                                  ||
## ||                                                                            ||
## \\============================================================================//

##                       out.aln.bam
## Total_reads                 46009
## Mapped_reads                46009
## Uniquely_mapped_reads       46009
## Multi_mapping_reads             0
## Unmapped_reads                  0
## Indels                          0

Assigning reads to annotated exons

After the read alignment, we assign reads to exons based on the annotation provided using the sc_exon_mapping function. Currently scPipe only supports annotation in gff3 or bed format.

sc_exon_mapping(file.path(data_dir, "out.aln.bam"),
            file.path(data_dir, "out.map.bam"),
            ERCCanno_fn)

## adding annotation files...
## time elapsed: 0 milliseconds
## 
## annotating exon features...
## updating progress every 3 minutes...
## 46009 reads processed, 46k reads/sec
## number of read processed: 46009
## unique map to exon: 46009 (100.00%)
## ambiguous map to multiple exon: 0 (0.00%)
## map to intron: 0 (0.00%)
## not mapped: 0 (0.00%)
## unaligned: 0 (0.00%)

De-multiplexing data and counting genes

Next we use the sc_demultiplex function to split the reads into separate .csv files for each cell in the /count subfolder. Each file contains three columns and each row corresponds to a distinct read. The first column contains the gene ID that the read maps to, the second column gives the UMI sequence for that read and the third column is the distance to the transcript end position (TES) in bp. This file will be used for UMI deduplication and to generate a gene count matrix by calling the sc_gene_counting function.

sc_demultiplex(file.path(data_dir, "out.map.bam"), data_dir, barcode_annotation_fn, has_UMI=FALSE)

## demultiplexing reads by barcode...
## Warning: mitochondrial chromosome not found using chromosome name `MT`.
## time elapsed: 92 milliseconds

sc_gene_counting(data_dir, barcode_annotation_fn)

## summarising gene counts...
## time elapsed: 63 milliseconds

We have now completed the preprocessing steps. The gene count matrix is available as a .csv file in data_dir/gene_count.csv and quality control statistics are saved in the data_dir/stat folder. These data are useful for later quality control (QC).