GitHub - liuwd15/scRIN: Scripts for mRNA integrity measurement. (original) (raw)

scRIN Documentation


Introduction

This python script is used for measuring the mRNA integrity with single-cell sequencing data. The analysis is conducted on 3 levels, sample/cell, gene/transcript and exon.
mRNA integrity is measured by 2 criteria, KS and TIN. KS measures the read coverage bias and TIN measures the read coverage uniformity on each gene model.

Requirements

You have to install or update some python packages before running this program.

python >= 2.7
matplotlib >= 2.2.0
pysam >= 0.9
RSeQC >= 2.6.4
numpy

Installation

This program is a python script and works in unix-like operating systems. You can download it from Github.

wget https://raw.githubusercontent.com/liuwd15/scRIN/master/scRIN.py

Then run it with python.

Or you can make it executable and move it to your PATH.

chmod +x scRIN.py  
mv scRIN.py \~/bin #"\~/bin" can be replaced with other path in you PATH.  
scRIN.py

Usage

Required input

  1. Sorted and indexed .bam file(s). Samtools can be used to sort and index a .bam file.
 samtools sort -o example_sorted.bam example.bam  
 samtools index example_sorted.bam  
  1. Reference 12-column .bed file containing a list of gene models. Representative .bed file containing RefSeq transcripts of hg19 and mm10 are available here. Only the longest transcript for the gene with multiple transcripts is included to avoid redundancy.

The simplest usage is:

scRIN.py -r example_refseq.bed -i example_sorted.bam

It is also recommended that many .bam files should be processed together. You can input comma-separated .bam files like this.

scRIN.py -r example_refseq.bed -i example1_sorted.bam,example2_sorted.bam,example3_sorted.bam

You can also create a text file containing the path of all .bam files like this.

cat samples.txt  
~/data/examples/example1/example1_sorted.bam  
~/data/examples/example2/example2_sorted.bam  
~/data/examples/example3/example3_sorted.bam

Then input the text file.

scRIN.py -r example_refseq.bed -i samples.txt

Other options

To perform additional analysis on inter-exon, use -e option. This will result in more output files (see Output).

scRIN.py -r example_refseq.bed -i example_sorted.bam -e

The transcripts with low expression are filted out. The default threshold is average coverage (read length * mapped read number / gene model length) > 0.5, you can change it with -d option.

scRIN.py -r example_refseq.bed -i example_sorted.bam -d 1

By default, only transcript/exon expressed in more than 3 samples will be summarized in summary_transcript.xls file (see Output), you can change this threshold with -s option. Noticing that only when the number of input files is more than this threshold will transcript/exon be summarized.

scRIN.py -r example_refseq.bed -i example_sorted.bam -s 5

If you want to get the rank of TIN of each trancript across samples, use -k option. This will create .xls files containing the rank of TIN of transcripts expresses in all samples.

scRIN.py -r example_refseq.bed -i example_sorted.bam -k

Output

Sample directory

For each sample(.bam file), following files will be created in the same directory as .bam file.

  1. A .metrics.xls file containing some mRNA integrity metrics of each transcript. Metrics include:
  1. A .KS_TIN.pdf file containing 3(4) scatter plots.
  1. [option -e enabled] A .exon.xls file containing metrics of each exon. Metrics include:

Current directory

Following files will be created in the current directory.

  1. A summary_sample.xls file containing the summary metrics of each sample. Metrics include:
  1. A summary_sample.pdf file containing 4 barplots.
  1. A summary_transcript.xls file containing the summary metrics of transcript expressed in all samples. Metrics are similar as those in 1 but are calculated across samples.
  2. A summary_transcript.pdf file containing 3 scatter plots.
  1. [option -e enabled] A summary_exon.xls file containing the summary metrics of exon expressed in all samples. Metrics include:
  1. [option -k enabled] Three files base_level_TIN_rank.xls, inter_exon_TIN_rank.xls and intra_exon_TIN_rank.xls containing the ranks of TIN of each transcript across all samples.