Quick Start — m6anet documentation (original) (raw)

m6anet

Dataprep

m6Anet dataprep requires eventalign.txt from nanopolish eventalign:

nanopolish eventalign --reads reads.fastq --bam reads.sorted.bam --genome transcript.fa --scale-events --signal-index --summary /path/to/summary.txt --threads 50 > /path/to/eventalign.txt

This function segments raw fast5 signals to each position within the transcriptome, allowing m6Anet to predict modification based on the segmented signals. In order to run eventalign, users will need: * reads.fastq: fastq file generated from basecalling the raw .fast5 files * reads.sorted.bam: sorted bam file obtained from aligning reads.fastq to the reference transcriptome file * transcript.fa: reference transcriptome file

We have also provided a demo eventalign.txt dataset in the repository under /path/to/m6anet/m6anet/tests/data/eventalign.txt. Please see Nanopolish for more information.

After running nanopolish eventalign, we need to preprocess the segmented raw signal file using ‘m6anet dataprep’:

m6anet dataprep --eventalign /path/to/m6anet/m6anet/tests/data/eventalign.txt
--out_dir /path/to/output --n_processes 4

The output files are stored in /path/to/output:

Inference

Once m6anet dataprep finishes running, we can run m6anet inference on the dataprep output

m6anet inference --input_dir path/to/output --out_dir path/to/output --n_processes 4 --num_iterations 1000

m6anet inference will run default human model trained on the HCT116 cell line. In order to run Arabidopsis-based model or the HEK293T-RNA004-based model, please supply the --pretrained_model argument

For the Arabidopsis-based model

m6anet inference --input_dir path/to/output --out_dir path/to/output --pretrained_model arabidopsis_RNA002 --n_processes 4 --num_iterations 1000

For the HEK293T-RNA004-based model

m6anet inference --input_dir path/to/output --out_dir path/to/output --pretrained_model HEK293T_RNA004 --n_processes 4 --num_iterations 1000

m6Anet will sample 20 reads from each candidate site and average the probability of modification across several round of sampling according to the –num_iterations parameter. The output file data.indiv_proba.csv contains the probability of modification for each read

The output file data.site_proba.csv contains the probability of modification at each individual position for each transcript. The output file will have 6 columns

The mod_ratio column is calculated by thresholding the probability_modified from data.indiv_proba.csv based on the --read_proba_threshold parameter during m6anet inference call, with a default value of 0.033379376 for the default human model HCT116_RNA002 and 0.0032978046219796 for arabidopsis_RNA002 model. We also recommend a threshold of 0.9 to select m6A sites from the probability_modified column in data.site_proba.csv. The total run time should not exceed 10 minutes on a normal laptop.

m6Anet also supports pooling over multiple replicates. To do this, simply input multiple folders containing m6anet-dataprep outputs:

m6anet inference --input_dir data_folder_1 data_folder_2 ... --out_dir output_folder --n_processes 4 --num_iterations 1000