GitHub - Genomon-Project/GenomonSV: structure detection program (original) (raw)

Genomon SV

License: GPL v3 Build Status

Introduction

Genomon SV is a software for detecting somatic structural variations from cancer genome sequencing data. Several characteristics of Genomon SV includes but not limited to;

  1. Use both breakpoint-containing junction read pairs and improperly aligned read pairs for maximizing sensitivity
  2. Various types of filters (e.g., use of non-matched normal control panels) are implemented for higher accuracy
  3. Detection of short tandem duplications and mid-range deletions (10bp ~ 300bp) as well as larger structural variations such as translocations

Dependency

Python

Python (>= 3.6), pysam (>= 0.8.1), numpy, scipy packages

Software

hstlib, edlib , parasail. You can use blat instead of edlib or parasail.

Install

For the last command, you may need to add --user if using a shared computing cluster.

Preparation

Before using GenomonSV, 3 preparation steps are required

Install required softwares and add them to the PATH

GenomonSV uses tabix, bgzip (which are part of HTSlib projects) and edlib inside the programs, assuming those are installed and the pathes are already added to the running environment. Please install and add them to the PATH.

Prepare bam files

Genomon SV accept just bam file aligned by bwa mem with -T0 option. We do not guarantee the results for other cases. Also, we assume that the sequencing data is paired-end. All the single-end reads are ignored in the program.

Commands

parse

Parsing breakpoint-containing and improperly aligned read pairs

GenomonSV parse [-h] [--debug]
                [--junction_abnormal_insert_size JUNCTION_ABNORMAL_INSERT_SIZE]
                [--junction_min_major_clipping_size JUNCTION_MIN_MAJOR_CLIPPING_SIZE]
                [--junction_max_minor_clipping_size JUNCTION_MAX_MINOR_CLIPPING_SIZE]
                [--junction_check_margin_size JUNCTION_CHECK_MARGIN_SIZE]
                [--improper_abnormal_insert_size IMPROPER_ABNORMAL_INSERT_SIZE]
                [--improper_min_mapping_qual IMPROPER_MIN_MAPPING_QUAL]
                [--improper_max_clipping_size IMPROPER_MAX_CLIPPING_SIZE]
                [--junction_dist_margin JUNCTION_DIST_MARGIN]
                [--junction_opposite_dist_margin_margin JUNCTION_OPPOSITE_DIST_MARGIN_MARGIN]
                [--improper_check_margin_size IMPROPER_CHECK_MARGIN_SIZE]
                [--reference_genome REFERENCE_GENOME]
                input.bam output_prefix

After successful completion, you will find possible breakpoint regions evidenced by breakpoint containing read pairs ({output_prefix}/junction.clustered.bedpe.gz(.tbi)) and improperly aligned read pairs ({output_prefix}/improper.clustered.bedpe.gz(.tbi)).

merge

Merging non-matched control panel breakpoint-containing read pairs. This step picks up germline and artifacts breakpoints (e.g., black-list breakpoints) for later filtering steps, typically using several control samples. We strongly believe this step is crucial for improving accuracy of somatic structural variation calling.

GenomonSV merge [-h] [--debug]
                [--merge_check_margin_size MERGE_CHECK_MARGIN_SIZE]
                control_info.txt merge_output_file                                     

filt

Filtering and annotating candidate somatic structural variation.

GenomonSV filt [-h] [--matched_control_bam matched_control.bam]
                      [--non_matched_control_junction merged.junction.control.bedpe.gz]
                      [--matched_control_label MATCHED_CONTROL_LABEL]
                      [--genome_id {hg19,hg38,mm10}] [--grc] [--debug]
                      [--thread_num THREAD_NUM] [--blat]
                      [--min_junc_num MIN_JUNC_NUM]
                      [--min_sv_size MIN_SV_SIZE]
                      [--min_inversion_size MIN_INVERSION_SIZE]
                      [--control_panel_num_thres CONTROL_PANEL_NUM_THRES]
                      [--control_panel_check_margin CONTROL_PANEL_CHECK_MARGIN]
                      [--min_support_num MIN_SUPPORT_NUM]
                      [--min_mapping_qual MIN_MAPPING_QUAL]
                      [--min_overhang_size MIN_OVERHANG_SIZE]
                      [--close_check_margin CLOSE_CHECK_MARGIN]
                      [--close_check_thres CLOSE_CHECK_THRES]
                      [--max_depth MAX_DEPTH] [--search_length SEARCH_LENGTH]
                      [--search_margin SEARCH_MARGIN]
                      [--split_refernece_thres SPLIT_REFERNECE_THRES]
                      [--validate_sequence_length VALIDATE_SEQUENCE_LENGTH]
                      [--short_tandem_reapeat_thres SHORT_TANDEM_REAPEAT_THRES]
                      [--blat_option BLAT_OPTION]
                      [--min_tumor_variant_read_pair MIN_TUMOR_VARIANT_READ_PAIR]
                      [--min_tumor_allele_freq MIN_TUMOR_ALLELE_FREQ]
                      [--max_control_variant_read_pair MAX_CONTROL_VARIANT_READ_PAIR]
                      [--max_control_allele_freq MAX_CONTROL_ALLELE_FREQ]
                      [--max_fisher_pvalue MAX_FISHER_PVALUE]
                      input.bam output_prefix reference.fa

The following options are not mandatory, but we strongly believe is necessary for improved results.

See the help (GenomonSV filt -h) for other options. You may want to tune up min_junc_num, min_support_num, min_overhang_size, max_depth, min_tumor_variant_read_pair,min_tumor_allele_freq, max_control_variant_read_pair, max_control_allele_freq, -max_fisher_pvaluedepending on your sequencing depth and tumor purity.

Results