1. Implementation (original) (raw)
ReporTree modular organization
ReporTree is implemented in python 3.8 and comprises six scripts available in standalone mode that are orchestrated by reportree.py:
- alignment_processing.py (only run when an alignment is provided)
This script transforms a multiple sequence alignment into a SNP matrix that can be used as input for clustering. Only samples that fulfil the user-specified metadata filters will be included in the SNP matrix. Non-informative alignment positions and those that do not have information in a minimum percentage of samples are removed from the matrix. Afterwards, samples that do not have a minimum percentage of ATCG content will also be discarded. The name of each column (position) of the final matrix may (optionally) correspond to the alignment or to the reference coordinates. - partitioning_grapetree.py (only run when a SNP/allele matrix or an alignment or a VCF/list mutations is provided)
This script reconstructs the minimum spanning tree of a user-provided cg/wgMLST allele-matrix using a modified version of GrapeTree, and cuts this tree at all (or any user-specified) thresholds. The resulting genetic clusters are reported in a single matrix file (partitions.tsv). - partitioning_HC.py (only run when a SNP/allele matrix or an alignment or a VCF/list mutations or pairwise distance matrix is provided)
This script uses cgmlst-dists to calculate pairwise hamming distances, performs hierarchical clustering (several options are available: single, complete, average, weighted, centroid, median, ward) and cuts the dendrogram at all (or any user-specified) thresholds. The resulting genetic clusters are reported in a single matrix file (partitions.tsv). - partitioning_treecluster.py (only run when a newick tree is provided)
This script takes advantage of TreeCluster to cut a newick tree for all the user-specified clustering methods and respective thresholds. It then reports all the resulting genetic clusters in a partitions matrix file (partitions.tsv). - comparing_partitions_v2.py (only run when the user requests “stability regions”)
This is a modified and automated version of comparing_partitions.py that analyzes the cluster congruence between subsequent partitions of a given clustering method (using the Adjusted Wallace coefficient (Carriço et al. 2006; Severiano et al. 2011)) and identifies the stability regions based on a previously described approach (Llarena et al. 2018; Barker et al. 2018). ReporTree then includes the minimum threshold of each “stability region” in the summary report. - metadata_report.py
This script takes a metadata table as input (.tsv) and provides a separated summary report for each variable specified by the user. If a partitions matrix file is also provided, this script includes the genetic clusters in a new metadata file (metadata_w_partitions.tsv) and creates a summary report of the metadata associated with each of the genetic clusters. Frequency and/or count matrices can also be produced.