Standard PICRUSt2 output (original) (raw)

This page describes the output that you can expect from running the entire pipeline as well as from running each of the steps individually. It is for PICRUSt2-v2.6.0 onwards, but most of this applies to earlier versions also. They are separated by the step within the pipeline that they are produced. For this, it is important to understand what exactly each part of the PICRUSt2 pipeline is doing, so I have included some details on that also.

Sequence placement

Uses: runs the function place_seqs_pipeline within picrust2/place_seqs.py, either by running scripts/place_seqs.py or within scripts/picrust2_pipeline.py.

Aim: place study sequences into tree containing reference genomes based on alignment to multiple sequence alignment (MSA) of reference sequences.

Note that these steps will be run for both bacteria and archaea by default.

Input: fasta file containing representative nucleotide sequences for study.

Key output:

You can see details of the intermediate files created below, but arc.tre and bac.tre are the only files that are needed to continue with the PICRUSt2 pipeline. Note that if you have run each of the commands separately, you could have named this file something different, but the default is for them to be named arc.tre and bac.tre.

Other intermediate output:

Note that these are described below in Output obtained from each step.

Output obtained from each step:

  1. Run hmmalign to align study sequences with reference MSA - this is necessary for EPA-NG and is used as a check for both EPA-NG and SEPP pipelines to exclude sequences that align very little
    Output: intermediate/place_seqs_[arc/bac]/query_align.stockholm - multiple sequence alignment in stockholm format
  2. Filter this file to include only sequences that align to the reference (default >0.8 alignment)
  3. EPA-NG: Write fasta files of reference and study sequences that align
    Output: intermediate/place_seqs_[arc/bac]/ref_seqs_hmmalign.fasta and intermediate/place_seqs_[arc/bac]/study_seqs_hmmalign.fasta
  4. EPA-NG: Run EPA-NG using the fasta files of reference and study sequences that align and the reference .model file
    Output: intermediate/place_seqs_[arc/bac]/epa_out/epa_info.log, intermediate/place_seqs_[arc/bac]/epa_out/epa_result.jplace and intermediate/place_seqs_[arc/bac]/epa_out/epa_result_parsed.jplace (these are the standard output files from EPA-NG)
  5. SEPP: Write fasta containing study sequences only
    Output: intermediate/place_seqs_[arc/bac]/study_seqs_filtered.fasta
  6. SEPP: Run SEPP using the fasta file containing the study sequences that align to the reference and the reference .model file
    Output: intermediate/place_seqs_[arc/bac]/sepp_out/
  7. Convert jplace output file to newick tree formatOutput: [arc/bac].tre

Hidden-state prediction for 16S copy number and NSTI calculation

Uses: runs the function castor_hsp_workflow within picrust2/wrap_hsp.py, either by running scripts/hsp.py or within scripts/picrust2_pipeline.py.

Aim: perform hidden state prediction on tips in the input tree with unknown trait values. Note that it is used here for prediction of 16S copy number per sequence in the input and will be used again for functional abundance prediction. It will also by default calculate the Nearest Sequenced Taxon Index (unless specified otherwise).

Note that these steps will be run for both bacteria and archaea by default.

Input: newick-formatted tree file output from the sequence placement step that contains reference genomes as well as study sequences.

Key output:

You can see details of the steps carried out below, but these files - arc_marker_predicted_and_nsti.tsv.gz and bac_marker_predicted_and_nsti.tsv.gz are the only output from this step. Note that scripts/hsp.py would have to be run twice to replicate the default behaviour within scripts/picrust2_pipeline.py.

Output obtained from each step:

  1. Calculate NSTI values using the R package Castor (unless specified otherwise)
  2. Run Hidden State Prediction using the R package Castor
    Output: [arc/bac]_marker_predicted_and_nsti.tsv.gz

Determine the best domain for each sequence

Uses: runs the function get_lowest_nsti within picrust2/split_domains.py, either by running scripts/pick_best_domain.py or within scripts/picrust2_pipeline.py.

Aim: determine whether the best domain is bacteria or archaea for each input sequence.

Input:

Key output:

You can see details of the steps that are run to create these files below. They will all be used in the next steps.

Output obtained from each step:

  1. For each sequence, determine which of the domains has the lowest NSTI and then filter the input tables to include only the sequences that fit best in each domain
    Output: combined_marker_nsti_predicted.tsv.gz, arc_reduced_marker_predicted_and_nsti.tsv.gz and bac_reduced_marker_predicted_and_nsti.tsv.gz
  2. Prune the trees to include only the sequences that fit best in each domain
    Output: arc_reduced.tre and bac_reduced.tre

Hidden-state prediction for EC number and KO abundances (or other traits given as input)

Uses: runs the function castor_hsp_workflow within picrust2/wrap_hsp.py, either by running scripts/hsp.py or within scripts/picrust2_pipeline.py.

Aim: perform hidden state prediction on tips in the input tree with unknown trait values. Note that it is used here for prediction of EC number and KO abundances (or other functional category) per sequence in the input. This time it will not be used for NSTI prediction (unless specified otherwise).

Note that these steps will be run for both bacteria and archaea by default.

Input: reduced newick-formatted tree file output from determining which domain is the best fit for each sequence. As above, it contains reference genomes as well as study sequences.

Key output:

You can see details of the steps carried out below, but these files - arc_EC_predicted.tsv.gz, arc_KO_predicted.tsv.gz, bac_EC_predicted.tsv.gz, bac_KO_predicted.tsv.gz or any other functional category given is the only output from this step. Note that scripts/hsp.py would need to be run four separate times to replicate the default behaviour within scripts/picrust2_pipeline.py.

Output obtained from each step:

  1. Run Hidden State Prediction using the R package Castor
    Output: [arc/bac]_[EC/KO/other]_predicted.tsv.gz

Combine the files from hidden-state prediction for each functional category

Uses: runs the function combine_domain_predictions within picrust2/split_domains.py, either by running scripts/combine_domains.py or within scripts/picrust2_pipeline.py.

Aim: combine the hidden-state predictions obtained separately for functional abundances within the bacteria and archaea above.

Input:

Note that it should be the same functional trait used for each of these files!

Key output:

The input files are really just combined to give the output, with a few checks performed for overlaps in sequence names or traits between the two, but any of these output files will be needed to construct the predicted metagenome for each sample.

Predict functional abundances within samples (metagenome prediction)

Uses: runs the function run_metagenome_pipeline within picrust2/metagenome_pipeline.py, either by running scripts/metagenome_pipeline.py or within scripts/picrust2_pipeline.py.

Aim: determine the predicted functional abundance profiles per sample after normalising the input sequence abundances by the predicted marker gene copy number.

Input:

Key output:

Other output:

  1. Drop any sequences if the NSTI is above the maximum NSTI allowed (by default this is 2)
  2. Drop sequence IDs that don't overlap between the study sequences table, marker gene copy number table and predicted functional abundance table
  3. Normalise the abundances of sequences within samples by the marker gene copy number
    Output: [EC/KO/other]_metagenome_out/seqtab_norm.tsv.gz
  4. Calculate the weighted NSTIs for all samples (divide the per-sequence NSTIs by the abundance of the sequences within the sample)
    Output: [EC/KO/other]_metagenome_out/weighted_nsti.tsv.gz
  5. Get unstratified function abundances for each sample - multiply the (normalised - if applicaple) sequence abundance within a sample by the predicted abundance of the function for the sequence
    Output: [EC/KO/other]_metagenome_out/pred_metagenome_unstrat.tsv.gz
  6. Get stratified function abundances for each sample - similar to above, but this is carried out separately for each sequence within each sample, with the intermediate predicted values retained in the output
    Output: [EC/KO/other]_metagenome_out/pred_metagenome_contrib.tsv.gz

Infer pathway abundances

Uses: runs the function pathway_pipeline within picrust2/pathway_pipeline.py, either by running scripts/pathway_pipeline.py or within scripts/picrust2_pipeline.py.

Aim: predict pathway abundance and coverage within metagenome samples. Note that by default, the only option for predicting pathway abundance/coverage is that of MetaCyc pathways using EC numbers.

If you have a KEGG database subscription, you may obtain these files for KEGG pathways.

Input:

Key output:

This can optionally be saved by giving the directory for it to be saved to. By default, it will be saved within intermediate/pathways/

  1. If a mapping file is provided, regroup the functions in the input table to these (by default, the EC numbers are regrouped to MetaCyc reactions using pathway_mapfiles/ec_level4_to_metacyc_rxn_new.tsv)
    Output: intermediate/pathways/regrouped_infile.tsv
  2. Read in the pathways and if Minpath is being used, get the pathways that are present in the samples
    Output: intermediate/pathways/parsed_mapfile.tsv
  3. Run Minpath for each sample to get predictions of the pathways present within each sample
    Output: minpath_running/[sample_name]_minpath_details.txt and minpath_running/[sample_name]_minpath_in.txt (for each sample)
  4. Combine the Minpath output for each sample (for coverage as well as abundance if coverage is set)
    Output: pathways_out/path_abun_[unstrat/contrib/strat].tsv.gz and pathways_out/path_cov_[unstrat/contrib/strat].tsv.gz