Full pipeline script (original) (raw)

The entire PICRUSt2 pipeline can be run using a single script, called picrust2_pipeline.py. This script will run each of the 4 key steps outlined on this wiki: (1) sequence placement, (2) hidden-state prediction of genomes, (3) metagenome prediction, (4) pathway-level predictions.

The option of this program are the same as for each individual scripts overall.

The standard pipeline will generate metagenome predictions for 16S rRNA gene data. The input files should be a FASTA of amplicon sequences variants (ASVs; i.e. your representative sequences, not your raw reads, which is study_seqs.fna below) and a BIOM table of the abundance of each ASV across each sample (study_seqs.biom below). Note that a tab-delimited table with ASV ids as the first column and sample abundances as all subsequent columns will also work.

The below command will run the full default pipeline on the two input files. EC number and KO metagenomes are predicted as well as MetaCyc pathway abundances and coverages predicted based on the predicted EC number abundances. The nearest-sequenced taxon index (NSTI) will be calculated for each input ASV and by default any ASVs with NSTI > 2 will be excluded from the output by default. Stratified output will only be calculated when the --stratified option is set, which can greatly increase run-time.

picrust2_pipeline.py -s study_seqs.fna -i study_seqs.biom -o picrust2_out_pipeline -p 1

Note on using SEPP: if you use SEPP for sequence placement, you may need to first run export LC_ALL=C

Please note that before PICRUSt2-v2.6.0 the default running of this command was with the PICRUSt2-oldIMG database. As of PICRUSt2-v2.6.0 the default database will be the PICRUSt2-SC database. See here for further details on this new database. If you are using v2.6.0 you can still use the PICRUSt2-oldIMG database, but you will need to use the picrust2_pipeline_oldIMG.py command. The options/output will then be the same as v2.5.3 below.

All of the output files produced by the pipeline (including intermediate files, which can be useful for troubleshooting issues), will be output in picrust2_out_pipeline. Note that these are the default outputs, but if you specify different functional databases or custom reference databases (e.g. non-16S amplicon reference data) then the output will differ. See About the output for full details on all files produced.

v2.6.0 onwards

The key output files are:

EC_metagenome_out - Folder containing unstratified EC number metagenome predictions (pred_metagenome_unstrat.tsv.gz), sequence table normalized by predicted 16S copy number abundances (seqtab_norm.tsv.gz), and the per-sample NSTI values weighted by the abundance of each ASV (weighted_nsti.tsv.gz).
KO_metagenome_out - As EC_metagenome_out above, but for KO metagenomes.
pathways_out - Folder containing predicted pathway abundances and coverages per-sample, based on predicted EC number abundances.

Additional output files, which can be useful for advanced users are:

combined_EC_predicted.tsv.gz - Predicted EC number copy numbers per ASV.
intermediate - Folder containing intermediate MinPath files and files used for sequence placement pipeline (including JPLACE files: intermediate/place_seqs_bac/epa_out/epa_result.jplace and intermediate/place_seqs_arc/epa_out/epa_result.jplace).
combined_KO_predicted.tsv.gz - As combined_EC_predicted.tsv.gz above, but for KO predictions.
combined_marker_nsti_predicted.tsv.gz - Predicted 16S copy numbers and NSTI per ASV. This file now includes
bac_reduced_out.tre and arc_reduced_out.tre - Trees of reference and study 16S sequences placed into the bacterial and archaeal trees. Note that each study sequence should be placed into only one of these trees.
bac_out.tre and arc_out.tre - Trees of reference and study 16S sequences placed into the bacterial and archaeal trees. These differ from the reduced files above in that each study sequence should be placed into both of these trees.

See About the output for full details on all files produced.

Options:

-s PATH - FASTA of unaligned study sequences
-i PATH - Input table of sequence abundances (BIOM, TSV, or mothur shared file format)
-o PATH - Output folder
-p INT: Number of processes to run in parallel.
-t epa-ng|sepp: Placement tool to use when placing sequences into reference tree. One of "epa-ng" or "sepp" must be input (default: epa-ng)
--ref_dir1 DIRECTORY: Argument specifying non-default reference files to use for sequence placement. There are four expected files in this directory (see below). By default, this will be the bacterial reference files. If you wish to use only one domain, I recommend using the previous picrust2_pipeline.py script (full details here) for running with one of the databases, but if you wish to add two different references or replace one of the bacterial and archaeal references, then you could use this in this way.
--ref_dir2 DIRECTORY: As above ref_dir1 but for a second domain.
--in_traits IN_TRAITS - Comma-delimited list (with no spaces) of which gene families to predict from this set: BIGG, CAZY, EC, GENE_NAMES, GO, KO, PFAM. Note that EC numbers will always be predicted unless --no_pathways is set (default: EC,KO).
--custom_trait_tables_ref1 PATH - Optional path to custom trait tables with gene families as columns and genomes as rows for the first reference domain (default: bacteria; overrides --in_traits setting) to be used for hidden-state prediction. Multiple tables can be specified by delimiting filenames by commas. Importantly, the first custom table specified will be used for inferring pathway abundances. Typically this command would be used with a custom marker gene table (--marker_gene_table) as well
--custom_trait_tables_ref2 As above custom_trait_tables_ref1 but for the second reference group (default: archaea)
--marker_gene_table_ref1 PATH - Path to marker gene copy number table for the first domain (default: bacteria, 16S copy numbers by default).
--marker_gene_table_ref2 PATH - As above marker_gene_table_ref1 for the second domain (default: archaea, 16S copy numbers by default).
--pathway_map MAP - MinPath mapfile. The default mapfile maps MetaCyc reactions to prokaryotic pathways (default: picrust2/default_files/pathway_mapfiles/metacyc_path2rxn_struc_filt_pro.txt).
--reaction_func STR - Functional database to use as reactions for inferring pathway abundances (default: "EC", which corresponds to the prokaryotic EC number table). This should be either the short-form of the database as specified in --in_traits, or the path to the file as would be specified for --custom_trait_tables. Note that when functions besides the default prokaryotic EC numbers are used typically the --no_regroup option would also be set.
--no_pathways - Flag to indicate that pathways should NOT be inferred (otherwise they will be inferred by default). Predicted E.C. number abundances are used to infer pathways when default reference files are used.
--regroup_map ID_MAP - Mapfile of ids to regroup gene families to before running MinPath. The default mapfile is for regrouping E. C. numbers to MetaCyc reactions (default: picrust2/default_files/pathway_mapfiles/ec_level4_to_metacyc_rxn.tsv).
--no_regroup - Do not regroup input gene families to reactions as specified in the regrouping mapfile. This option should only be used if you are using custom reference and/or mapping files.
--stratified - Flag to indicate that stratified tables should be generated at all steps (will increase run-time).
-a {hmmalign,papara} - Which program to use for aligning query sequences to reference MSA prior to EPA-NG step (default: hmmalign).
--max_nsti INT - Sequences with NSTI values above this value will be excluded (default: 2).
--min_reads INT - Minimum number of reads across all samples for each input ASV. ASVs below this cut-off will be counted as part of the "RARE" category in the stratified output (default: 1).
--min_samples INT - Minimum number of samples that an ASV needs to be identfied within. ASVs below this cut-off will be counted as part of the "RARE" category in the stratified output (default: 1).
-m {mp,emp_prob,pic,scp,subtree_average} - HSP method to use."mp": predict discrete traits using max parsimony. "emp_prob": predict discrete traits based on empirical state probabilities across tips. "subtree_average": predict continuous traits using subtree averaging. "pic": predict continuous traits with phylogentic independent contrast. "scp": reconstruct continuous traits using squared-change parsimony (default: mp).
-e float - Setting for maximum parisomony hidden-state prediction. Specifies weighting transition costs by the inverse length of edge lengths. If 0, then edge lengths do not influence predictions. Must be a non-negative real-valued number (default: 0.5).
--skip_nsti - Do not calculate nearest-sequenced taxon index (NSTI), which is added as a column in the predicted marker-gene copy-number table by default (marker_nsti_predicted.tsv.gz).
--no_gap_fill - Do not perform gap filling before predicting pathway abundances (Gap filling is on otherwise by default).
--coverage - Calculate pathway coverages as well as abundances, which are an alternative way to identify which pathway are present. Note that these values are experimental and only useful for advanced users. Coverage is also calculated using the same approach as HUMAnN2.
--skip_minpath -Do not run MinPath to identify which pathways are present as a first pass (MinPath is run by default).
--per_sequence_contrib - Option to specify that stratified abundances should be reported in terms of the contribution by each predicted genome rather than how much each genome is contributing to the overall community abundance. In other words, pathway abundances will be calculated for each individual predicted genome. Stratified coverages will only be reported when this option is used (and --coverage is set). As of v2.2.0-b, unstratified pathway abundances based on the community-wide pathway abundances and also based on the per-sequence pathway abundances will be output when this option is used.
--wide_table -Output wide-format stratified table of metagenome predictions when --strat_out is set. This is the deprecated method of generating stratified tables since it is extremely memory intensive. The stratified outfile is named pred_metagenome_strat.tsv.gz when this option is set (added in v2.2.0-b).
--skip_norm - Skip normalizing sequence abundances by predicted marker gene copy numbers (typically 16S rRNA genes). This step will be performed automatically unless this option is specified (added in v2.2.0-b).
--remove_intermediate - Remove the intermediate outfiles of the sequence placement and pathway inference steps.
--verbose - If specified, print out wrapped commands to screen.

v2.5.3 and earlier

Note that the behaviour of v2.5.3 and earlier can be replicated using picrust2_pipeline_oldIMG.py.

The key output files are:

EC_metagenome_out - Folder containing unstratified EC number metagenome predictions (pred_metagenome_unstrat.tsv.gz), sequence table normalized by predicted 16S copy number abundances (seqtab_norm.tsv.gz), and the per-sample NSTI values weighted by the abundance of each ASV (weighted_nsti.tsv.gz).
KO_metagenome_out - As EC_metagenome_out above, but for KO metagenomes.
pathways_out - Folder containing predicted pathway abundances and coverages per-sample, based on predicted EC number abundances.

Additional output files, which can be useful for advanced users are:

EC_predicted.tsv.gz - Predicted EC number copy numbers per ASV.
intermediate - Folder containing intermediate MinPath files and files used for sequence placement pipeline (including JPLACE file: intermediate/place_seqs/epa_out/epa_result.jplace).
KO_predicted.tsv.gz - As EC_predicted.tsv.gz above, but for KO predictions.
marker_nsti_predicted.tsv.gz - Predicted 16S copy numbers and NSTI per ASV.
out.tre - Tree of reference and study 16S sequences.

See About the output for full details on all files produced.

Options:

-s PATH - FASTA of unaligned study sequences
-i PATH - Input table of sequence abundances (BIOM, TSV, or mothur shared file format)
-o PATH - Output folder
-p INT: Number of processes to run in parallel.
-t epa-ng|sepp: Placement tool to use when placing sequences into reference tree. One of "epa-ng" or "sepp" must be input (default: epa-ng)
--ref_dir DIRECTORY: Argument specifying non-default reference files to use for sequence placement. There are four expected files in this directory (see below).
--in_traits IN_TRAITS - Comma-delimited list (with no spaces) of which gene families to predict from this set: EC, KO, GO, PFAM, BIGG, CAZY, GENE_NAMES. Note that EC numbers will always be predicted unless --no_pathways is set (default: EC,KO).
--custom_trait_tables PATH - Optional path to custom trait tables with gene families as columns and genomes as rows (overrides --in_traits setting) to be used for hidden-state prediction. Multiple tables can be specified by delimiting filenames by commas. Importantly, the first custom table specified will be used for inferring pathway abundances. Typically this command would be used with a custom marker gene table (--marker_gene_table) as well
--marker_gene_table PATH - Path to marker gene copy number table (16S copy numbers by default).
--pathway_map MAP - MinPath mapfile. The default mapfile maps MetaCyc reactions to prokaryotic pathways (default: picrust2/default_files/pathway_mapfiles/metacyc_path2rxn_struc_filt_pro.txt).
--reaction_func STR - Functional database to use as reactions for inferring pathway abundances (default: "EC", which corresponds to the prokaryotic EC number table). This should be either the short-form of the database as specified in --in_traits, or the path to the file as would be specified for --custom_trait_tables. Note that when functions besides the default prokaryotic EC numbers are used typically the --no_regroup option would also be set.
--no_pathways - Flag to indicate that pathways should NOT be inferred (otherwise they will be inferred by default). Predicted E.C. number abundances are used to infer pathways when default reference files are used.
--regroup_map ID_MAP - Mapfile of ids to regroup gene families to before running MinPath. The default mapfile is for regrouping E. C. numbers to MetaCyc reactions (default: picrust2/default_files/pathway_mapfiles/ec_level4_to_metacyc_rxn.tsv).
--no_regroup - Do not regroup input gene families to reactions as specified in the regrouping mapfile. This option should only be used if you are using custom reference and/or mapping files.
--stratified - Flag to indicate that stratified tables should be generated at all steps (will increase run-time).
-a {hmmalign,papara} - Which program to use for aligning query sequences to reference MSA prior to EPA-NG step (default: hmmalign).
--max_nsti INT - Sequences with NSTI values above this value will be excluded (default: 2).
--min_reads INT - Minimum number of reads across all samples for each input ASV. ASVs below this cut-off will be counted as part of the "RARE" category in the stratified output (default: 1).
--min_samples INT - Minimum number of samples that an ASV needs to be identfied within. ASVs below this cut-off will be counted as part of the "RARE" category in the stratified output (default: 1).
-m {mp,emp_prob,pic,scp,subtree_average} - HSP method to use."mp": predict discrete traits using max parsimony. "emp_prob": predict discrete traits based on empirical state probabilities across tips. "subtree_average": predict continuous traits using subtree averaging. "pic": predict continuous traits with phylogentic independent contrast. "scp": reconstruct continuous traits using squared-change parsimony (default: mp).
-e float - Setting for maximum parisomony hidden-state prediction. Specifies weighting transition costs by the inverse length of edge lengths. If 0, then edge lengths do not influence predictions. Must be a non-negative real-valued number (default: 0.5).
--skip_nsti - Do not calculate nearest-sequenced taxon index (NSTI), which is added as a column in the predicted marker-gene copy-number table by default (marker_nsti_predicted.tsv.gz).
--no_gap_fill - Do not perform gap filling before predicting pathway abundances (Gap filling is on otherwise by default).
--coverage - Calculate pathway coverages as well as abundances, which are an alternative way to identify which pathway are present. Note that these values are experimental and only useful for advanced users. Coverage is also calculated using the same approach as HUMAnN2.
--skip_minpath -Do not run MinPath to identify which pathways are present as a first pass (MinPath is run by default).
--per_sequence_contrib - Option to specify that stratified abundances should be reported in terms of the contribution by each predicted genome rather than how much each genome is contributing to the overall community abundance. In other words, pathway abundances will be calculated for each individual predicted genome. Stratified coverages will only be reported when this option is used (and --coverage is set). As of v2.2.0-b, unstratified pathway abundances based on the community-wide pathway abundances and also based on the per-sequence pathway abundances will be output when this option is used.
--wide_table -Output wide-format stratified table of metagenome predictions when --strat_out is set. This is the deprecated method of generating stratified tables since it is extremely memory intensive. The stratified outfile is named pred_metagenome_strat.tsv.gz when this option is set (added in v2.2.0-b).
--skip_norm - Skip normalizing sequence abundances by predicted marker gene copy numbers (typically 16S rRNA genes). This step will be performed automatically unless this option is specified (added in v2.2.0-b).
--remove_intermediate - Remove the intermediate outfiles of the sequence placement and pathway inference steps.
--verbose - If specified, print out wrapped commands to screen.