macでインフォマティクス (original) (raw)

メタゲノミクスは、環境やヒトに関連するマイクロバイオーム研究に革命をもたらした。しかし、生物学的プロセスや分子機能が知られているタンパク質の数は限られており、これが大きなボトルネックとなっている。原核生物やウイルスでは、同じ生物学的プロセスに関与する遺伝子を、保存された遺伝子クラスターとして共局在させておくことが進化上好ましい。逆に、遺伝子近傍の保存は機能的関連を示す。Spacedustは、保存された遺伝子クラスターを系統的にde novoで発見するためのツールである。相同タンパク質のマッチを見つけるために、Foldseekとの高速で高感度な構造比較を使用する。部分的に保存されたクラスターは、新しいクラスタリングと次数保存のP値を用いて検出される。1,308の細菌ゲノムの全対全分析でSpacedustの感度を実証し、420万遺伝子の58%を含む72,843の保存遺伝子クラスターを同定した。また、特殊なツールによってアノテーションされた抗ウイルス防御系クラスターの95%を回復した。Spacedustの高い感度とスピードは、配列決定された膨大な数の細菌、古細菌、ウイルスゲノムの大規模アノテーションを促進するだろう。

インストール

ubuntuu22.04LTSで公開されているstatic binaryを使用した（CPU: TR3990X）。

Github

static Linux AVX2 build

wget https://mmseqs.com/spacedust/spacedust-linux-avx2.tar.gz; tar xvzf spacedust-linux-avx2.tar.gz; export PATH=$(pwd)/spacedust/bin/:$PATH

static Linux SSE4.1 build

wget https://mmseqs.com/spacedust/spacedust-linux-sse41.tar.gz; tar xvzf spacedust-linux-sse41.tar.gz; export PATH=$(pwd)/spacedust/bin/:$PATH

static macOS build (universal binary with SSE4.1/AVX2/M1 NEON)

wget https://mmseqs.com/spacedust/spacedust-osx-universal.tar.gz; tar xvzf spacedust-osx-universal.tar.gz; export PATH=$(pwd)/spacedust/bin/:$PATH

> spacedust -h

$ spacedust -h

Spacedust is a tool to discover conserved gene clusters between any pairs of contig/genomes

spacedust Version: 16b020301be952232d6eb2eaa2cd2ad0933d68b0

usage: spacedust []

Main workflows for database input/output

createsetdb Create sequence set database from FASTA (and GFF3) input of contigs/genomes

aa2foldseek Map a sequence DB to reference foldseek DB

clusterdb Build a searchable cluster database from sequence DB or foldseek structure DB

clustersearch Find clusters of colocalized hits between any query-target sequence/profile set database

Special-purpose utilities

besthitbyset For each set of sequences compute the best element and update p-value

combinehits Group hits and compute a combined E-value for each query-target set pair

summarizeresults Summarize results on clustered hits

clusterhits Find clusters of hits by agglomerative hierarchical clustering and compute their clustering and ordering P-values

>spacedust createsetdb -h

$ spacedust createsetdb -h

usage: spacedust createsetdb <i:fastaFile1[.gz|bz2]> ... <i:fastaFileN[.gz|bz2]> <o:setDB> [options]

By Ruoshi Zhang ruoshi.zhang@mpinat.mpg.de & Milot Mirdita milot@mirdita.de

options: misc:

--dbtype INT Database type 0: auto, 1: amino acid 2: nucleotides [0]

--shuffle BOOL Shuffle input database [0]

--createdb-mode INT Createdb mode 0: copy data, 1: soft link data and write new index (works only with single line fasta/q) [0]

--id-offset INT Numeric ids in index file are offset by this value [0]

--min-length INT Minimum codon number in open reading frames [30]

--max-length INT Maximum codon number in open reading frames [32734]

--max-gaps INT Maximum number of codons with gaps or unknown residues before an open reading frame is rejected [2147483647]

--contig-start-mode INT Contig start can be 0: incomplete, 1: complete, 2: both [2]

--contig-end-mode INT Contig end can be 0: incomplete, 1: complete, 2: both [2]

--orf-start-mode INT Orf fragment can be 0: from start to stop, 1: from any to stop, 2: from last encountered start to stop (no start in the middle) [1]

--forward-frames STR Comma-separated list of frames on the forward strand to be extracted [1,2,3]

--reverse-frames STR Comma-separated list of frames on the reverse strand to be extracted [1,2,3]

--translation-table INT 1) CANONICAL, 2) VERT_MITOCHONDRIAL, 3) YEAST_MITOCHONDRIAL, 4) MOLD_MITOCHONDRIAL, 5) INVERT_MITOCHONDRIAL, 6) CILIATE

FLATWORM_MITOCHONDRIAL, 10) EUPLOTID, 11) PROKARYOTE, 12) ALT_YEAST, 13) ASCIDIAN_MITOCHONDRIAL, 14) ALT_FLATWORM_MITOCHONDRIAL
BLEPHARISMA, 16) CHLOROPHYCEAN_MITOCHONDRIAL, 21) TREMATODE_MITOCHONDRIAL, 22) SCENEDESMUS_MITOCHONDRIAL
THRAUSTOCHYTRIUM_MITOCHONDRIAL, 24) PTEROBRANCHIA_MITOCHONDRIAL, 25) GRACILIBACTERIA, 26) PACHYSOLEN, 27) KARYORELICT, 28) CONDYLOSTOMA
MESODINIUM, 30) PERTRICH, 31) BLASTOCRITHIDIA [1]

--translate INT Translate ORF to amino acid [0]

--use-all-table-starts BOOL Use all alternatives for a start codon in the genetic table, if false - only ATG (AUG) [0]

--add-orf-stop BOOL Add stop codon '*' at complete start and end [0]

--gff-type STR Comma separated list of feature types in the GFF file to select

--stat STR One of: linecount, mean, min, max, doolittle, charges, seqlen, firstline

--tsv BOOL Return output in TSV format [0]

--gff-dir STR Path to gff dir file

common:

--compressed INT Write compressed output [0]

-v INT Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]

--threads INT Number of CPU-cores used (all by default) [128]

expert:

--write-lookup INT write .lookup file containing mapping from internal id, fasta id and file number [1]

--create-lookup INT Create database lookup file (can be very large) [0]

references:

- Steinegger M, Soding J: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11), 1026-1028 (2017)

> spacedust aa2foldseek -h

$ spacedust aa2foldseek -h

usage: spacedust aa2foldseek <i:inputDB> <i:targetDB> [options]

By Ruoshi Zhang ruoshi.zhang@mpinat.mpg.de & Milot Mirdita milot@mirdita.de

options: prefilter:

--seed-sub-mat TWIN Substitution matrix file for k-mer generation [aa:VTML80.out,nucl:nucleotide.out]

-s FLOAT Sensitivity: 1.0 faster; 4.0 fast; 7.5 sensitive [4.000]

-k INT k-mer length (0: automatically set to optimum) [0]

--k-score TWIN k-mer threshold for generating similar k-mer lists [seq:2147483647,prof:2147483647]

--alph-size TWIN Alphabet size (range 2-21) [aa:21,nucl:5]

--max-seqs INT Maximum results per query sequence allowed to pass the prefilter (affects sensitivity) [10]

--split INT Split input into N equally distributed chunks. 0: set the best split automatically [0]

--split-mode INT 0: split target db; 1: split query db; 2: auto, depending on main memory [2]

--split-memory-limit BYTE Set max memory per split. E.g. 800B, 5K, 10M, 1G. Default (0) to all available system memory [0]

--comp-bias-corr INT Correct for locally biased amino acid composition (range 0-1) [1]

--comp-bias-corr-scale FLOAT Correct for locally biased amino acid composition (range 0-1) [1.000]

--diag-score BOOL Use ungapped diagonal scoring during prefilter [1]

--exact-kmer-matching INT Extract only exact k-mers for matching (range 0-1) [1]

--mask INT Mask sequences in k-mer stage: 0: w/o low complexity masking, 1: with low complexity masking [1]

--mask-prob FLOAT Mask sequences is probablity is above threshold [0.900]

--mask-lower-case INT Lowercase letters will be excluded from k-mer search 0: include region, 1: exclude region [0]

--min-ungapped-score INT Accept only matches with ungapped alignment score above threshold [15]

--add-self-matches BOOL Artificially add entries of queries with themselves (for clustering) [0]

--spaced-kmer-mode INT 0: use consecutive positions in k-mers; 1: use spaced k-mers [1]

--spaced-kmer-pattern STR User-specified spaced k-mer pattern

--local-tmp STR Path where some of the temporary files will be created

align:

-c FLOAT List matches above this fraction of aligned (covered) residues (see --cov-mode) [0.900]

--cov-mode INT 0: coverage of query and target

1: coverage of target

2: coverage of query

3: target seq. length has to be at least x% of query length

4: query seq. length has to be at least x% of target length

5: short seq. needs to be at least x% of the other seq. length [0]

-a BOOL Add backtrace string (convert to alignments with mmseqs convertalis module) [0]

--alignment-mode INT How to compute the alignment:

0: automatic

1: only score and end_pos

2: also start_pos and cov

3: also seq.id

4: only ungapped alignment [0]

--alignment-output-mode INT How to compute the alignment:

0: automatic

1: only score and end_pos

2: also start_pos and cov

3: also seq.id

4: only ungapped alignment

5: score only (output) cluster format [0]

--wrapped-scoring BOOL Double the (nucleotide) query sequence during the scoring process to allow wrapped diagonal scoring around end and start [0]

-e DOUBLE List matches below this E-value (range 0.0-inf) [1.000E-03]

--min-seq-id FLOAT List matches above this sequence identity (for clustering) (range 0.0-1.0) [0.900]

--min-aln-len INT Minimum alignment length (range 0-INT_MAX) [0]

--seq-id-mode INT 0: alignment length 1: shorter, 2: longer sequence [0]

--alt-ali INT Show up to this many alternative alignments [0]

--max-rejected INT Maximum rejected alignments before alignment calculation for a query is stopped [2147483647]

--max-accept INT Maximum accepted alignments before alignment calculation for a query is stopped [2147483647]

--score-bias FLOAT Score bias when computing SW alignment (in bits) [0.000]

--realign BOOL Compute more conservative, shorter alignments (scores and E-values not changed) [0]

--realign-score-bias FLOAT Additional bias when computing realignment [-0.200]

--realign-max-seqs INT Maximum number of results to return in realignment [2147483647]

--corr-score-weight FLOAT Weight of backtrace correlation score that is added to the alignment score [0.000]

--gap-open TWIN Gap open cost [aa:11,nucl:5]

--gap-extend TWIN Gap extension cost [aa:1,nucl:2]

--zdrop INT Maximal allowed difference between score values before alignment is truncated (nucleotide alignment only) [40]

profile:

--pca Pseudo count admixture strength

--pcb Pseudo counts: Neff at half of maximum admixture (range 0.0-inf)

misc:

--taxon-list STR Taxonomy ID, possibly multiple values separated by ','

--stat STR One of: linecount, mean, min, max, doolittle, charges, seqlen, firstline

--tsv BOOL Return output in TSV format [0]

common:

--sub-mat TWIN Substitution matrix file [aa:blosum62.out,nucl:nucleotide.out]

--max-seq-len INT Maximum sequence length [65535]

--db-load-mode INT Database preload mode 0: auto, 1: fread, 2: mmap, 3: mmap+touch [0]

--threads INT Number of CPU-cores used (all by default) [128]

--compressed INT Write compressed output [0]

-v INT Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]

--remove-tmp-files BOOL Delete temporary files [0]

references:

- Steinegger M, Soding J: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11), 1026-1028 (2017)

> spacedust clusterdb -h

$ spacedust clusterdb -h

usage: spacedust clusterdb <i:inputDB> [options]

By Ruoshi Zhang ruoshi.zhang@mpinat.mpg.de & Milot Mirdita milot@mirdita.de

options: prefilter:

--seed-sub-mat TWIN Substitution matrix file for k-mer generation [aa:VTML80.out,nucl:nucleotide.out]

-s FLOAT Sensitivity: 1.0 faster; 4.0 fast; 7.5 sensitive [4.000]

-k INT k-mer length (0: automatically set to optimum) [0]

--k-score TWIN k-mer threshold for generating similar k-mer lists [seq:2147483647,prof:2147483647]

--alph-size TWIN Alphabet size (range 2-21) [aa:21,nucl:5]

--max-seqs INT Maximum results per query sequence allowed to pass the prefilter (affects sensitivity) [300]

--split INT Split input into N equally distributed chunks. 0: set the best split automatically [0]

--split-mode INT 0: split target db; 1: split query db; 2: auto, depending on main memory [2]

--split-memory-limit BYTE Set max memory per split. E.g. 800B, 5K, 10M, 1G. Default (0) to all available system memory [0]

--comp-bias-corr INT Correct for locally biased amino acid composition (range 0-1) [1]

--comp-bias-corr-scale FLOAT Correct for locally biased amino acid composition (range 0-1) [1.000]

--diag-score BOOL Use ungapped diagonal scoring during prefilter [1]

--exact-kmer-matching INT Extract only exact k-mers for matching (range 0-1) [0]