Home (original) (raw)

mist_logo

MiST is a rapid, accurate and flexible (core-genome) multi-locus sequence typing (MLST) allele caller.

Getting started

1. Installation

Start by installing the tool and its dependencies using the installation instructions.

To check the installation:

2. Downloading a scheme

MiST does not include built-in (cg)MLST schemes. However, schemes can be downloaded from widely used resources such as PubMLST.org, EnteroBase (https://enterobase.warwick.ac.uk/) or cgMLST.org. When using these schemes, make sure to cite the corresponding source in your research.

Instructions on how to download the schemes are provided here.

3. Creating the index

Afterwards, an index has to be created, which requires locus FASTA files and optionally a profiles TSV file.

mist index abcZ.fasta adk.fasta aroE.fasta fumC.fasta gdh.fasta -o mlst_neisseria

For details, see Indexing schemes.

4. Calling alleles

Once the scheme is indexed, you can query assemblies (FASTA format) against it. Minimal example:

mist call --db mlst_neisseria --fasta input_contigs.fasta

With TSV output, JSON, log, and intermediate Minimap2 output:

mist call --db mlst_neisseria
--fasta input_contigs.fasta
--out-tsv results/results.tsv
--out-dir results/
--log results/log.txt
--keep-minimap2
--threads 8

For details, see Running MiST.

5. Understanding outputs

MiST produces results in JSON by default, with optional TSV and additional files.

Typical output directory:

results/
├── mist.json              # Main JSON output
├── results.tsv            # (optional) tabular results
├── mist.log               # (optional) log file
├── minimap2_parsed.tsv    # (optional) alignments
└── novel_alleles/         # novel allele FASTAs (if detected)

JSON: contains allele calls, best profile match, and metadata
TSV: simple tabular format (locus, allele, is_novel)
Logs: useful for debugging and traceability

For details, see: Running MiST or follow the Tutorial.

Graphical overview

Database construction

Each FASTA file containing allele sequences for a locus is first clustered by sequence identity using CD-HIT. Sequences of different lengths are forced into separate clusters, regardless of identity. The resulting clusters are labelled C1, C2, and C3.
Alleles with frameshifts relative to other cluster members are detected using nucmer and split into separate clusters (e.g., C1 is split into C1a and C1b).
One representative per cluster (typing allele) is retained in the final FASTA file.
FASTA files for all loci are combined, and a Minimap2 index is built.

Allele calling

Input contigs are aligned to the combined typing alleles using Minimap2.
Corresponding sequences are extracted based on their location in the input contigs.
Extracted sequences are hashed and compared against a database of pre-computed allele hashes.
If one or more exact matches are found, they are reported; otherwise, the best-matching allele is identified.