a multiple sequence alignment program (original) (raw)

Acceptable symbols in the default mode

If other alphabets are included in the input data, then the calculation stops with an error message. Non-alphabetical characters (excl. period) are removed by default. Use the --anysymbol option below, if you need to use these characters.

Sequence type (protein / nucleotide) is automatically recognized based on the frequency of a, t, g, c, and u, unless the --amino or --nuc flag is given.

--anysymbol

To use unusual characters (e.g.,

U

as selenocysteine in protein sequence;

i

as inosine in nucleotide sequence), use the

--anysymbol

option:

% mafft --anysymbol input > output

It accepts any printable characters (U, O, #, $, %, etc.; 0x21-0x7e in the ASCII code), execpt for > (0x3e) and ( (0x28). Unusual characters are scored as unknown (not considered in the calculation), unlike in the --text mode.

When the input data is:

> SampleSequenceWithUnusualCharacter > Sample#Sequence_With%Various^Unusual*Characters > SAMPLESEQUENCE

The result will be:

> Sample-Sequence-With---------Unusual-Character- > Sample#Sequence_With%Various^Unusual*Characters > SAMPLE-SEQUENCE--------------------------------

Upper/lower case is preserved. The --anysymbol option is internally equivalent to the --preservecase option.

For aligning non-biological sequences, use the --text mode, in which unusual characters are also considered in the alignment calculation.