MSTMap -- an efficient tool for constructing genetic linkage maps (original) (raw)
Welcome to MSTMap
MSTMap is a software tool that is capable of constructing genetic linkage maps efficiently and accurately. It can handle various mapping populations including BC1, DH, Hap, and RIL, among others. The tool builds the genetic linkage map by first constructing a Minimum Spanning Tree (MST), and hence the name MSTMap. The algorithm implemented in MSTMap is very efficient and can handle ultra-dense maps of up to 10,000~100,000 markers. According to our experimental studies, when the data quality is high, the accuracies of the maps produced by our tool are as good as those by the best tools available in the literature. However, when the data are noisy, the maps generated by our algorithm are significantly better.
View this page in Romanian courtesy of azoft
How to use MSTMap
You will need to download the source code and compile it on a Linux machine. To use MSTMap, you need to specify 2 parameters: input_file and output_file. The following command line demonstrates the usage of MSTMap:
MSTMap example.txt example_map.txt
The input file format
Each input file consists of two parts, the header and the body. Please refer to example.txt for an example.
- The header always contains the following lines, where , ..., are the places for you to specify various parameters.
population_type
population_name
distance_function
cut_off_p_value
no_map_dist
no_map_size
missing_threshold
estimation_before_clustering
detect_bad_data
objective_function
number_of_loci
number_of_individual
specifies the type of mapping population being used. Possible values are DH and RILd, where d is any natural number. For example, RIL6 means a RIL population at generation 6. You should use DH for any population that involves only two distinct genotype states (even if it is not a DH population), which includes BC1, DH, Hap, and advanced RIL.
gives a name for the mapping population. It can be any string of letters (a-z, A-Z) or digits (0-9).
specifies the distance function to be used. Possible choices are kosambi and haldane, which refers to the commonly used Kosambi's and Haldane's distance functions respectively.
specifies the threshold to be used for clustering the markers into LGs. A reasonable choice of p_value is 0.000001. Alternatively, the user can turn off this feature by setting to any number larger than 1. If the user does so, our software tool assumes that all markers belong to one single linakge group.
and together allow one to detect bad markers. In high density genetic linkage mapping, bad markers appear to be isolated from others. MSTmap will detect isolated marker groups and will place them in seperate LGs. An isolated marker group is a small set of markers of size less than or equal to and is more than away from the rest of the markers. A reasonable choice for is 1 or 2. To disable this feature, simply set to 0.
For example, if =15 and =2, then any group whose size is less than 2 and is 15 centimorgans away from the rest of the markers will be placed in a linkage group by themselves.
Occasionally there are markers with excessive number of missing observations. Those markers can be eliminated by settting to a proper value. For example, if =0.25, then any marker with more than 25% missing observations will be removed completely without being mapped.
is a binary flag which can be set to yes or no. If is set to yes, then our software tool will try to estimate missing data before clustering the markers into linkage groups.
is a binary flag which can be set to yes or no. If is set to yes, then our software tool will try to detect bad data during the map construction process. Those suspicious genotype data will be printed to the console for user inspection. The error detection feature can be turned off by setting to no.
specifies the objective function to be used. Possible choices are COUNT and ML.COUNT refers to the commonly used sum of recombination events objective function and ML refers to the commonly used maximum likelihood objective function.
specifies the total number of markers in the data set.
specifies the total number of mapping lines in the data set. - The body of the input file contains a table of dimension (m+1)*(n+1), where m is the total number of markers (which is equal to the value) and n is the total number of mapping lines (which is equal to the value). The first row gives the ids for the mapping lines, while the first column gives the ids for the genetic markers. Each id is a string of letters (a-z, A-Z) or digits (0-9). No space is allowed within an id. Each cell in the table refers to the genotype state of a particular mapping line on a particular marker locus. The genotype states can be specified with letters 'A', 'a', 'B', 'b', '-', 'U' or 'X'. 'A' and 'a' are equivalent, 'B' and 'b' are equivalent and so are '-' and 'U'. 'U' and '-' indicates the missing genotype call. If the data set is from a RIL population, you can use 'X' to indicate that the corresponding genotype is a heterozygous 'AB'.
How to interpret the output file
The output file is self-explanatory and easy to understand. It simply lists the markers in each linkage group. The genetic distances between markers are also available from the output file. Please refer to example_map.txt for an example.
Downloads
- Source
- Sample Input
- Sample Output Sample Input is a synthetic data set with 100 mapping lines and 100 markers. The markers are spaced at an average distance of 2cMs. 1% of missing and 1% of error genotype calls are introduced to the data set purposely in order to mimic what's happing in reality. The true order of the markers is m0,m1,m2,...,m99 (or its reverse).
Copyright
MSTMap is free for academic use only. For questions about the tool, please contactyonghui@cs.ucr.edu.