GitHub - jstjohn/SimSeq: An illumina paired-end and mate-pair short read simulator. This project attempts to model as many of the quirks that exist in Illumina data as possible. Some of these quirks include the potential for chimeric reads, and non-biotinylated fragment pull down in mate-pair libraries . Additionally the program provides the ability to model both site and base specific error, and scripts are provided to train this error model on real datasets. My hope in creating this program is to generate as realistic data as possible to assist in assessing the accuracy of genome assembly tools. (original) (raw)

SimSeqNBProject currently contains the net beans project with the initial source

getErrorProfile contains the c source for the program that generates the error model for the sequence simulator. This program utilizes a subset of the kent source library (http://hgwdev.cse.ucsc.edu/~kent/src/) which is freely available for both commercial and non-commercial use (see README in "ThirdParty/kent" for more information)

examples contains some things to run this on. Right now there is an example mitochondrial sequence, and an example error profile from a 100 bp illumina GAIIx dataset. You can simulate any read less than or equal to 100bp in length using that error profile.

usage: java -jar -Xmx2048m SimSeq.jar [required options] [options] Last Updated: 4.12.2011 -1,--read1_length Integer length of first read. Default: 100 -2,--read2_length Integer length of second read. Default: 100 --adapter1 The first read illumina adapter sequence to use when the insert size is less than the read length. Currently only works on paired end simulation. Defalt:AGATCGGAAGAGCGGTTCAGCAGGAATGCCG AGACCG --adapter2 The second read illumina adapter sequence to use when the insert size is less than the read length. Currently only works on paired end simulation. Defalt:AGATCGGAAGAGCGTCGTGTAGGGAAAGAGT GT -d,--dip If diploid data desired, path to diploid file. (format: chrom pos(0 based) altChar --debug Write debug info to stderr. -e,--error If simulated read error desired, path to read error file. --error2 If you desire a seperate error distribution to be applied to the second read, then provide a path to that error profile with this option -h,--help Print this usage message. --inf_id Output location information in the read ID. Currently this is only informative for paired-end data and note mate paired. -l,--insert_size mean library insert size for either mate-paired or paired-end. Default: 200 -m,--mate_pair Perform mate-pair rather than paired end run. --mate_frag If using a mate-pair library, what is your desired loop fragmentation size? Default: 500 --mate_frag_stdev If using a mate-pair library, what is your desired loop fragmentation size standard deviation? Default: 50 --mate_pulldown_error_p If using a mate-pair library, what is the probability that a read does not include the biotin marker? Default: 0.3 -n,--read_number Integer number of reads you would like to sample. Default: 1000000 -o,--out Filename for output sam file (REQUIRED) -p,--read_prefix Prefix for simulated reads. Default: SimSeq_ --phred64 Output phred+64 quality string rather than phred+33 -r,--reference Reference genome sequence in uncompressed fasta format (REQUIRED) -s,--insert_stdev mean library insert stdev for either mate-paired or paired-end. Default: 20 -u,--duplicate_probability probability of generating a duplicate. Default: 0.0

===Quick Start Example:=== ==Prerequisits== 0.0. Download and install samtools: http://samtools.sourceforge.net/ and add executable to your $PATH 0.1. Download latest Picard jar file: http://sourceforge.net/projects/picard/files/, and extract .jar files somewhere that makes sense like HOME/jars/,fortherestofthisIwillassumethatalljarfilesarelocatedinHOME/jars/, for the rest of this I will assume that all jar files are located in HOME/jars/,fortherestofthisIwillassumethatalljarfilesarelocatedinHOME/jars/ ==Generate and check simulated reads==

  1. Download SimSeq.jar, error_profile.txt, and AlligatorMito.fa
  2. put SImSeq.jar in your jar directory ($HOME/jars/ or something) and put error_profile.txt and AlligatorMito.fa in a folder where you want to work
  3. Run java -jar -Xmx2048m $HOME/jars/SimSeq.jar -1 100 -2 100 \ --error hiseq_mito_default_bwa_mapping_mq10_1.txt \ --error2 hiseq_mito_default_bwa_mapping_mq10_2.txt \ --insert_size 3000 --insert_stdev 300 \ --mate_pair --mate_frag 500 --mate_frag_stdev 50 \ --mate_pulldown_error_p 0.3 --read_number 10000 \ --read_prefix AMito_ --reference AlligatorMito.fa \ --duplicate_probability 0.01 --out out.sam
  4. Convert sam output to bam: samtools view -bS -T AlligatorMito.fa -t AlligatorMito.size -o out.bam out.sam
  5. Sort the bam file: samtools sort out.bam out.sorted
  6. Index the sorted bam file: samtools index out.sorted.bam
  7. Check the alignment in tview: samtools tview out.sorted.bam AlligatorMito.fa ==Convert Sam output to Fastq== run java -jar -Xmx2048 $HOME/jars/SamToFastq.jar INPUT=out.sorted.bam FASTQ=library.1.fastq SECOND_END_FASTQ=library.2.fastq INCLUDE_NON_PF_READS=true VALIDATION_STRINGENCY=SILENT

Note: setting the VALIDATION_STRINGENCY=SILENT is probably not ideal, but for now it is necessary because samtools isn't putting the length of the reference sequence into the header, which is expected by Picard. You get an error about the first read having a coordinate outside of the range of the reference sequence which has a length of 0. If you use Picard to somehow fix the header given the reference sequence this should work without telling it not to validate your sequence.

Also Note: The current sam alignment for a chimeric read shows the fragment starting at the left-most coordinate. Since the rest of the sequence occurs before the left most coordinate I use a soft clipping for the remainder of the sequence. This does not effect the output fastq file generated by the above method, but when you are looking at the alignment in samtools tview you will see some sequences as short as 1 nucleotide. This is normal, and doesn't mean that the actual sam records are that short.

When producing chimeric reads, samtools doesn't currently allow you to view the backwards jumps that occur. Doing so would involve a '-xN' operation on a cigar string. I added a custom tag ('YC:Z:[CIGAR]') at the end of each sam string that shows either a normal cigar string if the read can be represented with one, or a custom cigar string that allows for negative jumps needed to represent chimeric reads if the read is chimeric. Here is an example of what a chimeric read's tag would look like YC:Z:97M-4996N3M while a non-chimeric read would have a normal cigar string in its tag like: YC:Z:100M. To reconstruct a chimeric read, you start at the start coordinate for the read which corresponds to the left most coordinate, then you proceed for the number of matches 'M', then you jump the specified number of positions from your current location, in this case -4996, and finally proceed with the remaining CIGAR string. Right now since I don't add indels to the data, this string will either be a match or one of these chimeric jumps.