2011-Fairclothetal-SysBiol-Notes (original) (raw)
Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist
Clone this repository at <script src="https://gist.github.com/brantfaircloth/47e03463db0573c4252f.js"></script>
Save brantfaircloth/47e03463db0573c4252f to your computer and use it in GitHub Desktop.
Clone this repository at <script src="https://gist.github.com/brantfaircloth/47e03463db0573c4252f.js"></script>
Save brantfaircloth/47e03463db0573c4252f to your computer and use it in GitHub Desktop.
2011-Fairclothetal-SysBiol-Notes
Identifying UCEs in other organisms
- get count of all UCEs identified that were non-dupes:
sqlite> select count() from cons, blast where blast.id = cons.id and blast.matches = 1 and duplicate = 0;
count()
5599
* get distance btw. UCE loci in set above w/:
python other-organisms/get_distances_in_chicken.py
* create a new table in the probe dbase for the ids and count of all "conservation" sureselect probes:
sqlite> create table sureselect_all_probe_counts as select seq, id, count(seq) as cnt,
sqlite> data_source from sureselect where data_source = 'conservation' group by seq order by id;
sqlite> select count() from sureselect_all_probe_source;
count()
5138
- get the probe sequences from the dbase for all the conserved probes:
python assembly/get_sureselect_probes_or_cons.py --output all-probes.fasta --all --probes probe.sqlite - move that into MS folder in Dropbox:
mv all-probes.fasta /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers - [lastz] this file against several genomes (used --huge option on anoCar2 because of chromo/scaffold number). The output files are spread btw. snp/lastz and matches/lastz directories:
- mm9
- hg19
- anoCar2
- taeGut1
- galGal3
- melGal2
- croPor1 (really 0.1)
- monDom5
- ornAna1
- xenTro2
python share/run_lastz.py \
--target /Volumes/Data/Genomes/taeGut1/taeGut1.2bit \
--query all-probes.fasta --nprocs 8 \
--output lastz/all-probes-to-taeGut1.lastz - check probes for dupes:
python share/easy_lastz.py \
--target all-probes.fasta --query all-probes.fasta \
--identity 85 --output all-to-all.lastz - generate probe counts for critters in matches/lastz:
python other-organisms/get_probes_matches_from_lastz.py ./ \
all-probes.fasta test.sqlite \
--dupefile all-to-all.lastz - data will reside in data/. split files into appropriate directories - some into snp/ and some into matches/
Primate tree test
- update genomes by dloading new build, if avaialble
- run lastz of probes against genomes:
python align/un_multiple_lastzs.py \
--output=/Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates \
--probefile=all-probes.fasta \
--chromolist="hg19","gorGor3","ponAbe2","rheMac2","mm9" \
--readlist="nomLeu1","panTro3","calJac3","papAnu1","otoGar3" - rename all the files to similar names used previously. In ipython:
[shutil.copy(i,"all-probes-to-{0}".format(os.path.basename(i).split('_')[-1])) for i in glob.glob('*.lastz')] - first move SNP-related and initial db-related stuff to its own SNP folder. Put these data in "primates" folder.
- Then (ran w/ 200 AND 150), build "fake" data sets from the extant genomes that are just like the data we collect from UCE loci:
for i in {"hg19","gorGor3","ponAbe2","rheMac2","mm9","nomLeu1","calJac3","otoGar3"};
do \
python share/get_fake_velvet_contigs_from_genomes.py \
/Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/lastz/all-probes-to-$i.lastz \
/Volumes/Data/Genomes/$i/$i.2bit \
--dupefile /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/probes/all-to-all.lastz \
--fasta /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/fake-velvet-fasta/$i-150.fasta --flank 150;
done; - I originally ran the above w/ panTro3 and papAnu1, but it failed for panTro3 (some chromo names have 3 parts) and papAnu1 (chromo names have many parts) because they have crazy-ass name structures. Updated code and ran, instead, w/:
python share/get_fake_velvet_contigs_from_genomes.py \
/Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/lastz/all-probes-to-panTro3.lastz \
/Volumes/Data/Genomes/panTro3/panTro3.2bit \
--dupefile /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/probes/all-to-all.lastz \
--fasta /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/fake-velvet-fasta/panTro3-150.fasta \
--flank 150 --name-components 3
python share/get_fake_velvet_contigs_from_genomes.py \
/Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/lastz/all-probes-to-papHam1.lastz \
/Volumes/Data/Genomes/papHam1/papHam1.2bit \
--dupefile /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/probes/all-to-all.lastz \
--fasta /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/fake-velvet-fasta/papHam1-150.fasta \
--flank 150 --splitchar None
- make symlinks to fastas in contigs for each of {"hg19","gorGor3","ponAbe2","rheMac2","mm9","nomLeu1","panTro3","calJac3","papAnu1","otoGar3"}:
ln -s ../fake-velvet-fasta/mm9-200.fasta mus-musculus.fasta - re-lastz probes against "contigs" and store group matches in db:
python assembly/match_contigs_to_probes.py \
~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/contigs \
~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/probes/all-probes.fasta \
~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/probe-match-for-db \
--dupefile ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/probes/all-to-all.lastz - run get_match_counts on reptile data to build format file:
python assembly/get_match_counts.py \
~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/probe-match-for-db/primate-probes-matches.sqlite \
~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/config/primate-match-count.config \
'Primates' \
--output ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/config/primate-loci-for-fasta.out - add asterisks to primate-loci-for-fasta.out for each organism
- turn those data into a giant fasta for alignment:
python assembly/get_fastas_from_match_counts.py \
~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/contigs \
~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/probe-match-for-db/primate-probes-matches.sqlite \
~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/config/primate-loci-for-fasta.out \
--output ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/alignment/primates.fasta - align those puppies:
python align/eqcap_align.py \
--input primates.fasta --output nexus \
--species 10 --trim-running --multiprocessing --processors 7 - within ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/alignment/ convert these to phylip so we can run mrAIC on them:
python align/convert_nexus_to_phylip.py \
--input nexus/ --output phylip/ --shorten-name --positions '-2,-1' - estimate subs models:
python align/run_mraic.py --input phylip --output aicc - remove the locus name from each nexus file:
python align/remove_locus_name_from_nexus_lines.py \
--input nexus --output nexus-rename - prep the mrbayes file:
python align/nexus_to_concat.py \
--models ../output_models.txt \
--aligns ../nexus-rename/ \
--concat primate-each-part.nex \
--mr-bayes --interleave --unlink - crank up some ec2. setup mrbayes per: https://gist.github.com/950010, then:
ec2-run-instances --key ec2-keypair ami-349d645d \
--instance-type=c1.xlarge --availability-zone=us-east-1b \
--block-device-mapping '/dev/sda2=ephemeral0' \
--block-device-mapping '/dev/sda3=ephemeral1' - sync up nexus file:
rsync -avz -e "ssh -i /Users/bcf/.ec2/ec2-keypair" \
./ ec2-user@ec2-67-202-26-236.compute-1.amazonaws.com:/mnt/data/primate-rate-variation/ - after starting in mpi-mode, set ngen = 5,000,000 and run w/ mcmc. run. wait.
- connect to instance:
ssh -i ec2-keypair ec2-user@xx-yy-zz.compute-1.amazonaws.com - get summary data across nexus (trimmed alignments):
python align/get_cons_region_align_stats.py \
--input nexus-rename --output primate-align-stats.csv - open up R, then:
this gives U-shaped figure
primates = read.table('/Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/stats/primate-trimmed-align-stats.csv',header=TRUE, sep = ',')
nobase = primates[primates$greaterthanonediff >0.0,]
nooutlier = nobase[nobase$greaterthanonediff <= 0.10,]
ggplot(nooutlier, aes(x=bp,y=greaterthanonediff,color=greaterthanonediff)) geom_point(size=3) scale_colour_gradientn(colour = rainbow(5)) scale_y_continuous('Frequency of variant bases\n') scale_x_continuous('\nDistance from center of alignment') theme_bw() opts(legend.position = "none")
pdf('name.pdf')
remove some gridlines
p
g = ggplotGrob(p)
grid.ls(g)
grid.remove(gPath("GRID.gTree","layout","panel","grill.gTree","panel.grid.major.y.polyline"),grep=T)
grid.remove(gPath("GRID.gTree","layout","panel","grill.gTree","panel.grid.minor.y.polyline"),grep=T)
dev.off()
Bird deep tree from in vitro data
- unzip data from sequencer
- clean adapter contamination and trim using sliding window:
python assembly/process_reads.py \
--sample-map Illumina_Data/SamplesDirectories.csv Bin_012 - several tags were parsed incorrectly (not at all). go ahead and re-parse the unknown file to see if we can pull out necessary reads:
python assembly/reparse_tags.py --input s_5_sequence.txt --tag 'TCTGCC' --output anser-erythropus.fastq
python assembly/reparse_tags.py --input s_6_sequence.txt --tag 'GCATCC' --output zanclostomus-javanicus.fastq
python assembly/reparse_tags.py --input s_6_sequence.txt --tag 'GTAATC' --output nyctibius-grandis.fastq - put these in Bin_unk and run through trimming/assembly as above. Then get coverage and read lengths, etc. as above.
- upload data on individual basis to ec2 (need a machine w/ shitloads of RAM) - ZONE us-east-1:
ec2-run-instances --key ec2-keypair ami-349d645d --region us-east-1\
--instance-type=m2.2xlarge \
--block-device-mapping '/dev/sda2=ephemeral0' \
--block-device-mapping '/dev/sda3=ephemeral1' - built velvet (max_kmer = 96) and assoc. code (bioperl, etc.), then assemble using VelvetOptimiser.pl (see stats folder for each Bin for actual run code):
~/src/velvet_1.1.04/contrib/VelvetOptimiser-2.1.7/VelvetOptimiser.pl \
-s 59 -e 79 -f '-short -fastq genus-species.fastq' \
-o '-amos_file yes' && tar cvjf ../done/genus-species.tar.bz2 auto_*/ - assembled reads and zipped up assemblies and contigs as above, downloaded those. generated checksums and checked those and uploaded to webserver
- for several species, had to tweak contig generation. See readme.md in retry folder
- determine size of contigs (unzipped):
for i in contigs/*; do echo i;faSizei; faSize i;faSizei; done - determine kmer coverage of contig, given knowledge of kmer and contig output:
python assembly/coverage_from_kmers.py --csv True \
--input Bin_005/contigs/mus-musculus.assembly-retry.contigs.fasta - align probes that we used to themselves to make a dupefile:
python ~alignment/easy_lastz.py
--target=probe-subset-2560-synthesized.fasta --query=probe-subset-2560-synthesized.fasta --coverage=80 --identity=80 --output=probe-subset-2560-to-self.fasta - match the contigs that we have to the probes that we used using lastz and store the matching results in a database. also write output (lastz files) to a folder:
python assembly/match_contigs_to_probes.py \
/Volumes/Data3/lsu/birds/contigs \
probe-subset-2560-synthesized.fasta \
/Volumes/Data3/lsu/birds/lastz \
--dupefile probe-subset-2560-to-self.fasta
anser_erythropus: 1596 (69.63%) uniques of 2292 contigs, 0 dupe probes, 0 dupe probe matches, 45 dupe node matches
dromaius_novaehollandiae: 1518 (76.40%) uniques of 1987 contigs, 0 dupe probes, 2 dupe probe matches, 24 dupe node matches
eudromia_elegans: 1484 (77.29%) uniques of 1920 contigs, 0 dupe probes, 1 dupe probe matches, 52 dupe node matches
gallus_gallus: 1438 (77.65%) uniques of 1852 contigs, 0 dupe probes, 1 dupe probe matches, 30 dupe node matches
megalaima_virens: 1174 (85.69%) uniques of 1370 contigs, 0 dupe probes, 0 dupe probe matches, 10 dupe node matches
phalacrocorax_carbo: 1384 (86.45%) uniques of 1601 contigs, 0 dupe probes, 3 dupe probe matches, 10 dupe node matches
pitta_guajana: 1572 (66.36%) uniques of 2369 contigs, 0 dupe probes, 2 dupe probe matches, 32 dupe node matches
struthio_camelus: 1591 (75.19%) uniques of 2116 contigs, 0 dupe probes, 2 dupe probe matches, 37 dupe node matches
urocolius_indicus: 1495 (67.71%) uniques of 2208 contigs, 0 dupe probes, 2 dupe probe matches, 43 dupe node matches - screen various groups and/or optimize individuals in groups (in terms of UCEs located) to max the number of UCEs shared across members of the group:
python assembly/get_match_counts.py \
/Volumes/Data3/lsu-seqcap/lastz/probe.matches.sqlite \
~/Dropbox/Research/brumfield/LSU_Illumina_May2011_run/faircloth/match-count.config \
'Birds - UCE paper' - to extend capture data with genomic/outgroup data and build probe data set (genomic outgroup data are indicated in the config file using asterisks):
[Birds - UCE paper]
dromaius_novaehollandiae
struthio_camelus
eudromia_elegans
anser_erythropus
gallus_gallus
phalacrocorax_carbo
megalaima_virens
urocolius_indicus
pitta_guajana
anolis_carolinensis*
python assembly/get_match_counts.py \
/Volumes/Data3/lsu-seqcap/lastz/probe.matches.sqlite \
~/Dropbox/Research/brumfield/LSU_Illumina_May2011_run/faircloth/match-count.config \
--extend /Users/bcf/Dropbox/research/faircloth/UCEs-as-genetic-markers/data/snp/probe-match-for-db/genome.matches.sqlite \
'Birds - UCE paper' \
--output /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/birds/deep-bird-phylo-with-anolis-outgroup.out - turn those data into a giant fasta for alignment:
python assembly/get_fastas_from_match_counts.py \
/Volumes/Data3/lsu-seqcap/contig/ \
/Volumes/Data3/lsu-seqcap/lastz/probe.matches.sqlite \
/Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/birds/deep-bird-phylo-with-anolis-outgroup.out \
--extend-db /Users/bcf/Dropbox/research/faircloth/UCEs-as-genetic-markers/data/snp/probe-match-for-db/genome.matches.sqlite \
--extend-dir /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/snp/contig/ \
--output /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/birds/deep-bird-phylo-with-anolis-outgroup.fasta - align:
python align/seqcap_align.py \
--input /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/birds/deep-bird-phylo-with-anolis-outgroup.fasta \
--output /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/birds/nexus/ \
--species 10 --trim-running --multiprocessing --processors=7 - convert these to phylip so we can run mrAIC on them:
python align/convert_nexus_to_phylip.py \
--input nexus/ --output phylip/ --shorten-name --positions '-2,-1' - get subs models:
python align/run_mraic.py --input phylip --output aicc - remove the locus name from each nexus file:
python align/remove_locus_name_from_nexus_lines.py \
--input nexus --output nexus-rename - prep the mrbayes file:
python align/nexus_to_concat.py \
--models output_models.txt --aligns nexus-rename/ \
--concat mrbayes/deep-bird-each-part.nex --mr-bayes \
--interleave --unlink - crank up some AWS:
ec2-run-instances --key keypair ami-349d645d \
--instance-type=c1.xlarge --availability-zone=us-east-1b \
--block-device-mapping '/dev/sda2=ephemeral0' \
--block-device-mapping '/dev/sda3=ephemeral1' - login and setup mrbayes:
ssh -i ec2-keypair ec2-user@ec2-174-129-57-11.compute-1.amazonaws.com - use rsync to up/dload data:
rsync -avz -e "ssh -i /Users/bcf/.ec2/ec2-keypair" \
ec2-user@ec2-174-129-57-11.compute-1.amazonaws.com:/mnt/data/birds/ ./ - get variant stats for birds:
python align/get_cons_region_align_stats.py --input nexus-rename --output bird-align-stats.csv - make figures:
this gives U-shaped figure
birds = read.table('/Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/birds/stats/bird-trimmed-align-stats.csv', header = TRUE, sep = ',')
birdnobase = birds[birds$greaterthanonediff >0.0,]
birdnooutlier = birdnobase[birdnobase$greaterthanonediff <= 0.10,]
ggplot(birdnooutlier, aes(x=bp,y=greaterthanonediff,color=greaterthanonediff)) geom_point(size=3, alpha = 5/10) scale_colour_gradient(low = "red", high = "blue") scale_y_continuous('Frequency of variable positions\n') scale_x_continuous('\nDistance from center of alignment') theme_bw() opts(legend.position = "none")
pdf('bird-variable-positions.pdf')
remove some gridlines
p
g = ggplotGrob(p)
grid.ls(g)
grid.remove(gPath("GRID.gTree","layout","panel","grill.gTree","panel.grid.major.y.polyline"),grep=T)
grid.remove(gPath("GRID.gTree","layout","panel","grill.gTree","panel.grid.minor.y.polyline"),grep=T)
dev.off()
Enrichment stats (for birds)
- extract AMOS file from velvet tarball and store in tmp:
tar -jtvf /Volumes/Data3/lsu-seqcap/Bin_008/assembled/urocolius-indicus.assembly.tar.bz2
tar -xjvf /Volumes/Data3/lsu-seqcap/Bin_008/assembled/urocolius-indicus.assembly.tar.bz2 auto_data_73/velvet_asm.afg - convert AMOS afg file to AMOS bnk file:
~/Source/amos-3.0.0/src/Bank/bank-transact -m megalaima-virens.afg -b megalaima-virens.bnk -c - output coverage file, so we can get iids:
~/Source/amos-3.0.0/src/Utils/cvgStat -b megalaima-virens.bnk > megalaima-virens.cvg - get loci from query of sqlite data:
sqlite> .output megalaima-virens.loci
sqlite> select megalaima_virens from match_map where megalaima_virens IS NOT NULL;
sqlite> .output stdout - parse that bad-boy and the cvg file to get iids:
mv /Volumes/Data3/lsu-seqcap/lastz/megalaima-virens.loci ~/Tmp/bird-coverage/megalaima-virens
python assembly/get_iids_from_cvg_stat.py
\ megalaima-virens.cvg megalaima-virens.loci --output megalaima-virens.iid - now get depth of coverage of locus-specific reads:
~/Source/amos-3.0.0/src/Validation/analyze-read-depth megalaima-virens.bnk -d -I megalaima-virens.iid - get depth of coverage of all reads in contigs:
~/Source/amos-3.0.0/src/Validation/analyze-read-depth megalaima-virens.bnk -d - in order to derive the average contig length from loci matching UCEs (there is another script for only UCE loci present in all orgs - it parses the fasta we input to alignment rather than the assembled contigs from velvet):
python assembly/get_contig_lengths_for_all_uce_loci.py \
/Volumes/Data3/lsu-seqcap/contig \
/Volumes/Data3/lsu-seqcap/lastz/probe.matches.sqlite \
~/Dropbox/Research/brumfield/LSU_Illumina_May2011_run/faircloth/match-count.config \
"Birds - UCE paper"
Additional SNP Analyses
- screen lastz files for dupes, starts and ends, and generate velvet-like contigs, so we can insert these data into the same pipeline we have for the seqcap data:
python share/get_fake_velvet_contigs_from_genomes.py \
all-probes-to-hg19.lastz \
/Volumes/Data/Genomes/hg19/hg19.2bit \
fake-velvet-fasta/taeGut1-all-probe-fake-velvet.fasta
--flank 50 - Also get BED files for flanks w/ flank dist @ 50, 100, 250:
python share/get_fake_velvet_contigs_from_genomes.py \
all-probes-to-hg19.lastz /Volumes/Data/Genomes/hg19/hg19.2bit \
--bed fake-velvet-bed/fake-velvet-all-probe-hg19-50.bed \
--flank 50 - Or, do both at once:
python share/get_fake_velvet_contigs_from_genomes.py \
data/lastz/all-probes-to-hg19.lastz \
/Volumes/Data/Genomes/hg19/hg19.2bit \
--fasta data/fake-velvet-fasta/fake-velvet-all-probe-hg19-500.fasta \
--bed data/fake-velvet-bed/fake-velvet-all-probe-hg19-500.bed \
--flank 500 - using UCSC browser, intersect those with dbSNP128 results (for mm9 and hg19) and bgiSNP results (for galGal3):
results => snp/ - associate snps w/ the correct UCEs (using coordinate overlaps, since UCSC doesn't return this):
python snps/find_snps_in_bed_interval.py \
data/fake-velvet-bed/fake-velvet-all-probe-hg19-500.bed \
data/snp/dbSNP132-intersect-all-probe-hg19-500.bed \
--output data/snp/dbSNP132-to-hg19-uce-500.csv - upload SNP RSs to dbSNP to pull down heterozygosity data
- from hg19 snps, get heterozygosity data:
python snps/parse_dbsnp_xml.py \
hg19-dbSNP-data > variation-data-hg19-dbSNP.csv - do same in mouse:
python snps/parse_dbsnp_xml.py \
mm9-dbSNP-data > variation-data-mm9-dbSNP.csv
Plotting additional figures
- R code (didn't use these boxplots):
data = read.table('contig-lengths-by-probe.csv', header = TRUE, sep = ',')
ggplot(data, aes(factor(probes), contiglength)) geom_boxplot()
ggplot(data, aes(factor(probes), coverage)) geom_boxplot()
p1 = ggplot(data, aes(factor(probes), coverage)) geom_boxplot()
ggsave('coverage-by-probes.pdf')
p2 = ggplot(data, aes(factor(probes), contiglength)) geom_boxplot()
ggsave('contiglength-by-probes.pdf')
f1 = ggplot(data, aes(factor(probes), contiglength)) geom_boxplot() facet_wrap(~ organism, ncol = 6)
ggsave('contiglength-by-probes-facet.pdf')
f2 = ggplot(data, aes(factor(probes), coverage)) geom_boxplot() facet_wrap(~ organism, ncol = 6)
ggsave('coverage-by-probes-facet.pdf') - using intersection BED data, prep CSV file associating SNPs w/ UCE names and coords associate variation data with figure with:
python snps/get_dbsnp_variability_for_all_uces.py \
dbSNP132-to-hg19-uce-200.csv \
~/Tmp/dbsnp/hg19/hg19-dbSNP-data > dbSNP132-to-hg19-uce-200-ready-to-plot.csv - get the SNP variability data for SNPS within 200 bp of UCEs:
python snps/get_dbsnp_freq_stats.py dbSNP132-to-hg19-uce-200.csv \
~/Tmp/dbsnp/hg19/hg19-dbSNP-data \
--dupefile /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/probes/all-to-all.lastz \
--output test1.csv --output2 test2.csv - get that into R and:
raw = read.table('test2.csv', sep = ',', header = TRUE) - convert freqs to floats:
raw$freq = as.numeric(as.character(raw$freq))
add pos column
raw$pos = raw$snp.start - (raw$uce.start raw$uce.end)/2
- plot out just the allele freqs for the loci:
raw1 = subset(raw,freq > 0.0)
raw1 = subset(raw1, X1000gvalidated == 'true') #4653 loci
p = ggplot(raw1, aes(x=pos,y=freq,color=freq)) + geom_point(size=3, alpha = 5/10)- scale_colour_gradient(low = "red", high = "blue") \
- scale_y_continuous('Minor allele frequency (MAF)\n') \
- scale_x_continuous('\nDistance from center of UCE locus') \
- theme_bw() + opts(legend.position = "none")
np = read.table('test1.csv', sep = ',', header = TRUE)
- running average (window 25) of the number of SNPs per base across all UCEs:
p = ggplot(subset(np, datatype == 'running'), aes(x=pos,y=avg,color=avg)) \- geom_point(size=3, alpha = 5/10) \
- scale_colour_gradient(low = "red", high = "blue") \
- scale_y_continuous('Average number (25 bp window) of 1000 Genomes Project validated SNPs\n') \
- scale_x_continuous('\nDistance from center of UCE locus') \
- theme_bw() \
- opts(legend.position = "none")
- running average (window 25) of SNP heterozygosity across all UCEs:
p = ggplot(subset(np, datatype == 'running_hetero'), aes(x=pos,y=avg,color=avg)) \- geom_point(size=3, alpha = 5/10)
- scale_colour_gradient(low = "red", high = "blue") \
- scale_y_continuous('Mean number of 1000 Genomes Project validated SNPs\n') \
- scale_x_continuous('\nDistance from center of UCE locus') \
- theme_bw() \
- opts(legend.position = "none")
- running average (window 25) of SNP heterozygosity across all UCEs:
p = ggplot(subset(np, datatype == 'mean_hetero'), aes(x=pos,y=avg,color=avg)) \- geom_point(size=3, alpha = 5/10) \
- scale_colour_gradient(low = "red", high = "blue") \
- scale_y_continuous('Mean Minor Allele Frequency (MAF) of 1000 Genomes Project validated SNPs\n') \
- scale_x_continuous('\nDistance from center of UCE locus') theme_bw() opts(legend.position = "none")
pdf('uce-running-average-1000G-snps-trunc-at-225.pdf')
- remove some gridlines:
p
g = ggplotGrob(p)
grid.ls(g)
grid.remove(gPath("GRID.gTree","layout","panel","grill.gTree","panel.grid.major.y.polyline"),grep=T)
grid.remove(gPath("GRID.gTree","layout","panel","grill.gTree","panel.grid.minor.y.polyline"),grep=T)
dev.off()
UCE Capture Data Figures
- summarize data:
python get_hits_misses_length_and_gc.py - import data:
d = read.table('gc-length-species-matches.csv', sep = ',', header = TRUE) - order True/False correctly:
d$detected = factor(d$detected, c("TRUE", "FALSE")) - uce_gc_content_v_detection_facet_by_taxon.pdf:
ggplot(d, aes(x = detected, y = gc)) +
geom_jitter(aes(colour = detected), alpha = 0.4) +
geom_boxplot(alpha=0.3) + facet_wrap(~taxon) +
scale_colour_manual(values = c("#377EB8","#E41A1C")) +
scale_x_discrete('UCE Locus Detected') + scale_y_continuous('UCE Locus GC Content') +
opts(legend.position = "none") - uce_tm_v_detection_facet_by_taxon.pdf:
ggplot(d, aes(x = detected, y = tm)) +
geom_jitter(aes(colour = detected), alpha = 0.4) +
geom_boxplot(alpha=0.3) + facet_wrap(~taxon) +
scale_colour_manual(values = c("#377EB8","#E41A1C")) +
scale_x_discrete('UCE Locus Detected') + scale_y_continuous('Average (by Locus) Probe Tm') +
opts(legend.position = "none") - uce_length_v_detection_facet_by_taxon.pdf:
ggplot(d, aes(x = detected, y = length)) +
geom_jitter(aes(colour = detected), alpha = 0.4) +
geom_boxplot(alpha=0.3) + facet_wrap(~taxon) +
scale_colour_manual(values = c("#377EB8","#E41A1C")) +
scale_x_discrete('UCE Locus Detected') + scale_y_continuous('UCE Locus Length') +
opts(legend.position = "none") - uce_bases_added_v_detection_facet_by_taxon.pdf:
ggplot(d, aes(x = detected, y = added)) +
geom_jitter(aes(colour = detected), alpha = 0.4) +
geom_boxplot(alpha=0.3) + facet_wrap(~taxon) +
scale_colour_manual(values = c("#377EB8","#E41A1C")) +
scale_x_discrete('UCE Locus Detected') +
scale_y_continuous('Bases Added to Probes Targeting UCE') +
opts(legend.position = "none") - uce_gc_v_target_uce_length.pdf:
ggplot(d, aes(y = gc, x = length)) +
geom_point(aes(colour = detected), alpha=0.5, size = 2) +
facet_wrap(~taxon) + scale_x_continuous('Target UCE Length') +
scale_y_continuous('Target UCE GC Content') +
scale_colour_manual(values = c("#377EB8","#E41A1C")) - uce_average_probe_tm_v_target_length.pdf:
ggplot(d, aes(y = tm, x = length)) +
geom_point(aes(colour = detected), alpha=0.5, size = 2) +
facet_wrap(~taxon) + scale_x_continuous('Target UCE Length') +
scale_y_continuous('Average (by Locus) Probe Tm') +
scale_colour_manual(values = c("#377EB8","#E41A1C")) - uce_target_length_v_probes_targeting.pdf:
ggplot(d, aes(x = count, y = length)) +
geom_jitter(aes(colour = detected), alpha = 0.4) +
facet_wrap(~taxon) +
scale_colour_manual(values = c("#377EB8","#E41A1C")) +
scale_x_continuous('Number of Probes Targeting UCE') +
scale_y_continuous('Target UCE Length') - import contig data:
d2 = read.table('../archive/birds-contig-lengths-by-probe-filtered-birds.csv', header = TRUE, sep = ',') - uce_contig_length_by_probe_count.pdf:
ggplot(d2, aes(x = factor(probes), y = contiglength)) +
geom_jitter(aes(colour = factor(probes))) +
geom_boxplot(alpha = 0.3) +
scale_colour_brewer(palette="Set1") +
opts(legend.position = "none") +
scale_x_discrete('Number of Probes Targeting UCE') +
scale_y_continuous('Contig Length of UCE-associated Locus)') - uce_sequencing_depth_by_probe_count.pdf:
ggplot(d2, aes(x = factor(probes), y = coverage)) +
geom_jitter(aes(colour = factor(probes))) +
geom_boxplot(alpha = 0.3) +
scale_colour_brewer(palette="Set1") +
opts(legend.position = "none") +
scale_x_discrete('Number of Probes Targeting UCE') +
scale_y_continuous('Sequencing Depth of UCE-associated Locus') - uce_contig_length_by_probe_count_by_taxon.pdf:
ggplot(d2, aes(x = factor(probes), y = contiglength)) +
geom_jitter(aes(colour = factor(probes))) +
geom_boxplot(alpha = 0.3) +
facet_wrap(~organism) +
scale_colour_brewer(palette="Set1") +
opts(legend.position = "none") +
scale_x_discrete('Number of Probes Targeting UCE') +
scale_y_continuous('Contig Length of UCE-associated Locus') - uce_loss_of_loci_with_randomly_selected_groups.pdf:
for i in {1..7}; do python assembly/get_match_counts.py \
../archive/birds-probe-matches.sqlite \
~/Dropbox/Research/brumfield/LSU_Illumina_May2011_run1/faircloth/match-count.config \
--extend /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/snp/probe-match-for-db/genome.matches.sqlite \
'Birds - UCE paper' --optimize --random --sample-size $i \
--samples 1001 --keep-counts --output sample$i-sim1000.txt; \
echo "\n";done
for i in {1..7}; do cat sample$i-sim1000.txt >> all-sample-sim1000.txt; done
ggplot(a, aes(x = factor(taxa), y = probes)) +
geom_jitter(aes(colour = factor(taxa))) +
geom_boxplot(alpha = 0.3) +
scale_colour_brewer(palette="Set1") +
scale_x_discrete('Number of Randomly Selected Group Members') +
scale_y_continuous('Number of UCE Loci Recovered From All Group Members', limits = c(1000,2300)) +
opts(legend.position = "none") - uce_loci_missed_by_taxon_count.pdf:
new = d$uce[d$detected == FALSE]
test = table(new)
test2 = as.data.frame(test)
ggplot(test2, aes(factor(Freq), fill = factor(Freq))) +
geom_bar(colour="black") +
scale_fill_manual(values = c("#A6CEE3", "#1F78B4", "#B2DF8A",
"#33A02C","#FB9A99","#E31A1C","#FDBF6F","#FF7F00","#CAB2D6",
"#6A3D9A")) + scale_x_discrete('Number of Taxa') +
scale_y_continuous('Failures to Detect Targeted UCEs')
ggsave('uce_loci_missed_by_taxon_count.pdf') - uce_coverage_by_taxon.pdf:
ggplot(data = d, aes(x = uce, y = coverage, group = organism)) +
geom_line(aes(colour = organism)) + facet_wrap(~organism) +
scale_x_discrete(breaks=NA) +
scale_fill_manual(values = c("#A6CEE3", "#1F78B4", "#B2DF8A",
"#33A02C","#FB9A99","#E31A1C","#FDBF6F","#FF7F00","#CAB2D6","#6A3D9A")) +
scale_x_discrete(breaks=NA, 'UCE-associated Locus (Name Not Shown)') +
scale_y_continuous('Coverage') + opts(legend.position = "none")