2011-Fairclothetal-SysBiol-Notes (original) (raw)

Star (1) You must be signed in to star a gist
Fork (2) You must be signed in to fork a gist
Clone this repository at <script src="https://gist.github.com/brantfaircloth/47e03463db0573c4252f.js"></script>
Save brantfaircloth/47e03463db0573c4252f to your computer and use it in GitHub Desktop.

Clone this repository at <script src="https://gist.github.com/brantfaircloth/47e03463db0573c4252f.js"></script>

Save brantfaircloth/47e03463db0573c4252f to your computer and use it in GitHub Desktop.

2011-Fairclothetal-SysBiol-Notes

Identifying UCEs in other organisms

get count of all UCEs identified that were non-dupes:
sqlite> select count() from cons, blast where blast.id = cons.id and blast.matches = 1 and duplicate = 0;
count()

5599 * get distance btw. UCE loci in set above w/:
python other-organisms/get_distances_in_chicken.py * create a new table in the probe dbase for the ids and count of all "conservation" sureselect probes:
sqlite> create table sureselect_all_probe_counts as select seq, id, count(seq) as cnt,
sqlite> data_source from sureselect where data_source = 'conservation' group by seq order by id;
sqlite> select count() from sureselect_all_probe_source;
count()

5138

get the probe sequences from the dbase for all the conserved probes:
python assembly/get_sureselect_probes_or_cons.py --output all-probes.fasta --all --probes probe.sqlite
move that into MS folder in Dropbox:
mv all-probes.fasta /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers
[lastz] this file against several genomes (used --huge option on anoCar2 because of chromo/scaffold number). The output files are spread btw. snp/lastz and matches/lastz directories:
mm9
hg19
anoCar2
taeGut1
galGal3
melGal2
croPor1 (really 0.1)
monDom5
ornAna1
xenTro2
python share/run_lastz.py \
--target /Volumes/Data/Genomes/taeGut1/taeGut1.2bit \
--query all-probes.fasta --nprocs 8 \
--output lastz/all-probes-to-taeGut1.lastz
check probes for dupes:
python share/easy_lastz.py \
--target all-probes.fasta --query all-probes.fasta \
--identity 85 --output all-to-all.lastz
generate probe counts for critters in matches/lastz:
python other-organisms/get_probes_matches_from_lastz.py ./ \
all-probes.fasta test.sqlite \
--dupefile all-to-all.lastz
data will reside in data/. split files into appropriate directories - some into snp/ and some into matches/

Primate tree test

update genomes by dloading new build, if avaialble
run lastz of probes against genomes:
python align/un_multiple_lastzs.py \
--output=/Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates \
--probefile=all-probes.fasta \
--chromolist="hg19","gorGor3","ponAbe2","rheMac2","mm9" \
--readlist="nomLeu1","panTro3","calJac3","papAnu1","otoGar3"
rename all the files to similar names used previously. In ipython:
[shutil.copy(i,"all-probes-to-{0}".format(os.path.basename(i).split('_')[-1])) for i in glob.glob('*.lastz')]
first move SNP-related and initial db-related stuff to its own SNP folder. Put these data in "primates" folder.
Then (ran w/ 200 AND 150), build "fake" data sets from the extant genomes that are just like the data we collect from UCE loci:
for i in {"hg19","gorGor3","ponAbe2","rheMac2","mm9","nomLeu1","calJac3","otoGar3"};
do \
python share/get_fake_velvet_contigs_from_genomes.py \
/Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/lastz/all-probes-to-$i.lastz \
/Volumes/Data/Genomes/$i/$i.2bit \
--dupefile /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/probes/all-to-all.lastz \
--fasta /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/fake-velvet-fasta/$i-150.fasta --flank 150;
done;
I originally ran the above w/ panTro3 and papAnu1, but it failed for panTro3 (some chromo names have 3 parts) and papAnu1 (chromo names have many parts) because they have crazy-ass name structures. Updated code and ran, instead, w/:
python share/get_fake_velvet_contigs_from_genomes.py \
/Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/lastz/all-probes-to-panTro3.lastz \
/Volumes/Data/Genomes/panTro3/panTro3.2bit \
--dupefile /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/probes/all-to-all.lastz \
--fasta /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/fake-velvet-fasta/panTro3-150.fasta \
--flank 150 --name-components 3

python share/get_fake_velvet_contigs_from_genomes.py \
/Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/lastz/all-probes-to-papHam1.lastz \
/Volumes/Data/Genomes/papHam1/papHam1.2bit \
--dupefile /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/probes/all-to-all.lastz \
--fasta /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/fake-velvet-fasta/papHam1-150.fasta \
--flank 150 --splitchar None

make symlinks to fastas in contigs for each of {"hg19","gorGor3","ponAbe2","rheMac2","mm9","nomLeu1","panTro3","calJac3","papAnu1","otoGar3"}:
ln -s ../fake-velvet-fasta/mm9-200.fasta mus-musculus.fasta
re-lastz probes against "contigs" and store group matches in db:
python assembly/match_contigs_to_probes.py \
~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/contigs \
~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/probes/all-probes.fasta \
~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/probe-match-for-db \
--dupefile ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/probes/all-to-all.lastz
run get_match_counts on reptile data to build format file:
python assembly/get_match_counts.py \
~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/probe-match-for-db/primate-probes-matches.sqlite \
~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/config/primate-match-count.config \
'Primates' \
--output ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/config/primate-loci-for-fasta.out
add asterisks to primate-loci-for-fasta.out for each organism
turn those data into a giant fasta for alignment:
python assembly/get_fastas_from_match_counts.py \
~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/contigs \
~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/probe-match-for-db/primate-probes-matches.sqlite \
~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/config/primate-loci-for-fasta.out \
--output ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/alignment/primates.fasta
align those puppies:
python align/eqcap_align.py \
--input primates.fasta --output nexus \
--species 10 --trim-running --multiprocessing --processors 7
within ~/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/alignment/ convert these to phylip so we can run mrAIC on them:
python align/convert_nexus_to_phylip.py \
--input nexus/ --output phylip/ --shorten-name --positions '-2,-1'
estimate subs models:
python align/run_mraic.py --input phylip --output aicc
remove the locus name from each nexus file:
python align/remove_locus_name_from_nexus_lines.py \
--input nexus --output nexus-rename
prep the mrbayes file:
python align/nexus_to_concat.py \
--models ../output_models.txt \
--aligns ../nexus-rename/ \
--concat primate-each-part.nex \
--mr-bayes --interleave --unlink
crank up some ec2. setup mrbayes per: https://gist.github.com/950010, then:
ec2-run-instances --key ec2-keypair ami-349d645d \
--instance-type=c1.xlarge --availability-zone=us-east-1b \
--block-device-mapping '/dev/sda2=ephemeral0' \
--block-device-mapping '/dev/sda3=ephemeral1'
sync up nexus file:
rsync -avz -e "ssh -i /Users/bcf/.ec2/ec2-keypair" \
./ ec2-user@ec2-67-202-26-236.compute-1.amazonaws.com:/mnt/data/primate-rate-variation/
after starting in mpi-mode, set ngen = 5,000,000 and run w/ mcmc. run. wait.
connect to instance:
ssh -i ec2-keypair ec2-user@xx-yy-zz.compute-1.amazonaws.com
get summary data across nexus (trimmed alignments):
python align/get_cons_region_align_stats.py \
--input nexus-rename --output primate-align-stats.csv
open up R, then:

this gives U-shaped figure

primates = read.table('/Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/primates/stats/primate-trimmed-align-stats.csv',header=TRUE, sep = ',')
nobase = primates[primates$greaterthanonediff >0.0,]
nooutlier = nobase[nobase$greaterthanonediff <= 0.10,]
ggplot(nooutlier, aes(x=bp,y=greaterthanonediff,color=greaterthanonediff)) geom_point(size=3) scale_colour_gradientn(colour = rainbow(5)) scale_y_continuous('Frequency of variant bases\n') scale_x_continuous('\nDistance from center of alignment') theme_bw() opts(legend.position = "none")
pdf('name.pdf')

remove some gridlines

p
g = ggplotGrob(p)
grid.ls(g)
grid.remove(gPath("GRID.gTree","layout","panel","grill.gTree","panel.grid.major.y.polyline"),grep=T)
grid.remove(gPath("GRID.gTree","layout","panel","grill.gTree","panel.grid.minor.y.polyline"),grep=T)
dev.off()

Bird deep tree from in vitro data

unzip data from sequencer
clean adapter contamination and trim using sliding window:
python assembly/process_reads.py \
--sample-map Illumina_Data/SamplesDirectories.csv Bin_012
several tags were parsed incorrectly (not at all). go ahead and re-parse the unknown file to see if we can pull out necessary reads:
python assembly/reparse_tags.py --input s_5_sequence.txt --tag 'TCTGCC' --output anser-erythropus.fastq
python assembly/reparse_tags.py --input s_6_sequence.txt --tag 'GCATCC' --output zanclostomus-javanicus.fastq
python assembly/reparse_tags.py --input s_6_sequence.txt --tag 'GTAATC' --output nyctibius-grandis.fastq
put these in Bin_unk and run through trimming/assembly as above. Then get coverage and read lengths, etc. as above.
upload data on individual basis to ec2 (need a machine w/ shitloads of RAM) - ZONE us-east-1:
ec2-run-instances --key ec2-keypair ami-349d645d --region us-east-1\
--instance-type=m2.2xlarge \
--block-device-mapping '/dev/sda2=ephemeral0' \
--block-device-mapping '/dev/sda3=ephemeral1'
built velvet (max_kmer = 96) and assoc. code (bioperl, etc.), then assemble using VelvetOptimiser.pl (see stats folder for each Bin for actual run code):
~/src/velvet_1.1.04/contrib/VelvetOptimiser-2.1.7/VelvetOptimiser.pl \
-s 59 -e 79 -f '-short -fastq genus-species.fastq' \
-o '-amos_file yes' && tar cvjf ../done/genus-species.tar.bz2 auto_*/
assembled reads and zipped up assemblies and contigs as above, downloaded those. generated checksums and checked those and uploaded to webserver
for several species, had to tweak contig generation. See readme.md in retry folder
determine size of contigs (unzipped):
for i in contigs/*; do echo i;faSizei; faSize i;faSizei; done
determine kmer coverage of contig, given knowledge of kmer and contig output:
python assembly/coverage_from_kmers.py --csv True \
--input Bin_005/contigs/mus-musculus.assembly-retry.contigs.fasta
align probes that we used to themselves to make a dupefile:
python ~alignment/easy_lastz.py
--target=probe-subset-2560-synthesized.fasta --query=probe-subset-2560-synthesized.fasta --coverage=80 --identity=80 --output=probe-subset-2560-to-self.fasta
match the contigs that we have to the probes that we used using lastz and store the matching results in a database. also write output (lastz files) to a folder:
python assembly/match_contigs_to_probes.py \
/Volumes/Data3/lsu/birds/contigs \
probe-subset-2560-synthesized.fasta \
/Volumes/Data3/lsu/birds/lastz \
--dupefile probe-subset-2560-to-self.fasta
anser_erythropus: 1596 (69.63%) uniques of 2292 contigs, 0 dupe probes, 0 dupe probe matches, 45 dupe node matches
dromaius_novaehollandiae: 1518 (76.40%) uniques of 1987 contigs, 0 dupe probes, 2 dupe probe matches, 24 dupe node matches
eudromia_elegans: 1484 (77.29%) uniques of 1920 contigs, 0 dupe probes, 1 dupe probe matches, 52 dupe node matches
gallus_gallus: 1438 (77.65%) uniques of 1852 contigs, 0 dupe probes, 1 dupe probe matches, 30 dupe node matches
megalaima_virens: 1174 (85.69%) uniques of 1370 contigs, 0 dupe probes, 0 dupe probe matches, 10 dupe node matches
phalacrocorax_carbo: 1384 (86.45%) uniques of 1601 contigs, 0 dupe probes, 3 dupe probe matches, 10 dupe node matches
pitta_guajana: 1572 (66.36%) uniques of 2369 contigs, 0 dupe probes, 2 dupe probe matches, 32 dupe node matches
struthio_camelus: 1591 (75.19%) uniques of 2116 contigs, 0 dupe probes, 2 dupe probe matches, 37 dupe node matches
urocolius_indicus: 1495 (67.71%) uniques of 2208 contigs, 0 dupe probes, 2 dupe probe matches, 43 dupe node matches
screen various groups and/or optimize individuals in groups (in terms of UCEs located) to max the number of UCEs shared across members of the group:
python assembly/get_match_counts.py \
/Volumes/Data3/lsu-seqcap/lastz/probe.matches.sqlite \
~/Dropbox/Research/brumfield/LSU_Illumina_May2011_run/faircloth/match-count.config \
'Birds - UCE paper'
to extend capture data with genomic/outgroup data and build probe data set (genomic outgroup data are indicated in the config file using asterisks):
[Birds - UCE paper]
dromaius_novaehollandiae
struthio_camelus
eudromia_elegans
anser_erythropus
gallus_gallus
phalacrocorax_carbo
megalaima_virens
urocolius_indicus
pitta_guajana
anolis_carolinensis*
python assembly/get_match_counts.py \
/Volumes/Data3/lsu-seqcap/lastz/probe.matches.sqlite \
~/Dropbox/Research/brumfield/LSU_Illumina_May2011_run/faircloth/match-count.config \
--extend /Users/bcf/Dropbox/research/faircloth/UCEs-as-genetic-markers/data/snp/probe-match-for-db/genome.matches.sqlite \
'Birds - UCE paper' \
--output /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/birds/deep-bird-phylo-with-anolis-outgroup.out
turn those data into a giant fasta for alignment:
python assembly/get_fastas_from_match_counts.py \
/Volumes/Data3/lsu-seqcap/contig/ \
/Volumes/Data3/lsu-seqcap/lastz/probe.matches.sqlite \
/Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/birds/deep-bird-phylo-with-anolis-outgroup.out \
--extend-db /Users/bcf/Dropbox/research/faircloth/UCEs-as-genetic-markers/data/snp/probe-match-for-db/genome.matches.sqlite \
--extend-dir /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/snp/contig/ \
--output /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/birds/deep-bird-phylo-with-anolis-outgroup.fasta
align:
python align/seqcap_align.py \
--input /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/birds/deep-bird-phylo-with-anolis-outgroup.fasta \
--output /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/birds/nexus/ \
--species 10 --trim-running --multiprocessing --processors=7
convert these to phylip so we can run mrAIC on them:
python align/convert_nexus_to_phylip.py \
--input nexus/ --output phylip/ --shorten-name --positions '-2,-1'
get subs models:
python align/run_mraic.py --input phylip --output aicc
remove the locus name from each nexus file:
python align/remove_locus_name_from_nexus_lines.py \
--input nexus --output nexus-rename
prep the mrbayes file:
python align/nexus_to_concat.py \
--models output_models.txt --aligns nexus-rename/ \
--concat mrbayes/deep-bird-each-part.nex --mr-bayes \
--interleave --unlink
crank up some AWS:
ec2-run-instances --key keypair ami-349d645d \
--instance-type=c1.xlarge --availability-zone=us-east-1b \
--block-device-mapping '/dev/sda2=ephemeral0' \
--block-device-mapping '/dev/sda3=ephemeral1'
login and setup mrbayes:
ssh -i ec2-keypair ec2-user@ec2-174-129-57-11.compute-1.amazonaws.com
use rsync to up/dload data:
rsync -avz -e "ssh -i /Users/bcf/.ec2/ec2-keypair" \
ec2-user@ec2-174-129-57-11.compute-1.amazonaws.com:/mnt/data/birds/ ./
get variant stats for birds:
python align/get_cons_region_align_stats.py --input nexus-rename --output bird-align-stats.csv
make figures:

this gives U-shaped figure

birds = read.table('/Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/birds/stats/bird-trimmed-align-stats.csv', header = TRUE, sep = ',')
birdnobase = birds[birds$greaterthanonediff >0.0,]
birdnooutlier = birdnobase[birdnobase$greaterthanonediff <= 0.10,]
ggplot(birdnooutlier, aes(x=bp,y=greaterthanonediff,color=greaterthanonediff)) geom_point(size=3, alpha = 5/10) scale_colour_gradient(low = "red", high = "blue") scale_y_continuous('Frequency of variable positions\n') scale_x_continuous('\nDistance from center of alignment') theme_bw() opts(legend.position = "none")
pdf('bird-variable-positions.pdf')

remove some gridlines

Enrichment stats (for birds)

extract AMOS file from velvet tarball and store in tmp:
tar -jtvf /Volumes/Data3/lsu-seqcap/Bin_008/assembled/urocolius-indicus.assembly.tar.bz2
tar -xjvf /Volumes/Data3/lsu-seqcap/Bin_008/assembled/urocolius-indicus.assembly.tar.bz2 auto_data_73/velvet_asm.afg
convert AMOS afg file to AMOS bnk file:
~/Source/amos-3.0.0/src/Bank/bank-transact -m megalaima-virens.afg -b megalaima-virens.bnk -c
output coverage file, so we can get iids:
~/Source/amos-3.0.0/src/Utils/cvgStat -b megalaima-virens.bnk > megalaima-virens.cvg
get loci from query of sqlite data:
sqlite> .output megalaima-virens.loci
sqlite> select megalaima_virens from match_map where megalaima_virens IS NOT NULL;
sqlite> .output stdout
parse that bad-boy and the cvg file to get iids:
mv /Volumes/Data3/lsu-seqcap/lastz/megalaima-virens.loci ~/Tmp/bird-coverage/megalaima-virens
python assembly/get_iids_from_cvg_stat.py
\ megalaima-virens.cvg megalaima-virens.loci --output megalaima-virens.iid
now get depth of coverage of locus-specific reads:
~/Source/amos-3.0.0/src/Validation/analyze-read-depth megalaima-virens.bnk -d -I megalaima-virens.iid
get depth of coverage of all reads in contigs:
~/Source/amos-3.0.0/src/Validation/analyze-read-depth megalaima-virens.bnk -d
in order to derive the average contig length from loci matching UCEs (there is another script for only UCE loci present in all orgs - it parses the fasta we input to alignment rather than the assembled contigs from velvet):
python assembly/get_contig_lengths_for_all_uce_loci.py \
/Volumes/Data3/lsu-seqcap/contig \
/Volumes/Data3/lsu-seqcap/lastz/probe.matches.sqlite \
~/Dropbox/Research/brumfield/LSU_Illumina_May2011_run/faircloth/match-count.config \
"Birds - UCE paper"

Additional SNP Analyses

screen lastz files for dupes, starts and ends, and generate velvet-like contigs, so we can insert these data into the same pipeline we have for the seqcap data:
python share/get_fake_velvet_contigs_from_genomes.py \
all-probes-to-hg19.lastz \
/Volumes/Data/Genomes/hg19/hg19.2bit \
fake-velvet-fasta/taeGut1-all-probe-fake-velvet.fasta
--flank 50
Also get BED files for flanks w/ flank dist @ 50, 100, 250:
python share/get_fake_velvet_contigs_from_genomes.py \
all-probes-to-hg19.lastz /Volumes/Data/Genomes/hg19/hg19.2bit \
--bed fake-velvet-bed/fake-velvet-all-probe-hg19-50.bed \
--flank 50
Or, do both at once:
python share/get_fake_velvet_contigs_from_genomes.py \
data/lastz/all-probes-to-hg19.lastz \
/Volumes/Data/Genomes/hg19/hg19.2bit \
--fasta data/fake-velvet-fasta/fake-velvet-all-probe-hg19-500.fasta \
--bed data/fake-velvet-bed/fake-velvet-all-probe-hg19-500.bed \
--flank 500
using UCSC browser, intersect those with dbSNP128 results (for mm9 and hg19) and bgiSNP results (for galGal3):
results => snp/
associate snps w/ the correct UCEs (using coordinate overlaps, since UCSC doesn't return this):
python snps/find_snps_in_bed_interval.py \
data/fake-velvet-bed/fake-velvet-all-probe-hg19-500.bed \
data/snp/dbSNP132-intersect-all-probe-hg19-500.bed \
--output data/snp/dbSNP132-to-hg19-uce-500.csv
upload SNP RSs to dbSNP to pull down heterozygosity data
from hg19 snps, get heterozygosity data:
python snps/parse_dbsnp_xml.py \
hg19-dbSNP-data > variation-data-hg19-dbSNP.csv
do same in mouse:
python snps/parse_dbsnp_xml.py \
mm9-dbSNP-data > variation-data-mm9-dbSNP.csv

Plotting additional figures

R code (didn't use these boxplots):

data = read.table('contig-lengths-by-probe.csv', header = TRUE, sep = ',')
ggplot(data, aes(factor(probes), contiglength)) geom_boxplot()
ggplot(data, aes(factor(probes), coverage)) geom_boxplot()
p1 = ggplot(data, aes(factor(probes), coverage)) geom_boxplot()
ggsave('coverage-by-probes.pdf')
p2 = ggplot(data, aes(factor(probes), contiglength)) geom_boxplot()
ggsave('contiglength-by-probes.pdf')
f1 = ggplot(data, aes(factor(probes), contiglength)) geom_boxplot() facet_wrap(~ organism, ncol = 6)
ggsave('contiglength-by-probes-facet.pdf')
f2 = ggplot(data, aes(factor(probes), coverage)) geom_boxplot() facet_wrap(~ organism, ncol = 6)
ggsave('coverage-by-probes-facet.pdf')
using intersection BED data, prep CSV file associating SNPs w/ UCE names and coords associate variation data with figure with:
python snps/get_dbsnp_variability_for_all_uces.py \
dbSNP132-to-hg19-uce-200.csv \
~/Tmp/dbsnp/hg19/hg19-dbSNP-data > dbSNP132-to-hg19-uce-200-ready-to-plot.csv
get the SNP variability data for SNPS within 200 bp of UCEs:
python snps/get_dbsnp_freq_stats.py dbSNP132-to-hg19-uce-200.csv \
~/Tmp/dbsnp/hg19/hg19-dbSNP-data \
--dupefile /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/probes/all-to-all.lastz \
--output test1.csv --output2 test2.csv
get that into R and:
raw = read.table('test2.csv', sep = ',', header = TRUE)
convert freqs to floats:
raw$freq = as.numeric(as.character(raw$freq))

add pos column

raw$pos = raw$snp.start - (raw$uce.start raw$uce.end)/2

plot out just the allele freqs for the loci:
raw1 = subset(raw,freq > 0.0)
raw1 = subset(raw1, X1000gvalidated == 'true') #4653 loci
p = ggplot(raw1, aes(x=pos,y=freq,color=freq)) + geom_point(size=3, alpha = 5/10)
- scale_colour_gradient(low = "red", high = "blue") \
- scale_y_continuous('Minor allele frequency (MAF)\n') \
- scale_x_continuous('\nDistance from center of UCE locus') \
- theme_bw() + opts(legend.position = "none")
  np = read.table('test1.csv', sep = ',', header = TRUE)
running average (window 25) of the number of SNPs per base across all UCEs:
p = ggplot(subset(np, datatype == 'running'), aes(x=pos,y=avg,color=avg)) \
- geom_point(size=3, alpha = 5/10) \
- scale_colour_gradient(low = "red", high = "blue") \
- scale_y_continuous('Average number (25 bp window) of 1000 Genomes Project validated SNPs\n') \
- scale_x_continuous('\nDistance from center of UCE locus') \
- theme_bw() \
- opts(legend.position = "none")
running average (window 25) of SNP heterozygosity across all UCEs:
p = ggplot(subset(np, datatype == 'running_hetero'), aes(x=pos,y=avg,color=avg)) \
- geom_point(size=3, alpha = 5/10)
- scale_colour_gradient(low = "red", high = "blue") \
- scale_y_continuous('Mean number of 1000 Genomes Project validated SNPs\n') \
- scale_x_continuous('\nDistance from center of UCE locus') \
- theme_bw() \
- opts(legend.position = "none")
running average (window 25) of SNP heterozygosity across all UCEs:
p = ggplot(subset(np, datatype == 'mean_hetero'), aes(x=pos,y=avg,color=avg)) \
- geom_point(size=3, alpha = 5/10) \
- scale_colour_gradient(low = "red", high = "blue") \
- scale_y_continuous('Mean Minor Allele Frequency (MAF) of 1000 Genomes Project validated SNPs\n') \
- scale_x_continuous('\nDistance from center of UCE locus') theme_bw() opts(legend.position = "none")
  pdf('uce-running-average-1000G-snps-trunc-at-225.pdf')
remove some gridlines:
p
g = ggplotGrob(p)
grid.ls(g)
grid.remove(gPath("GRID.gTree","layout","panel","grill.gTree","panel.grid.major.y.polyline"),grep=T)
grid.remove(gPath("GRID.gTree","layout","panel","grill.gTree","panel.grid.minor.y.polyline"),grep=T)
dev.off()

UCE Capture Data Figures

summarize data:
python get_hits_misses_length_and_gc.py
import data:
d = read.table('gc-length-species-matches.csv', sep = ',', header = TRUE)
order True/False correctly:
d$detected = factor(d$detected, c("TRUE", "FALSE"))
uce_gc_content_v_detection_facet_by_taxon.pdf:
ggplot(d, aes(x = detected, y = gc)) +
geom_jitter(aes(colour = detected), alpha = 0.4) +
geom_boxplot(alpha=0.3) + facet_wrap(~taxon) +
scale_colour_manual(values = c("#377EB8","#E41A1C")) +
scale_x_discrete('UCE Locus Detected') + scale_y_continuous('UCE Locus GC Content') +
opts(legend.position = "none")
uce_tm_v_detection_facet_by_taxon.pdf:
ggplot(d, aes(x = detected, y = tm)) +
geom_jitter(aes(colour = detected), alpha = 0.4) +
geom_boxplot(alpha=0.3) + facet_wrap(~taxon) +
scale_colour_manual(values = c("#377EB8","#E41A1C")) +
scale_x_discrete('UCE Locus Detected') + scale_y_continuous('Average (by Locus) Probe Tm') +
opts(legend.position = "none")
uce_length_v_detection_facet_by_taxon.pdf:
ggplot(d, aes(x = detected, y = length)) +
geom_jitter(aes(colour = detected), alpha = 0.4) +
geom_boxplot(alpha=0.3) + facet_wrap(~taxon) +
scale_colour_manual(values = c("#377EB8","#E41A1C")) +
scale_x_discrete('UCE Locus Detected') + scale_y_continuous('UCE Locus Length') +
opts(legend.position = "none")
uce_bases_added_v_detection_facet_by_taxon.pdf:
ggplot(d, aes(x = detected, y = added)) +
geom_jitter(aes(colour = detected), alpha = 0.4) +
geom_boxplot(alpha=0.3) + facet_wrap(~taxon) +
scale_colour_manual(values = c("#377EB8","#E41A1C")) +
scale_x_discrete('UCE Locus Detected') +
scale_y_continuous('Bases Added to Probes Targeting UCE') +
opts(legend.position = "none")
uce_gc_v_target_uce_length.pdf:
ggplot(d, aes(y = gc, x = length)) +
geom_point(aes(colour = detected), alpha=0.5, size = 2) +
facet_wrap(~taxon) + scale_x_continuous('Target UCE Length') +
scale_y_continuous('Target UCE GC Content') +
scale_colour_manual(values = c("#377EB8","#E41A1C"))
uce_average_probe_tm_v_target_length.pdf:
ggplot(d, aes(y = tm, x = length)) +
geom_point(aes(colour = detected), alpha=0.5, size = 2) +
facet_wrap(~taxon) + scale_x_continuous('Target UCE Length') +
scale_y_continuous('Average (by Locus) Probe Tm') +
scale_colour_manual(values = c("#377EB8","#E41A1C"))
uce_target_length_v_probes_targeting.pdf:
ggplot(d, aes(x = count, y = length)) +
geom_jitter(aes(colour = detected), alpha = 0.4) +
facet_wrap(~taxon) +
scale_colour_manual(values = c("#377EB8","#E41A1C")) +
scale_x_continuous('Number of Probes Targeting UCE') +
scale_y_continuous('Target UCE Length')
import contig data:
d2 = read.table('../archive/birds-contig-lengths-by-probe-filtered-birds.csv', header = TRUE, sep = ',')
uce_contig_length_by_probe_count.pdf:
ggplot(d2, aes(x = factor(probes), y = contiglength)) +
geom_jitter(aes(colour = factor(probes))) +
geom_boxplot(alpha = 0.3) +
scale_colour_brewer(palette="Set1") +
opts(legend.position = "none") +
scale_x_discrete('Number of Probes Targeting UCE') +
scale_y_continuous('Contig Length of UCE-associated Locus)')
uce_sequencing_depth_by_probe_count.pdf:
ggplot(d2, aes(x = factor(probes), y = coverage)) +
geom_jitter(aes(colour = factor(probes))) +
geom_boxplot(alpha = 0.3) +
scale_colour_brewer(palette="Set1") +
opts(legend.position = "none") +
scale_x_discrete('Number of Probes Targeting UCE') +
scale_y_continuous('Sequencing Depth of UCE-associated Locus')
uce_contig_length_by_probe_count_by_taxon.pdf:
ggplot(d2, aes(x = factor(probes), y = contiglength)) +
geom_jitter(aes(colour = factor(probes))) +
geom_boxplot(alpha = 0.3) +
facet_wrap(~organism) +
scale_colour_brewer(palette="Set1") +
opts(legend.position = "none") +
scale_x_discrete('Number of Probes Targeting UCE') +
scale_y_continuous('Contig Length of UCE-associated Locus')
uce_loss_of_loci_with_randomly_selected_groups.pdf:
for i in {1..7}; do python assembly/get_match_counts.py \
../archive/birds-probe-matches.sqlite \
~/Dropbox/Research/brumfield/LSU_Illumina_May2011_run1/faircloth/match-count.config \
--extend /Users/bcf/Dropbox/Research/faircloth/UCEs-as-genetic-markers/data/snp/probe-match-for-db/genome.matches.sqlite \
'Birds - UCE paper' --optimize --random --sample-size $i \
--samples 1001 --keep-counts --output sample$i-sim1000.txt; \
echo "\n";done
for i in {1..7}; do cat sample$i-sim1000.txt >> all-sample-sim1000.txt; done
ggplot(a, aes(x = factor(taxa), y = probes)) +
geom_jitter(aes(colour = factor(taxa))) +
geom_boxplot(alpha = 0.3) +
scale_colour_brewer(palette="Set1") +
scale_x_discrete('Number of Randomly Selected Group Members') +
scale_y_continuous('Number of UCE Loci Recovered From All Group Members', limits = c(1000,2300)) +
opts(legend.position = "none")
uce_loci_missed_by_taxon_count.pdf:
new = d$uce[d$detected == FALSE]
test = table(new)
test2 = as.data.frame(test)
ggplot(test2, aes(factor(Freq), fill = factor(Freq))) +
geom_bar(colour="black") +
scale_fill_manual(values = c("#A6CEE3", "#1F78B4", "#B2DF8A",
"#33A02C","#FB9A99","#E31A1C","#FDBF6F","#FF7F00","#CAB2D6",
"#6A3D9A")) + scale_x_discrete('Number of Taxa') +
scale_y_continuous('Failures to Detect Targeted UCEs')
ggsave('uce_loci_missed_by_taxon_count.pdf')
uce_coverage_by_taxon.pdf:
ggplot(data = d, aes(x = uce, y = coverage, group = organism)) +
geom_line(aes(colour = organism)) + facet_wrap(~organism) +
scale_x_discrete(breaks=NA) +
scale_fill_manual(values = c("#A6CEE3", "#1F78B4", "#B2DF8A",
"#33A02C","#FB9A99","#E31A1C","#FDBF6F","#FF7F00","#CAB2D6","#6A3D9A")) +
scale_x_discrete(breaks=NA, 'UCE-associated Locus (Name Not Shown)') +
scale_y_continuous('Coverage') + opts(legend.position = "none")