140127 rtg vcfeval vcf comparison tool (original) (raw)
1. Comparing Variant Calls GENOME- IN- A- BOTTLE W ORKSHOP Francisco M. De La Vega, D.Sc. Visiting Scholar, Department of Genetics Stanford University School of Medicine In collaboration with Real Time Genomics, Inc.
2. rtgTools v1.0 A toolkit to compare and analyze VCFs • • • • • • • vcfeval – comparison of VCFs for ROC curves rocplot – draw ROC curves from vcfeval output medelian – counts of Mendelian inheritance errors in pedigrees vcfstats – basic statistics of VCF files vcffilter – filtering of VCFs by scores, etc. vcfannotate – annotation of VCF files vcfmerge – merge VCF files Java compiled code freely available at GiaB repository: ftp://ftp-trace.ncbi.nih.gov/giab/ftp/tools/RTG/
3. 3 Issues in representation of complex calls Indel in homopolymer MNPs Reference CAAAAAAG Reference Baseline Called C..AAAAG CAAAA..G After replay: Baseline Called CAAAAG CAAAAG Baseline Called CAACGTAAG CAATGTCAG CAATGTCAG
4. Issues in representation of complex calls Dinucleotide repeat Reference Baseline Called ACGTACCAGATATCACAACATATATATA ACGGACCAG..ATCACAACATATATATATA ACGGACCAGAT..CACAACATATATATATA After replay: Baseline Called ACGGACCAGATCACAACATATATATATA ACGGACCAGATCACAACATATATATATA
5. Comparison of variant call set with baseline set Basic rules • Match the baseline and called sequences so as to maximize true positives and minimize false positives and false negatives. • True positives + false negatives = total calls in the baseline • Heterozygous calls match: Both heterozygous and alleles must agree Best path Link mutations ROC Path creation • A path is a selection of subset of calls • Best path: paths that maximize true positives and minimize errors • In theory, exponential number of paths; in practice this can be solved by dynamic programing
6. Path creation - simple homozygous case Reference Baseline a Called b c d e f g h
7. Path creation - simple homozygous case Reference Baseline a b c d e f g h e f g h Called Best Path Baseline False negative (excluded) a b c Called False positive (excluded) d
8. Path creation - simple heterozygous case (non-phased) Reference Baseline a Called b c d e f
9. Path creation - simple heterozygous case (non-phased) Reference Baseline a b c d e f e f Called Best Path False negative (excluded) Baseline a b c d Called False positive (excluded)
10. Why weighting is needed? TP + FN = Totalbaseline Reference CAACAACTATCCTC....ATCT....GC Baseline CAACAACTATCCTCATCTATCTATCTGC Called CAACAACTATCCTCATCTATCTATCTGC
12. Weighting where B is the number of baseline variants between the current (Sn) and previous sync points (Sn-1) and C is the number of called variants between the current and previous sync points.
13. Simple homozygous weighting False negative (excluded) 1 Sync points Baseline Weights a1 b1 c1 d1 e1 f1 Called False positive (excluded) 1 Type TP Sync point Weighted total 6 FP 1 FN 1
14. Simple heterozygous case (non-phased) weighting False negative (excluded) 2 Baseline a 1 b 1 c 1 d1 e f Called False positive (excluded) 1 Type Sync point Weighted total TP 4 FP 1 FN 2
15. Complex weighting Baseline a 1 b 1 c 1 d1 e 0.5 f 0.5 Called Type TP 5 FP Sync point Weighted total 0 FN 0
16. ROC Plot
18. Acknowledgements RTG, Hamilton, New Zealand John Cleary Len Trigg Mehul Rathoud Data and tools to compare with phased standard released publicly at NIST Genome-in-a-Bottle repository (s3://giab) This work was done while the presenter was employed by Real Time Genomics Inc., San Bruno, CA. © 2014 Real Time Genomics, Inc. All rights reserved.