single-sperm-co-analysis (original) (raw)
Contents
- 1 Introduction
- 2 Locate file path
- 3 File information
- 4 Diagnostic functions
- 5 Input parsing
- 6 Construct RangedSummarizedExpriment object
- 7 Count crossovers
- 8 Count crossovers for SNP intervals
- 9 Calculate genetic distances
- 10 Calculate genetic distances
- 11 Plot whole genome genetic distances
- 12 Group differences
- 13 Session info
suppressPackageStartupMessages({
library(comapr)
library(SummarizedExperiment)
})
Introduction
In the previous vignette, we demonstrated how to calculate genetic distances using genotyped markers from a group of samples.
Genetic distance calculate from genotype shifts of markers
In this document, we will focus on building individualized genetic maps from output files of sscocaller
which is available athere.
Locate file path
The comapr
package includes a list of toy example output files from the sscocaller
command line tool. The follow code will get the file path that we will use later.
demo_path <-paste0(system.file("extdata",package = "comapr"),"/")
We can see that we have two samples (individual donors) and each of them has haplotype states inferred for chr1 to chr5.
list.files(demo_path)
#> [1] "s1_barcodes.txt" "s1_chr1_altCount.mtx" "s1_chr1_snpAnnot.txt"
#> [4] "s1_chr1_totalCount.mtx" "s1_chr1_vi.mtx" "s1_chr1_viSegInfo.txt"
#> [7] "s1_chr2_altCount.mtx" "s1_chr2_snpAnnot.txt" "s1_chr2_totalCount.mtx"
#> [10] "s1_chr2_vi.mtx" "s1_chr2_viSegInfo.txt" "s1_chr3_altCount.mtx"
#> [13] "s1_chr3_snpAnnot.txt" "s1_chr3_totalCount.mtx" "s1_chr3_vi.mtx"
#> [16] "s1_chr3_viSegInfo.txt" "s1_chr4_altCount.mtx" "s1_chr4_snpAnnot.txt"
#> [19] "s1_chr4_totalCount.mtx" "s1_chr4_vi.mtx" "s1_chr4_viSegInfo.txt"
#> [22] "s1_chr5_altCount.mtx" "s1_chr5_snpAnnot.txt" "s1_chr5_totalCount.mtx"
#> [25] "s1_chr5_vi.mtx" "s1_chr5_viSegInfo.txt" "s2_barcodes.txt"
#> [28] "s2_chr1_snpAnnot.txt" "s2_chr1_vi.mtx" "s2_chr1_viSegInfo.txt"
#> [31] "s2_chr2_snpAnnot.txt" "s2_chr2_vi.mtx" "s2_chr2_viSegInfo.txt"
#> [34] "s2_chr3_snpAnnot.txt" "s2_chr3_vi.mtx" "s2_chr3_viSegInfo.txt"
#> [37] "s2_chr4_snpAnnot.txt" "s2_chr4_vi.mtx" "s2_chr4_viSegInfo.txt"
#> [40] "s2_chr5_snpAnnot.txt" "s2_chr5_vi.mtx" "s2_chr5_viSegInfo.txt"
File information
- *.mtx
- sparse matrix with columns corresponding to the list of sperm cell barcodes and rows corresponding to the list of SNP positions in VCF file
- {sample}_chr1_altCount.mtx, a sparse mtx with entries representing alternative allele counts
- {sample}_chr1_totalCount.mtx, a sparse mtx with entries representing total allele counts
- {sample}_chr1_vi.mtx, a sparse mtx with entries representing inferred viterbi state (haplotype state)
- {sample}_chr1_snpAnnot.txt, the SNP positions and allele
- {sample}_chr1_SegInfo.txt, statistics of viterbi state segments in text file format. It contains consecutive viterbi states for each chromosome with statistics including, starting SNP position, ending SNP position, the number of SNPs supporting the segment, the log likelihood ratio of the viterbi segment and the inferred hidden state.
Diagnostic functions
comapr
provides quality-check functions for examining SNP coverage per chr and per cell, chromosome segregation pattern checks, and summary statics for filtering low confidence crossovers.
perCellChrQC
pcQC <- perCellChrQC(sampleName="s1",chroms=c("chr1"),
path=paste0(demo_path,"/"),
barcodeFile=NULL)
Input parsing
Input-parsing functions are implemented in comapr
to constructRangedSummarizedExpriment
object that parses files generated from sscocaller
and filter out low-confidence COs at the same time. For the demo dataset, these filters do not have any effects:
minSNP = 0,
minlogllRatio = 50,
bpDist = 100,
maxRawCO=10,
minCellSNP = 1
Construct RangedSummarizedExpriment
object
We first construct RangedSummarizedExpriment
object from parsing output files from sscocaller
and filter out low-confidence crossovers.
s1_rse_state <- readHapState("s1",chroms=c("chr1","chr2"),
path=demo_path,barcodeFile=NULL,
minSNP = 0,
minlogllRatio = 50,
bpDist = 100,
maxRawCO=10,
minCellSNP = 1)
s2_rse_state <- readHapState("s2",chroms=c("chr1","chr2"),
path=demo_path,
barcodeFile=NULL,
minSNP = 0,
minlogllRatio = 50,
bpDist = 100,
maxRawCO=10,
minCellSNP = 1)
s1_rse_state
#> class: RangedSummarizedExperiment
#> dim: 12 5
#> metadata(10): ithSperm Seg_start ... bp_dist barcode
#> assays(1): vi_state
#> rownames: NULL
#> rowData names(0):
#> colnames(5): BC1 BC2 BC3 BC4 BC5
#> colData names(1): barcodes
The Viterbi states for SNP markers are stored in the “assay” slot:
assay(s1_rse_state)
#> 12 x 5 sparse Matrix of class "dgCMatrix"
#> BC1 BC2 BC3 BC4 BC5
#> [1,] 1 2 1 2 1
#> [2,] 1 . . . .
#> [3,] . 2 . . .
#> [4,] . . . . 1
#> [5,] 2 1 . . .
#> [6,] 2 1 1 2 .
#> [7,] 1 2 1 2 1
#> [8,] 1 . . . .
#> [9,] . 2 . . .
#> [10,] . . . . 1
#> [11,] 2 1 . . .
#> [12,] 2 1 1 2 .
The rowRanges
contains the SNP’s positions:
rowRanges(s1_rse_state)
#> GRanges object with 12 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr1 3000 *
#> [2] chr1 3200 *
#> [3] chr1 4000 *
#> [4] chr1 4500 *
#> [5] chr1 5500 *
#> ... ... ... ...
#> [8] chr2 3200 *
#> [9] chr2 4000 *
#> [10] chr2 4500 *
#> [11] chr2 5500 *
#> [12] chr2 6000 *
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
Formate sample group factor
We have read in the Viterbi states for cells from two samples: s1 and s2. We now combine them into one data object for downstream analysis.
Add sample group factor
colData(s1_rse_state)
#> DataFrame with 5 rows and 1 column
#> barcodes
#> <character>
#> BC1 BC1
#> BC2 BC2
#> BC3 BC3
#> BC4 BC4
#> BC5 BC5
colData(s1_rse_state)$sampleGroup <- "s1"
colData(s2_rse_state)$sampleGroup <- "s2"
Combine two groups
We now call combineHapState
function to combine sample s1 and sample s2 into one RangedSummarizedExperiment
object.
twoSample <- combineHapState(s1_rse_state,
s2_rse_state)
Now the assay
data slot contains the Viterbi states across SNP positions for the combined samples.
twoSample <- combineHapState(s1_rse_state,s2_rse_state)
Now the twoSample
object contains the cells from both samples with assay
slot contains the Viterbi states and rowRanges
contains position annotaitons for the list SNP markers.
assay(twoSample)
#> 12 x 10 sparse Matrix of class "dgCMatrix"
#> [[ suppressing 10 column names 'BC1', 'BC2', 'BC3' ... ]]
#>
#> [1,] 1 2 1 2 1 1 2 1 2 1
#> [2,] 1 . . . . . . . . .
#> [3,] . 2 . . . . 2 . . .
#> [4,] . . . . 1 . . . . 1
#> [5,] 2 1 . . . . 1 . . .
#> [6,] 2 1 1 2 . 1 1 1 2 .
#> [7,] 1 2 1 2 1 1 2 1 2 1
#> [8,] 1 . . . . . . . . .
#> [9,] . 2 . . . . 2 . . .
#> [10,] . . . . 1 . . . . 1
#> [11,] 2 1 . . . . 1 . . .
#> [12,] 2 1 1 2 . 1 1 1 2 .
Count crossovers
The countCOs
function can then find out the crossover intervals according to the Viterbi states for each cell and the number of crossovers per cell is then calculated.
Count crossovers for SNP intervals
countCOs
function will find the crossover intervals for each samples and attribute the observed crossovers from each sample to the corresponding intervals.
cocounts <- countCOs(twoSample)
The rowRanges
from the resulting data object now denotes the crossover interval and the assay
slot now contains the number of crossovers in each cell.
Now rowRanges
contains the intervals
rowRanges(cocounts)
#> GRanges object with 6 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr1 3201-3999 *
#> [2] chr1 4001-4499 *
#> [3] chr1 4501-5499 *
#> [4] chr2 3201-3999 *
#> [5] chr2 4001-4499 *
#> [6] chr2 4501-5499 *
#> -------
#> seqinfo: 2 sequences from an unspecified genome; no seqlengths
The colData slot still contains the annotations of each cell.
colData(cocounts)
#> DataFrame with 10 rows and 2 columns
#> barcodes sampleGroup
#> <character> <character>
#> BC1 BC1 s1
#> BC2 BC2 s1
#> BC3 BC3 s1
#> BC4 BC4 s1
#> BC5 BC5 s1
#> BCa BCa s2
#> BCb BCb s2
#> BCc BCc s2
#> BCd BCd s2
#> BCe BCe s2
Calculate genetic distances
The genetic distances can be calculated by using mapping functions such as the Kosambi or Haldane and assay
slot contains the number of crossovers found in each sample across these intervals.
assay(cocounts)
#> DataFrame with 6 rows and 10 columns
#> BC1 BC2 BC3 BC4 BC5 BCa BCb
#> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
#> 1 0.347826 0.000000 0 0 0 0 0.000000
#> 2 0.217391 0.333333 0 0 0 0 0.333333
#> 3 0.434783 0.666667 0 0 0 0 0.666667
#> 4 0.347826 0.000000 0 0 0 0 0.000000
#> 5 0.217391 0.333333 0 0 0 0 0.333333
#> 6 0.434783 0.666667 0 0 0 0 0.666667
#> BCc BCd BCe
#> <numeric> <numeric> <numeric>
#> 1 0 0 0
#> 2 0 0 0
#> 3 0 0 0
#> 4 0 0 0
#> 5 0 0 0
#> 6 0 0 0
Calculate genetic distances
calGeneticDist
function will convert the raw crossover frequencies to genetic distances via selected mapping function (ie. kosambi or haldane).
dist_gr <- calGeneticDist(co_count = cocounts, mapping_fun = "k")
dist_gr
#> class: RangedSummarizedExperiment
#> dim: 6 10
#> metadata(0):
#> assays(1): co_count
#> rownames: NULL
#> rowData names(2): raw_rate kosambi
#> colnames(10): BC1 BC2 ... BCd BCe
#> colData names(2): barcodes sampleGroup
The genetic distances for each interval are stored in rowData
.
rowData(dist_gr)
#> DataFrame with 6 rows and 2 columns
#> raw_rate kosambi
#> <numeric> <numeric>
#> 1 0.0347826 3.48389
#> 2 0.0884058 8.93447
#> 3 0.1768116 18.47894
#> 4 0.0347826 3.48389
#> 5 0.0884058 8.93447
#> 6 0.1768116 18.47894
The above genetic distances have been calculated using all samples. We can also specify the group factor so that we can calculate genetic distances for different sample groups:
## sampleGroup is a column in the colData slot
dist_gr <- calGeneticDist(co_count = cocounts,
group_by = "sampleGroup",
mapping_fun = "k")
Now the group/Sample specific distances are stored in rowData
rowData(dist_gr)$kosambi
#> s1 s2
#> [1,] 7.001937 0.00000
#> [2,] 11.198036 6.70660
#> [3,] 23.647496 13.66359
#> [4,] 7.001937 0.00000
#> [5,] 11.198036 6.70660
#> [6,] 23.647496 13.66359
Plot whole genome genetic distances
We construct a GRange
object from the dist_gr
first.
p_gr <- granges(dist_gr)
mcols(p_gr) <- rowData(dist_gr)$kosambi
We can plot whole-genome genetic distances
plotWholeGenome(p_gr)
We can also do per chromosome plot
plotGeneticDist(p_gr,chr = "chr1",cumulative = TRUE)
Group differences
bootstrapDist
function generates the sampling distribution of the difference in total genetic distances for the two groups under comparisons.
set.seed(100)
bootsDiff <- bootstrapDist(co_gr = cocounts,B=10,
group_by = "sampleGroup")
hist(bootsDiff)
From the bootsDiff
data object, we can find a 95% confidence interval to test whether the difference is statistically significant.
quantile(bootsDiff,c(0.025,0.975),na.rm =TRUE)
#> 2.5% 97.5%
#> -63.57753 185.35635
An alternative re-sampling method, permuteDist
, can be applied to generate a null distribution for the group difference by reshuffling the group labels across the samples.
set.seed(100)
perms <- permuteDist(cocounts,B=1000,group_by = "sampleGroup")
A p-value can be calculated using the statmod::permp
function (Phipson and Smyth 2010).
statmod::permp(x = sum(perms$permutes>=perms$observed_diff,
na.rm = TRUE),
nperm = sum(!is.na(perms$permutes)),
n1 = perms$nSample[1],
n2 = perms$nSample[2])
#> [1] 0.5045233
Session info
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] SummarizedExperiment_1.36.0 Biobase_2.66.0
#> [3] MatrixGenerics_1.18.0 matrixStats_1.4.1
#> [5] BiocParallel_1.40.0 GenomicRanges_1.58.0
#> [7] GenomeInfoDb_1.42.0 IRanges_2.40.0
#> [9] S4Vectors_0.44.0 BiocGenerics_0.52.0
#> [11] comapr_1.10.0 BiocStyle_2.34.0
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 shape_1.4.6.1 rstudioapi_0.17.1
#> [4] jsonlite_1.8.9 magrittr_2.0.3 magick_2.8.5
#> [7] GenomicFeatures_1.58.0 farver_2.1.2 rmarkdown_2.28
#> [10] GlobalOptions_0.1.2 BiocIO_1.16.0 zlibbioc_1.52.0
#> [13] vctrs_0.6.5 memoise_2.0.1 Rsamtools_2.22.0
#> [16] RCurl_1.98-1.16 base64enc_0.1-3 tinytex_0.53
#> [19] htmltools_0.5.8.1 S4Arrays_1.6.0 progress_1.2.3
#> [22] curl_5.2.3 SparseArray_1.6.0 Formula_1.2-5
#> [25] sass_0.4.9 bslib_0.8.0 htmlwidgets_1.6.4
#> [28] plyr_1.8.9 Gviz_1.50.0 httr2_1.0.5
#> [31] plotly_4.10.4 cachem_1.1.0 GenomicAlignments_1.42.0
#> [34] iterators_1.0.14 lifecycle_1.0.4 pkgconfig_2.0.3
#> [37] Matrix_1.7-1 R6_2.5.1 fastmap_1.2.0
#> [40] GenomeInfoDbData_1.2.13 digest_0.6.37 colorspace_2.1-1
#> [43] AnnotationDbi_1.68.0 Hmisc_5.2-0 RSQLite_2.3.7
#> [46] labeling_0.4.3 filelock_1.0.3 fansi_1.0.6
#> [49] httr_1.4.7 abind_1.4-8 compiler_4.4.1
#> [52] withr_3.0.2 bit64_4.5.2 htmlTable_2.4.3
#> [55] backports_1.5.0 DBI_1.2.3 highr_0.11
#> [58] biomaRt_2.62.0 rappdirs_0.3.3 DelayedArray_0.32.0
#> [61] rjson_0.2.23 tools_4.4.1 foreign_0.8-87
#> [64] nnet_7.3-19 glue_1.8.0 restfulr_0.0.15
#> [67] grid_4.4.1 checkmate_2.3.2 reshape2_1.4.4
#> [70] cluster_2.1.6 generics_0.1.3 gtable_0.3.6
#> [73] BSgenome_1.74.0 tidyr_1.3.1 ensembldb_2.30.0
#> [76] data.table_1.16.2 hms_1.1.3 xml2_1.3.6
#> [79] utf8_1.2.4 XVector_0.46.0 foreach_1.5.2
#> [82] pillar_1.9.0 stringr_1.5.1 circlize_0.4.16
#> [85] dplyr_1.1.4 BiocFileCache_2.14.0 lattice_0.22-6
#> [88] rtracklayer_1.66.0 bit_4.5.0 deldir_2.0-4
#> [91] biovizBase_1.54.0 tidyselect_1.2.1 Biostrings_2.74.0
#> [94] knitr_1.48 gridExtra_2.3 bookdown_0.41
#> [97] ProtGenerics_1.38.0 xfun_0.48 statmod_1.5.0
#> [100] stringi_1.8.4 UCSC.utils_1.2.0 lazyeval_0.2.2
#> [103] yaml_2.3.10 evaluate_1.0.1 codetools_0.2-20
#> [106] interp_1.1-6 tibble_3.2.1 BiocManager_1.30.25
#> [109] cli_3.6.3 rpart_4.1.23 munsell_0.5.1
#> [112] jquerylib_0.1.4 dichromat_2.0-0.1 Rcpp_1.0.13
#> [115] dbplyr_2.5.0 png_0.1-8 XML_3.99-0.17
#> [118] parallel_4.4.1 ggplot2_3.5.1 blob_1.2.4
#> [121] prettyunits_1.2.0 latticeExtra_0.6-30 jpeg_0.1-10
#> [124] AnnotationFilter_1.30.0 bitops_1.0-9 viridisLite_0.4.2
#> [127] VariantAnnotation_1.52.0 scales_1.3.0 purrr_1.0.2
#> [130] crayon_1.5.3 rlang_1.1.4 KEGGREST_1.46.0
Phipson, Belinda, and Gordon K Smyth. 2010. “Permutation P-Values Should Never Be Zero: Calculating Exact P-Values When Permutations Are Randomly Drawn.” Stat. Appl. Genet. Mol. Biol. 9 (October): Article39.