Get barcode from 10X Genomics scRNASeq data (original) (raw)
library(data.table)
library(ggplot2)
library(CellBarcode)
Preprocess of CellRanger output bam file
The the information in the bam file. There are RNA sequence, Cell barcode and UMI in the bam file. We need to get the barcode in the RNA sequence together with the Cell barcode and UMI.
Where to find the bam file. Usually, the bam file in in following location of the CellRanger output.
CellRanger Output fold/outs/possorted_genome_bam.bam
Why preprocess. We need get the sam file as input. And in some cases we can do some filtering to make the input file smaller, for reducing the running time.
Example 1 get the sam file
samtools view possorted_genome_bam.bam > scRNASeq_10X.sam
Example 2 get the sam file only contain un mapped reads
In most of the time, the barcodes are designed not overlap with the genome sequence, so those barcodes sequences are not mapped to the reference genome. Add a simple parameter, we can get the un-mapped reads to significantly reduce the running time of the barcode extraction procedure. In the following example, the scRNASeq_10X.sam
file only contains the un-mapped reads.
samtools view -f 4 possorted_genome_bam.bam > scRNASeq_10X.sam
More about the pattern
The pattern
is a regular expression, it tells the function where to find the barcode. In the pattern, we define the barcode backbone, and label the barcode sequence by bracket ()
.
For example, the pattern ATCG(.{21})TCGG
tells the barcode is surrounded by constant sequence of ATCG
, and TCGG
. Following are some examples to define the constant region and barcode sequence.
Example 1
ATCG(.{21})
21 bases barcode after a constant sequence of “ATCG”.
Example 2
(.{15})TCGA
15 bases barcode before a constant sequence of “TCGA”.
Example 3
ATCG(.*)TCGA
A flexible length barcode between constant regions of “ATCG” and “TCGA”.
Need more helps: About more complex barcode pattern, please ask the package author, then the exmaple will be apear here.
Session Info
sessionInfo()
#> R version 4.5.0 RC (2025-04-04 r88126)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0 LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] CellBarcode_1.14.0 ggplot2_3.5.2 data.table_1.17.0 BiocStyle_2.36.0
#>
#> loaded via a namespace (and not attached):
#> [1] SummarizedExperiment_1.38.0 gtable_0.3.6
#> [3] xfun_0.52 bslib_0.9.0
#> [5] hwriter_1.3.2.1 latticeExtra_0.6-30
#> [7] Biobase_2.68.0 lattice_0.22-7
#> [9] Rdpack_2.6.4 vctrs_0.6.5
#> [11] tools_4.5.0 bitops_1.0-9
#> [13] generics_0.1.3 stats4_4.5.0
#> [15] parallel_4.5.0 tibble_3.2.1
#> [17] pkgconfig_2.0.3 Matrix_1.7-3
#> [19] RColorBrewer_1.1-3 S4Vectors_0.46.0
#> [21] lifecycle_1.0.4 GenomeInfoDbData_1.2.14
#> [23] stringr_1.5.1 deldir_2.0-4
#> [25] compiler_4.5.0 egg_0.4.5
#> [27] Rsamtools_2.24.0 Biostrings_2.76.0
#> [29] Ckmeans.1d.dp_4.3.5 munsell_0.5.1
#> [31] codetools_0.2-20 GenomeInfoDb_1.44.0
#> [33] htmltools_0.5.8.1 sass_0.4.10
#> [35] yaml_2.3.10 pillar_1.10.2
#> [37] crayon_1.5.3 jquerylib_0.1.4
#> [39] BiocParallel_1.42.0 cachem_1.1.0
#> [41] DelayedArray_0.34.0 ShortRead_1.66.0
#> [43] abind_1.4-8 tidyselect_1.2.1
#> [45] digest_0.6.37 stringi_1.8.7
#> [47] dplyr_1.1.4 bookdown_0.43
#> [49] fastmap_1.2.0 grid_4.5.0
#> [51] colorspace_2.1-1 cli_3.6.4
#> [53] SparseArray_1.8.0 magrittr_2.0.3
#> [55] S4Arrays_1.8.0 withr_3.0.2
#> [57] scales_1.3.0 UCSC.utils_1.4.0
#> [59] rmarkdown_2.29 pwalign_1.4.0
#> [61] XVector_0.48.0 httr_1.4.7
#> [63] jpeg_0.1-11 matrixStats_1.5.0
#> [65] interp_1.1-6 gridExtra_2.3
#> [67] png_0.1-8 evaluate_1.0.3
#> [69] knitr_1.50 rbibutils_2.3
#> [71] GenomicRanges_1.60.0 IRanges_2.42.0
#> [73] rlang_1.1.6 Rcpp_1.0.14
#> [75] glue_1.8.0 BiocManager_1.30.25
#> [77] BiocGenerics_0.54.0 jsonlite_2.0.0
#> [79] plyr_1.8.9 R6_2.6.1
#> [81] MatrixGenerics_1.20.0 GenomicAlignments_1.44.0