Generating reference files for spliced and unspliced abundance estimation with alignment-free methods (original) (raw)

Introduction

In this vignette, we show how to prepare reference files for estimating abundances of spliced and unspliced abundances with alignment-free methods (e.g., Salmon,alevin orkallisto). Such abundances are used as input, e.g., for estimation of RNA velocity in single-cell data.

library(eisaR)

Generate feature ranges

The first step is to generate a GRangesList object containing the genomic ranges for the features of interest (spliced transcripts, and either unspliced transcripts or intron sequences). This is done using the getFeatureRanges() function, based on a reference GTF file. Here, we exemplify this with a small subset of a GTF file fromGencode release 28. We extract genomic ranges for spliced transcript and introns, where the latter are defined for each transcript separately (using the same terminology as in the BUSpaRse package). For each intron, a flanking sequence of 50 nt is added on each side to accommodate reads mapping across an exon/intron boundary.

gtf <- system.file("extdata/gencode.v28.annotation.sub.gtf", 
                   package = "eisaR")
grl <- getFeatureRanges(
  gtf = gtf,
  featureType = c("spliced", "intron"), 
  intronType = "separate", 
  flankLength = 50L, 
  joinOverlappingIntrons = FALSE, 
  verbose = TRUE
)
#> Import genomic features from the file as a GRanges object ... OK
#> Prepare the 'metadata' data frame ... OK
#> Make the TxDb object ...
#> Warning in .get_cds_IDX(mcols0$type, mcols0$phase): The "phase" metadata column contains non-NA values for features of
#>   type stop_codon. This information was ignored.
#> Warning in .makeTxDb_normarg_chrominfo(chrominfo): genome version
#> information is not available for this TxDb object
#> OK
#> 'select()' returned 1:1 mapping between keys and columns
#> Extracting spliced transcript features
#> Extracting introns using the separate approach

The output from getFeatureRanges() is a GRangesList object, with all the features of interest (both spliced transcripts and introns):

grl
#> GRangesList object of length 895:
#> $ENST00000456328.2
#> GRanges object with 3 ranges and 5 metadata columns:
#>       seqnames      ranges strand |           exon_id exon_rank
#>          <Rle>   <IRanges>  <Rle> |       <character> <integer>
#>   [1]     chr1 11869-12227      + | ENSE00002234944.1         1
#>   [2]     chr1 12613-12721      + | ENSE00003582793.1         2
#>   [3]     chr1 13221-14409      + | ENSE00002312635.1         3
#>           transcript_id           gene_id        type
#>             <character>       <character> <character>
#>   [1] ENST00000456328.2 ENSG00000223972.5        exon
#>   [2] ENST00000456328.2 ENSG00000223972.5        exon
#>   [3] ENST00000456328.2 ENSG00000223972.5        exon
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#> 
#> $ENST00000450305.2
#> GRanges object with 6 ranges and 5 metadata columns:
#>       seqnames      ranges strand |           exon_id exon_rank
#>          <Rle>   <IRanges>  <Rle> |       <character> <integer>
#>   [1]     chr1 12010-12057      + | ENSE00001948541.1         1
#>   [2]     chr1 12179-12227      + | ENSE00001671638.2         2
#>   [3]     chr1 12613-12697      + | ENSE00001758273.2         3
#>   [4]     chr1 12975-13052      + | ENSE00001799933.2         4
#>   [5]     chr1 13221-13374      + | ENSE00001746346.2         5
#>   [6]     chr1 13453-13670      + | ENSE00001863096.1         6
#>           transcript_id           gene_id        type
#>             <character>       <character> <character>
#>   [1] ENST00000450305.2 ENSG00000223972.5        exon
#>   [2] ENST00000450305.2 ENSG00000223972.5        exon
#>   [3] ENST00000450305.2 ENSG00000223972.5        exon
#>   [4] ENST00000450305.2 ENSG00000223972.5        exon
#>   [5] ENST00000450305.2 ENSG00000223972.5        exon
#>   [6] ENST00000450305.2 ENSG00000223972.5        exon
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#> 
#> $ENST00000473358.1
#> GRanges object with 3 ranges and 5 metadata columns:
#>       seqnames      ranges strand |           exon_id exon_rank
#>          <Rle>   <IRanges>  <Rle> |       <character> <integer>
#>   [1]     chr1 29554-30039      + | ENSE00001947070.1         1
#>   [2]     chr1 30564-30667      + | ENSE00001922571.1         2
#>   [3]     chr1 30976-31097      + | ENSE00001827679.1         3
#>           transcript_id           gene_id        type
#>             <character>       <character> <character>
#>   [1] ENST00000473358.1 ENSG00000243485.5        exon
#>   [2] ENST00000473358.1 ENSG00000243485.5        exon
#>   [3] ENST00000473358.1 ENSG00000243485.5        exon
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
#> 
#> ...
#> <892 more elements>

The metadata slot of the GRangesList object contains a list of the feature IDs for each type of feature:

lapply(S4Vectors::metadata(grl)$featurelist, head)
#> $spliced
#> [1] "ENST00000456328.2" "ENST00000450305.2" "ENST00000473358.1"
#> [4] "ENST00000469289.1" "ENST00000607096.1" "ENST00000606857.1"
#> 
#> $intron
#> [1] "ENST00000456328.2-I"  "ENST00000456328.2-I1"
#> [3] "ENST00000450305.2-I"  "ENST00000450305.2-I1"
#> [5] "ENST00000450305.2-I2" "ENST00000450305.2-I3"

As we can see, the intron IDs are identified by a -I suffix. Each feature is further annotated to a gene ID. For the intronic features, the corresponding gene ID also bears the -Isuffix appended to the original gene ID. Having separate gene IDs for spliced transcripts and introns allows, for example, joint quantification of spliced and unspliced versions of a gene with alevin. Adding a suffix rather than defining a completely new gene ID allows us to easily match corresponding spliced and unspliced feature IDs. To further simplify the latter, the metadata of the GRangesList object returned by getFeatureRanges() contains data.frames matching the corresponding gene IDs (as well as transcript IDs, if unspliced transcripts are extracted):

head(S4Vectors::metadata(grl)$corrgene)
#>             spliced              intron
#> 1 ENSG00000223972.5 ENSG00000223972.5-I
#> 2 ENSG00000243485.5 ENSG00000243485.5-I
#> 3 ENSG00000284332.1 ENSG00000284332.1-I
#> 4 ENSG00000268020.3 ENSG00000268020.3-I
#> 5 ENSG00000240361.2 ENSG00000240361.2-I
#> 6 ENSG00000186092.6 ENSG00000186092.6-I

Generate an expanded GTF file

In addition to extracting feature sequences, we can also export a GTF file with the full set of features. This is useful, for example, in order to generate a linked transcriptome for later import of estimated abundances with tximeta.

exportToGtf(
  grl, 
  filepath = file.path(tempdir(), "exported.gtf")
)

In the exported GTF file, each entry of grl will correspond to a “transcript” feature, and each individual range corresponds to an “exon” feature. In addition, each gene is represented as a “gene” feature.

Generate a transcript-to-gene mapping

Finally, we can export a data.frame (or a tab-separated test file) providing a conversion table between “transcript” and “gene” identifiers. This is needed to aggregate the transcript-level abundance estimates from alignment-free methods such as Salmon and kallisto to the gene level.

df <- getTx2Gene(grl)
head(df)
#>                       transcript_id           gene_id
#> ENST00000456328.2 ENST00000456328.2 ENSG00000223972.5
#> ENST00000450305.2 ENST00000450305.2 ENSG00000223972.5
#> ENST00000473358.1 ENST00000473358.1 ENSG00000243485.5
#> ENST00000469289.1 ENST00000469289.1 ENSG00000243485.5
#> ENST00000607096.1 ENST00000607096.1 ENSG00000284332.1
#> ENST00000606857.1 ENST00000606857.1 ENSG00000268020.3
tail(df)
#>                               transcript_id              gene_id
#> ENST00000304952.10-I1 ENST00000304952.10-I1 ENSG00000188290.10-I
#> ENST00000304952.10-I2 ENST00000304952.10-I2 ENSG00000188290.10-I
#> ENST00000481869.1-I     ENST00000481869.1-I ENSG00000188290.10-I
#> ENST00000484667.2-I     ENST00000484667.2-I ENSG00000188290.10-I
#> ENST00000484667.2-I1   ENST00000484667.2-I1 ENSG00000188290.10-I
#> ENST00000458555.1-I     ENST00000458555.1-I  ENSG00000224969.1-I

Session info

sessionInfo()
#> R version 4.5.0 beta (2025-04-02 r88102)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.2 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.22-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils    
#> [7] datasets  methods   base     
#> 
#> other attached packages:
#>  [1] BSgenome.Hsapiens.UCSC.hg38_1.4.5
#>  [2] BSgenome_1.77.0                  
#>  [3] BiocIO_1.19.0                    
#>  [4] Biostrings_2.77.0                
#>  [5] XVector_0.49.0                   
#>  [6] edgeR_4.7.0                      
#>  [7] limma_3.65.0                     
#>  [8] QuasR_1.49.0                     
#>  [9] Rbowtie_1.49.0                   
#> [10] rtracklayer_1.69.0               
#> [11] GenomicFeatures_1.61.0           
#> [12] AnnotationDbi_1.71.0             
#> [13] Biobase_2.69.0                   
#> [14] GenomicRanges_1.61.0             
#> [15] GenomeInfoDb_1.45.0              
#> [16] IRanges_2.43.0                   
#> [17] S4Vectors_0.47.0                 
#> [18] BiocGenerics_0.55.0              
#> [19] generics_0.1.3                   
#> [20] eisaR_1.21.0                     
#> [21] BiocStyle_2.37.0                 
#> 
#> loaded via a namespace (and not attached):
#>  [1] DBI_1.2.3                   bitops_1.0-9               
#>  [3] deldir_2.0-4                httr2_1.1.2                
#>  [5] biomaRt_2.65.0              rlang_1.1.6                
#>  [7] magrittr_2.0.3              Rhisat2_1.25.0             
#>  [9] matrixStats_1.5.0           compiler_4.5.0             
#> [11] RSQLite_2.3.9               png_0.1-8                  
#> [13] vctrs_0.6.5                 txdbmaker_1.5.0            
#> [15] stringr_1.5.1               pwalign_1.5.0              
#> [17] pkgconfig_2.0.3             crayon_1.5.3               
#> [19] fastmap_1.2.0               magick_2.8.6               
#> [21] dbplyr_2.5.0                Rsamtools_2.25.0           
#> [23] rmarkdown_2.29              UCSC.utils_1.5.0           
#> [25] tinytex_0.57                bit_4.6.0                  
#> [27] xfun_0.52                   cachem_1.1.0               
#> [29] jsonlite_2.0.0              progress_1.2.3             
#> [31] blob_1.2.4                  DelayedArray_0.35.0        
#> [33] BiocParallel_1.43.0         jpeg_0.1-11                
#> [35] prettyunits_1.2.0           R6_2.6.1                   
#> [37] VariantAnnotation_1.55.0    bslib_0.9.0                
#> [39] stringi_1.8.7               RColorBrewer_1.1-3         
#> [41] jquerylib_0.1.4             Rcpp_1.0.14                
#> [43] bookdown_0.43               SummarizedExperiment_1.39.0
#> [45] knitr_1.50                  SGSeq_1.43.0               
#> [47] igraph_2.1.4                Matrix_1.7-3               
#> [49] tidyselect_1.2.1            abind_1.4-8                
#> [51] yaml_2.3.10                 codetools_0.2-20           
#> [53] RUnit_0.4.33                hwriter_1.3.2.1            
#> [55] curl_6.2.2                  lattice_0.22-7             
#> [57] tibble_3.2.1                ShortRead_1.67.0           
#> [59] KEGGREST_1.49.0             evaluate_1.0.3             
#> [61] BiocFileCache_2.17.0        xml2_1.3.8                 
#> [63] pillar_1.10.2               BiocManager_1.30.25        
#> [65] filelock_1.0.3              MatrixGenerics_1.21.0      
#> [67] RCurl_1.98-1.17             hms_1.1.3                  
#> [69] GenomicFiles_1.45.0         glue_1.8.0                 
#> [71] tools_4.5.0                 interp_1.1-6               
#> [73] locfit_1.5-9.12             GenomicAlignments_1.45.0   
#> [75] XML_3.99-0.18               grid_4.5.0                 
#> [77] latticeExtra_0.6-30         GenomeInfoDbData_1.2.14    
#> [79] restfulr_0.0.15             cli_3.6.4                  
#> [81] rappdirs_0.3.3              S4Arrays_1.9.0             
#> [83] dplyr_1.1.4                 sass_0.4.10                
#> [85] digest_0.6.37               SparseArray_1.9.0          
#> [87] rjson_0.2.23                memoise_2.0.1              
#> [89] htmltools_0.5.8.1           lifecycle_1.0.4            
#> [91] httr_1.4.7                  statmod_1.5.0              
#> [93] bit64_4.6.0-1