A. Introduction (original) (raw)

This section illustrates how AlphaMissense data can be integrated with other Bioconductor workflows, including the GenomicRanges infrastructure and the ensembldb package / AnnotationHub resources.

Genomic ranges

The GenomicRanges infrastructure provides standard data structures that allow range-based operations across Bioconductor packages. The GPos data structure provides a convenient and memory-efficient representation of single-nucleotide variants like those in the hg19 and hg38 AlphaMissense resources. Start by installing (if necessary) the GenomicRanges package.

if (!requireNamespace("GenomicRanges", quietly = TRUE))
    BiocManager::install("GenomicRanges")

Select the hg38 data resource, and filter to a subset of variants; GPos is an in-memory data structure and can easily manage 10’s of millions of variants.

tbl <-
    am_data("hg38") |>
    filter(CHROM == "chr2", POS < 10000000, REF == "G")

Use to_GPos() to coerce to a GPos object.

gpos <-
    tbl |>
    to_GPos()
gpos
#> UnstitchedGPos object with 26751 positions and 7 metadata columns:
#>           seqnames       pos strand |         REF         ALT  uniprot_id
#>              <Rle> <integer>  <Rle> | <character> <character> <character>
#>       [1]     chr2     41613      * |           G           C      Q1W6H9
#>       [2]     chr2     41615      * |           G           C      Q1W6H9
#>       [3]     chr2     41615      * |           G           A      Q1W6H9
#>       [4]     chr2     41615      * |           G           T      Q1W6H9
#>       [5]     chr2     41618      * |           G           C      Q1W6H9
#>       ...      ...       ...    ... .         ...         ...         ...
#>   [26747]     chr2   9999028      * |           G           A      Q9NZI5
#>   [26748]     chr2   9999028      * |           G           T      Q9NZI5
#>   [26749]     chr2   9999029      * |           G           C      Q9NZI5
#>   [26750]     chr2   9999029      * |           G           A      Q9NZI5
#>   [26751]     chr2   9999029      * |           G           T      Q9NZI5
#>                transcript_id protein_variant am_pathogenicity          am_class
#>                  <character>     <character>        <numeric>       <character>
#>       [1]  ENST00000327669.5           R321G           0.0848     likely_benign
#>       [2]  ENST00000327669.5           S320C           0.0946     likely_benign
#>       [3]  ENST00000327669.5           S320F           0.1593     likely_benign
#>       [4]  ENST00000327669.5           S320Y           0.1429     likely_benign
#>       [5]  ENST00000327669.5           P319R           0.0765     likely_benign
#>       ...                ...             ...              ...               ...
#>   [26747] ENST00000324907.14           G581R           0.9997 likely_pathogenic
#>   [26748] ENST00000324907.14           G581W           0.9996 likely_pathogenic
#>   [26749] ENST00000324907.14           G581A           0.9970 likely_pathogenic
#>   [26750] ENST00000324907.14           G581E           0.9995 likely_pathogenic
#>   [26751] ENST00000324907.14           G581V           0.9994 likely_pathogenic
#>   -------
#>   seqinfo: 25 sequences (1 circular) from hg38 genome

Vignettes in the GenomicRanges package illustrate use of these objects.

utils::browseVignettes("GenomicRanges")

GRCh38 annotation resources

Start by identifying the most recent EnsDb resource for Homo Sapiens.

hub <- AnnotationHub::AnnotationHub()
AnnotationHub::query(hub, c("EnsDb", "Homo sapiens"))
#> AnnotationHub with 28 records
#> # snapshotDate(): 2025-04-08
#> # $dataprovider: Ensembl
#> # $species: Homo sapiens
#> # $rdataclass: EnsDb
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["AH53211"]]' 
#> 
#>              title                             
#>   AH53211  | Ensembl 87 EnsDb for Homo Sapiens 
#>   AH53715  | Ensembl 88 EnsDb for Homo Sapiens 
#>   AH56681  | Ensembl 89 EnsDb for Homo Sapiens 
#>   AH57757  | Ensembl 90 EnsDb for Homo Sapiens 
#>   AH60773  | Ensembl 91 EnsDb for Homo Sapiens 
#>   ...        ...                               
#>   AH109606 | Ensembl 109 EnsDb for Homo sapiens
#>   AH113665 | Ensembl 110 EnsDb for Homo sapiens
#>   AH116291 | Ensembl 111 EnsDb for Homo sapiens
#>   AH116860 | Ensembl 112 EnsDb for Homo sapiens
#>   AH119325 | Ensembl 113 EnsDb for Homo sapiens

AnnotationHub::AnnotationHub()["AH113665"]
#> AnnotationHub with 1 record
#> # snapshotDate(): 2025-04-08
#> # names(): AH113665
#> # $dataprovider: Ensembl
#> # $species: Homo sapiens
#> # $rdataclass: EnsDb
#> # $rdatadateadded: 2023-04-25
#> # $title: Ensembl 110 EnsDb for Homo sapiens
#> # $description: Gene and protein annotations for Homo sapiens based on Ensem...
#> # $taxonomyid: 9606
#> # $genome: GRCh38
#> # $sourcetype: ensembl
#> # $sourceurl: http://www.ensembl.org
#> # $sourcesize: NA
#> # $tags: c("110", "Annotation", "AnnotationHubSoftware", "Coverage",
#> #   "DataImport", "EnsDb", "Ensembl", "Gene", "Protein", "Sequencing",
#> #   "Transcript") 
#> # retrieve record with 'object[["AH113665"]]'

Load the ensembldb library and retrieve the record. Unfortunately, there are many conflicts between function names in different packages, so it becomes necessary to fully resolve functions to the package where they are defined.

library(ensembldb)
edb <- AnnotationHub::AnnotationHub()[["AH113665"]]
edb
#> EnsDb for Ensembl:
#> |Backend: SQLite
#> |Db type: EnsDb
#> |Type of Gene ID: Ensembl Gene ID
#> |Supporting package: ensembldb
#> |Db created by: ensembldb package from Bioconductor
#> |script_version: 0.3.10
#> |Creation time: Mon Aug  7 09:02:07 2023
#> |ensembl_version: 110
#> |ensembl_host: 127.0.0.1
#> |Organism: Homo sapiens
#> |taxonomy_id: 9606
#> |genome_build: GRCh38
#> |DBSCHEMAVERSION: 2.2
#> |common_name: human
#> |species: homo_sapiens
#> | No. of genes: 71440.
#> | No. of transcripts: 278545.
#> |Protein data available.

From Bioconductor to DuckDB

In this section we will work more directly with the database, including writing temporary tables. For illustration purposes, we use a connection that can read and write; in practice, read-only permissions are sufficient for creating temporary tables.

db_rw <- db_connect(read_only = FALSE)

As an illustration, use ensembldb to identify the exons of the canonical transcript of a particular gene.

bcl2l11 <-
    edb |>
    ensembldb::filter(
        ~ symbol == "BCL2L11" &
            tx_biotype == "protein_coding" &
            tx_is_canonical == TRUE
    ) |>
    exonsBy("tx")
bcl2l11
#> GRangesList object of length 1:
#> $ENST00000393256
#> GRanges object with 4 ranges and 2 metadata columns:
#>       seqnames              ranges strand |         exon_id exon_rank
#>          <Rle>           <IRanges>  <Rle> |     <character> <integer>
#>   [1]        2 111120914-111121188      + | ENSE00004011574         1
#>   [2]        2 111123733-111124139      + | ENSE00001008808         2
#>   [3]        2 111150044-111150147      + | ENSE00003483971         3
#>   [4]        2 111164133-111168444      + | ENSE00001924953         4
#>   -------
#>   seqinfo: 1 sequence from GRCh38 genome

Munge the data to a tibble, updating the seqnames to a column CHROM with identifiers such as "chr1" (as in the AlphaMissense data). Write the tibble to a temporary table (it will be deleted when db is disconnected from the database) so that it can be used in ‘lazy’ SQL queries with the AlphaMissense data.

bcl2l11_tbl <-
    bcl2l11 |>
    dplyr::as_tibble() |>
    dplyr::mutate(CHROM = paste0("chr", seqnames)) |>
    dplyr::select(CHROM, everything(), -seqnames)

db_temporary_table(db_rw, bcl2l11_tbl, "bcl2l11")
#> # Source:   table<bcl2l11> [?? x 9]
#> # Database: DuckDB v1.2.1 [jwokaty@Linux 6.8.0-57-generic:R 4.5.0//home/biocbuild/.cache/R/BiocFileCache/16c8e66d5cf4d1_16c8e66d5cf4d1]
#>   CHROM group group_name          start       end width strand exon_id exon_rank
#>   <chr> <int> <chr>               <int>     <int> <int> <fct>  <chr>       <int>
#> 1 chr2      1 ENST00000393256 111120914 111121188   275 +      ENSE00…         1
#> 2 chr2      1 ENST00000393256 111123733 111124139   407 +      ENSE00…         2
#> 3 chr2      1 ENST00000393256 111150044 111150147   104 +      ENSE00…         3
#> 4 chr2      1 ENST00000393256 111164133 111168444  4312 +      ENSE00…         4

The temporary table is now available on the db_rw connection; the tables will be removed on disconnect, db_disconnect(db_rw).

"bcl2l11" %in% db_tables(db_rw)
#> [1] TRUE

Use db_range_join() to join the AlphaMissense data with the ranges defining the exons in our gene of interest. The arguments are the database connection, the AlphaMissense table of interest (this table must have columns CHROM and POS), the table containing ranges of interest (with columns CHROM, start, end), and the temporary table to contain the results. A range join is like a standard database join, expect that the constraints can be relations, in our case that POS >= start and POS <= end for each range of interest; implementation details are in this DuckDB blog. The range join uses closed intervals (the start and end positions are included in the query), following Bioconductor convention. Writing to a temporary table avoids bringing potentially large datasets into R memory, and makes the table available for subsequent manipulation in the current session.

rng <- db_range_join(db_rw, "hg38", "bcl2l11", "bcl2l11_overlaps")
rng
#> # Source:   table<bcl2l11_overlaps> [?? x 18]
#> # Database: DuckDB v1.2.1 [jwokaty@Linux 6.8.0-57-generic:R 4.5.0//home/biocbuild/.cache/R/BiocFileCache/16c8e66d5cf4d1_16c8e66d5cf4d1]
#>    CHROM       POS REF   ALT   genome uniprot_id transcript_id   protein_variant
#>    <chr>     <dbl> <chr> <chr> <chr>  <chr>      <chr>           <chr>          
#>  1 chr2  111164133 G     A     hg38   O43521     ENST0000039325… V167I          
#>  2 chr2  111164133 G     T     hg38   O43521     ENST0000039325… V167L          
#>  3 chr2  111164133 G     C     hg38   O43521     ENST0000039325… V167L          
#>  4 chr2  111164134 T     C     hg38   O43521     ENST0000039325… V167A          
#>  5 chr2  111164134 T     A     hg38   O43521     ENST0000039325… V167E          
#>  6 chr2  111164134 T     G     hg38   O43521     ENST0000039325… V167G          
#>  7 chr2  111164136 T     A     hg38   O43521     ENST0000039325… F168I          
#>  8 chr2  111164136 T     C     hg38   O43521     ENST0000039325… F168L          
#>  9 chr2  111164136 T     G     hg38   O43521     ENST0000039325… F168V          
#> 10 chr2  111164137 T     G     hg38   O43521     ENST0000039325… F168C          
#> # ℹ more rows
#> # ℹ 10 more variables: am_pathogenicity <dbl>, am_class <chr>, group <int>,
#> #   group_name <chr>, start <int>, end <int>, width <int>, strand <fct>,
#> #   exon_id <chr>, exon_rank <int>

This query takes place almost instantly. A larger query of 71M variants against 1000 ranges took about 20 seconds.

The usual database and dplyr verbs can be used to summarize the results, e.g., the number of variants in each pathogenecity class in each exon.

rng |>
    dplyr::count(exon_id, am_class) |>
    tidyr::pivot_wider(names_from = "am_class", values_from = "n")
#> # Source:   SQL [?? x 4]
#> # Database: DuckDB v1.2.1 [jwokaty@Linux 6.8.0-57-generic:R 4.5.0//home/biocbuild/.cache/R/BiocFileCache/16c8e66d5cf4d1_16c8e66d5cf4d1]
#>   exon_id         likely_pathogenic likely_benign ambiguous
#>   <chr>                       <dbl>         <dbl>     <dbl>
#> 1 ENSE00003483971                71           128        33
#> 2 ENSE00001008808               257           494       108
#> 3 ENSE00001924953                16           172        19

It is perhaps instructive to review the range join as a source of inspiration for other computations that might be of interest.

-- range join of 'hg38' with 'bcl2l11'; overwrite any existing table
-- 'bcl2l11_overlaps'
DROP TABLE IF EXISTS bcl2l11_overlaps;
CREATE TEMP TABLE bcl2l11_overlaps AS
SELECT
    hg38.*,
    bcl2l11.* EXCLUDE (CHROM)
FROM hg38
JOIN bcl2l11
    ON bcl2l11.CHROM = hg38.CHROM
    AND bcl2l11.start <= hg38.POS
    AND bcl2l11.end >= hg38.POS;

As best practice, disconnect from the writable database connection when work is complete.

From DuckDB to Bioconductor

There are likely more straight-forward ways of performing the query in the previous section, e.g., by filtering hg38 on the relevant transcript id(s), retrieving to R, and working with edb to classify variants by exon. The transcript we are interested in is "ENST00000393256".

Select the relevant variants and, because there are not too many, load into R.

variants_of_interest <-
    am_data("hg38") |>
    dplyr::filter(transcript_id %like% "ENST00000393256%")

Coerce the AlphaMissense data to a GRanges::GPos object.

gpos <-
    variants_of_interest |>
    to_GPos()
## make gpos 'genome' and 'seqlevels' like bcl2l11
GenomeInfoDb::genome(gpos) <- "GRCh38"
GenomeInfoDb::seqlevelsStyle(gpos) <- "Ensembl"
gpos
#> UnstitchedGPos object with 1298 positions and 7 metadata columns:
#>          seqnames       pos strand |         REF         ALT  uniprot_id
#>             <Rle> <integer>  <Rle> | <character> <character> <character>
#>      [1]        2 111123749      * |           G           C      O43521
#>      [2]        2 111123749      * |           G           T      O43521
#>      [3]        2 111123749      * |           G           A      O43521
#>      [4]        2 111123750      * |           C           A      O43521
#>      [5]        2 111123750      * |           C           G      O43521
#>      ...      ...       ...    ... .         ...         ...         ...
#>   [1294]        2 111164227      * |           A           G      O43521
#>   [1295]        2 111164227      * |           A           T      O43521
#>   [1296]        2 111164227      * |           A           C      O43521
#>   [1297]        2 111164228      * |           T           A      O43521
#>   [1298]        2 111164228      * |           T           G      O43521
#>              transcript_id protein_variant am_pathogenicity          am_class
#>                <character>     <character>        <numeric>       <character>
#>      [1] ENST00000393256.8             A2P           0.6610 likely_pathogenic
#>      [2] ENST00000393256.8             A2S           0.3398     likely_benign
#>      [3] ENST00000393256.8             A2T           0.7219 likely_pathogenic
#>      [4] ENST00000393256.8             A2E           0.7283 likely_pathogenic
#>      [5] ENST00000393256.8             A2G           0.3883         ambiguous
#>      ...               ...             ...              ...               ...
#>   [1294] ENST00000393256.8           H198R           0.1226     likely_benign
#>   [1295] ENST00000393256.8           H198L           0.2526     likely_benign
#>   [1296] ENST00000393256.8           H198P           0.1323     likely_benign
#>   [1297] ENST00000393256.8           H198Q           0.1781     likely_benign
#>   [1298] ENST00000393256.8           H198Q           0.1781     likely_benign
#>   -------
#>   seqinfo: 25 sequences (1 circular) from GRCh38 genome

One can then use GenomicRanges functionality, e.g., to count the number of variants in each exon.

countOverlaps(unlist(bcl2l11), gpos)
#> ENST00000393256 ENST00000393256 ENST00000393256 ENST00000393256 
#>               0             859             232             207