A. Introduction (original) (raw)
This section illustrates how AlphaMissense data can be integrated with other Bioconductor workflows, including the GenomicRanges infrastructure and the ensembldb package / AnnotationHub resources.
Genomic ranges
The GenomicRanges infrastructure provides standard data structures that allow range-based operations across Bioconductor packages. The GPos
data structure provides a convenient and memory-efficient representation of single-nucleotide variants like those in the hg19
and hg38
AlphaMissense resources. Start by installing (if necessary) the GenomicRanges package.
if (!requireNamespace("GenomicRanges", quietly = TRUE))
BiocManager::install("GenomicRanges")
Select the hg38
data resource, and filter to a subset of variants; GPos
is an in-memory data structure and can easily manage 10’s of millions of variants.
tbl <-
am_data("hg38") |>
filter(CHROM == "chr2", POS < 10000000, REF == "G")
Use to_GPos()
to coerce to a GPos
object.
gpos <-
tbl |>
to_GPos()
gpos
#> UnstitchedGPos object with 26751 positions and 7 metadata columns:
#> seqnames pos strand | REF ALT uniprot_id
#> <Rle> <integer> <Rle> | <character> <character> <character>
#> [1] chr2 41613 * | G C Q1W6H9
#> [2] chr2 41615 * | G C Q1W6H9
#> [3] chr2 41615 * | G A Q1W6H9
#> [4] chr2 41615 * | G T Q1W6H9
#> [5] chr2 41618 * | G C Q1W6H9
#> ... ... ... ... . ... ... ...
#> [26747] chr2 9999028 * | G A Q9NZI5
#> [26748] chr2 9999028 * | G T Q9NZI5
#> [26749] chr2 9999029 * | G C Q9NZI5
#> [26750] chr2 9999029 * | G A Q9NZI5
#> [26751] chr2 9999029 * | G T Q9NZI5
#> transcript_id protein_variant am_pathogenicity am_class
#> <character> <character> <numeric> <character>
#> [1] ENST00000327669.5 R321G 0.0848 likely_benign
#> [2] ENST00000327669.5 S320C 0.0946 likely_benign
#> [3] ENST00000327669.5 S320F 0.1593 likely_benign
#> [4] ENST00000327669.5 S320Y 0.1429 likely_benign
#> [5] ENST00000327669.5 P319R 0.0765 likely_benign
#> ... ... ... ... ...
#> [26747] ENST00000324907.14 G581R 0.9997 likely_pathogenic
#> [26748] ENST00000324907.14 G581W 0.9996 likely_pathogenic
#> [26749] ENST00000324907.14 G581A 0.9970 likely_pathogenic
#> [26750] ENST00000324907.14 G581E 0.9995 likely_pathogenic
#> [26751] ENST00000324907.14 G581V 0.9994 likely_pathogenic
#> -------
#> seqinfo: 25 sequences (1 circular) from hg38 genome
Vignettes in the GenomicRanges package illustrate use of these objects.
utils::browseVignettes("GenomicRanges")
GRCh38 annotation resources
Start by identifying the most recent EnsDb resource for Homo Sapiens.
hub <- AnnotationHub::AnnotationHub()
AnnotationHub::query(hub, c("EnsDb", "Homo sapiens"))
#> AnnotationHub with 28 records
#> # snapshotDate(): 2025-04-08
#> # $dataprovider: Ensembl
#> # $species: Homo sapiens
#> # $rdataclass: EnsDb
#> # additional mcols(): taxonomyid, genome, description,
#> # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> # rdatapath, sourceurl, sourcetype
#> # retrieve records with, e.g., 'object[["AH53211"]]'
#>
#> title
#> AH53211 | Ensembl 87 EnsDb for Homo Sapiens
#> AH53715 | Ensembl 88 EnsDb for Homo Sapiens
#> AH56681 | Ensembl 89 EnsDb for Homo Sapiens
#> AH57757 | Ensembl 90 EnsDb for Homo Sapiens
#> AH60773 | Ensembl 91 EnsDb for Homo Sapiens
#> ... ...
#> AH109606 | Ensembl 109 EnsDb for Homo sapiens
#> AH113665 | Ensembl 110 EnsDb for Homo sapiens
#> AH116291 | Ensembl 111 EnsDb for Homo sapiens
#> AH116860 | Ensembl 112 EnsDb for Homo sapiens
#> AH119325 | Ensembl 113 EnsDb for Homo sapiens
AnnotationHub::AnnotationHub()["AH113665"]
#> AnnotationHub with 1 record
#> # snapshotDate(): 2025-04-08
#> # names(): AH113665
#> # $dataprovider: Ensembl
#> # $species: Homo sapiens
#> # $rdataclass: EnsDb
#> # $rdatadateadded: 2023-04-25
#> # $title: Ensembl 110 EnsDb for Homo sapiens
#> # $description: Gene and protein annotations for Homo sapiens based on Ensem...
#> # $taxonomyid: 9606
#> # $genome: GRCh38
#> # $sourcetype: ensembl
#> # $sourceurl: http://www.ensembl.org
#> # $sourcesize: NA
#> # $tags: c("110", "Annotation", "AnnotationHubSoftware", "Coverage",
#> # "DataImport", "EnsDb", "Ensembl", "Gene", "Protein", "Sequencing",
#> # "Transcript")
#> # retrieve record with 'object[["AH113665"]]'
Load the ensembldb library and retrieve the record. Unfortunately, there are many conflicts between function names in different packages, so it becomes necessary to fully resolve functions to the package where they are defined.
library(ensembldb)
edb <- AnnotationHub::AnnotationHub()[["AH113665"]]
edb
#> EnsDb for Ensembl:
#> |Backend: SQLite
#> |Db type: EnsDb
#> |Type of Gene ID: Ensembl Gene ID
#> |Supporting package: ensembldb
#> |Db created by: ensembldb package from Bioconductor
#> |script_version: 0.3.10
#> |Creation time: Mon Aug 7 09:02:07 2023
#> |ensembl_version: 110
#> |ensembl_host: 127.0.0.1
#> |Organism: Homo sapiens
#> |taxonomy_id: 9606
#> |genome_build: GRCh38
#> |DBSCHEMAVERSION: 2.2
#> |common_name: human
#> |species: homo_sapiens
#> | No. of genes: 71440.
#> | No. of transcripts: 278545.
#> |Protein data available.
From Bioconductor to DuckDB
In this section we will work more directly with the database, including writing temporary tables. For illustration purposes, we use a connection that can read and write; in practice, read-only permissions are sufficient for creating temporary tables.
db_rw <- db_connect(read_only = FALSE)
As an illustration, use ensembldb to identify the exons of the canonical transcript of a particular gene.
bcl2l11 <-
edb |>
ensembldb::filter(
~ symbol == "BCL2L11" &
tx_biotype == "protein_coding" &
tx_is_canonical == TRUE
) |>
exonsBy("tx")
bcl2l11
#> GRangesList object of length 1:
#> $ENST00000393256
#> GRanges object with 4 ranges and 2 metadata columns:
#> seqnames ranges strand | exon_id exon_rank
#> <Rle> <IRanges> <Rle> | <character> <integer>
#> [1] 2 111120914-111121188 + | ENSE00004011574 1
#> [2] 2 111123733-111124139 + | ENSE00001008808 2
#> [3] 2 111150044-111150147 + | ENSE00003483971 3
#> [4] 2 111164133-111168444 + | ENSE00001924953 4
#> -------
#> seqinfo: 1 sequence from GRCh38 genome
Munge the data to a tibble, updating the seqnames
to a column CHROM
with identifiers such as "chr1"
(as in the AlphaMissense data). Write the tibble to a temporary table (it will be deleted when db
is disconnected from the database) so that it can be used in ‘lazy’ SQL queries with the AlphaMissense data.
bcl2l11_tbl <-
bcl2l11 |>
dplyr::as_tibble() |>
dplyr::mutate(CHROM = paste0("chr", seqnames)) |>
dplyr::select(CHROM, everything(), -seqnames)
db_temporary_table(db_rw, bcl2l11_tbl, "bcl2l11")
#> # Source: table<bcl2l11> [?? x 9]
#> # Database: DuckDB v1.2.1 [jwokaty@Linux 6.8.0-57-generic:R 4.5.0//home/biocbuild/.cache/R/BiocFileCache/16c8e66d5cf4d1_16c8e66d5cf4d1]
#> CHROM group group_name start end width strand exon_id exon_rank
#> <chr> <int> <chr> <int> <int> <int> <fct> <chr> <int>
#> 1 chr2 1 ENST00000393256 111120914 111121188 275 + ENSE00… 1
#> 2 chr2 1 ENST00000393256 111123733 111124139 407 + ENSE00… 2
#> 3 chr2 1 ENST00000393256 111150044 111150147 104 + ENSE00… 3
#> 4 chr2 1 ENST00000393256 111164133 111168444 4312 + ENSE00… 4
The temporary table is now available on the db_rw
connection; the tables will be removed on disconnect, db_disconnect(db_rw)
.
"bcl2l11" %in% db_tables(db_rw)
#> [1] TRUE
Use db_range_join()
to join the AlphaMissense data with the ranges defining the exons in our gene of interest. The arguments are the database connection, the AlphaMissense table of interest (this table must have columns CHROM
and POS
), the table containing ranges of interest (with columns CHROM
, start
, end
), and the temporary table to contain the results. A range join is like a standard database join, expect that the constraints can be relations, in our case that POS >= start
and POS <= end
for each range of interest; implementation details are in this DuckDB blog. The range join uses closed intervals (the start and end positions are included in the query), following Bioconductor convention. Writing to a temporary table avoids bringing potentially large datasets into R memory, and makes the table available for subsequent manipulation in the current session.
rng <- db_range_join(db_rw, "hg38", "bcl2l11", "bcl2l11_overlaps")
rng
#> # Source: table<bcl2l11_overlaps> [?? x 18]
#> # Database: DuckDB v1.2.1 [jwokaty@Linux 6.8.0-57-generic:R 4.5.0//home/biocbuild/.cache/R/BiocFileCache/16c8e66d5cf4d1_16c8e66d5cf4d1]
#> CHROM POS REF ALT genome uniprot_id transcript_id protein_variant
#> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 chr2 111164133 G A hg38 O43521 ENST0000039325… V167I
#> 2 chr2 111164133 G T hg38 O43521 ENST0000039325… V167L
#> 3 chr2 111164133 G C hg38 O43521 ENST0000039325… V167L
#> 4 chr2 111164134 T C hg38 O43521 ENST0000039325… V167A
#> 5 chr2 111164134 T A hg38 O43521 ENST0000039325… V167E
#> 6 chr2 111164134 T G hg38 O43521 ENST0000039325… V167G
#> 7 chr2 111164136 T A hg38 O43521 ENST0000039325… F168I
#> 8 chr2 111164136 T C hg38 O43521 ENST0000039325… F168L
#> 9 chr2 111164136 T G hg38 O43521 ENST0000039325… F168V
#> 10 chr2 111164137 T G hg38 O43521 ENST0000039325… F168C
#> # ℹ more rows
#> # ℹ 10 more variables: am_pathogenicity <dbl>, am_class <chr>, group <int>,
#> # group_name <chr>, start <int>, end <int>, width <int>, strand <fct>,
#> # exon_id <chr>, exon_rank <int>
This query takes place almost instantly. A larger query of 71M variants against 1000 ranges took about 20 seconds.
The usual database and dplyr verbs can be used to summarize the results, e.g., the number of variants in each pathogenecity class in each exon.
rng |>
dplyr::count(exon_id, am_class) |>
tidyr::pivot_wider(names_from = "am_class", values_from = "n")
#> # Source: SQL [?? x 4]
#> # Database: DuckDB v1.2.1 [jwokaty@Linux 6.8.0-57-generic:R 4.5.0//home/biocbuild/.cache/R/BiocFileCache/16c8e66d5cf4d1_16c8e66d5cf4d1]
#> exon_id likely_pathogenic likely_benign ambiguous
#> <chr> <dbl> <dbl> <dbl>
#> 1 ENSE00003483971 71 128 33
#> 2 ENSE00001008808 257 494 108
#> 3 ENSE00001924953 16 172 19
It is perhaps instructive to review the range join as a source of inspiration for other computations that might be of interest.
-- range join of 'hg38' with 'bcl2l11'; overwrite any existing table
-- 'bcl2l11_overlaps'
DROP TABLE IF EXISTS bcl2l11_overlaps;
CREATE TEMP TABLE bcl2l11_overlaps AS
SELECT
hg38.*,
bcl2l11.* EXCLUDE (CHROM)
FROM hg38
JOIN bcl2l11
ON bcl2l11.CHROM = hg38.CHROM
AND bcl2l11.start <= hg38.POS
AND bcl2l11.end >= hg38.POS;
As best practice, disconnect from the writable database connection when work is complete.
From DuckDB to Bioconductor
There are likely more straight-forward ways of performing the query in the previous section, e.g., by filtering hg38
on the relevant transcript id(s), retrieving to R, and working with edb
to classify variants by exon. The transcript we are interested in is "ENST00000393256"
.
Select the relevant variants and, because there are not too many, load into R.
variants_of_interest <-
am_data("hg38") |>
dplyr::filter(transcript_id %like% "ENST00000393256%")
Coerce the AlphaMissense data to a GRanges::GPos
object.
gpos <-
variants_of_interest |>
to_GPos()
## make gpos 'genome' and 'seqlevels' like bcl2l11
GenomeInfoDb::genome(gpos) <- "GRCh38"
GenomeInfoDb::seqlevelsStyle(gpos) <- "Ensembl"
gpos
#> UnstitchedGPos object with 1298 positions and 7 metadata columns:
#> seqnames pos strand | REF ALT uniprot_id
#> <Rle> <integer> <Rle> | <character> <character> <character>
#> [1] 2 111123749 * | G C O43521
#> [2] 2 111123749 * | G T O43521
#> [3] 2 111123749 * | G A O43521
#> [4] 2 111123750 * | C A O43521
#> [5] 2 111123750 * | C G O43521
#> ... ... ... ... . ... ... ...
#> [1294] 2 111164227 * | A G O43521
#> [1295] 2 111164227 * | A T O43521
#> [1296] 2 111164227 * | A C O43521
#> [1297] 2 111164228 * | T A O43521
#> [1298] 2 111164228 * | T G O43521
#> transcript_id protein_variant am_pathogenicity am_class
#> <character> <character> <numeric> <character>
#> [1] ENST00000393256.8 A2P 0.6610 likely_pathogenic
#> [2] ENST00000393256.8 A2S 0.3398 likely_benign
#> [3] ENST00000393256.8 A2T 0.7219 likely_pathogenic
#> [4] ENST00000393256.8 A2E 0.7283 likely_pathogenic
#> [5] ENST00000393256.8 A2G 0.3883 ambiguous
#> ... ... ... ... ...
#> [1294] ENST00000393256.8 H198R 0.1226 likely_benign
#> [1295] ENST00000393256.8 H198L 0.2526 likely_benign
#> [1296] ENST00000393256.8 H198P 0.1323 likely_benign
#> [1297] ENST00000393256.8 H198Q 0.1781 likely_benign
#> [1298] ENST00000393256.8 H198Q 0.1781 likely_benign
#> -------
#> seqinfo: 25 sequences (1 circular) from GRCh38 genome
One can then use GenomicRanges functionality, e.g., to count the number of variants in each exon.
countOverlaps(unlist(bcl2l11), gpos)
#> ENST00000393256 ENST00000393256 ENST00000393256 ENST00000393256
#> 0 859 232 207