Some thoughts about improving the current implementation · Issue #8 · Bioconductor/SQLDataFrame (original) (raw)
Problem
Currently, the SQL data frame does not support all aspects of the DataFrame
API, which makes it difficult to plug and play into existing workflows. For example, if I were to add a new column, it currently fails. (In fact, the messages suggest that it's trying to save the entire data frame into a new SQLite file, which could be very inefficient for large tables.)
library(SQLDataFrame) example(SQLDataFrame, echo=FALSE) obj$foo <- runif(nrow(obj))
Error in coerce2(c(x, value), x) :
coercion of DFrame object to SQLDataFrame didn't preserve its
dimensions
The same goes for other operations not currently/correctly supported by SQLDataFrame
:
Other operations also don't work as expected
colnames(obj) <- LETTERS[1:5] # does nothing rownames(obj) <- obj$state # does nothing cbind(obj, obj) # realizes everything into memory obj$state <- tolower(obj$state) # fails
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘replaceCOLS’ for signature ‘"SQLDataFrame"’
Proposed solution
After briefly working on https://github.com/LTLA/ParquetDataFrame, I can start to see some common design patterns for out-of-memory DataFrame
representations. So far, the most useful one has been to define a column vector class based on DelayedArray, where column access returns a delayed vector instead of realizing the column's contents in memory.
This is mildly useful for end users as it defers any realization in limited-memory scenarios. However, its true value lies in providing a coherent response when the user does something to a ParquetDataFrame
that causes it to no longer be consistent with the underlying Parquet file, e.g., adding a new column, replacing a column's contents, binding it with another DF... In such cases, the ParquetDataFrame
collapses to a DFrame
of ParquetColumnVector
objects, providing users with the full DFrame
functionality while preserving the file-backed nature of each surviving column.
The same philosophy can be applied to SQLDataFrame
s. For example, if I want to add a new column that wasn't present in the SQL table, we would collapse the SQLDataFrame
to a DFrame
of SQLColumnVector
s plus the new column. This ensures that the full DataFrame
API is supported, avoids any realization of data from the SQL file, and avoids any writes to the file system. We could also handle all the other operations, e.g., row/colname setting, cbind
ing, column mutation, and so on, some of which would manifest as DelayedArray-implemented operations on the SQLColumnVector
objects.
Implementation
So, on that note, I wrote a prototype of a SQL-based DelayedArray
vector:
library(RSQLite) library(DelayedArray)
setClass("SqliteColumnSeed", slots=c( path="character", table="character", column="character", length="integer", type="character" ))
SqliteColumnSeed <- function(path, table, column, length=NULL, type=NULL) { if (is.null(length) || is.null(type)) { con <- dbConnect(RSQLite::SQLite(), path) if (is.null(type)) { out <- dbGetQuery(con, paste0("SELECT ", column, " FROM ", table, " LIMIT 1")) type <- typeof(out[,column]) } if (is.null(length)) { length <- dbGetQuery(con, paste0("SELECT COUNT(", column, ") FROM ", table))[,1] } } new("SqliteColumnSeed", path=path, table=table, column=column, length=length, type=type) }
setMethod("dim", "SqliteColumnSeed", function(x) x@length)
setMethod("type", "SqliteColumnSeed", function(x) x@type)
setMethod("extract_array", "SqliteColumnSeed", function(x, index) { con <- dbConnect(RSQLite::SQLite(), x@path)
i <- index[[1]]
if (is.null(i)) {
res <- dbGetQuery(con, paste0("SELECT ", column, " FROM ", table))
} else {
dbWriteTable(con, "tmp_indices", data.frame(indices=i), temporary=TRUE)
res <- dbGetQuery(con, sprintf(
"SELECT x.%s FROM
(SELECT %s, ROW_NUMBER () OVER (ORDER BY 1) AS row FROM %s) x
INNER JOIN tmp_indices ON tmp_indices.indices = x.row",
x@column, x@column, x@table)
)
}
array(res[,x@column])
})
setClass("SqliteColumnVector", contains="DelayedArray", slots=c(seed="SqliteColumnSeed"))
setMethod("DelayedArray", "SqliteColumnSeed", function(seed) new("SqliteColumnVector", seed=seed))
There are a couple of rough corners in the code above, but hopefully you get the main idea.
Usage
This is as simple as:
Mocking up a read-only DF file
con <- dbConnect(RSQLite::SQLite(), "test.sqlite") dbWriteTable(con, "mtcars", mtcars) system("chmod 444 test.sqlite")
Creating an instance.
seed <- SqliteColumnSeed("test.sqlite", "mtcars", "carb") col <- DelayedArray(seed)
<32> SqliteColumnVector object of type "double":
[1] [2] [3] . [31] [32]
4 4 1 . 8 2
We could make a whole DFrame:
df <- DataFrame(carb = col)
DataFrame with 32 rows and 1 column
carb
1 4
2 4
3 1
4 1
5 2
... ...
28 2
29 4
30 6
31 8
32 2
Then, the SQLDataFrame
would just be a collection of SqliteColumnVector
objects. No need to reproduce the delayed machinery - let DelayedArray take care of all of that when operations are applied to each column, e.g., row subsetting:
df[1:10,,drop=FALSE]
DataFrame with 10 rows and 1 column
carb
1 4
2 4
3 1
4 1
5 2
6 1
7 4
8 2
9 2
10 4
More ambitious developers could even specialize [
when x
is a SQLDataFrame
and i
is a SQLColumnVector
to perform some predicate pushdown for greater efficiency. Though this may be a fragile performance boost as most Bioconductor data structures will probably normalize the subscripts to a regular vector before passing it on to components.