Welcome to SQL-based analysis for large surveys project! (original) (raw)

Analysing large complex surveys, such as the American Community Survey, using R to generate SQL code for MonetDB.

This project used to have two packages. It now just has one: sqlsurvey, for analysis of large surveys. You also need MonetDB.R, which replaces my JDBC-based interface. There are other MonetDB to R interfaces, but they won't work with the sqlsurvey package, because we've had to extend the R-DBI interface to handle concurrency problems from garbage collection.

Both packages require MonetDB, so installation is more complicated than just installing an R package. Under Windows it is important to use a 64-bit version of MonetDB to allow creation of large databases.

Examples (small enough to play with):

ACS 3-year data for Alabama (47k records, replicate weights): CSV data, R script
Some blood pressure data from NHANES (18k records, linearisation variances): .RDA file of data, R script. Useful notes:
Some installation notes. If anyone wants to write better ones, I will put them up.
While it is possible to read data into R and then save into MonetDB using dbWriteTable, this is very inefficient for large files and it is better to construct the database table and read the data directly using the MonetDB console client. Here is a script that reads the whole-US ACS 3yr person data, which comes in four CSV files, into a table in MonetDB.
Examples of setting up and using an ACS 3-yr person file.
Here is a R transcript that reproduces some of the Census Bureau totals for the 2008-2010 ACS Supported analyses:
For surveys using either linearisation or replicate weights
- Means (svymean) and totals (svytotal) with standard errors, by grouping variables
- Quantiles (svyquantile)
- Thinned or hexagonally binned scatterplots (svyplot)
- Smoothers and density estimators (svysmooth)
- Linear regression models (with standard errors) (svylm)
- Contingency tables (svytable)
- Subpopulation estimates (subset)
Only with replicate weights
- Quantiles with standard errors and confidence intervals(svyquantile)
- Loglinear models, with Rao-Scott tests, including tests for independence in 2x2 tables (svyloglin, svychisq)

None of the analyses will modify any existing database table, and the R survey design objects behave as if they are passed by value, like ordinary R objects. Temporary tables are automatically dropped when the R objects referring to them are garbage-collected. The basic design ideas for the package were described in apresentation at UseR 2007, but were not developed further because of lack of demand. The American Community Survey and some medical-record surveys such as the Nationwide Inpatient Sample do represent a real need, so the project has been restarted. MonetDB turns out to be much faster than SQLite for this sort of analysis, and interactive analysis of millions of records on an ordinary desktop is quite feasible.

The project summary page is here.