Survey weighting and regression (original) (raw)
In the words of Andrew Gelman, "survey weighting is a mess." Full weighting (that is, creating weights based on cell counts in the population contingency table) on all covariates is straightforward enough for estimating simple quantities like means. However, weights can be hard to calculate and even with known weights it gets tricky to estimate and calculate standard errors for more complicated quantities such as regression coefficients using weighting. Moreover, as more and more covariates are used to calculate weights, the weights can become unstable and it may make more sense to use raking weights based on the covariates' marginal distributions rather than the full contingency table. Finally, the population covariate contingency table may not be available, so we may be forced to use raking weights based on the covariates' marginal distributions. An alternative to weighting is to use regression modeling. If you include as predictors in the regression model all covariates (including interactions) that go into making the weights and then poststratify the regression estimates using the population contingency table distribution, you'll get the same answer (at least for mean estimates) as you would have from doing the full weighting. BUT sometimes we don't want to or can't do the full weighting. Using raking weights or not poststratifying regression results based on the full population contingency table can cause weighting and regression to give different estimates. Here we try to figure out when and why these two estimates are most different by working through an example. 2 The Data The data we'll be using for investigation are from the New York City Social Indicators Survey, a biennial survey of families that is conducted by Columbia University's School or Social Work (references). The data are from two years, with 1752 responses from 1999 and 1722 responses from 2001, and the quantity of interest is the proportion of respondents who consider themselves in good/excellent health, specifically, the change in those proportions between 1999 and 2001. Other survey variables include race, gender, age, education, marital status, etc. For exploration, we'll treat the survey data as the entire population and sample from it to compare weighting and regression results. This will allow us to compare the weighting and regression estimates with each other as well as with the true population quantities. The population proportions in 1999 and 2001 are .836 and .780, respectively, so the population change is −.056. We'll generate samples of size 100 from each of the two time periods in the data, with sampling probabilities depending on covariates. 3 Trivial Case: Weighting on One Variable We start by examining weighting and regression results for the simple case where sampling probabilities depend on only a single categorical covariate, race. Race takes on four levels in the data. The population proportions of the four race categories for the two time periods are show below. There is little difference in the race distribution between the two years.