The Use of Survey Weights in Regression Analysis (original) (raw)
Related papers
Clarifying Some Issues in the Regression Analysis of Survey Data
The literature offers two distinct reasons for incorporating sample weights into the estimation of linear regression coefficients from a model-based point of view. Either the sample selection is nonignorable or the model is incomplete. The traditional sample-weighted least-squares estimator can be improved upon when the sample selection is nonignorable, but not when the standard linear model fails and needs to be extended. Conceptually, it can be helpful to view the realized sample as the result of a two-phase process. In the first phase, the finite population is drawn from a hypothetical superpopulation via simple random (cluster) sampling. In the second phase, the actual sample is drawn from the finite population. In the extended model, the parameters of this superpopulation are vague. Mean-squared-error estimation can become problematic when the primary sampling units are drawn within strata using unequal probability sampling without replacement. This remains true even under the ...
Are Survey Weights Needed? A Review of Diagnostic Tests in Regression Analysis
Annual Review of Statistics and Its Application, 2016
Researchers apply sampling weights to take account of unequal sample selection probabilities and to frame coverage errors and nonresponses. If researchers do not weight when appropriate, they risk having biased estimates. Alternatively, when they unnecessarily apply weights, they can create an inefficient estimator without reducing bias. Yet in practice researchers rarely test the necessity of weighting and are sometimes guided more by the current practice in their field than by scientific evidence. In addition, statistical tests for weighting are not widely known or available. This article reviews empirical tests to determine whether weighted analyses are justified. We focus on regression models, though the review's implications extend beyond regression. We find that nearly all weighting tests fall into two categories: difference in coefficients tests and weight association tests. We describe the distinguishing features of each category, present their properties, and explain th...
Role of Weights in Descriptive and Analytical Inferences from Survey Data: An Overview
Statistical agencies generally collect data from samples drawn from well-defined finite populations and using complex sampling procedures that may include stratification, clustering, and mult-stage sampling and using unequal probabilities of selection. Sample design weights, defined from the sampling procedures, are often adjusted to account for non-responding units and to calibrate to known population totals of auxiliary variables. Once adjusted, these ‘final’ weights are included on the survey datasets. There has been some discussions on the necessity of using these weights in the estimation of descriptive statistics and to perform analysis of data from these surveys. In this paper, we discuss the role of weights in descriptive and analytical inference. Keywords: Calibration, Design weights, Re-sampling methods, Multi-level models, Survey data analysis
Comment: Struggles with Survey Weighting and Regression Modeling
Statistical Science, 2007
Andrew Gelman's article "Struggles with survey weighting and regression modeling" addresses the question of what approach analysts should use to produce estimates (and associated estimates of variability) based on sample survey data. Gelman starts by asserting that survey weighting is a "mess." While we agree that incorporation of the survey design for regression remains challenging, with important open questions, many recent contributions to the literature have greatly clarified the situation. Examples include relatively recent contributions by Pfeffermann and Sverchkov (1999), Graubard and Korn (2002) and Little (2004). Gelman's paper is a very welcome addition to that literature. There are some understandable reasons for the current lack of resolution. First, U.S. federal statistical agencies have been historically limited by their mission statements to producing statistical summaries, primarily means, percentages, ratios and cross-classified tables of counts. This is one explanation for why Cochran (1977) and Kish (1965) devote the great majority of their classical texts to these estimates. As a result, the job of using regression and other more complex models to learn about any causal structure underlying these summary statistics was generally left to sister policy agencies and outside users. However, things are changing. The federal statistical system (whether it likes it or not) is becoming more involved with complex modeling. This includes small-area estimation (e.g., unemployment estimates and census net undercoverage estimates) and research into models combining information from surveys with administrative data. (There will also likely be increased demands to use data mining procedures on federal statistical data.) This relatively new development has
Weighting in Regression for Use in Survey Methodology
From InterStat, April 1997, http://interstat.statjournals.net/ - Weighted linear regression models have been developed for use in the estimation of totals and variances for survey data. (Consider works by Brewer, and by Royall and Cumberland, etc.) Weighted linear regression models have also been developed for prediction and variance studies in analyses of physical and biological data. (Consider works by Carroll and Ruppert, etc.) There are similarities and differences between these approaches. This paper considers this and deduces further implications for survey methodology. Application guidelines are discussed. ---------- Update: One should use predicted-y as the size measure, not any individual predictor in multiple regression. See https://www.academia.edu/35049395/Essential\_Heteroscedasticity, commenting on a reference by Ken Brewer. For more information, see https://www.researchgate.net/publication/354854317\_WHEN\_WOULD\_HETEROSCEDASTICITY\_IN\_REGRESSION\_OCCUR. --- Note: Early in the paper, gamma = "w" was not a good choice for notation. Sorry.
A design-sensitive approach to fitting regression models with complex survey data
Statistics Surveys
Fitting complex survey data to regression equations is explored under a design-sensitive model-based framework. A robust version of the standard model assumes that the expected value of the difference between the dependent variable and its model-based prediction is zero no matter what the values of the explanatory variables. The extended model assumes only that the difference is uncorrelated with the covariates. Little is assumed about the error structure of this difference under either model other than independence across primary sampling units. The standard model often fails in practice, but the extended model very rarely does. Under this framework some of the methods developed in the conventional design-based, pseudo-maximum-likelihood framework, such as fitting weighted estimating equations and sandwich mean-squared-error estimation, are retained but their interpretations change. Few of the ideas here are new to the refereed literature. The goal instead is to collect those ideas and put them into a unified conceptual framework.
Chapter 19 Statistical analysis of survey data
The fact that survey data are obtained from units selected with complex sample designs needs to be taken into account in the survey analysis: weights need to be used in analyzing survey data and variances of survey estimates need to be computed in a manner that reflects the complex sample design. This chapter outlines the development of weights and their use in computing survey estimates and provides a general discussion of variance estimation for survey data. It deals first with what are termed "descriptive" estimates, such as the totals, means, and proportions that are widely used in survey reports. It then discusses three forms of "analytic" uses of survey data that can be used to examine relationships between survey variables, namely multiple linear regression models, logistic regression models and multi-level models. These models form a set of valuable tools for analyzing the relationships between a key response variable and a number of other factors. In this chapter we give examples to illustrate the use of these modeling techniques and also provide guidance on the interpretation of the results.
Survey weighting and regression
2006
In the words of Andrew Gelman, "survey weighting is a mess." Full weighting (that is, creating weights based on cell counts in the population contingency table) on all covariates is straightforward enough for estimating simple quantities like means. However, weights can be hard to calculate and even with known weights it gets tricky to estimate and calculate standard errors for more complicated quantities such as regression coefficients using weighting. Moreover, as more and more covariates are used to calculate weights, the weights can become unstable and it may make more sense to use raking weights based on the covariates' marginal distributions rather than the full contingency table. Finally, the population covariate contingency table may not be available, so we may be forced to use raking weights based on the covariates' marginal distributions. An alternative to weighting is to use regression modeling. If you include as predictors in the regression model all covariates (including interactions) that go into making the weights and then poststratify the regression estimates using the population contingency table distribution, you'll get the same answer (at least for mean estimates) as you would have from doing the full weighting. BUT sometimes we don't want to or can't do the full weighting. Using raking weights or not poststratifying regression results based on the full population contingency table can cause weighting and regression to give different estimates. Here we try to figure out when and why these two estimates are most different by working through an example. 2 The Data The data we'll be using for investigation are from the New York City Social Indicators Survey, a biennial survey of families that is conducted by Columbia University's School or Social Work (references). The data are from two years, with 1752 responses from 1999 and 1722 responses from 2001, and the quantity of interest is the proportion of respondents who consider themselves in good/excellent health, specifically, the change in those proportions between 1999 and 2001. Other survey variables include race, gender, age, education, marital status, etc. For exploration, we'll treat the survey data as the entire population and sample from it to compare weighting and regression results. This will allow us to compare the weighting and regression estimates with each other as well as with the true population quantities. The population proportions in 1999 and 2001 are .836 and .780, respectively, so the population change is −.056. We'll generate samples of size 100 from each of the two time periods in the data, with sampling probabilities depending on covariates. 3 Trivial Case: Weighting on One Variable We start by examining weighting and regression results for the simple case where sampling probabilities depend on only a single categorical covariate, race. Race takes on four levels in the data. The population proportions of the four race categories for the two time periods are show below. There is little difference in the race distribution between the two years.