Some Issues in the Analysis of Complex Survey Data (original) (raw)

Model choices for complex survey analysis

2016

Survey data are an important source of information for modern society. However, the complex structures of modern populations require sampling designs for surveys that are more complex than simple random sampling in order to be effective. With large national population surveys, the sample data collected via these designs typically include sample weights that allow analysis to take account of these complex population structures. As a consequence, these sample weights need to be taken into consideration when modelling the sample data, e.g. when the target of estimation is the coefficients of a regression model for the target population. In this situation, it is important to know whether these weights should be used when identifying an appropriate model specification and also whether they should be used when fitting this model to the survey data. Given the complexity of both model choice and model fitting and the limited literature on this issue, there is clearly scope for theoretical a...

Fitting Regression Models to Complex Survey Data|Gelman's Estimator Revisited

2011

Survey data typically have certain characteristics which must be accounted for in their analysis. These characteristics include unequal selection probabilities, clustering, and missing data (including non-response). In this paper, we focus only on unequal selection probabilities and assume full response. For a more complete discussion see, for example, Pfeffermann & Sverchkov (2009). Consider a model of interest with a response (dependent) variable y and a vector of explanatory (independent) variables x, and let yi and xi be the values associated with unit i. The sample conditional distribution fs(yi|xi), defined in (1) below is usually different from the population conditional distribution fp(yi|xi). Indeed, by Bayes Rule,

Applications of quasi-Monte Carlo methods in inference for complex survey data

Abstract This paper proposes a new method for estimating variances of complex survey estimators based on the recent developments in quasi-Monte Carlo methods. The method can be effectively used to create replication schemes in complex surveys with designs more complex than 2 PSU/stratum, while other methods such as the survey bootstrap carry with them a substantial computational burden, as well as somewhat larger instability.

A Closer Examination of Subpopulation Analysis of Complex-Sample Survey Data

The Stata Journal: Promoting communications on statistics and Stata, 2008

In recent years, general-purpose statistical software packages have incorporated new procedures that feature several useful options for design-based analysis of complex-sample survey data. A common and frequently desired technique for analysis of survey data in practice is the restriction of estimation to a subpopulation of interest. These subpopulations are often referred to interchangeably in a variety of fields as subclasses, subgroups, and domains. In this article, we consider two approaches that analysts of complex-sample survey data can follow when analyzing subpopulations; we also consider the implications of each approach for estimation and inference. We then present examples of both approaches, using selected procedures in Stata to analyze data from the National Hospital Ambulatory Medical Care Survey (NHAMCS). We conclude with important considerations for subpopulation analyses and a summary of suggestions for practice.

Pseudo Empirical Likelihood Confidence Intervals for Complex Sample Survey Data

2018

Complex sample survey data are obtained through multistage sampling designs that involve clustering, stratification, and non–responce adjustments. Standard statistical methods such as empirical likelihood are typically not applicable to complex samples because independent, identically distributed observations seldom result from such data. Hence, we derive pseudo empirical likelihood confidence intervals for stratified single–stage and stratified multistage sampling designs. Use of such designs include national health data sets.

Some Inferential Problems in Finite Population Sampling

2004

We review some results in problems of estimating a finite population total (mean) through a sample survey. Section 2 considers inference under a fixed population model and Section 3 addresses the same problem when the finite population is looked upon as a sample from a superpopulation and technique of theory of prediction are used. Since the probability density function of data obtained from a sample survey equals the selection probability of the sample, thus making the likelihood function 'flat', use of the likelihood, when a prior is assumed for the finite population parameters, restricts one to model-based inference, in case a non-informative sampling design (s.d.) is used for the survey. The data obtained through a set (sample) are minimal sufficient (though not complete sufficient) for inference and hence the use ofRao-Blackwellization provide improved estimators. Noting the non-existence of a uniformly minimum variance unbiased estimator for population total in general, review is made of the results on admissibility of estimators for a fixed s.d. in the relevant classes. If, however, the survey population is looked upon as a sample from a superpopulation ~' optimum strategies are available in certain classes. Under the predictiontheoretic approach, a purposive sampling design becomes an optimal one under a wide class of superpopulation models. This is in direct conflict with the classical probability sampling-based theory. However, these model-dependent optimal strategies fail (invoke large bias or large mean square error (mse)) if the assumed models tum out to be wrong. Use of probability sampling salvages the situation. A class of strategies, which depend both on superpopulation model and sampling design, have been suggested. Finally, the problem of asymptotic unbiased estimation of design variance of these strategies under multiple regression superpopulation models have been reviewed.

Applications of quasi-Monte Carlo methods in survey inference

This work aims at proposing a new method for estimating variances of complex survey estimators based on the recent developments in quasi-Monte Carlo methods. It can be effectively used to create replication schemes in complex surveys where the mathematically elegant schemes such as balanced repeated replications break down due to design complexities, while other methods such as the survey bootstrap carry with them a substantial computational burden, as well as somewhat larger instability.

Estimation of the Distribution Function with a Complex Survey

2002

Estimation of the cumulative distribution function and related statistics, such as the median and interquartile range, is considered. Large sample properties of estimators constructed from stratified cluster samples are presented. The PC CARP computer algorithm is discussed. I . I n t r o d u c t i o n There is an extensive literature on quantile estimation, most of it for simple random sampling. Extension of results derived under an assumption of simple random sampling from an absolutely continuous distribution to the complex sampling designs used in finite population sampling has met with limited success. Woodruff (1952) proposed using a weighted sample median to estimate the population median, where the weight assigned to each observation is proportional to the inverse of its selection probability. Using the approach taken by Marltz and Jarrett (1978), Gross (1980) derived a small-sample estimator of the variance of the weighted sample median estimator for stratified sampling wit...