Modeling Nonnegative Data with Clumping at Zero: A Survey (original) (raw)

Statistical Analysis of Zero-Inflated Nonnegative Continuous Data: A Review

Statistical Science, 2019

Zero-inflated nonnegative continuous (or semicontinuous) data arise frequently in biomedical, economical, and ecological studies. Examples include substance abuse, medical costs, medical care utilization, biomarkers (e.g., CD4 cell counts, coronary artery calcium scores), single cell gene expression rates, and (relative) abundance of microbiome. Such data are often characterized by the presence of a large portion of zero values and positive continuous values that are skewed to the right and heteroscedastic. Both of these features suggest that no simple parametric distribution may be suitable for modeling such type of outcomes. In this paper, we review statistical methods for analyzing zero-inflated nonnegative outcome data. We will start with the cross-sectional setting, discussing ways to separate zero and positive values and introducing flexible models to characterize right skewness and heteroscedasticity in the positive values. We will then present models of correlated zero-inflated nonnegative continuous data, using random effects to tackle the correlation on repeated measures from the same subject and that across different parts of the model. We will also discuss expansion to related topics, for example, zero-inflated count and survival data, nonlinear covariate effects, and joint models of longitudinal zero-inflated nonnegative continuous data and survival. Finally, we will present applications to three real datasets (i.e., microbiome, medical costs, and alcohol drinking) to illustrate these methods. Example code will be provided to facilitate applications of these methods.

Random effect models for repeated measures of zero-inflated count data

Statistical Modelling, 2005

For count responses, the situation of excess zeros (relative to what standard models allow) often occurs in biomedical and sociological applications. Modeling repeated measures of zero-inflated count data presents special challenges. This is because in addition to the problem of extra zeros, the correlation between measurements upon the same subject at different occasions needs to be taken into account. This article discusses random effect models for repeated measurements on this type of response variable. A useful model is the hurdle model with random effects, which separately handles the zero observations and the positive counts. In maximum likelihood model fitting, we consider both a normal distribution and a nonparametric approach for the random effects. A special case of the hurdle model can be used to test for zero inflation. Random effects can also be introduced in a zero-inflated Poisson or negative binomial model, but such a model may encounter fitting problems if there is ...

The analysis of zero-inflated count data: Beyond zero-inflated Poisson regression

British Journal of Mathematical and Statistical Psychology, 2012

Infrequent count data in psychological research are commonly modelled using zeroinflated Poisson regression. This model can be viewed as a latent mixture of an "alwayszero" component and a Poisson component. Hurdle models are an alternative class of two-component models that are seldom used in psychological research, but clearly separate the zero counts and the non-zero counts by using a left-truncated count model for the latter. In this tutorial we revisit both classes of models, and discuss model comparisons and the interpretation of their parameters. As illustrated with an example from relational psychology, both types of models can easily be fitted using the R-package pscl.

Zero-inflated count regression models with applications to some examples

Quality & Quantity, 2012

In this paper, we employed SAS PROC NLMIXED (Nonlinear mixed model procedure) to analyze three example data having inflated zeros. Examples used are data having covariates and no covariates. The covariates utilized in this article have binary outcomes to simplify our analysis. Of course the analysis can readily be extended to situations with several covariates having multiple levels. Models fitted include the Poisson (P), the negative binomial (NB), the generalized Poisson (GP), and their zero-inflated variants, namely the ZIP, the ZINB and the ZIGP models respectively. Parameter estimates as well as the appropriate goodness-of-fit statistic (the deviance D) in this case are computed and in some cases, the Pearson's X 2 statistic, that is based on the variance of the relevant model distribution is also computed. Also obtained are the expected frequencies for the models and GOF tests are conducted based on the rule established by Lawal (Appl Stat 29:292-298, 1980). Our results extend previous results on the analysis of the chosen data in this example. Further, results obtained are very consistent with previous analyses on the data sets chosen for this article. We also present an hierarchical figure relating all the models employed in this paper. While we do not pretend that the results obtained are entirely new, however, the analyses give opportunities to researchers in the field the much needed means of implementing these models in SAS without having to resort to S-PLUS, R or Stata.

A comparison of different methods of zero-inflated data analysis and an application in health surveys

Journal of Modern Applied Statistical Methods, 2017

The performance of several models under different conditions of zero-inflation and dispersion are evaluated. Results from simulated and real data showed that the zeroaltered or zero-inflated negative binomial model were preferred over others (e.g., ordinary least-squares regression with log-transformed outcome, Poisson model) when data have excessive zeros and over-dispersion.

Two-part regression models for longitudinal zero-inflated count data

Canadian Journal of Statistics, 2000

Two-part models are quite well established in the economic literature, since they resemble accurately a principal-agent type model, where homogeneous, observable, counted outcomes are subject to a (prior, exogenous) selection choice. The first decision can be represented by a binary choice model, modeled using a probit or a logit link; the second can be analyzed through a truncated discrete distribution such as a truncated Poisson, negative binomial, and so on. Only recently, a particular attention has been devoted to the extension of two-part models to handle longitudinal data. The authors discuss a semi-parametric estimation method for dynamic two-part models and propose a comparison with other, well-established alternatives. Heterogeneity sources that influence the first level decision process, that is, the decision to use a certain service, are assumed to influence also the (truncated) distribution of the positive outcomes. Estimation is carried out through an EM algorithm without parametric assumptions on the random effects distribution. Furthermore, the authors investigate the extension of the finite mixture representation to allow for unobservable transition between components in each of these parts. The proposed models are discussed using empirical as well as simulated data.

Poisson and negative binomial regression models for zero-inflated data: an experimental study

Communications Faculty Of Science University of Ankara Series A1Mathematics and Statistics

Count data regression has been widely used in various disciplines, particularly health area. Classical models like Poisson and negative binomial regression may not provide reasonable performance in the presence of excessive zeros and overdispersion problems. Zero-inflated and Hurdle variants of these models can be a remedy for dealing with these problems. As well as zero-inflated and Hurdle models, alternatives based on some biased estimators like ridge and Liu may improve the performance against to multicollinearity problem except excessive zeros and overdispersion. In this study, ten different regression models including classical Poisson and negative binomial regression with their variants based on zero-inflated, Hurdle, ridge and Liu approaches have been compared by using a health data. Some criteria including Akaike information criterion, log-likelihood value, mean squared error and mean absolute error have been used to investigate the performance of models. The results show th...

Modeling Zero-inflated Clustered Count Data: A Semiparametric Approach

2014

This paper proposes to use an additive semiparametric Poisson regression in modeling zero-inflated clustered data. Two estimation methods are exploited in this paper based on de Vera (2010). The first simultaneously estimates both the parametric and nonparametric parts of the model. The second utilizes the backfitting algorithm by smoothing the nonparametric function of the covariates and then estimating the parametric parts of the postulated model. The predictive accuracy, measured in terms of root mean square error (RMSE), of the proposed methods is compared to that of ordinary Zero-Inflated Poisson (ZIP) regression model. Through a simulation study, the average RMSE of the ordinary ZIP regression model is at most 81% and 27% higher for equal and unequal cluster sizes, respectively, than that of proposed model whose parametric and nonparametric parts are simultaneously estimated.

A Zero-Inflated Regression Model for Grouped Data

Oxford Bulletin of Economics and Statistics, 2014

We introduce the (panel) zero-inflated interval regression (ZIIR) model, which is ideally suited when data are in the form of groups, which is commonly the case in survey data, and there is an 'excess' of zero observations. We apply our new modelling framework to the analysis of visits to general practitioners (GPs) using individual-level panel data from the British Household Panel Survey (BHPS). The ZIIR model simultaneously estimates the probability of visiting the GP and the frequency of visits (defined by given numerical intervals in the data). The results show that different socio-economic factors influence the probability of visiting the GP and the frequency of visits, thereby providing potentially valuable information to policy-makers concerned with health care allocation.

A Method for Analyzing Longitudinal Outcomes with Many Zeros

Mental Health Services Research, 2000

Health care utilization and cost data have challenged analysts because they are often correlated over time, highly skewed, and clumped at 0. Traditional approaches do not address all these problems, and evaluators of mental health and substance abuse interventions often grapple with the problem of how to analyze these data in a way that accurately represents program impact. Recently, the traditional 2-part model has been extended to mixedeffects mixed-distribution model with correlated random effects to deal simultaneously with excess zeros, skewness, and correlated observations. We introduce and demonstrate this new method to mental health services researchers and evaluators by analyzing the data from a study of assertive community treatment (ACT). The response variable is the number of days of hospitalization, collected every 6 months over 3 years. The explanatory variable is group: ACT vs. standard case management. Diagnosis (schizophrenia vs. bipolar disorder), time, and the baseline values of hospital days are covariates. Results indicate that clients in the ACT group have a higher probability of hospital admission, but tend to have shorter lengths of stay. The mixed-distribution model provides greater specification of a model to fit these data and leads to more refined interpretation of the results.