(original) (raw)

--- title: "Modeling Rates/Proportions using Beta Regression with rstanarm" author: "Imad Ali, Jonah Gabry and Ben Goodrich" date: "`r Sys.Date()`" output: html_vignette: toc: yes ---```{r, child="children/SETTINGS-knitr.txt"} ``` ```{r, child="children/SETTINGS-gg.txt"} ``` # Introduction This vignette explains how to model continuous outcomes on the open unit interval using the `stan_betareg` function in the __rstanarm__ package. ```{r, child="children/four_steps.txt"} ``` Steps 3 and 4 are covered in more depth by the vignette entitled ["How to Use the __rstanarm__ Package"](rstanarm.html). This vignette focuses on Step 1 when the likelihood is the product of beta distributions. # Likelihood Beta regression uses the beta distribution as the likelihood for the data, f(yi∣a,b)=fracyi(a−1)(1−yi)(b−1)B(a,b)f(y_i | a, b) = \frac{y_i^{(a-1)}(1-y_i)^{(b-1)}}{B(a,b)}f(yi∣a,b)=fracyi(a−1)(1−yi)(b−1)B(a,b) where B(cdot)B(\cdot)B(cdot) is the beta function. The shape parameters for the distribution are aaa and bbb and enter into the model according to the following transformations, a=mucdotphib=(1−mu)cdotphia = \mu\cdot\phi \\ b = (1-\mu)\cdot\phia=mucdotphib=(1−mu)cdotphi Let g1(cdot)g_1(\cdot)g1(cdot) be some link function. Then, in the specification of the shape parameters above, mu=g1−1(mathbfXboldsymbolbeta)\mu = g_1^{-1}(\mathbf{X}\boldsymbol{\beta})mu=g1−1(mathbfXboldsymbolbeta), where boldsymbolX\boldsymbol{X}boldsymbolX is a NtimesKN\times KNtimesK dimensional matrix of predictors, and boldsymbolbeta\boldsymbol{\beta}boldsymbolbeta is a KKK dimensional vector of parameters associated with each predictor. In the simplest case (with only one set of regressors), phi\phiphi is a scalar parameter. Alternatively, it is possible to model phi\phiphi using a second set of regressors mathbfZ\mathbf{Z}mathbfZ. In this context let g2(cdot)g_2(\cdot)g2(cdot) be some link function that is not necessarily identical to g1(cdot)g_1(\cdot)g1(cdot). Then phi=g2−1(mathbfZboldsymbolgamma)\phi = g_2^{-1}(\mathbf{Z}\boldsymbol{\gamma})phi=g2−1(mathbfZboldsymbolgamma), where boldsymbolgamma\boldsymbol{\gamma}boldsymbolgamma is a JJJ dimensional vector of parameters associated with the NtimesJN\times JNtimesJ dimensional matrix of predictors mathbfZ\mathbf{Z}mathbfZ. After substituting the shape parameter values in, the likelihood used in beta regression takes the following form, f(yi∣mu,phi)=fracyi(muphi−1)(1−yi)((1−mu)phi−1)B(muphi,(1−mu)phi)f(y_i | \mu, \phi) = \frac{y_i^{(\mu\phi-1)}(1-y_i)^{((1-\mu)\phi-1)}}{B(\mu\phi,(1-\mu)\phi)}f(yi∣mu,phi)=fracyi(muphi−1)(1−yi)((1−mu)phi−1)B(muphi,(1−mu)phi) # Priors A full Bayesian analysis requires specifying prior distributions f(boldsymbolbeta)f(\boldsymbol{\beta})f(boldsymbolbeta) and f(phi)f(\phi)f(phi) for the vector of regression coefficients and phi\phiphi. When using `stan_betareg`, these distributions can be set using the `prior_intercept`, `prior`, and `prior_phi` arguments. The `stan_betareg` function supports a variety of prior distributions, which are explained in the __rstanarm__ documentation (`help(priors, package = 'rstanarm')`). When modeling phi\phiphi with a linear predictor a full Bayesian analysis requires specifying the prior distributions f(boldsymbolbeta)f(\boldsymbol{\beta})f(boldsymbolbeta) and f(boldsymbolgamma)f(\boldsymbol{\gamma})f(boldsymbolgamma). In `stan_betareg` the prior distributions on boldsymbolgamma\boldsymbol{\gamma}boldsymbolgamma can be set using the `prior_intercept_z` and `prior_z` arguments. As an example, suppose we have KKK predictors and believe --- prior to seeing the data --- that beta1,dots,betaK\beta_1, \dots, \beta_Kbeta1,dots,betaK and phi\phiphi are as likely to be positive as they are to be negative, but are highly unlikely to be far from zero. These beliefs can be represented by normal distributions with mean zero and a small scale (standard deviation). To give phi\phiphi and each of the beta\betabetas this prior (with a scale of 1, say), in the call to `stan_betareg` we would include the arguments `prior_intercept = normal(0,1)`, `prior = normal(0,1)`, and `prior_phi = normal(0,1)`. If, on the other hand, we have less a priori confidence that the parameters will be close to zero then we could use a larger scale for the normal distribution and/or a distribution with heavier tails than the normal like the Student t distribution. __Step 1__ in the "How to Use the __rstanarm__ Package" vignette discusses one such example. After fitting the model we can use the `prior_summary` function to print information about the prior distributions used when fitting the model. # Posterior When using only a *single set of regressors*, the posterior distribution of boldsymbolbeta\boldsymbol{\beta}boldsymbolbeta and phi\phiphi is proportional to the product of the likelihood contributions, the KKK priors on the betak\beta_kbetak parameters, and phi\phiphi, f(boldsymbolbeta,phi∣mathbfy,mathbfX)proptoprodi=1Nf(yi∣a,b)timesprodk=1Kf(betak)timesf(phi)f(\boldsymbol{\beta},\phi|\mathbf{y},\mathbf{X}) \propto \prod_{i=1}^N f(y_i | a, b) \times \prod_{k=1}^K f(\beta_k) \times f(\phi)f(boldsymbolbeta,phi∣mathbfy,mathbfX)proptoprodi=1Nf(yi∣a,b)timesprodk=1Kf(betak)timesf(phi) When using *two sets of regressors*, the posterior distribution of boldsymbolbeta\boldsymbol{\beta}boldsymbolbeta and boldsymbolgamma\boldsymbol{\gamma}boldsymbolgamma is proportional to the product of the likelihood contribution, the KKK priors on the betak\beta_kbetak parameters, and the JJJ priors on the gammaj\gamma_jgammaj parameters, f(boldsymbolbeta,boldsymbolgamma∣mathbfy,mathbfX)proptoprodi=1Nf(yi∣a,b)timesprodk=1Kf(betak)timesprodj=1Jf(gammaj)f(\boldsymbol{\beta},\boldsymbol{\gamma}|\mathbf{y},\mathbf{X}) \propto \prod_{i=1}^N f(y_i | a, b) \times \prod_{k=1}^K f(\beta_k) \times \prod_{j=1}^J f(\gamma_j)f(boldsymbolbeta,boldsymbolgamma∣mathbfy,mathbfX)proptoprodi=1Nf(yi∣a,b)timesprodk=1Kf(betak)timesprodj=1Jf(gammaj) # An Example Using Simulated Data In this example the outcome variable mathbfy\mathbf{y}mathbfy is simulated in a way that warrants the use of beta regression. It is worth mentioning that the data generation process is quite convoluted, which is apparent in the identification of the likelihood above. The data simulated below uses the logistic link function on the first set of regressors and the log link function on the second set of regressors. ```{r simulated-data, fig.height=5} SEED <- 1234 set.seed(SEED) eta <- c(1, -0.2) gamma <- c(1.8, 0.4) N <- 200 x <- rnorm(N, 2, 2) z <- rnorm(N, 0, 2) mu <- binomial(link = logit)$linkinv(eta[1] + eta[2]*x) phi <- binomial(link = log)$linkinv(gamma[1] + gamma[2]*z) y <- rbeta(N, mu * phi, (1 - mu) * phi) dat <- data.frame(cbind(y, x, z)) hist(dat$y, col = "darkgrey", border = F, main = "Distribution of Outcome Variable", xlab = "y", breaks = 20, freq = F) ``` The model can be fit by calling `stan_betareg`, using the appropriate link functions. ```{r simulated-fit, results = "hide"} library(rstanarm) fit1 <- stan_betareg(y ~ x | z, data = dat, link = "logit", link.phi = "log", cores = 2, seed = 12345) fit2 <- stan_betareg(y ~ -1 + x , data = dat, link = "logit", link.phi = "log", cores = 2, seed = 12345) round(coef(fit1), 2) round(coef(fit2), 2) ``` ``` {r simulated-fit-print, echo=FALSE} round(coef(fit1), 2) round(coef(fit2), 2) ``` For clarity we can use `prior_summary` to print the information about the prior distributions used to fit the models. The priors used in `fit1` are provided below. ``` {r print-priors} prior_summary(fit1) ``` The usual posterior analyses are available in **rstanarm**. The plots below illustrate simulated values of the outcome variable. The incorrect model noticeably fails to capture the top of the distribution consistently in comparison to the true model. ```{r simulated-analysis, fig.height=5} library(ggplot2) library(bayesplot) bayesplot_grid( pp_check(fit1), pp_check(fit2), xlim = c(0,1), ylim = c(0,4), titles = c("True Model: y ~ x | z", "False Model: y ~ x - 1"), grid_args = list(ncol = 2) ) ``` We can also compare models by evaluating the expected log pointwise predictive density (`elpd`), which can be calculated using the `loo` method, which provides an interface for __rstanarm__ models to the functionality in the __loo__ package. ``` {r simulated-loo} loo1 <- loo(fit1) loo2 <- loo(fit2) loo_compare(loo1, loo2) ``` The difference in `elpd` is negative indicating that the expected predictive accuracy for the first model is higher. # An Example Using Gasoline Data In some applied contexts it may be necessary to work with an outcome variable that is a proportion. If the proportion is bound on the open unit interval then beta regression can be considered a reasonable estimation method. The `betareg` package provides a dataset on the proportion of crude oil converted to gasoline after distillation and fractionation. This variable is defined as yield. Below `stan_betareg` is used to model yield as a function of temperature, pressure, and the batch of conditions. ```{r, gas-fit, results="hide"} library(rstanarm) data("GasolineYield", package = "betareg") gas_fit1 <- stan_betareg(yield ~ temp + batch, data = GasolineYield, link = "logit", seed = 12345) gas_fit2 <- stan_betareg(yield ~ temp + batch | pressure, data = GasolineYield, link = "logit", seed = 12345) round(coef(gas_fit1), 2) round(coef(gas_fit2), 2) ``` ``` {r, gas-print, echo=FALSE} round(coef(gas_fit1), 2) round(coef(gas_fit2), 2) ``` The plots below illustrate simulated values of gasoline yield. While the first model accounts for variation in batch conditions its predictions looks somewhat uniform rather than resembling the peaked and right-skewed behavior of the true data. The second model does a somewhat better job at capturing the shape of the distribution, however its location is off as it is centered around 0.50 rather than 0.20. ```{r gas-analysis, fig.height=5} library(ggplot2) bayesplot_grid( pp_check(gas_fit1), pp_check(gas_fit2), xlim = c(0,1), ylim = c(0,5), titles = c("gas_fit1", "gas_fit2"), grid_args = list(ncol = 2) ) ``` ``` {r, gas-loo} gas_loo1 <- loo(gas_fit1) gas_loo2 <- loo(gas_fit2) loo_compare(gas_loo1, gas_loo2) ``` Evaluating the expected log predictive distribution using `loo` reveals that the second of the two models is preferred. # References Ferrari, SLP and Cribari-Neto, F (2004) "Beta Regression for Modeling Rates and Proportions". _Journal of Applied Statistics._ Vol. 31, No. 07, p799-815.