Coverage estimates (original) (raw)

Coverage relates to how well a proposed confidence interval captures the population parameters. For instance, given some desired nominal error rate, say, \(\alpha = .05\), does the confidence interval mean based on a t-test capture the population mean across 95% of the time across random samples? Intervals that are too narrow (i.e., less than the nominal coverage rate) are generally termed ‘liberal’, while those that are too wide (i.e., greater than the nominal coverage rate) are considered ‘conservative’. In the example below we evaluate whether the z or t-distribution is a suitable candidate to capture the population mean across different sample sizes.

Define the functions

library(SimDesign)
#SimFunctions(comments = FALSE)

# z-test confidence intervals function
z.CI <- function(dat, alpha = .05){
    xbar <- mean(dat)
    SE <- sd(dat) / sqrt(length(dat))
    z <- abs(qnorm(alpha/2))
    CI <- c(xbar - z * SE, xbar + z * SE)
    CI
}

### Define design conditions. Here we want a finer range of values for plotting
Design <- createDesign(mu = c(0, 5, 10), 
                       sample_size = c(10, 20, 30))

#---------------------------------------------------------------------

Generate <- function(condition, fixed_objects = NULL) {
    ret <- with(condition, rnorm(sample_size, mu))
    ret
}

Analyse <- function(condition, dat, fixed_objects = NULL) {
    t_test <- t.test(dat)
    t_CI <- t_test$conf.int
    z_CI <- z.CI(dat)
    ret <- c(t_CI = ECR(t_CI, parameter = condition$mu), 
             z_CI = ECR(z_CI, parameter = condition$mu)) 
    ret
}

Summarise <- function(condition, results, fixed_objects = NULL) {
    ret <- colMeans(results)
    ret
}

#---------------------------------------------------------------------

### Run the simulation
res <- runSimulation(Design, replications=10000, verbose=FALSE, parallel = TRUE,
                     generate=Generate, analyse=Analyse, summarise=Summarise)

# A tibble: 9 × 8
     mu sample_size   t_CI   z_CI REPLICATIONS SIM_TIME       SEED COMPLETED    
  <dbl>       <dbl>  <dbl>  <dbl>        <dbl> <chr>         <int> <chr>        
1     0          10 0.9531 0.9221        10000 0.52s     488349520 Thu Apr 17 2…
2     5          10 0.9466 0.9169        10000 0.34s     469344150 Thu Apr 17 2…
3    10          10 0.9503 0.9173        10000 0.37s    1550099436 Thu Apr 17 2…
4     0          20 0.9481 0.9325        10000 0.45s      40778633 Thu Apr 17 2…
5     5          20 0.9527 0.9373        10000 0.37s     713327526 Thu Apr 17 2…
6    10          20 0.9523 0.9375        10000 0.40s     831490948 Thu Apr 17 2…
7     0          30 0.9494 0.9408        10000 0.47s    1214377063 Thu Apr 17 2…
8     5          30 0.9476 0.9381        10000 0.36s    1255528026 Thu Apr 17 2…
9    10          30 0.9488 0.9392        10000 0.39s    1229276239 Thu Apr 17 2…

From these results it’s clear to see that the t-distribution gives very good coverage rates across all conditions. However, the z-distribution CI is too liberal in the smaller sample sizes, and gives an interval that does not contain the population mean as often as desired. Around a sample size of 30 we can see that the two distributions behave reasonably well.

Note that the ECR() function was used within the Analyse() function rather than in Summarise(). When only the three values are supplied (CIs + population value) then ECR() will return either a 1 if the population value was within the CI or 0 if it was not. Hence, in the Summarise() function all that is required to obtain the coverage estimate is to find the mean of these 1’s and 0’s. The next section demonstrates how to use ECR() in the Summarise() instead, and it’s really a matter of preference which style you prefer.

An Alternative Form

The above simulation called the ECR() function within analyse, however it is entirely possible to use this function within the summarise block by simply supplying matrices of confidence intervals instead.

library(SimDesign)
#SimFunctions(comments = FALSE)

### Define design conditions. Here we want a finer range of values for plotting
Design <- createDesign(mu = c(0, 5, 10), 
                       sample_size = c(10, 20, 30))

z.CI <- function(dat, alpha = .05){
    xbar <- mean(dat)
    SE <- sd(dat) / sqrt(length(dat))
    z <- abs(qnorm(alpha/2))
    CI <- c(xbar - z * SE, xbar + z * SE)
    CI
} 

#-----------------------------------------------------------------

Generate <- function(condition, fixed_objects = NULL) {
    ret <- with(condition, rnorm(sample_size, mu))
    ret
}

Analyse <- function(condition, dat, fixed_objects = NULL) {
    t_test <- t.test(dat)
    t_CI <- t_test$conf.int
    z_CI <- z.CI(dat)
    ret <- c(t_CI = t_CI, z_CI = z_CI) 
    ret
}

Summarise <- function(condition, results, fixed_objects = NULL) {
    ret <- ECR(results, parameter = condition$mu, 
               names = c('t_CI', 'z_CI'))
    ret
}

#-----------------------------------------------------------------

### Run the simulation
res <- runSimulation(Design, replications=10000, verbose=FALSE, parallel = FALSE,
                     generate=Generate, analyse=Analyse, summarise=Summarise)

Many secondary function in SimDesign have this type of flexibility, therefore it’s generally useful to refer to the documentation to help think about easier ways to summarise your results.