Fitting regressions with Armadillo (original) (raw)

Design matrix and response vector

The starting point to fit a linear regresion in R without using the lm function is to create a design matrix and a response vector. The design matrix is a matrix where each row corresponds to an observation and each column corresponds to a predictor. The response vector is a vector of the same length as the number of observations.

For example, using the mtcars dataset it is possible to create a design matrix to later estimate the linear regression coefficients for the model:

\[ \text{mpg}_i = \beta_0 + \beta_1 \times \text{weight}_i + e_i \]

For \(\beta_0\) and \(\beta_1\) to be estimated, the design matrix and the response vector are created as follows:

x <- cbind(1, mtcars$wt)
y <- mtcars$mpg

head(x)
#>      [,1]  [,2]
#> [1,]    1 2.620
#> [2,]    1 2.875
#> [3,]    1 2.320
#> [4,]    1 3.215
#> [5,]    1 3.440
#> [6,]    1 3.460

head(y)
#> [1] 21.0 21.0 22.8 21.4 18.7 18.1

dim(x)
#> [1] 32  2

length(y)
#> [1] 32

Certainly, there is a more efficient way to create the design matrix and the response vector. The model.matrix function can be used to create the design matrix and the model.response function can be used to create the response vector:

x <- model.matrix(mpg ~ wt, data = mtcars)
y <- model.response(model.frame(mpg ~ wt, data = mtcars))

The advantage of using these functions is that they handle factor variables more easily. For example, if the mtcars dataset has a factor variable, the model.matrix function will create one 0/1 column for each level of the factor variable.

Estimating the regression coefficients in R

To estimate the regression coefficients, the solve function can be used:

solve(t(x) %*% x) %*% t(x) %*% y
#>                  [,1]
#> (Intercept) 37.285126
#> wt          -5.344472

It can be verified that the coefficients are the same as the ones estimated by the lm function:

lm(mpg ~ wt, data = mtcars)$coefficients
#> (Intercept)          wt 
#>   37.285126   -5.344472

However, the lm() function does not use the solve function to estimate the coefficients. Instead, it uses the QR decomposition and internal functions written in C and FORTRAN to estimate the coefficients.

Estimating the regression coefficients in Armadillo

Using ‘cpp11armadillo’ library, the regression coefficients can be estimated as follows:

vec ols_fit(const Mat<double>& X, const Col<double>& Y) {
  // QR decomposition
  mat Q, R;
  qr_econ(Q, R, X);

  // Least Squares Problem
  vec betas = solve(trimatu(R), Q.t() * Y);

  return betas;
}

[[cpp11::register]] doubles ols_(const doubles_matrix<>& x, const doubles& y) {
  mat X = as_Mat(x);
  vec Y = as_Col(y);
  return as_doubles(ols_fit(X, Y));
}

Verify the equivalence:

all.equal(ols_(x,y), unname(coef(lm(mpg ~ wt, data = mtcars))))
[1] TRUE