General linear model (original) (raw)

From Wikipedia, the free encyclopedia

Statistical linear model

The general linear model or general multivariate regression model is a compact way of simultaneously writing several multiple linear regression models. In that sense it is not a separate statistical linear model. The various multiple linear regression models may be compactly written as[1]

Y = X B + U , {\displaystyle \mathbf {Y} =\mathbf {X} \mathbf {B} +\mathbf {U} ,} $\mathbf {Y} =\mathbf {X} \mathbf {B} +\mathbf {U} ,$

where Y is a matrix with series of multivariate measurements (each column being a set of measurements on one of the dependent variables), X is a matrix of observations on independent variables that might be a design matrix (each column being a set of observations on one of the independent variables), B is a matrix containing parameters that are usually to be estimated and U is a matrix containing errors (noise). The errors are usually assumed to be uncorrelated across measurements, and follow a multivariate normal distribution. If the errors do not follow a multivariate normal distribution, generalized linear models may be used to relax assumptions about Y and U.

The general linear model incorporates a number of different statistical models: ANOVA, ANCOVA, MANOVA, MANCOVA, ordinary linear regression, _t_-test and _F_-test. The general linear model is a generalization of multiple linear regression to the case of more than one dependent variable. If Y, B, and U were column vectors, the matrix equation above would represent multiple linear regression.

Hypothesis tests with the general linear model can be made in two ways: multivariate or as several independent univariate tests. In multivariate tests the columns of Y are tested together, whereas in univariate tests the columns of Y are tested independently, i.e., as multiple univariate tests with the same design matrix.

Comparison to multiple linear regression

[edit]

Multiple linear regression is a generalization of simple linear regression to the case of more than one independent variable, and a special case of general linear models, restricted to one dependent variable. The basic model for multiple linear regression is

Y i = β 0 + β 1 X i 1 + β 2 X i 2 + … + β p X i p + ϵ i {\displaystyle Y_{i}=\beta _{0}+\beta _{1}X_{i1}+\beta _{2}X_{i2}+\ldots +\beta _{p}X_{ip}+\epsilon _{i}} $Y_{i}=\beta _{0}+\beta _{1}X_{i1}+\beta _{2}X_{i2}+\ldots +\beta _{p}X_{ip}+\epsilon _{i}$ or more compactly Y i = β 0 + ∑ k = 1 p β k X i k + ϵ i {\displaystyle Y_{i}=\beta _{0}+\sum \limits _{k=1}^{p}{\beta _{k}X_{ik}}+\epsilon _{i}} $Y_{i}=\beta _{0}+\sum \limits _{k=1}^{p}{\beta _{k}X_{ik}}+\epsilon _{i}$

for each observation i = 1, ... , n.

In the formula above we consider n observations of one dependent variable and p independent variables. Thus, Y i is the _i_th observation of the dependent variable, X ik is _k_th observation of the _k_th independent variable, j = 1, 2, ..., p. The values β j represent parameters to be estimated, and ε i is the _i_th independent identically distributed normal error.

In the more general multivariate linear regression, there is one equation of the above form for each of m > 1 dependent variables that share the same set of explanatory variables and hence are estimated simultaneously with each other:

Y i j = β 0 j + β 1 j X i 1 + β 2 j X i 2 + … + β p j X i p + ϵ i j {\displaystyle Y_{ij}=\beta _{0j}+\beta _{1j}X_{i1}+\beta _{2j}X_{i2}+\ldots +\beta _{pj}X_{ip}+\epsilon _{ij}} $Y_{ij}=\beta _{0j}+\beta _{1j}X_{i1}+\beta _{2j}X_{i2}+\ldots +\beta _{pj}X_{ip}+\epsilon _{ij}$ or more compactly Y i j = β 0 j + ∑ k = 1 p β k j X i k + ϵ i j {\displaystyle Y_{ij}=\beta _{0j}+\sum \limits _{k=1}^{p}{\beta _{kj}X_{ik}}+\epsilon _{ij}} $Y_{ij}=\beta _{0j}+\sum \limits _{k=1}^{p}{\beta _{kj}X_{ik}}+\epsilon _{ij}$

for all observations indexed as i = 1, ... , n and for all dependent variables indexed as j = 1, ... , m.

Note that, since each dependent variable has its own set of regression parameters to be fitted, from a computational point of view the general multivariate regression is simply a sequence of standard multiple linear regressions using the same explanatory variables.

Comparison to generalized linear model

[edit]

The general linear model and the generalized linear model (GLM)[2][3] are two commonly used families of statistical methods to relate some number of continuous and/or categorical predictors to a single outcome variable.

The main difference between the two approaches is that the general linear model strictly assumes that the residuals will follow a conditionally normal distribution,[4] while the GLM loosens this assumption and allows for a variety of other distributions from the exponential family for the residuals.[2] The general linear model is a special case of the GLM in which the distribution of the residuals follow a conditionally normal distribution.

The distribution of the residuals largely depends on the type and distribution of the outcome variable; different types of outcome variables lead to the variety of models within the GLM family. Commonly used models in the GLM family include binary logistic regression [5] for binary or dichotomous outcomes, Poisson regression [6] for count outcomes, and linear regression for continuous, normally distributed outcomes. This means that GLM may be spoken of as a general family of statistical models or as specific models for specific outcome types.

| | General linear model | Generalized linear model | | | ------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Typical estimation method | Least squares, best linear unbiased prediction | Maximum likelihood or Bayesian | | Examples | ANOVA, ANCOVA, linear regression | linear regression, logistic regression, Poisson regression, gamma regression,[7] general linear model | | Extensions and related methods | MANOVA, MANCOVA, linear mixed model | generalized linear mixed model (GLMM), generalized estimating equations (GEE) | | R package and function | lm() in stats package (base R) | glm() in stats package (base R) | | MATLAB function | mvregress() | glmfit() | | SAS procedures | PROC GLM, PROC REG | PROC GENMOD, PROC LOGISTIC (for binary & ordered or unordered categorical outcomes) | | Stata command | regress | glm | | SPSS command | regression, glm | genlin, logistic | | Wolfram Language & Mathematica function | LinearModelFit[][8] | GeneralizedLinearModelFit[][9] | | EViews command | ls[10] | glm[11] | | statsmodels Python Package | regression-and-linear-models | GLM |

An application of the general linear model appears in the analysis of multiple brain scans in scientific experiments where Y contains data from brain scanners, X contains experimental design variables and confounds. It is usually tested in a univariate way (usually referred to a mass-univariate in this setting) and is often referred to as statistical parametric mapping.[12]

^ Mardia, K. V.; Kent, J. T.; Bibby, J. M. (1979). Multivariate Analysis. Academic Press. ISBN 0-12-471252-5.
^ a b McCullagh, P.; Nelder, J. A. (January 1, 1983). "An outline of generalized linear models". Generalized Linear Models. Springer US. pp. 21–47. doi:10.1007/978-1-4899-3242-6_2 (inactive 13 December 2024). ISBN 9780412317606.{{[cite book](/wiki/Template:Cite%5Fbook "Template:Cite book")}}: CS1 maint: DOI inactive as of December 2024 (link)
^ Fox, J. (2015). Applied regression analysis and generalized linear models. Sage Publications.
^ Cohen, J.; Cohen, P.; West, S. G.; Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences (Report).
^ Hosmer Jr, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression (Vol. 398). John Wiley & Sons.
^ Gardner, W.; Mulvey, E. P.; Shaw, E. C. (1995). "Regression analyses of counts and rates: Poisson, overdispersed Poisson, and negative binomial models". Psychological Bulletin. 118 (3): 392–404. doi:10.1037/0033-2909.118.3.392. PMID 7501743.
^ McCullagh, Peter; Nelder, John (1989). Generalized Linear Models (2nd ed.). Boca Raton: Chapman and Hall/CRC. ISBN 978-0-412-31760-6.
^ LinearModelFit, Wolfram Language Documentation Center.
^ GeneralizedLinearModelFit, Wolfram Language Documentation Center.
^ ls, EViews Help.
^ glm, EViews Help.
^ Friston, K.J.; Holmes, A.P.; Worsley, K.J.; Poline, J.-B.; Frith, C.D.; Frackowiak, R.S.J. (1995). "Statistical Parametric Maps in functional imaging: A general linear approach". Human Brain Mapping. 2 (4): 189–210. doi:10.1002/hbm.460020402. S2CID 9898609.

Christensen, Ronald (2020). Plane Answers to Complex Questions: The Theory of Linear Models (5th ed.). New York: Springer. ISBN 978-3-030-32096-6.
Wichura, Michael J. (2006). The coordinate-free approach to linear models. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge: Cambridge University Press. pp. xiv+199. ISBN 978-0-521-86842-6. MR 2283455.
Rawlings, John O.; Pantula, Sastry G.; Dickey, David A., eds. (1998). Applied Regression Analysis. Springer Texts in Statistics. doi:10.1007/b98890. ISBN 0-387-98454-2.