Principal Component Analysis with R Programming (original) (raw)

Last Updated : 30 Apr, 2026

Principal Component Analysis (PCA) is a machine learning technique used to reduce the dimensionality of large datasets while preserving as much information as possible. It transforms correlated variables into a smaller set of uncorrelated variables called principal components, making complex datasets easier to understand and visualize. PCA is widely used in Exploratory Data Analysis (EDA) and as a preprocessing step in predictive modeling.

PCA

Principal Component Analysis (PCA)

PCA transforms the original variables into new components, where each component captures the maximum remaining variance in the data.

How Principal Component Analysis (PCA) Works in R

PCA converts correlated numerical variables into a smaller set of uncorrelated components using linear algebra methods. In R, it is performed using the functions prcomp() or princomp(), which calculate the principal components using SVD or eigendecomposition.

Step 1: Standardize the Data

PCA is sensitive to scale. If variables are measured in different units (e.g., income and age), features with larger scales may dominate the results. Therefore, data is usually standardized before applying PCA.

Standardization is done using:

Z=\frac{X-\mu}{\sigma}

where

Step 2: Compute the Covariance (or Correlation) Matrix

After scaling, PCA examines relationships between variables by computing the covariance matrix (or correlation matrix if standardized).

Covariance between two variables x1 and x2 is:

cov(x_1, x_2) = \frac{\sum_{i=1}^{n}(x_{1i} - \bar{x}_1)(x_{2i} - \bar{x}_2)}{n - 1}

The covariance matrix is symmetric and shows how variables vary together:

Step 3: Compute Eigenvectors and Eigenvalues

PCA then calculates:

AX=\lambda X

where

Step 4: Select the Principal Components

Step 5: Transform the Data

Finally, the original data is projected onto the selected principal components. This creates a new dataset in a lower-dimensional space while preserving most of the important information.

Z = X W

Where:

Step By Step Implementation

We will perform Principal Component Analysis (PCA) on the mtcars dataset to reduce dimensionality, visualize the variance and explore the relationships between different car attributes.

Step 1: Installing and Loading the Required Packages

We will install and load the necessary packages.

install.packages("dplyr") library(dplyr)

`

Step 2: Loading the Dataset

The mtcars dataset is a built in data set in R. It contains data on fuel consumption and various performance and design aspects of 32 cars. This dataset has 11 variables, including miles per gallon (mpg), horsepower and weight.

str(mtcars)

`

**Output:

str

mtcars dataset

Step 3: Performing PCA

To perform PCA, we use the prcomp() function. It is used to scale and center the data before applying PCA since PCA is based on distance measures and scaling ensures that all variables are treated equally.

my_pca <- prcomp(mtcars, scale. = TRUE, center = TRUE, retx = TRUE) names(my_pca)

`

**Output:

'sdev''rotation''center''scale''x'

Step 4: Summary of PCA Results

We will summaries the PCA model to understand how much variance is captured by each principal component.

summary(my_pca)

`

**Output:

impor

Summary of PCA Results

Step 5: Principal Component Loadings

We will see the weights (loadings) of each variable in the principal components.

my_pca$rotation[1:5, 1:4]

`

**Output:

first5

Principal Component Loadings

Step 6: Principal Components Scores

We will now inspect the scores (values of the observations on each principal component).

head(my_pca$x)

`

**Output:

score

Principal Components Scores

Step 7: Visualizing the Principal Components

We will use a biplot to visualize the principal components and their contributions to the overall variance.

biplot(my_pca, main = "Biplot of Principal Components", scale = 0)

`

**Output:

bio

Visualizing the Principal Components

Step 8: Computing Standard Deviation and Variance

We will now compute Standard Deviation and Variance

my_pca.var <- my_pca$sdev^2

cat("Standard Deviation :",my_pca$sdev,"\n") cat("Variance :",my_pca.var,"\n")

`

**Output:

Standard Deviation : 2.570681 1.628026 0.7919579 0.5192277 0.4727061 0.4599958 0.3677798 0.350573 0.2775728 0.2281128 0.1484736
Variance : 6.6084 2.650468 0.6271973 0.2695974 0.2234511 0.2115961 0.135262 0.1229014 0.07704665 0.05203544 0.02204441

Step 9: Proportion of Variance Explained

We calculate the proportion of variance for each component and visualize it using a scree plot to see how much variance each principal component explains.

propve <- my_pca.var / sum(my_pca.var)

plot(propve, xlab = "Principal Component", ylab = "Proportion of Variance Explained", ylim = c(0, 1), type = "b", main = "Scree Plot")

`

**Output:

scree

Proportion of Variance Explained

Step 10: Cumulative Proportion of Variance

We can see the cumulative variance explained by the components.

plot(cumsum(propve), xlab = "Principal Component", ylab = "Cumulative Proportion of Variance Explained", ylim = c(0, 1), type = "b")

`

**Output:

cumm

Cumulative Proportion of Variance

Step 11: Choosing Top Principal Components

We can identify the smallest number of principal components that explain at least 90% of the variance.

which(cumsum(propve) >= 0.9)[1]

`

**Output:

4

Step 12: Predicting with Principal Components

We can now use the first few principal components to predict another variable. For example, predicting disp (displacement) from the top 4 principal components.

train.data <- data.frame(disp = mtcars$disp, my_pca$x[, 1:4])

head(train.data)

`

**Output:

newdata

Predicting with Principal Components

Step 13: Building a Decision Tree

Next, we can use the rpart package to build a decision tree model to predict disp using the first four principal components.

install.packages("rpart") install.packages("rpart.plot")

library(rpart) library(rpart.plot)

rpart.model <- rpart(disp ~ ., data = train.data, method = "anova")

rpart.plot(rpart.model)

`

**Output:

dt

Building a Decision Tree