Exploratory Data Analysis in R Programming (original) (raw)

Last Updated : 30 Apr, 2026

Exploratory Data Analysis (EDA) is a process for analyzing and summarizing the key characteristics of a dataset, often using visual methods. It helps to understand the structure, relationships and potential issues in data before conducting formal modeling. Key Aspects of EDA

exploratory_data_analysis_eda_2

Exploratory Data Analysis

EDA is an iterative process that involves:

  1. **Generating questions about the data.
  2. **Searching for answers using visualization, transformation and modeling.
  3. **Refining the questions or generating new ones based on what has been learned.

In R, we perform EDA through two primary approaches:

  1. **Descriptive Statistics: Summarizing data using numerical methods such as mean, median and standard deviation.
  2. **Graphical Methods: Visualizing data through plots like histograms, box plots and scatter plots.

In this example, we will use the built-in iris dataset in R to show EDA techniques.

R `

data("iris") head(iris)

`

**Output:

iris

iris dataset

Descriptive Statistics for EDA

Descriptive statistics involve summarizing and describing the main features of a dataset through numerical measures like mean, median, mode, standard deviation, variance and range. These statistics help in understanding the central tendency, dispersion and overall distribution of the data,

Measures of Central Tendency

To summarize the data, we begin with measures of central tendency: the mean, median and mode of the numeric variables.

getmode <- function(v) { uniqv <- unique(v) uniqv[which.max(tabulate(match(v, uniqv)))] }

cat("\n Mean Sepal Length: ",mean(iris$Sepal.Length)) cat("\n Median Sepal Length: ",median(iris$Sepal.Length)) cat("\n Mode Sepal Length: ",getmode(iris$Sepal.Length))

`

**Output:

Mean Sepal Length: 5.843333
Median Sepal Length: 5.8
Mode Sepal Length: 5

Measures of Dispersion

To understand the spread of the data, we calculate the variance, standard deviation, range and interquartile range (IQR).

cat("\n Variance: ", var(iris$Sepal.Length)) cat("\n Standard Deviation: ", sd(iris$Sepal.Length)) cat("\n Range: ", range(iris$Sepal.Length)) cat("\n Interquartile Range (IQR): ", IQR(iris$Sepal.Length))

`

**Output:

Variance: 0.6856935
Standard Deviation: 0.8280661
Range: 4.3 7.9
Interquartile Range (IQR): 1.3

Correlation

Next, we examine the relationships between numerical variables by computing the correlation matrix.

cor(iris[, 1:4])

`

**Output:

corr

Correlation

Graphical Methods for EDA

Graphical methods involve visualizing the data using plots such as histograms, box plots, scatter plots and bar charts. These visualizations help in identifying patterns, trends, outliers and the distribution of data, making it easier to interpret and communicate insights. We will use the ggplot2 package for this purpose.

Distribution Histograms and Density Plots

We begin by plotting histograms to visualize the distribution of variables like Sepal Length.

install.packages("ggplot2") library(ggplot2)

ggplot(iris, aes(x = Sepal.Length)) + geom_histogram(binwidth = 0.2, fill = "blue", color = "white", alpha = 0.7) + labs(title = "Histogram of Sepal Length", x = "Sepal Length", y = "Frequency") + theme_minimal()

`

**Output:

Histogram

Histogram

Next, we can plot the density curve for Sepal Length:

ggplot(iris, aes(x = Sepal.Length)) + geom_density(fill = "blue", alpha = 0.7) + labs(title = "Density Curve for Sepal Length", x = "Sepal Length", y = "Density") + theme_minimal()

`

**Output:

Density-Curve

Density Plot

Box Plot

A box plot is useful to visualize the spread and potential outliers in the data.

ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) + geom_boxplot() + labs(title = "Box Plot of Sepal Length by Species", x = "Species", y = "Sepal Length") + theme_minimal()

`

**Output:

Box-plot

Boxplot

Scatter Plot

We can also examine the relationships between two numerical variables with scatter plots. For example, we’ll plot Sepal Length against Sepal Width.

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + geom_point() + labs(title = "Scatter Plot of Sepal Length vs Sepal Width", x = "Sepal Length", y = "Sepal Width") + theme_minimal()

`

**Output:

Scatter-Plot

Scatterplot

Pairwise Plot

For more comprehensive visualization, a pairwise scatter plot (or pairs plot) can help us see all pairwise relationships between the numerical variables in the dataset.

pairs(iris[, 1:4], col = iris$Species, pch = 21)

`

**Output:

Sepal-Length

Pairwise Plot

You can download the source code from here.