Data analysis using R (original) (raw)

Last Updated : 29 Apr, 2026

Data analysis is the process of examining and interpreting data to extract meaningful insights that can guide decision-making. Using R, a statistical programming language, makes this process efficient and reproducible.

what_is_data_analysis_

Data Analysis

Importance of Data Analysis

Data analysis helps convert raw data into meaningful information that can support better understanding and decision-making. It allows individuals and organizations to examine data carefully and draw logical conclusions based on facts.

Steps for Data Analysis

Data analysis in R follows a structured approach to transform raw data into meaningful insights. Each step plays an important role in ensuring accurate and reliable results.

data_analysis

Steps of Data Analysis

1. Define the Problem Statement

The first step is to clearly identify the objective of the analysis. A well-defined problem helps determine what data is needed and what type of analysis should be performed.

2. Data Collection

After defining the problem, relevant data must be gathered from appropriate sources. Only data related to the objective should be collected.

3. Data Inspection

Before cleaning, it is important to understand the structure and content of the dataset. This helps identify potential issues.

4. Data Preprocessing

Raw data often contains missing values, duplicates or inconsistencies. Preprocessing prepares the data for accurate analysis.

5. Exploratory Data Analysis (EDA)

After cleaning the data, analysis can begin to uncover patterns, trends and relationships. This step helps to understand the dataset and extract meaningful insights.

6. Data Visualization

Visualization helps present findings in a clear and understandable manner. Graphs make patterns and trends easier to interpret.

7. Interpretation and Reporting

The final step is to interpret the results and communicate findings effectively. Clear reporting supports informed decision-making.

Performing Data Analysis Using Titanic Dataset

Here we will explore a real-world example of data analysis using the Titanic dataset. The Titanic dataset contains information about passengers aboard the RMS Titanic, including whether they survived, their age, gender, ticket class and more.

You can download the dataset from here.

Step 1: Importing the Dataset

We will load the dataset into R. We will use the read.csv() function to load the dataset and examine the first few rows of the data.

R `

titanic = read.csv("train.csv") head(titanic)

`

**Output:

df

Dataset

Step 2: Checking Data Types

Next, we can check the class (data type) of each column using the sapply() function. This will help us understand how each column is represented in R.

R `

cls <- sapply(titanic, class) cls <- as.data.frame(cls) cls

`

**Output:

dtype

Data Types

Step 3: Converting Categorical Data

Columns like Survived and Sex are categorical, so we can convert them to factors for better analysis.

R `

titanic$Survived = factor(titanic$Survived, levels = c(0,1), labels = c("Not Survived", "Survived")) titanic$Sex = as.factor(titanic$Sex)

cls <- sapply(titanic, class) cls <- as.data.frame(cls) cls

`

**Output:

correcteddtype

Converting Categorical Data

Step 4: Summary Statistics

To get an overview of the data, we can use the summary() function. This will provide key statistics for each column, such as the minimum, maximum, mean and median values.

R `

summary(titanic)

`

**Output:

summary

Summary Statistics

Step 5: Handling Missing Values

The dataset contains missing values (NA). To identify how many missing values are present, we can use the following code:

R `

sum(is.na(titanic))

`

**Output:

87

This indicates that there are 87 missing values in the dataset. We can either remove the rows containing missing values or fill them with the mean (for numerical columns) or mode (for categorical columns).

R `

dropnull_titanic = titanic[rowSums(is.na(titanic)) <= 0, ]

`

This will remove the rows with missing values, leaving us with a cleaner dataset.

Step 6: Analyzing Survival Rate

Here we divide the data into two groups, those who survived and those who did not.

R `

survivedlist = dropnull_titanic[dropnull_titanic$Survived == "Survived", ] notsurvivedlist = dropnull_titanic[dropnull_titanic$Survived == "Not Survived", ]

`

We can now analyze the number of survivors and non-survivors using a pie chart:

R `

mytable = table(dropnull_titanic$Survived)

lbls = paste(names(mytable), "\n", mytable, sep="")

pie(mytable, labels = lbls, main = "Pie Chart of Survived Column Data (with sample sizes)")

`

**Output:

Screenshot-2026-02-20-175843

Survival Rate

This pie chart will show the distribution of survivors versus non-survivors, highlighting the imbalance in the dataset.

Step 7: Visualizing Age Distribution of Survivors

We can also visualize the age distribution of the survivors:

R `

hist(survivedlist$Age, xlab = "Age", ylab = "Frequency", main = "Age Distribution of Survivors")

`

**Output:

Screenshot-2026-02-20-175958

Age Distribution of Survivors

Step 8: Analyzing Gender Distribution

We can use a bar plot to analyze the distribution of survivors and non-survivors based on gender. This plot will help us understand the number of males and females who survived or did not survive, giving us insights into how gender might have influenced survival on the Titanic.

R `

barplot(table(notsurvivedlist$Sex), xlab = "Gender", ylab = "Frequency", main = "Gender Distribution of Non-Survivors")

`

**Output:

Screenshot-2026-02-20-180101

Analyzing Gender Distribution

Step 9: Analysis Class vs Survived

We can use a bar plot to analyze the distribution of survivors and non-survivors based on class. This plot will help us understand the number of passengers who survived or did not survive, giving us insights into how class might have influenced survival on the Titanic.

R `

install.packages("ggplot2") library(ggplot2)

ggplot(dropnull_titanic, aes(x = factor(Pclass), fill = Survived)) + geom_bar(position = "dodge") + labs(title = "Pclass vs Survived", x = "Pclass (1 = First, 2 = Second, 3 = Third)", y = "Count") + scale_fill_manual(values = c("red", "green"), labels = c("Not Survived", "Survived")) + theme_minimal() + theme(legend.title = element_blank())

`

**Output:

Screenshot-2026-02-20-180405

Class vs Survived

From our analysis, we can conclude that:

Applications

Limitations