R Programming for Data Science (original) (raw)

Last Updated : 22 Nov, 2025

R is an open-source programming language used statistical software and data analysis tools. It is an important tool for Data Science. It is highly popular and is the first choice of many statisticians and data scientists.

R includes tools for creating aesthetic and insightful visualizations.
Helps in extracting, cleaning, transforming and loading data from multiple sources, including SQL databases, spreadsheets and even unstructured data through NoSQL interfaces.
Enables the use of predictive models to forecast future outcomes.

Syntax and Variables in R

In R, we use the <- operator to assign values to variables, though = is also commonly used. You can also add comments in your code to explain what’s happening, using the**#** symbol. It’s great practice to comment your code so that it’s easier to understand later.

R `

x <- 5 # Assigns the value 5 to x y <- 3 # Assigns the value 3 to y sum_result <- x + y product_result <- x * y

print(paste('Sum of x and y: ', sum_result)) print(paste('Product of x and y: ', product_result))

Output

[1] "Sum of x and y: 8" [1] "Product of x and y: 15"

Data Types and Structure in R

In R, data is stored in various structures, such as vectors, matrices, lists and data frames. Let’s break each one down.

1. **Vectors: Vectors are like simple arrays that hold multiple values of the same type. You can create a vector using the c() function:

R `

vector <- c(1, 2, 3, 4, 5)
print(vector)

2. **Matrices: Matrices are two-dimensional arrays where each element has the same data type. You create a matrix using the matrix() function:

R `

matrix_data <- matrix(1:9, nrow = 3, ncol = 3) print(matrix_data)

Output

 [,1] [,2] [,3]

[1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9

3. **Lists: Lists can contain elements of different types, including numbers, strings, vectors and another list inside it. Lists are created using the list() function:

R `

list_data <- list("Red", 20, TRUE, 1:5) print(list_data)

Output

[[1]] [1] "Red"

[[2]] [1] 20

[[3]] [1] TRUE

[[4]] [1] 1 2 3 4 5

4. **Data Frames****:** Data frames are the most commonly used data structure in R. They’re like tables, where each column can contain different data types. Use data.frame() to create one:

R `

Creating DataFrame in R

data_frame <- data.frame(Name = c("Alice", "Bob"), Age = c(24, 28)) print(data_frame)

Output

Name Age 1 Alice 24 2 Bob 28

These foundational concepts are a great starting point for your journey into data science. To dive deeper, consider exploring the following tutorial: R Programming Tutorial

R libraries for Data Science

In R Programming, several libraries are required in data science for tasks like data manipulation and statistical modeling to visualize and machine learning. The key libraries include:

Data Manipulation with R Programming

R Libraries are effective for data manipulation, enabling analysts to clean, transform and summarize datasets efficiently.

Using dplyr for Data Manipulation

The dplyr package provides a set of functions that make it easy to manipulate data frames in a clean and readable manner. Some of the key functions in dplyr include:

**filter(): Filters rows based on conditions.
**select()****:** Selects specific columns.
**mutate()****:** Adds or modifies columns.
**arrange()****:** Orders rows by specified columns.
**summarize()****:** Summarizes data by applying functions (e.g., mean, sum).

Let's perform data manipulation using the above function using a sample dataset:

R `

install.packages("dplyr") library(dplyr)

data <- data.frame( Name = c("Alice", "Bob", "Charlie", "David", "Eve"), Age = c(24, 28, 35, 40, 22), Salary = c(50000, 60000, 70000, 80000, 45000) )

Filters rows based on conditions

filtered_data <- filter(data, Age > 25) print("Filtered Data (Age > 25):") print(filtered_data)

Selects specific columns

selected_data <- select(data, Name, Salary) print("Selected Data (Name and Salary columns):") print(selected_data)

**Output:

[1] "Filtered Data (Age > 25):"
Name Age Salary
1 Bob 28 60000
2 Charlie 35 70000
3 David 40 80000

[1] "Selected Data (Name and Salary columns):"
Name Salary
1 Alice 50000
2 Bob 60000
3 Charlie 70000
4 David 80000
5 Eve 45000

Data Cleaning and Transformation

Data cleaning involves correcting or removing errors and transforming data into a usable format. Key transformations include:

**rename()****:** to rename columns
**as.character()****:** to change the data type
**mutate()****:** to create derived variables

Now, we will be using the previous dataset to perform data transformation:

R `

Renaming columns

data_renamed <- rename(data, Employee_Name = Name, Employee_Age = Age) print("Renamed Data (Name to Employee_Name, Age to Employee_Age):") print(data_renamed)

**Output

[1] "Renamed Data (Name to Employee_Name, Age to Employee_Age):"
Employee_Name Employee_Age Salary Salary_per_year
1 Alice 24 50000 4166.667
2 Bob 28 60000 5000.000
3 Charlie 35 70000 5833.333
4 David 40 80000 6666.667
5 Eve 22 45000 3750.000

Handling Missing Values

Dealing with missing values is an essential part of data preparation. R provides several functions to identify, handle and replace missing values in datasets. Key functions include:

**is.na(): To identify missing values in the data.
**na.omit(): To remove rows with missing values.
**ifelse(): To replace missing values with a specific value or calculated result.
**tidyr::fill(): To fill missing values using the previous or next non-missing value in the column. R `

data_missing <- data.frame( Name = c("Alice", "Bob", "Charlie", NA, "Eve"), Age = c(24, 28, 35, NA, 22), Salary = c(50000, NA, 70000, 80000, 45000) )

Identifying missing values

missing_data <- is.na(data_missing) print("Identifying Missing Values:") print(missing_data)

Fill missing values

install.packages("tidyr") library(tidyr) data_filled <- fill(data_missing, Age, .direction = "down") print("Data After Filling Missing Values in Age (Downward Direction):") print(data_filled)

**Output:

[1] "Identifying Missing Values:"
Name Age Salary
[1,] FALSE FALSE FALSE
[2,] FALSE FALSE TRUE
[3,] FALSE FALSE FALSE
[4,] TRUE TRUE FALSE
[5,] FALSE FALSE FALSE

[1] "Data After Filling Missing Values in Age (Downward Direction):"
Name Age Salary
1 Alice 24 50000
2 Bob 28 NA
3 Charlie 35 70000
4 35 80000
5 Eve 22 45000

Statistical Analysis in R

R provides tools for performing both descriptive and inferential statistical analysis, making it a preferred choice for statisticians and data scientists.

Descriptive Statistics

Descriptive statistics provide a summary of the data's key characteristics using measures like mean, median, variance and standard deviation.

**mean(): Calculates the average of a dataset.
**median(): Identifies the middle value in a dataset.
**sd(): Computes the standard deviation.
**summary(): Provides a summary of key descriptive statistics. R `

vector <- c(10, 20, 30, 40, 50)

mean_value <- mean(vector)

median_value <- median(vector)

total_sum <- sum(vector)

print(paste("Mean:", mean_value)) print(paste("Median:", median_value)) print(paste("Sum:", total_sum))

Output

[1] "Mean: 30" [1] "Median: 30" [1] "Sum: 150"

Inferential Statistics

Inferential statistics allow you to make predictions or generalizations about a population based on sample data.

**1. Hypothesis Testing

Hypothesis Testing evaluates assumptions (hypotheses) about population parameters. In R, common hypothesis tests include:

**t.test(): Performs t-tests to compare means between two groups.
**aov(): Conducts Analysis of Variance (ANOVA) to compare means among three or more groups
**chisq.test(): Performs Chi-Square tests for independence or goodness of fit.
**wilcox.test(): A non-parametric test that compares two independent samples (Wilcoxon rank-sum test).
**ks.test(): The Kolmogorov-Smirnov test compares two distributions to see if they are the same.
**fisher.test(): Fisher's exact test is used for small sample sizes in contingency tables. R `

T-test to compare means between two groups

group1 <- c(1, 2, 3, 4, 5) group2 <- c(6, 7, 8, 9, 10) t_test_result <- t.test(group1, group2) print("T-test Result:") print(t_test_result)

Chi-Square test for independence

data_chisq <- matrix(c(10, 20, 20, 40), nrow = 2, byrow = TRUE) chisq_result <- chisq.test(data_chisq) print("Chi-Square Test Result:") print(chisq_result)

**Output:

[1] "T-test Result:"

Welch Two Sample t-test

data: group1 and group2
t = -5, df = 8, p-value = 0.001053
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-7.306004 -2.693996
sample estimates:
mean of x mean of y
3 8

[1] "Chi-Square Test Result:"

Pearson's Chi-squared test

data: data_chisq
X-squared = 0, df = 1, p-value = 1

**2. Correlation and Regression Analysis

Correlation and Regression Analysis these techniques explore relationships between variables:

**Correlation Analysis: Measures the strength and direction of relationships using cor().
**Regression Analysis: Models relationships using lm()(linear regression). R `

Correlation Analysis using cor(): Measure the strength and direction of a linear relationship

x <- c(1, 2, 3, 4, 5) y <- c(5, 4, 3, 2, 1) correlation_result <- cor(x, y) print("Correlation Between x and y:") print(correlation_result)

**Output:

[1] "Correlation Between x and y:"
[1] -1

Machine Learning with R

Machine learning in R enables analysts to build predictive models, perform classification and uncover patterns in data.

Supervised Learning

1. **Linear Regression: Linear regression is used for predicting continuous numeric outcomes based on one or more predictors. In R, we can predict the continuous numeric outcomes using lm().

Python `

Sample Dataset

set.seed(123) train_data <- data.frame( predictor1 = rnorm(100, mean = 50, sd = 10), predictor2 = rnorm(100, mean = 30, sd = 5), target = rnorm(100, mean = 100, sd = 15) )

model_lr <- lm(target ~ predictor1 + predictor2, data = train_data) pred_lr <- predict(model_lr, newdata = train_data) head(pred_lr) mse <- mean((train_data$target - pred_lr)^2) mse

**Output:

197.509197666493

2. **Logistic Regression****:** Logistic regression is used for binary classification tasks where the outcome variable is categorical (e.g., 0 or 1), in R, it is performed using glm() function.

R `

set.seed(123) train_data_logistic <- data.frame( predictor1 = rnorm(100, mean = 50, sd = 10), predictor2 = rnorm(100, mean = 30, sd = 5), target = sample(0:1, 100, replace = TRUE) )

Fit Logistic Regression model

model_logistic <- glm(target ~ predictor1 + predictor2, family = binomial, data = train_data_logistic) pred_logistic <- predict(model_logistic, newdata = train_data_logistic, type = "response") pred_logistic_class <- ifelse(pred_logistic > 0.5, 1, 0) # Convert probabilities to binary predictions

accuracy_logistic <- mean(pred_logistic_class == train_data_logistic$target) accuracy_logistic

**Output:

0.63

3. **Decision Trees****:** Decision trees are used for both classification and regression tasks. In this example, we perform classification using **rpart() function:

R `

install.packages("rpart") library(rpart)

set.seed(123) train_data_tree <- data.frame( predictor1 = rnorm(100, mean = 50, sd = 10), predictor2 = rnorm(100, mean = 30, sd = 5), target = sample(0:1, 100, replace = TRUE) )

Fit Decision Tree model

model_tree <- rpart(target ~ predictor1 + predictor2, data = train_data_tree, method = "class") pred_tree <- predict(model_tree, newdata = train_data_tree, type = "class") accuracy_tree <- mean(pred_tree == train_data_tree$target) accuracy_tree

**Output:

0.72

4. **Random Forest****:** Random Forest is an ensemble learning technique to perform classification and regression using **randomForest().

R `

install.packages("randomForest") library(randomForest)

set.seed(123) train_data_rf <- data.frame( predictor1 = rnorm(100, mean = 50, sd = 10), predictor2 = rnorm(100, mean = 30, sd = 5), target = sample(0:1, 100, replace = TRUE)
)

train_data_rf$target <- factor(train_data_rf$target, levels = c(0, 1))

Random Forest model

model_rf <- randomForest(target ~ predictor1 + predictor2, data = train_data_rf) pred_rf <- predict(model_rf, newdata = train_data_rf) accuracy_rf <- mean(pred_rf == train_data_rf$target) print(paste("Random Forest Accuracy: ", accuracy_rf))

**Output:

Random Forest Accuracy: 1

**Unsupervised Learning

Unsupervised learning involves learning patterns in data without labeled outputs. Common techniques include clustering and dimensionality reduction.

1. **K-means Clustering: K-means partitions the data into K clusters based on the distance between data points. In R, kmeans() function is used perform clustering.

Python `

set.seed(123) data <- data.frame( predictor1 = rnorm(100, mean = 50, sd = 10), predictor2 = rnorm(100, mean = 30, sd = 5), target = sample(0:1, 100, replace = TRUE)
)

Perform K-means clustering

model_kmeans <- kmeans(data[, -3], centers = 3)
cluster_centers <- model_kmeans$centers
cluster_assignments <- model_kmeans$cluster withinss <- model_kmeans$tot.withinss

print("Cluster Centers:") print(cluster_centers)

print("Cluster Assignments:") print(cluster_assignments)

print("Total Within-Cluster Sum of Squares:") print(withinss)

**Output:

[1] "Cluster Centers:"
predictor1 predictor2
1 62.48318 27.73121
2 51.24186 30.80630
3 41.05266 29.10471

[1] "Cluster Assignments:"
[1] 3 2 1 2 2 1 2 3 3 2 1 2 2 2 3 1 2 3 1 3 3 2 3 3 3 3 1 2 3 1 2 2 1 1 1 2 1
[38] 2 2 3 3 2 3 1 1 3 3 3 2 2 2 2 2 1 2 1 3 2 2 2 2 3 3 3 3 2 2 2 1 1 3 3 1 3
[75] 3 1 2 3 2 2 2 2 3 1 2 2 1 2 2 1 1 2 2 3 1 3 1 1 2 3

[1] "Total Within-Cluster Sum of Squares:"
[1] 3809.048

2. **Principal Component Analysis (PCA)****:** PCA transforms the data into a new coordinate system where the axes represent direction of maximum variance. In R, PCA is performed using prcomp() function.

R `

set.seed(123) data_pca <- data.frame( predictor1 = rnorm(100, mean = 50, sd = 10), predictor2 = rnorm(100, mean = 30, sd = 5), predictor3 = rnorm(100, mean = 60, sd = 15) )

Perform PCA

pca_result <- prcomp(data_pca, center = TRUE, scale. = TRUE) summary(pca_result)

**Output:

Importance of components:
PC1 PC2 PC3
Standard deviation 1.0726 0.9900 0.9324
Proportion of Variance 0.3835 0.3267 0.2898
Cumulative Proportion 0.3835 0.7102 1.0000

**Model Evaluation

After building a model, it’s essential to evaluate its performance. We can evaluate models using the following metrics:

**1. Classification Evaluation Metrics

**2. Regression Evaluation Metrics

Time Series Analysis in R

R provides multiple functions for creating, manipulating and analyzing time series data.

**ts() function in R: The ts() function is used to convert a numeric vector into a time series object, where you can specify the start date and the frequency of the data (e.g., monthly, quarterly).
**Decomposition of Time Series: In R, the decompose() function is used for decomposing time series into trend, seasonal and residual components.

For more advanced decomposition, you can use STL (Seasonal and Trend decomposition using Loess), which is more robust for irregular seasonality. It is implemented using stl() function.

Time Series Forecasting using R

**ARIMA Model****:** The auto.arima() function from the forecast package can automatically select the best ARIMA model for the given time series data based on criteria like AIC (Akaike Information Criterion).
**SARIMA Model: The auto.arima() function in R can also be used to fit a SARIMA model by automatically selecting the seasonal components.
**Exponential Smoothing (ETS): Another popular forecasting technique is Exponential Smoothing, which is available in R through the ets() function from the forecast package.
**Prophet****:** For handling seasonality and holidays, Facebook's Prophet model can be used. The function used to perform forecasting is prophet(). It is particularly useful for forecasting time series data with strong seasonal effects and missing data.

R Programming vs. Python Programming

**Feature	**R	**Python
**Introduction	R is a language and environment designed for statistical programming, computing and graphics.	Python is a general-purpose programming language used for data analysis and scientific computing.
**Objective	Focuses on statistical analysis and data visualization.	Supports a wide range of applications, including GUI development, web development and embedded systems.
**Workability	Offers numerous easy-to-use packages for statistical tasks.	Excels in matrix computation, optimization and general-purpose tasks.
**Integrated Development Environment (IDE)	Popular IDEs include RStudio, RKward and R Commander.	Common IDEs are Spyder, Eclipse+PyDev, Atom and more.
**Libraries and Packages	Includes packages like ggplot2 for visualization and caret for machine learning.	Features libraries like Pandas, NumPy and SciPy for data manipulation and analysis.
**Scope	Primarily used for complex statistical analysis and data science projects.	Offers a streamlined approach for data science, along with versatility in other domains.

R is ideal for statistical computing and visualization, while Python provides a more versatile platform for diverse applications, including data science.

**Top Companies Using R for Data Science

**Google: Utilizes R for analytical operations, including the Google Flu Trends project, which analyzes flu-related search trends.
**Facebook: Leverages R for social network analytics, gaining user insights and analyzing user relationships.
**IBM: A major investor in R, IBM uses it for developing analytical solutions, including in IBM Watson.
**Uber: Employs R’s Shiny package for interactive web applications and embedding dynamic visual graphics.

To get a detailed overview of R Programming for Data Science, you can refer to: Data Science Tutorial with R