Movie and TV Show Recommendation Engine in R (original) (raw)

Last Updated : 23 Jul, 2025

A movie recommendation system, powered by machine learning recommendation engines, can create a personalized viewing experience that keeps viewers satisfied and engaged. Building a top-notch movie recommendation system is crucial because it directly impacts user retention and platform popularity. It's a complex mix of technology and creativity, with techniques ranging from content-based filtering to collaborative filtering in R Programming Language.

Understanding Recommendation Systems

So, let's talk about recommendation systems. These nifty systems use fancy algorithms and machine learning to predict what you might like and suggest stuff you'll be interested in. How do they do it? Well, they go through a bunch of steps like collecting, storing, analyzing, and filtering data to give you personalized recommendations.

Now, there are two main types of recommendation systems.

**Collaborative Filtering: Collaborative filtering uses data from user-item interactions to create suggestions. This approach splits into two main types: user-based and item-based. User-based filtering spots folks with matching tastes and pushes items these similar users enjoy. It uses tools like Pearson correlation or cosine similarity to figure out how alike users are. Item-based filtering, however, looks at how similar items are to each other. It suggests stuff that's close to what you've already checked out. This method often scales better than its counterpart.
**Content-Based Filtering: content-based filtering in recommendation systems, It guesses and proposes items that match what a user liked before. This approach banks on item traits - think movie genres or actors. The system looks at these features and suggests new stuff that fits what the user digs. For content-based filtering to work well, you need a ton of details about each item. You also need a full picture of the user - their clicks, ratings, likes, the works

Alright, now let's dive into how movie and TV show recommendation engines work. These engines are all about analyzing your behavior, preferences, and even your demographic info to suggest content that's right up your alley. They use collaborative filtering to see what you and other users have in common, and content-based filtering to focus on the specific attributes of the movies and shows.

Setting Up Your Environment in R

To begin setting up the environment in R for a movie recommendation system, one needs to install several key libraries.

**recommenderlab: R package for building recommendation systems.
**ggplot2: R package for data visualization with a grammar of graphics approach.
**data.table: R package for efficient data manipulation, especially for large datasets.
**reshape2: R package for data reshaping and aggregation, facilitating analysis and visualization.
**dplyr: R package For data manipulation using the %>% operator.

Step 1: Data Collection

In order to build our recommendation system, we have used the MovieLens Dataset. You can find the movies.csv and ratings.csv file that we have used in our Recommendation System Project here :

**Dataset Link: Movies Data , Rating Data

This data consists of 105339 ratings applied over 10329 movies.

Step 2: install and load the required libraries

First we will install and load the required libraries.

R `

Load necessary libraries

library(recommenderlab) # For building recommendation systems library(ggplot2) # For data visualization library(data.table) # For efficient data manipulation library(reshape2) # For reshaping data library(dplyr) # For data manipulation using the %>% operator

Load movie and rating data

movie_data <- read.csv("movies.csv", stringsAsFactors = FALSE) rating_data <- read.csv("ratings.csv") head(movie_data) head(rating_data)

**Output:

userId movieId rating timestamp
1 1 16 4.0 1217897793
2 1 24 1.5 1217895807
3 1 32 4.0 1217896246
4 1 47 4.0 1217896556
5 1 50 4.0 1217896523
6 1 110 4.0 1217896150

Step 3: Data Preprocessing

Now we will preprocessing the data.

R `

Extract genres from movie_data into a data frame

movie_genre <- as.data.frame(movie_data$genres, stringsAsFactors = FALSE)

Split genres into separate columns using '|' delimiter

movie_genre2 <- as.data.frame(tstrsplit(movie_genre[, 1], '[|]', type.convert = TRUE), stringsAsFactors = FALSE)

Assign column names to the genre matrix

colnames(movie_genre2) <- c(1:10)

Define list of genres

list_genre <- c("Action", "Adventure", "Animation", "Children", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror", "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western")

Initialize genre matrix with zeros

genre_mat <- matrix(0, nrow(movie_data), length(list_genre))

Assign column names to the genre matrix

colnames(genre_mat) <- list_genre

Iterate through each movie and its genres

for (i in 1:nrow(movie_genre2)) { for (j in 1:ncol(movie_genre2)) { # Find the column index for the genre genre_col <- which(colnames(genre_mat) == movie_genre2[i, j]) # Mark the corresponding genre as 1 in the genre matrix genre_mat[i, genre_col] <- 1 } }

Convert genre matrix to data frame and ensure integer type

genre_mat <- as.data.frame(genre_mat, stringsAsFactors = FALSE) genre_mat <- sapply(genre_mat, as.integer)

Print structure of genre matrix

str(genre_mat)

Combine movie_data, movie_id, and genre information into SearchMatrix

SearchMatrix <- cbind(movie_data[, 1:2], genre_mat)

Display the first few rows of SearchMatrix

head(SearchMatrix)

**Output:

movieId	title	Action	Adventure	Animation	Children	Comedy	Crime	Documentary	Drama	Fantasy	Film-Noir	Horror	Musical	Mystery	Romance	Sci-Fi	Thriller	War	Western

1 1 Toy Story (1995) 0 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0
2 2 Jumanji (1995) 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
3 3 Grumpier Old Men (1995) 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
4 4 Waiting to Exhale (1995) 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0
5 5 Father of the Bride Part II (1995) 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
6 6 Heat (1995) 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0

This step involves preprocessing the genre information from the movie dataset. We split the genre strings into individual genres using tstrsplit. We then create a binary matrix genre_mat where each row represents a movie and each column represents a genre, with 1 indicating the presence of a genre for a movie. This matrix is then combined with the movie data to form the SearchMatrix.

Step 4: Visualizing of the Data

Now we will visualize the data.

R `

Create a histogram of rating distribution using ggplot2

ggplot(rating_data, aes(x = rating)) +
ggtitle("Rating Distribution") + # Add plot title xlab("Rating") + # Label for x-axis ylab("Count") # Label for y-axis

**Output:

Movie and TV Show Recommendation Engine in R

This plot shows the distribution of ratings in the dataset. It helps us understand the overall rating behavior of users.

Top Rated Movies

Now we will visualize the top Rated Movies.

R `

Calculate average rating and count of ratings for each movieId

top_rated_movies <- rating_data %>% group_by(movieId) %>% summarize(avg_rating = mean(rating), count = n()) %>%

Filter movies with more than 50 ratings

filter(count > 50) %>%

Arrange movies by average rating in descending order

arrange(desc(avg_rating)) %>%

Select top 10 movies by average rating

top_n(10, wt = avg_rating)

Merge top_rated_movies with movie_data to get movie titles

top_rated_movies <- merge(top_rated_movies, movie_data, by = "movieId")

Create a bar plot of top rated movies

ggplot(top_rated_movies, aes(x = reorder(title, avg_rating), y = avg_rating)) +
geom_bar(stat = "identity", fill = "lightgreen", color = "black") +
coord_flip() + # Flip coordinates to make horizontal bar plot ggtitle("Top 10 Rated Movies") + # Add plot title xlab("Movie Title") + # Label for x-axis ylab("Average Rating") # Label for y-axis

**Output:

Movie and TV Show Recommendation Engine in R

This plot shows the top 10 movies based on average ratings, considering only movies with more than 50 ratings. It provides insights into the highest-rated movies in the dataset.

Step 5: Create Rating Matrix

Now we will create one Rating Matrix for the Recommendation Engine in R.

R `

ratingMatrix <- dcast(rating_data, userId ~ movieId, value.var = "rating", na.rm = FALSE) ratingMatrix <- as.matrix(ratingMatrix[,-1]) ratingMatrix <- as(ratingMatrix, "realRatingMatrix") ratingMatrix

**Output:

668 x 10325 rating matrix of class ‘realRatingMatrix’ with 105339 ratings.

We transform the ratings data into a matrix format suitable for the recommendation engine. Using dcast, we create a user-item matrix where rows represent users and columns represent movies, with the values being the ratings. This matrix is then converted into a realRatingMatrix object, which is the required format for the recommenderlab package.

Step 6: Build Recommendation Engine

Now we will Build our Recommendation Engine in R.

R `

Build Item-Based Collaborative Filtering (IBCF) model using ratingMatrix data

recommen_model <- Recommender(data = ratingMatrix, method = "IBCF", parameter = list(k = 30))

Get model information

model_info <- getModel(recommen_model)

Display heatmap of similarity matrix for the first 20 rows and columns

image(model_info$sim[1:20, 1:20], main = "Heatmap of the first rows and columns")

**Output:

HYBRID_realRatingMatrix''ALS_realRatingMatrix''ALS_implicit_realRatingMatrix''IBCF_realRatingMatrix.
$HYBRID_realRatingMatrix
'Hybrid recommender that aggegates several recommendation strategies using weighted averages.'
$ALS_realRatingMatrix
'Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm.'
$ALS_implicit_realRatingMatrix
'Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm.'
$IBCF_realRatingMatrix
'Recommender based on item-based collaborative filtering.'
$LIBMF_realRatingMatrix
'Matrix factorization with LIBMF via package recosystem (https://cran.r-project.org/web/
$POPULAR_realRatingMatrix
'Recommender based on item popularity.'
$RANDOM_realRatingMatrix
'Produce random recommendations (real ratings).'
$RERECOMMEND_realRatingMatrix
'Re-recommends highly rated items (real ratings).'
$SVD_realRatingMatrix
'Recommender based on SVD approximation with column-mean imputation.'
$SVDF_realRatingMatrix
'Recommender based on Funk SVD with gradient descend (https://sifter.org/~simon/journal/20061211.html).'
$UBCF_realRatingMatrix
'Recommender based on user-based collaborative filtering.'

We build the recommendation model using the Item-Based Collaborative Filtering (IBCF) method. The k parameter specifies the number of nearest neighbors. We visualize the similarity matrix using a heatmap to understand the similarity between the first 20 movies.

Step 7: Build the IBCF Model

IBCF's Inner Workings:

**Similarity Measurement: IBCF figures out how alike items are based on user ratings. It uses tools like cosine similarity or Pearson correlation to crunch these numbers.
Making Picks: For each user, IBCF spots stuff they've given thumbs up to. Then it hunts down other items that match up . These lookalikes end up as suggestions for the user.
**Pros and Things to Ponder: IBCF scales better than User-Based CF. Its similarity matrix takes up less space and has fewer gaps than the user-item matrix. IBCF tackles the "cold start" issue for new items more . It bases its picks on how alike items are, not just on what users did before. To get the best out of IBCF, you need to tweak things like k (the number of neighbors). This can change how well it works and how good its suggestions are.
**Fitting into the Big Picture: IBCF doesn't work alone. It's part of a bigger system that cleans data, trains models, checks how good they are, and puts them to use. IBCF is just one tool in the box. Recommendation engines use it along with other methods (like different CF types, content-based filtering, and mix-and-match approaches) to give each user tailored suggestions based on what they do and what items are like. R `

Build IBCF model

recommen_model <- Recommender(data = ratingMatrix, method = "IBCF", parameter = list(k = 30)) recommen_model

Inspect model

model_info <- getModel(recommen_model) class(model_info$sim) dim(model_info$sim)

Heatmap of similarities

top_items <- 20 image(model_info$sim[1:top_items, 1:top_items], main = "Heatmap of the first rows and columns")

**Output:

Recommender of type ‘IBCF’ for ‘realRatingMatrix’
learned using 668 users.
'dgCMatrix'
1032510325

Screenshot-2024-07-03-100009

Heatmap of first rows and columns

We build the recommendation model using Item-Based Collaborative Filtering (IBCF) with 30 nearest neighbors. We inspect the similarity matrix and visualize the similarities between the first 20 items using a heatmap.

Step 8: Predict Recommendations

Now we will predict Recommendations.

R `

Set seed for reproducibility

set.seed(123)

Sample data for training and testing

sampled_data <- sample(x = c(TRUE, FALSE), size = nrow(ratingMatrix), replace = TRUE, prob = c(0.8, 0.2))

Split ratingMatrix into training and testing data

training_data <- ratingMatrix[sampled_data, ] testing_data <- ratingMatrix[!sampled_data, ]

Define the number of top recommendations to predict

top_recommendations <- 10

Predict recommendations for testing data using the recommen_model

predicted_recommendations <- predict(object = recommen_model, newdata = testing_data, n = top_recommendations)

Extract recommendations for the first user in the testing set

user1_recommendations <- predicted_recommendations@items[[1]] user1_movies <- predicted_recommendations@itemLabels[user1_recommendations]

Retrieve movie titles for the recommended movies

user1_movie_titles <- sapply(user1_movies, function(x) as.character(subset(movie_data, movieId == x)$title))

Print recommended movie titles for the first user

user1_movie_titles

**Output:

                                       "Now and Then (1995)"   
                                                          72   
                              "Kicking and Screaming (1995)"   
                                                          84   
                        "Last Summer in the Hamptons (1995)"   
                                                          90   
                        "Journey of August King, The (1995)"   
                                                         131   
                                  "Frankie Starlight (1995)"   
                                                         271   
                                      "Losing Isaiah (1995)"   
                                                         279   
                                          "My Family (1995)"   
                                                         309

"Red Firecracker, Green Firecracker (Pao Da Shuang Deng) (1994)"
330
"Tales from the Hood (1995)"
352
"Crooklyn (1994)"

We split the data into training and testing sets (80% training, 20% testing). We then predict the top 10 movie recommendations for the users in the testing set. For a specific user (e.g., user 1), we extract the recommended movie IDs and get their titles.

Step 9: Evaluate the Recommendation Engine

Now we will Evaluate our Recommendation Engine.

R `

Create an evaluation scheme with given parameters

scheme <- evaluationScheme(ratingMatrix, method = "split", train = 0.8, given = 15, goodRating = 4)

Define the model using the evaluation scheme

model <- Recommender(getData(scheme,"train"), method = "IBCF", parameter = list(k = 30))

Predict ratings for known data in the evaluation scheme

pred <- predict(model, getData(scheme, "known"), type = "ratings")

Calculate prediction accuracy using unknown data in the scheme

error <- calcPredictionAccuracy(pred, getData(scheme, "unknown"))

Print prediction accuracy error

error

**Output:

RMSE 1.49711559147115 MSE 2.24135509422602 MAE 1.14430992655367

We evaluate the recommendation engine using a split method (80% train, 20% test) and calculate prediction accuracy using RMSE and MAE. This helps to measure how close the predicted ratings are to the actual ratings.

Conclusion

In this article, we started by getting the data ready. We created a rating matrix and extracted movie genres. Then, we used the recommenderlab package in R to train a recommendation model called item-based collaborative filtering (IBCF).

Once the model was trained, we checked how well it performed using a testing dataset. We also made predictions for the top recommendations for each user. To understand how items are related in the recommendation system, we analyzed the similarity matrix. Additionally, we visualized the distribution of similarities between items and users' average ratings.