Amazon Product Reviews Sentiment Analysis in Python (original) (raw)

Last Updated : 23 Jul, 2025

Amazon gives a platform to small businesses and companies with modest resources to grow larger. And Because of its popularity, people actually spend time and write detailed reviews, about the brand and the product. So, by analyzing that data we can tell companies a lot about their products and also the ways to enhance the quality of the product. But that large amount of data can not be analyzed by a person.

Amazon Product Reviews Sentiment Analysis in Python

So here comes the Machine learning part, i.e. Natural Language Processing (NLP) to overcome the problem of large datasets and analyze it. Our task is to predict whether the review given is positive or negative. The real dataset after scraping the website might include millions of reviews. So we preprocessed the data for you,

Before starting the code, download the dataset by clicking the link.

Steps to be followed

  1. Importing Libraries and Datasets
  2. Preprocessing and cleaning the reviews
  3. Analysis of the Dataset
  4. Converting text into Vectors
  5. Model training, Evaluation, and Prediction

Let's start with the code now.

Importing Libraries and Datasets

The libraries used are :

import warnings warnings.filterwarnings('ignore') import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer import matplotlib.pyplot as plt from wordcloud import WordCloud

`

For NLP part, we will be using NLTK Library. From that we will be requiring stopword and punkt. so let's download and import them using the below command.

Python `

import nltk nltk.download('punkt') nltk.download('stopwords') from nltk.corpus import stopwords

`

After that import the downloaded dataset using the below code.

Python `

data = pd.read_csv('AmazonReview.csv') data.head()

`

**Output :

Preprocessing and cleaning the reviews

Python `

data.info()

`

**Output:

Data columns (total 2 columns):

Column Non-Null Count Dtype


0 Review 24999 non-null object 1 Sentiment 25000 non-null int64

Now, To drop the null values (if any), run the below command.

Python `

data.dropna(inplace=True)

`

To predict the Sentiment as positive(numerical value = 1) or negative(numerical value = 0), we need to change them the values to those categories. For that the condition will be like if the sentiment value is less than or equal to 3, then it is negative(0) else positive(1). For better understanding, refer the code below.

Python `

#1,2,3->negative(i.e 0) data.loc[data['Sentiment']<=3,'Sentiment'] = 0

#4,5->positive(i.e 1) data.loc[data['Sentiment']>3,'Sentiment'] = 1

`

Now, once the dataset is ready, we will clean the review column by removing the stopwords. The code for that is given below.

Python `

stp_words=stopwords.words('english') def clean_review(review): cleanreview=" ".join(word for word in review. split() if word not in stp_words) return cleanreview

data['Review']=data['Review'].apply(clean_review)

`

Once we have done with the preprocess. Let's see the top 5 rows to see the improved dataset.

Python `

data.head()

`

**Output :

Analysis of the Dataset

Let's check out that how many counts are there for positive and negative sentiments.

Python `

data['Sentiment'].value_counts()

`

**Output :

**0 15000 **1 9999

To have the better picture of the importance of the words let's create the Wordcloud of all the words with sentiment = 0 i.e. negative

Python `

consolidated=' '.join(word for word in data['Review'][data['Sentiment']==0].astype(str)) wordCloud=WordCloud(width=1600,height=800,random_state=21,max_font_size=110) plt.figure(figsize=(15,10)) plt.imshow(wordCloud.generate(consolidated),interpolation='bilinear') plt.axis('off') plt.show()

`

**Output :

WordCloud

Let's do the same for all the words with sentiment = 1 i.e. positive

Python `

consolidated=' '.join(word for word in data['Review'][data['Sentiment']==1].astype(str)) wordCloud=WordCloud(width=1600,height=800,random_state=21,max_font_size=110) plt.figure(figsize=(15,10)) plt.imshow(wordCloud.generate(consolidated),interpolation='bilinear') plt.axis('off') plt.show()

`

**Output :

Now we have a clear picture of the words we have in both the categories.

Let's create the vectors.

Converting text into Vectors

TF-IDF calculates that how relevant a word in a series or corpus is to a text. The meaning increases proportionally to the number of times in the text a word appears but is compensated by the word frequency in the corpus (data-set). We will be implementing this with the code below.

Python `

cv = TfidfVectorizer(max_features=2500) X = cv.fit_transform(data['Review'] ).toarray()

`

Model training, Evaluation, and Prediction

Once analysis and vectorization is done. We can now explore any machine learning model to train the data. But before that perform the train-test split.

Python `

from sklearn.model_selection import train_test_split x_train ,x_test,y_train,y_test=train_test_split(X,data['Sentiment'], test_size=0.25 , random_state=42)

`

Now we can train any model, Let's explore the Logistic Regression.

Python `

from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score

model=LogisticRegression()

#Model fitting model.fit(x_train,y_train)

#testing the model pred=model.predict(x_test)

#model accuracy print(accuracy_score(y_test,pred))

This code is modified by Susobhan Akhuli

`

**Output :

0.81632

Let's see the confusion matrix for the results.

Python `

from sklearn import metrics from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test,pred)

cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = cm, display_labels = [False, True])

cm_display.plot() plt.show()

This code is modified by Susobhan Akhuli

`

**Output :

Get the Complete notebook:

**Notebook: **click here.

**For Dataset: **click here.