Sentiment Analysis on Amazon musical instrument reviews

14 min readAug 10, 2020

Everyday we come across various products in our lives, on the digital medium we swipe across hundreds of product choices under one category. It will be tedious for the customer to make selection. Here comes ‘reviews’ where customers who have already got that product leave a rating after using them and brief their experience by giving reviews. As we know ratings can be easily sorted and judged whether a product is good or bad. But when it comes to sentence reviews we need to read through every line to make sure the review conveys a positive or negative sense. In the era of artificial intelligence, things like that have got easy with the Natural Langauge Processing(NLP) technology.

Acknowledgements:

Ngram visualization analysis — Ratan Rohith
ROC AUC curve — Scikit learn documentation
Polarity and orange plots-Susan Li

What is sentiment analysis?

Sentiment Analysis is the most common text classification tool that analyses an incoming message and tells whether the underlying sentiment is positive, negative our neutral.Understanding people’s emotions is essential for businesses since customers are able to express their thoughts and feelings more openly than ever before.It is quite hard for a human to go through each single line and identify the emotion being the user experience.Now with technology, we can automatically analyzing customer feedback, from survey responses to social media conversations, brands are able to listen attentively to their customers, and tailor products and services to meet their needs. To understand more about the concept . Read from this article

Problem statement

This is the Problem Statement given by ISRO to classify the customer comments. This would be helpful for the organization to understand Customer feedback.

Webportals like Bhuvan get vast amount of feedback from the users. To go through all the feedback’s can be a tedious job. You have to categorize opinions expressed in feedback forums. This can be utilized for feedback management system. We Classification of individual comments/reviews.and we also determining overall rating based on individual comments/reviews. So that company can get a complete idea on feedback’s provided by customers and can take care on those particular fields. This makes more loyal Customers to the company, increase in business , fame ,brand value ,profits.

Objectives of Project

Reviews pre-processing and Cleaning
Story Generation and Visualization from reviews
Extracting Features from Cleaned reviews
Model Building: Sentiment Analysis

Import Libraries

Let’s import all necessary libraries for the analysis and along with it let’s bring down our dataset

Importing the dataset

Let’s welcome our dataset and see what’s inside the box

raw_reviews = pd.read_csv('../input/amazon-music -reviews/Musical_instruments_reviews.csv')## print shape of dataset with rows and columns and information 
print ("The shape of the  data is (row, column):"+ str(raw_reviews.shape))
print (raw_reviews.info())

Dataset Details

This file has reviewer ID , User ID, Reviewer Name, Reviewer text, helpful, Summary(obtained from Reviewer text),Overall Rating on a scale 5, Review time

Description of columns in the file:

reviewerID — ID of the reviewer, e.g. A2SUAM1J3GNN3B
asin — ID of the product, e.g. 0000013714
reviewerName — name of the reviewer
helpful — helpfulness rating of the review, e.g. 2/3
reviewText — text of the review
overall — rating of the product
summary — summary of the review
unixReviewTime — time of the review (unix time)
reviewTime — time of the review (raw)

Pre-processing and cleaning

We got to do lot of pre-processing before sending the reviews to the model. Let’s go step by step.

Handling NaN values

Let’s check for null values

#Creating a copy
process_reviews=raw_reviews.copy()

#Checking for null values
process_reviews.isnull().sum()

We got null values in reviewer names and review text. Reviewer names doesn’t add any value(we got id’s instead) to our objective of the project. So let’s focus on review text. I don’t think dropping wouldn’t be a problem as there are only 7 null values, but instead I’m thinking to impute that as missing and explore why they didn’t leave any review . Could it be due to ratings?

#Filling NaN with 'missing'
process_reviews['reviewText']=process_reviews['reviewText'].fillna('Missing')

Concatenating review text and summary

Let’s combine review text and summary column. The sentiments won’t be contradicting in nature.

process_reviews['reviews']=process_reviews['reviewText']+process_reviews['summary']
process_reviews=process_reviews.drop(['reviewText', 'summary'], axis=1)
process_reviews.head()

Creating ‘sentiment’ column

This is an important pre-processing phase, we are deciding the outcome column (sentiment of review) based on the overall score. If the score is greater than 3, we take that as positive and if the value is less than 3 it is negative If it is equal to 3, we take that as neutral sentiment.

#Figuring out the distribution of categories
process_reviews['overall'].value_counts()

#Applying the above function in our new column
process_reviews['sentiment'] = process_reviews.apply(f, axis=1)
process_reviews.head()

The target feature is created as ‘sentiment’

#Checking the count of values
process_reviews['sentiment'].value_counts()

Handling time column

Here we have an unusual review time column which has date and year, once we split both we will split the date further into month and date.

Splitting the time column to day, month and year.

Finding the helpfulness of the review

From the main dataframe we can see the helpful feature with values in list [a,b] format. It says that a out of b people found that review helpful. But with that format, it could not add value to the machine learning model and it will be difficult to decrypt the meaning for the machine. So I have planned to create helpful_rate feature which returns a/b value from [a,b]. The following codeblock contains the complete processing step. I have added comments on what’s happening in each code.

We have successfully created the helpful_rate column through processing steps. Let’s look at the values

process_reviews['helpful_rate'].value_counts()

0.00 indicates that the review hasn’t been much helpful and 1.00 indicates that the review has been very helpful

Review text-Punctuation Cleaning

Let’s begin our text processing by removing the punctuation.

#Removing unnecessary columns
process_reviews=process_reviews.drop(['reviewerName','unixReviewTime'], axis=1)
#Creating a copy 
clean_reviews=process_reviews.copy()

process_reviews['reviews']=process_reviews['reviews'].apply(lambda x:review_cleaning(x))
process_reviews.head()

If you notice the review feature sentence,we can see that all punctuations have been removed.

Review text-Stop words

Coming to stop words, general nltk stop words contains words like not,hasn’t,would’nt which actually conveys a negative sentiment. If we remove that it will end up contradicting the target variable(sentiment). So I have curated the stop words which doesn’t have any negative sentiment or any negative alternatives. Read more about stopwords from this article

#Removing stopwords
process_reviews['reviews'] = process_reviews['reviews'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
process_reviews.head()

If you see the first record,’not’ word is not removed. If we had gone by the NLTK stopwords library which included ‘not’ it would have removed it and the sentiment would be contradicting in nature. Since we curated the list of stop words we have bypassed the issue. Removing stop words in sentiment analysis is not advised by the experts.

Story Generation and Visualization from reviews

In this section we will complete do exploratory data analysis on texts as well as other factors to understand what are all features which contributes to the sentiment.

Prior analysis assumptions:

Higher the helpful rate the sentiment becomes positive
There will be many negative sentiment reviews in the 2013 and 2014 year
There will be more reviews at the starting of a month

These assumptions will be verified with our plots also we will do text analysis.

Sentiments vs Helpful rate

First lets look whether there any relationship between sentiment of review and helpfulness of it.

pd.DataFrame(process_reviews.groupby('sentiment')['helpful_rate'].mean())

From the table we can see that the mean of of helpful rate is higher for any negative reviews than neutral and positive reviews. These mean value might have been influenced by the 0 values in helpful rates. Lets check how it is distributed through violin plot

Note: I don’t wish to post all the codes here and make this article lengthy. You can get all the code for this visualizations here.

Insights:

From the plot we can declare that more number of positive reviews are having high helpful rate. We got deceived by the mean value, it’s better to look at a plot rather than taking some measures of central tendency under such situation. Our first assumption is correct !

Year vs Sentiment count

In this block we will see how many reviews were posted based on sentiments in each year from 2004 to 2014

Insights:
From the plot we can clearly see the rise in positive reviews from 2010. Reaching its peak around 2013 and there is a dip in 2014, All the review rates were dropped at this time. Negative and neutral reviews are very low as compared to the positive reviews. Our second assumption is wrong !

Day of month vs Reviews count

Let’s check if there are any relationship between reviews and day of month

Insights:
The review counts are more or less uniformly distributed.There isn’t much variance between the days. But there is a huge drop at the end of month. Our third assumption is wrong ! Never trust your instincts unles you do EDA.

Creating few more features for text analysis

Now, let’s create polarity, review length and word count

Polarity: We use Textblob for for figuring out the rate of sentiment . It is between [-1,1] where -1 is negative and 1 is positive polarity

Review length: length of the review which includes each letters and spaces

Word length: This measures how many words are there in review

Visualizations based on new features

N-gram analysis

Welcome to the deep text analysis. Here we will be using n-grams to analyse the text, based on it’s sentiment. n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. Read more from this article

Wordcloud-Positive reviews

Let’s look at the word cloud of positive reviews, neutral and negative reviews respectively

Extracting Features from Cleaned reviews

Before we build the model for our sentiment analysis, it is required to convert the review texts into vector formation as computer cannot understand words and their sentiment. In this project, we are going to use TF-TDF method to convert the texts

Encoding target variable-sentiment

Let’s encode our target variable with Label encoder

#Checking for counts
process_reviews['sentiment'].value_counts()

We have successfully encoded our target feature. But we can see that it is imbalanced.

Stemming the reviews

Stemming is a method of deriving root word from the inflected word. Here we extract the reviews and convert the words in reviews to its root word. for example,

Going->go
Finally->fina

If you notice, the root words doesn’t need to carry a semantic meaning. There is another technique knows as Lemmatization where it converts the words into root words which has a semantic meaning. Simce it takes time. I’m using stemming. Read more about stemming and lemmatization here.

#Extracting 'reviews' for processing
review_features=process_reviews.copy()
review_features=review_features[['reviews']].reset_index(drop=True)
review_features.head()

This is how a line looks like now, as computer cannot understand words and their sentiment we need to convert these words into 1’s and 0’s. To encode it we use TFIDF

TFIDF(Term Frequency — Inverse Document Frequency)

TF-IDF stands for “Term Frequency — Inverse Document Frequency”. This is a technique to quantify a word in documents, we generally compute a weight to each word which signifies the importance of the word in the document and corpus. This method is a widely used technique in Information Retrieval and Text Mining. More about TFIDF in this article.

Here we are splitting as bigram (two words) and consider their combined weight.Also we are taking only the top 5000 words from the reviews.

As we have considered 5000 words, we can confirm that we have 5000 columns from the shape.

#Getting the target variable(encoded)
y=process_reviews['sentiment']

Handling Imbalance target feature-SMOTE

In our target feature, we noticed that we got a lot of positive sentiments compared to negative and neutral. So it is crucial to balanced the classes in such situation. Here I use SMOTE(Synthetic Minority Oversampling Technique) to balance out the imbalanced dataset problem.It aims to balance class distribution by randomly increasing minority class examples by replicating them.

SMOTE synthesizes new minority instances between existing minority instances. It generates the virtual training records by linear interpolation for the minority class. These synthetic training records are generated by randomly selecting one or more of the k-nearest neighbors for each example in the minority class. After the oversampling process, the data is reconstructed and several classification models can be applied for the processed data. Read more about SMOTE here.

Great, as you can see the resampled data has equally distributed classes

Train-test split(75:25)

Using train test split function we are splitting the dataset into 75:25 ratio for train and test set respectively.

## Divide the dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, test_size=0.25, random_state=0)

Model Building: Sentiment Analysis

As we have successfully processed the text data, not it is just a normal machine learning problem. Where from the sparse matrix we predict the classes in target feature.

Model selection

First select the best performing model by using cross validation. Let’s consider all the classification algorithm and perform the model selection process.

From the results, we can see logistic regression outdone the rest of the algorithms and all the accuracies from the results are more than 80%. That’s great. So let’s go with logistic regression with hyper-parameter tuning.

Logistic Regression with Hyperparameter tuning

We use regularization parameter and penality for parameter tuning. let’s see which one to plug.

From the selected parameters, we get accuracy. Let’s plug and chug

logreg = LogisticRegression(C=10000.0, random_state=0)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

We have got 94% accuracy. That ain’t bad. But for classification problems we need to get confusion matrix and check f1 score rather than accuracy.

Classification metrics

Here we plot the confusion matrix with ROC and check our f1 score

Function for confusion matrix:

cm = metrics.confusion_matrix(y_test, y_pred)
plot_confusion_matrix(cm, classes=['Negative','Neutral','Positive'])

Check out the diagonal elements(2326+2195+1854), they are correctly predicted records and rest are incorrectly classified by the algorithm.

Precision, recall & f1 score:

Since predicting both positive,negative and neutral reviews are important we are considering.We got a pretty good f1 score. As we see it got a good score across all classes classified

ROC-AUC curve

This is a very important curve where we decide on which threshold to setup based upon the objective criteria.Since we are dealing with multiclass classification, we need to binarize the target feature and perform OneVsAll Classifier .You can read more about binarizing from this article.Here we plotted ROC for different classes which can help us understand which class was classified better. Also we plot micro and macro averages on the roc curve.

Insights:

Considering the ROC curve for classes, class 2 and 0 have been classified pretty well a their area under the curve is high. We can chose any threshold between 0.6–0.8 to get the optimal number of TPR and FPR
Coming to micro and macro average, micro average preforms really well and macro average shows a not very good score
If you don’t understand what micro and macro average is, just remember the following ‘A macro-average will compute the metric independently for each class and then take the average (hence treating all classes equally), whereas a micro-average will aggregate the contributions of all classes to compute the average metric. In a multi-class classification setup, micro-average is preferable if you suspect there might be class imbalance

Conclusion

We have done a pretty neat job on classifying all the classes starting from splitting the sentiments based on overall score,text cleaning, customize the stopwords list based on requirement and finally handling imbalance with smote. Here are few insights from the notebook.

Consider welcoming ngram in sentiment analysis as one word can’t give is proper results and stop words got to be manually checked as they have negative words. It is advised to avoid using stop words in sentiment analysis
Most of our neutral reviews were actual critic of product from the buyers, so amazon can consider these as feedback and give them to the seller to help them improve their products
Most of the reviews in this dataset were about string instruments such as guitar.
Balancing the dataset got me a very fruitful accuracy score. Without balancing, I got good precision but very bad recall and inturn affected my f1 score. So balancing the target feature is important
In sentiment analysis, we should concentrate on our f1 score where we got an average of 94% so we did a pretty good job.