Anime Recommendation Engine |Content & Collaborative filtering

Published in

Analytics Vidhya

10 min readJul 26, 2020

Introduction

This article follows a boy whose stumbled upon this amazing dataset unintentionally. With the ability of machine learning, he explores the dataset in search of ‘what to watch next?’ using content based and collaborative filtering.

Problem statement

Every streaming content has its own viewers and each content has it’s rating. Viewers leave some good ratings for the content if they like it. But where does it apply? Viewers can spend hours scrolling through hundreds, sometimes thousands of anime’s never finding an content they like. Business need to be provided suggestions based on their likes and needs in order to create a better streaming environment that boosts revenue and increases the time spent on a website

What is a recommendation engine ?

It is an unsupervised learning algorithm (one that does not have a target variable to measure accuracy against) mostly used to aid in consumer decision making. I’m sure you have seen them while online shopping. They also appear in places like streaming apps (aka Netflix and Hulu) to help you select a TV show or movie to watch next and on journalism/media websites like Medium to suggest other articles you may like to read, among many other uses. Obviously many e-retailers like Amazon have already been using recommender algorithms for quite some time, but many smaller or newer sites are still in need. There are different varieties of recommenders that base their predictions on different features. Read more at Analytics Vidhya

About dataset

This data set contains information on user preference data from 73,516 users on 12,294 anime. Each user is able to add anime to their completed list and give it a rating and this data set is a compilation of those ratings.Credits: Anime Recommendations Database

Objectives of the project

Meet and greet data — Konnichiwa (こんにちわ)
Analyze the data- Byakugan (白ガン)
Preparing data for consumption- Sonaeru (備える)
Recommendation building phase — Tsukuru (作る)

1.Meet and Greet data — Konnichiwa (こんにちわ)

Konnichiwa (こんにちは or in kanji 今日は) is a Japanese greeting, typically a mid-day greeting. It is also used as an informal greeting, “hello”. So here our guests are the data. So let’s welcome our dataset guests !

“I’m not gonna run away, I never go back on my word! That’s my nindo: my ninja way.” — Naruto Uzumaki

Dataset Details

1)anime_data:

anime_id — myanimelist.net’s unique id identifying an anime.
name — full name of anime.
genre — comma separated list of genres for this anime.
type — movie, TV, OVA, etc.
episodes — how many episodes in this show. (1 if movie).
rating — average rating out of 10 for this anime.
members — number of community members that are in this anime’s “group”.

2)rating_data:

user_id — non identifiable randomly generated user id.
anime_id — the anime that this user has rated.
rating — rating out of 10 this user has assigned (-1 if the user watched it but didn’t assign a rating)

Merging dataframes — Fusion(融合)

In this section we are fusing our csv guests to make the recomenndation engine more powerful. FUSION HA !

We have successfully merged(fused) the csv’s and made it powerful. Let’s get into action

“I want to see and understand the world outside. I don’t want to die inside these walls without knowing what’s out there.” — Eren Jaeger ( Attack on Titan | Shingeki no Kyojin)

Analyze the data- Byakugan (白ガン)

Inorder to build a recommendation engine, we have to understand our dataset. So, let’s see an overview of the dataset. BYAKUGAAN !!

Top 10 Anime based on rating counts

I’m sensing the top anime’s based on their rating counts provided by the user id’s. Let’s see who tops the throne

Results:

Death Note wears the crown for rating count followed by sword art online and Attack on Titan

“Look around you, and all you will see are people
the world would be better off without.”-Light Yagami (Death Note)

Top 10 Anime based on Community size

I’m now sensing the top anime’s based on their community size(member count) . Let’s see who tops the throne

Results:

Death note captures the crown again . “I want to tell you I’M L “

“There Is No Heaven Or Hell. No Matter What You Do While You’re Alive,
Everybody Goes To The Same Place Once You Die. Death Is Equal.”-L (Death Note)

Distribution of ratings

I’ll be now sensing the distribution of ratings on both the datasets. I believe the rating from anime.csv is from review websites and user_rating in rating.csv is from user id’s

Insights:

Most of the ratings are spread between 6–10
The mode of the distribution is around 7.5–8.0
Both the distribution are left skewed
We have -1 rating as an outlier in rating of users which can be made into NaN

Medium of streaming

Byakugan ! I’m now seeing from where does this powerful anime’s are coming from

Insights:

67.6% of the anime’s were aired on TV followed by 13.5% through Movie
10.2% of anime’s are streamed as OVA which is greater than ONA(1.18%)

Genre Word Cloud

Look up, witness the genre cloud !

We can sense there are many Comedy genre anime’s followed by action, romance,drama in our dataset

“I’ve set myself to become the King of the Pirates…and if I die trying…then at least I tried!”-Monkey D. Luffy (One Piece)

Preparing data for consumption- Sonaeru (備える)

Before giving our data guests to the recommendation engine we have to fine tune them, sculpt them,train them to face the boss !

a) Handling NaN values

First we have to take care of the NaN values, as this revolves around ratings, a user who hasn’t given any ratings has added no value to the engine. So let’s drop and crush those NaN values

b) Filtering user_id

Let’s check out the counts of user id and filter based on it

There are users who has rated only once, even if they have rated it 5, it can’t be considered a valuable record for recommendation. So I have considered minimum 200 ratings by the user as threshold value. You can play around changing the threshold value to get better results, but this worked fine.

c) Pivot Dojo

This pivot table consists of rows as title and columns as user id, this will help us to create sparse matrix which can be very helpful in finding the cosine similarity ! Don’t know what is cosine similarity ? don’t worry, We will reveal that in the next section.

“To defeat evil, I shall become an even GREATER evil.” — Lelouch Lamperouge (Code Geass)

Recommendation building phase — Tsukuru (作る)

Collaborative Filtering

Collaborative filtering is a technique that can filter out items that a user might like on the basis of reactions by similar users. It works by searching a large group of people and finding a smaller set of users with tastes similar to a particular user. Learn more from Analytics Vidhya

Cosine Similarity using KNN

Cosine similarity is a metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, higher the cosine similarity. Learn more about similarity metrics from Analytics Vidhya

We have fitted the sparse matrix, let’s get a random anime title and find recommendation for it.

Here we are returning the distances and indices of 6 neighbours through KNN from the randomly chosen index(anime_title) those will be our recommended anime’s

Testing collaborative recommendation

As we see, these are the recommended anmime for . But this code doesn’t have much flexibility of our choice. I advise you to check out Indra Lin’s notebook where he had created an awesome function for this collaborative filtering. I used my sharingan eyes to capture two code section from him. LOL

“Human beings are strong because we have the ability to change ourselves.” -Saitama (One Punch Man)

Content based filtering

Content-based filtering, also referred to as cognitive filtering, recommends items based on a comparison between the content of the items and a user profile. The content of each item is represented as a set of descriptors or terms, typically the words that occur in a document.A content based recommender works with data that the user provides, either explicitly (rating) or implicitly (clicking on a link). Based on that data, a user profile is generated, which is then used to make suggestions to the user. As the user provides more inputs or takes actions on the recommendations, the engine becomes more and more accurate. Learn more about content based filtering in Analytics Vidhya

a) Cleaning anime_title

We got many symbols found in anime_title. Let’s remove those using this function

We have got the title cleaned and neat. Now it’s time for the ultimate TFIDF to recommend us the next anime

b) Term Frequency (TF) and Inverse Document Frequency (IDF)

TF is simply the frequency of a word in a document. IDF is the inverse of the document frequency among the whole corpus of documents. TF-IDF is used mainly because of two reasons: Suppose we search for “the rise of analytics” on Google. It is certain that “the” will occur more frequently than “analytics” but the relative importance of analytics is higher than the search query point of view. In such cases, TF-IDF weighting negates the effect of high frequency words in determining the importance of an item (document). Learn more about TFIDF in Analytics Vidhya

Here we are gonna use it on the genre so that we can recommend the users based on genre content

Scikit-learn already provides pairwise metrics (a.k.a. kernels in machine learning parlance) that work for both dense and sparse representations of vector collections. Here we need to assign 1 for recommended anime and 0 for not recommended anime. So we are using sigmoid kernel

We have got the indices for the anime title, now let’s jump onto figure out the recommended anime

“You should enjoy the little detours to the fullest. Because that’s where you’ll find the things more important than what you want.” — Ging Freecss (HUNTER X HUNTER)

c) Content based Recommendation function

Here we create the function for getting the recommendation for an anime. We turn the similarity scores into lists using enumerate function, sort the list and select the top 10 score for recommendation.