Fake news classifier on US Election News📰 | LSTM 🈚

Ben Roshan
14 min readDec 11, 2020

Introduction

News medium has become a channel to pass on the information of what’s happening on world to the people living. Often people perceive whatever conveyed in the news to be true. There were circumstances where even the news channels acknowledged that their news is not true as they wrote. But some news have a significant impact not only to the people or government but also the economy. One news can shift the curves up and down depending on the emotions of people and political situation. It is important to identify the fake news from the real true news. The problem has been taken over and resolved with the help of Natural Language Processing tools which help us identify fake or true news based on the historical data. The news are now in safe hands !

Problem statement

The authenticity of Information has become a longstanding issue affecting businesses and society, both for printed and digital media. On social networks, the reach and effects of information spread occur at such a fast pace and so amplified that distorted, inaccurate or false information acquires a tremendous potential to cause real world impacts, within minutes, for millions of users. Recently, several public concerns about this problem and some approaches to mitigate the problem were expressed. . The sensationalism of not-so-accurate eye catching and intriguing headlines aimed at retaining the attention of audiences to sell information has persisted all throughout the history of all kinds of information broadcast. On social networking websites, the reach and effects of information spread are however significantly amplified and occur at such a fast pace, that distorted, inaccurate or false information acquires a tremendous potential to cause real impacts, within minutes, for millions of user

Objective

  1. Our sole objective is to classify the news from the dataset to fake or true news.
  2. Extensive EDA of news
  3. Selecting and building a powerful model for classification

Import Libraries

Let’s import all necessary libraries for the analysis and along with it let’s bring down our dataset

Importing the dataset

Let’s welcome our dataset and see what’s inside the box

#reading the fake and true datasets
fake_news = pd.read_csv('../input/fake-and-real-news-dataset/Fake.csv')
true_news = pd.read_csv('../input/fake-and-real-news-dataset/True.csv')

# print shape of fake dataset with rows and columns and information
print ("The shape of the data is (row, column):"+ str(fake_news.shape))
print (fake_news.info())
print("\n --------------------------------------- \n")

# print shape of true dataset with rows and columns and information
print ("The shape of the data is (row, column):"+ str(true_news.shape))
print (true_news.info())

Dataset Details

This metadata has two csv files where one dataset contains fake news and the other contains true/real news and has nearly 23481 fake news and 21417 true news

Description of columns in the file:

  • title- contains news headlines
  • text-contains news content/article
  • subject- type of news
  • date- date the news was published

Preprocessing and Cleaning

We have to perform certain pre-processing steps before performing EDA and giving the data to the model.Let’s begin with creating the output column

Creating the target column

Let’s create the target column for both fake and true news. Here we are gonna denote the target value as ‘0’ incase of fake news and ‘1’ incase of true news

#Target variable for fake news
fake_news['output']=0

#Target variable for true news
true_news['output']=1

Concatenating title and text of news

News has to be classified based on the tile and text jointly. Treating the title and content of news separately doesn’t reap us any benefit. So, lets concatenate both the columns in both datasets

#Concatenating and dropping for fake news
fake_news['news']=fake_news['title']+fake_news['text']
fake_news=fake_news.drop(['title', 'text'], axis=1)

#Concatenating and dropping for true news
true_news['news']=true_news['title']+true_news['text']
true_news=true_news.drop(['title', 'text'], axis=1)

#Rearranging the columns
fake_news = fake_news[['subject', 'date', 'news','output']]
true_news = true_news[['subject', 'date', 'news','output']]

Converting the date columns to datetime format

We can use pd.datetime to convert our date columns to date format we desire. But there was a problem,especially in fake_news date column. Let’s check the value_counts() to see what lies inside

fake_news['date'].value_counts()

If you notice, we had links and news headline inside the date column which can give us trouble when converting to datetime format. So let’s remove those records from the column

Only fake news dataset had an issue with date column,Now let’s proceed with converting the date column to datetime format

#Converting the date to datetime format
fake_news['date'] = pd.to_datetime(fake_news['date'])
true_news['date'] = pd.to_datetime(true_news['date'])

Appending two datasets

When we are providing a dataset for the model, we have to provide it as a single file. So it’s better to append both true and fake news data and preprocess it further and perform EDA

frames = [fake_news, true_news]
news_dataset = pd.concat(frames)
news_dataset

Text Processing

This is an important phase for any text analysis application.There will be many un-useful content in the news which can be an obstacle when feeding to a machine learning model.Unless we remove them the machine learning model doesn’t work efficiently. Lets go step by step.

News-Punctuation Cleaning

Let’s begin our text processing by removing the punctuations

#Creating a copy 
clean_news=news_dataset.copy()
def review_cleaning(text):
'''Make text lowercase, remove text in square brackets,remove links,remove punctuation
and remove words containing numbers.'''
text = str(text).lower()
text = re.sub('\[.*?\]', '', text)
text = re.sub('https?://\S+|www\.\S+', '', text)
text = re.sub('<.*?>+', '', text)
text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
text = re.sub('\n', '', text)
text = re.sub('\w*\d\w*', '', text)
return text
clean_news['news']=clean_news['news'].apply(lambda x:review_cleaning(x))
clean_news.head()

We have removed all punctuation in our news column

News-Stop words

A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. We would not want these words to take up space in our database, or taking up valuable processing time. For this, we can remove them easily, by storing a list of words that you consider to stop words. NLTK(Natural Language Toolkit) in python has a list of stopwords stored in 16 different languages. Source: Geeks for Geeks

For our project, we are considering the english stop words and removing those words

stop = stopwords.words('english')
clean_news['news'] = clean_news['news'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
clean_news.head()

We have removed all the stop words in the review column

Story Generation and Visualization from news

In this section we will complete do exploratory data analysis on news such as ngram analysis and understand which are all the words,context which are most likely found in fake news.

Important note: Please check my kaggle notebook to find the coding part of the plots

Count of news subject

Let’s start by looking at the count of news types in our dataset

Insights:

  • Our dataset has more political news than any other news followed by world news
  • We have some repeated class names which expresses same meaning such as news,politics,government news etc which is similar to the alternative

Count of news subject based on true or fake

Lets look at the count based on the fake/true outcome.

Insights:

  • Fake news are all over the category except politics and world news
  • True news are present only in politics and world news and the count is high
  • THIS IS A HIGHLY BIASED DATASET and we can expect higher accuracy which doesn’t signify it is a good model considering the poor quality of dataset

Count of fake news and true news

Let’s check the count of fake and true news and confirm whether our data is balanced or not

Insights:

  • We have a pretty much balanced data
  • But the count of fake news is higher than the true news but not on a greater extent

Deriving new features from the news

Lets extract more features from the news feature such as

  1. Polarity: The measure which signifies the sentiment of th news
  2. Review length: Length of the news(number of letters and spaces)
  3. Word Count: Number of words in the news
Polarity,Review length and wordcount

Insights:

  • Most of the polarity are neutral, neither it shows some bad news nor much happy news
  • The word count is between 0–1000 and the length of the news are between 0–5000 and few near 10000 words which could be an article

N-gram analysis

Top 20 words in News

Let’s look at the top 20 words from the news which could give us a brief idea on what news are popular in our dataset

Insights:

  • All the top 20 news are about the US government
  • Especially it’s about Trump and US followed by obama
  • We can understand that the news are from reuters.

Top 2 words in the news

Now let’s expand our search to top 2 words from the news

Insights:

  • As feared, I think the model will be biased in it’s results considering the amount of trump news
  • We can see the north korea news as well, I think it will be about the dispute between US and NK
  • There are also few news from fox news as well

Top 3 words in the news

Now let’s expand our search to top 3 words from the news

Insights:

  • There is an important news which ruled the US media-’Black lives matter’ post the demise of Floyd. We can see that news has been covered in our data. There were lot of fake news revolved around the death.
  • Rest of the news are about US politics

WordCloud of Fake and True News

Let’s look at the word cloud for both fake and true news

Fake news

True news

Insights:

  • True news doesn’t involve much trump instead on Republican Party and Russia
  • There are news about Budget,military which comes under government news

Time series analysis- Fake/True news

Let’s look at the timeline of true and fake news that were circulated in the media.

Insights:

  • True news got their dominance since Aug 2017. As they are seen at a very higher rates.That is a good sign
  • There are few outliers in true news where it was higher than the fake news(Nov 9, 2016 and Apr 7, 2017)
  • Our dataset has more fake news than the true one as we can see that we don’t have true news data for whole 2015, So the fake news classification will be pretty accurate than the true news getting classified

Stemming the reviews

Stemming is a method of deriving root word from the inflected word. Here we extract the reviews and convert the words in reviews to its root word. for example,

  • Going->go
  • Finally->fina

If you notice, the root words doesn’t need to carry a semantic meaning. There is another technique knows as Lemmatization where it converts the words into root words which has a semantic meaning. Simce it takes time. I’m using stemming

#Extracting 'reviews' for processing
news_features=clean_news.copy()
news_features=news_features[['news']].reset_index(drop=True)
news_features.head()
stop_words = set(stopwords.words("english"))
#Performing stemming on the review dataframe
ps = PorterStemmer()

#splitting and adding the stemmed words except stopwords
corpus = []
for i in range(0, len(news_features)):
news = re.sub('[^a-zA-Z]', ' ', news_features['news'][i])
news= news.lower()
news = news.split()
news = [ps.stem(word) for word in news if not word in stop_words]
news = ' '.join(news)
corpus.append(news)
#Getting the target variable
y=clean_news['output']

Deep learning-LSTM

Please refer to this amazing article to know more about LSTM

Here in this part we use neural network to predict whether the given news is fake or not.

We aren’t gonna use normal neural network like ANN to classify but LSTM(long short term memory) which helps in containing sequence information.Long Short-Term Memory (LSTM) networks are a type of recurrent neural network capable of learning order dependence in sequence prediction problems. This is a behavior required in complex problem domains like machine translation, speech recognition, and more.

One hot for Embedding layers

Before jumping into creating a layer let’s take some vocabulary size. There might be a question why vocabulary size ? it is because we will be one hot encoding the sentences in the corpus for embedding layers. While onehot encoding the words in sentences will take the index from the vocabulary size. Let’s fix the vocabulary size to 10000

#Setting up vocabulary size
voc_size=10000

#One hot encoding
onehot_repr=[one_hot(words,voc_size)for words in corpus]

Padding embedded documents

All the neural networks require to have inputs that have the same shape and size. However, when we pre-process and use the texts as inputs for our LSTM model, not all the sentences have the same length. In other words, naturally, some of the sentences are longer or shorter. We need to have the inputs with the same size, this is where the padding is necessary. Here we take the common length as 5000 and perform padding using pad_sequence() function . Also we are going to ‘pre’ pad so that zeros are added before the sentences to make the sentence of equal length

#Setting sentence length
sent_length=5000

#Padding the sentences
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)
print(embedded_docs)

We can see all the sentences are of equal length with the addition of zeros infront of the sentences and making all the sentences of length 5000

LSTM Model

At first we are going to develop the base model and compile it. The first layer will be the embedding layer which has the input of vocabulary size, vector features and sentence length. Later we add 30% dropout layer to prevent overfitting and the LSTM layer which has 100 neurons in the layer.In final layer we use sigmoid activation function. Later we compile the model using adam optimizer and binary cross entropy as loss function since we have only two outputs.

To understand how LSTM works please check this link. To give a small overview on how LSTM works,it remembers only the important sequence of words and forgets the insignificant words which doesn’t add value in the prediction

Fitting the LSTM Model

Before fitting to the model, let’s consider the padded embedded object as X and y as y itself and convert them into an array.

# Converting the X and y as array
X_final=np.array(embedded_docs)
y_final=np.array(y)

#Check shape of X and y final
X_final.shape,y_final.shape

Let’s split our new X and y variable into train and test and proceed with fitting the model to the data. We have considered 10 epochs and 64 as batch size. It can be varied to get better results.

# Train test split of the X and y final
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.33, random_state=42)

# Fitting with 10 epochs and 64 batch size
model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=10,batch_size=64)

Last 4 epochs

Evaluation of model

Now, let’s predict the output for our test data and evaluate the predicted values with y_test. Check my kaggle notebook to find the function for confusion matrix

# Predicting from test data
y_pred=model.predict_classes(X_test)

#Creating confusion matrix
#confusion_matrix(y_test,y_pred)
cm = metrics.confusion_matrix(y_test, y_pred)
plot_confusion_matrix(cm,classes=['Fake','True'])
#Checking for accuracy
accuracy_score(y_test,y_pred)

We have got an accuracy of 96%. That’s awesome !

# Creating classification report 
print(classification_report(y_test,y_pred))

From the classification report we can see the accuracy value is nearly around 96%. We have to concentrate on precision score and it is 96% which is great.

Bidirectional LSTM

Bi-LSTM is an extension of normal LSTM with two independent RNN’s together. The normal LSTM is uni directional where it cannot know the future words whereas in Bi-LSTM we can predict the future use of words as there is a backward information passed on from the other RNN layer in reverse.

There is only one change made in the code compared to the LSTM, here we use Bidirectional() function and call LSTM inside.

Fitting and Evaluation of Model

Let’s now fit the bidirectional LSTM model to the data we have with the same parameters we had before

# Fitting the model
model1.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=10,batch_size=64)
# Predicting from test dataset
y_pred1=model1.predict_classes(X_test)

#Confusion matrix
cm = metrics.confusion_matrix(y_test, y_pred1)
plot_confusion_matrix(cm,classes=['Fake','True'])
#Calculating Accuracy score
accuracy_score(y_test,y_pred1)

We have got an accuracy of 99%. That’s better than LSTM !

# Creating classification report 
print(classification_report(y_test,y_pred1))

From the classification report we can see the accuracy value is nearly around 99%. We have to concentrate on precision score and it is 99%

Conclusion

We have done a mainstream work on processing the data and building the model. We could have indulged in changing the ngrams while vectorizing the text data . We took 2 words and vectorized it. You can check Shreta’s work on the same dataset where she got better results by considering both 1 and 2 word and also way better results with the help of LSTM and Bi-LSTM network.Let’s discuss on the general insights from the dataset.

  • Most of the fake news are surrounded among Election news and about Trump. Considering the US elections 2020. There are chances to spread fake news and application of these technology will be heavily required.
  • Fake news are currently rooted during this pandemic situation to play politics and to scare people and force them to buy goods
  • Most of the news are from Reuters. We don’t know whether this news media is politically influenced. So we should always consider the source of news to find if the news is fake or true.

You can check Josué Nascimento’s work where he has explained why this dataset is more biased

You can also check out my other articles here.

--

--