Fake News Detector

Raditya Fahritama, MPS
7 min readFeb 19, 2023

--

imgsource: https://pyxis.nymag.com/v1/imgs/689/d1d/8e79df9f90f987efcc0992a8f591dbd65e-15-fake-news.2x.h473.w710.jpg

The Internet is an incredible resource for news and information, but unfortunately not everything online is trustworthy. Fake news is any article or video containing untrue information disguised as a credible news source. While fake news is not unique to the Internet, it has recently become a big problem in today’s digital world.

The problem starts when fake news often went viral and started to make readers uncomfortable with its content. Fake news generally is designed to shock and fool people. And people usually re-shared the news without checking whether the news is real or not. With that, we surely want to know which news is real and which news is fake. What are things that differentiate between fake and real news.

Data used was from Kaggle for training our classifier. The data contains 5 columns and 20387 rows. The data consists of title of the news, author, the news itself, and label that explains whether that news is reliable or unreliable (0 for reliable, 1 for unreliable). With 10387 total reliable news and 10413 total of unreliable news. Total size for this train dataset is 99 MB.

Here are the 20 words that have the highest importance based on 175 sample texts that is computed in the TF-IDF vector from the dataset

Word Relevance Rank in the news text.

From this chart, we can see that the article of the dataset is mainly about political issues that happen in the United States.

Here are some of the snapshots of the n-gram of the dataset.

Frequent Unigram words of the Dataset
Frequent Bigram Words of the Dataset
Frequent Trigram words of the Dataset

From the n-grams, we can clearly see that these posts are mainly about political news in the United States.

We start the modeling part by importing libraries that will be needed for the model.

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import re
from sklearn.linear_model import PassiveAggressiveClassifier
import string
import urllib
from newspaper import Article
from wordcloud import WordCloud

after that, we load the dataset to the dataframe. We also drop the rows that have null values and reset the index of the dataframe.

data=pd.read_csv(r'E:\Downloads\Kaggle Datasets\fake-news\train.csv')
data=data.dropna()
#since null value rows are dropped indexes needs to be reset
df=data.copy()
df.reset_index(inplace=True)

We also convert the label to ‘Real’ and ‘Fake’. this process is just to maintain the readability and make the label more understandable.

conversion_dict = {0: 'Real', 1: 'Fake'}
df['label'] = df['label'].replace(conversion_dict)
df

We can see that the dataset looks nice. We don’t really need to worry about other columns because we only need the text and label columns.

After that, we will now move on to data preprocessing step. for these steps, we do couple of things like removing symbols with regular expression, change to lower case to maintain integrity and prevent ambiguity between terms, split the text to list of terms for the stemming process. then we do stemming. Stemming is the process of producing morphological variants of a root/base word. Stemming programs are commonly referred to as stemming algorithms or stemmers.

ps = PorterStemmer()

def stemming(corpus):
# Pick all alphabet characters - lowercase and uppercase...all others such as numbers and punctuations will be removed. Numbers or punctuations will be replaced by a whitespace
stemmed_corpus = re.sub('[^a-zA-Z]',' ',corpus)

# Converting all letters to lowercase
stemmed_corpus = stemmed_corpus.lower()

# Converting all to a splitted case or a list
stemmed_corpus = stemmed_corpus.split()

# Applying stemming, so we get the root words wherever possible + remove stopwords as well
stemmed_corpus = [ps.stem(word) for word in stemmed_corpus] #if not word in stopwords.words('english')]

# Join all the words in final content
stemmed_corpus = ' '.join(stemmed_corpus)
return stemmed_corpus

We now apply the stemming function to the texts that we have.

df['text'] = df['text'].apply(stemming)
df['text']

We can see that the texts are kind of unreadable because of the stemming process. We now move on to the data splitting and TF-IDF process. The preprocessed texts are subjected to vectorizer. The text list is converted to a vector array in this way. and doing the computation using TF-IDF vector representation. The statistical tool TF-IDF assesses a word’s relevance to a document within a collection of documents.

A word’s frequency in a document and its inverse document frequency over a group of documents are multiplied in order to achieve this.

TF-IDF for a word in a document is calculated by multiplying two different metrics:

  • the number of times a word appears in a document. The simplest method of determining this frequency is to simply count the number of times a word appears in a document. The length of a document or the frequency of the word that appears the most frequently in a document are other ways to modify frequency.
  • the word’s average inverse document frequency in a group of documents. This refers to how prevalent or uncommon a word is throughout all documents. A word is more common the nearer it is to 0. By taking the total number of documents, dividing it by the total number of documents containing a word, and then computing the logarithm, this metric can be obtained.

Then, we split the vectors into train and test vectors using train test split. The train vectors are used to train the machine to know what is considered fake news and what is considered real news based on the label. And after that, the test vectors are used for the machine to make predictions. The machine predicts each of the test vectors then compares it to the label of it. Accuracy is ranked by how many the machine got right and got wrong.

x_train,x_test,y_train,y_test=train_test_split(df['text'], df['label'], test_size=0.20, random_state=7, shuffle=True)
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.75)
vec_train=tfidf_vectorizer.fit_transform(x_train.values.astype('U'))
vec_test=tfidf_vectorizer.transform(x_test.values.astype('U'))

Now, we come to the training process. For this model, I use Passive Aggressive Classifier. In that they don’t require a learning rate, passive-aggressive algorithms resemble Perceptron models in certain ways. They do, however, have a regularization parameter. Passive-Aggressive algorithms are called so because:

Passive: If the prediction is correct, keep the model and do not make any changes. i.e., the data in the example is not enough to cause any changes in the model.

Aggressive: If the prediction is incorrect, make changes to the model. i.e., some change to the model may correct it.

It belongs to a select few of “online-learning algorithms.” As opposed to batch learning, when the full training dataset is used at once, online machine learning techniques employ sequential input data and update the machine learning model one step at a time.

pac=PassiveAggressiveClassifier(max_iter=50)
pac.fit(vec_train,y_train)
y_predpac=pac.predict(vec_test)
scorepac=accuracy_score(y_test,y_predpac)
print(f'Passive Aggressive Accuracy: {round(scorepac*100,2)}%')

We can see that we can achieve ~97% accuracy with this algorithm. So far, the model looks good. We now create function for the practical prediction.

def findlabelpac(newtext):
vec_newtestpac=tfidf_vectorizer.transform([newtext])
y_pred1pac=pac.predict(vec_newtestpac)
return y_pred1pac[0]

We first test the model with article from palmerreport.com. I got this website from Wikipedia that states that this website belongs to fake news source. We put the url of some of the article. Then, we scrape the content from the url by the article text.

url = "https://www.palmerreport.com/analysis/so-much-for-matthew-mcconaughey-2/42801/"
url = urllib.parse.unquote(url)
article = Article(str(url))
article.download()
article.parse()
article.nlp()
title = article.title
news = article.text
keywords = article.keywords

We then do the same preprocessing steps before we feed the text into our prediction function.

news = stemming(news)
news

after the stemming process, we apply the function to the text.

labeltestpac = findlabelpac(news)
print("Passive Aggressive Prediction:",labeltestpac)

We can see that even the model detect this article as Fake news. To prevent biased conclusion, we now test the model with article from New York Times. We do all the same processes from the previous.

url = "https://www.nytimes.com/2023/02/19/us/chris-hinds-wheelchair-denver-city-council.html"
url = urllib.parse.unquote(url)
article = Article(str(url))
article.download()
article.parse()
article.nlp()
title = article.title
news = article.text
keywords = article.keywords
news = stemming(news)
news
labeltestpac = findlabelpac(news)
print("Passive Aggressive Prediction:",labeltestpac)

The model detect that the article that we have from New York Times as Real. While we still need more testing to ensure unbiased conclusion, I believe that this model can classify between Real and Fake news relatively well.

--

--

Raditya Fahritama, MPS

Penn State University - MPS in Data Analytics | Machine Learning and Artificial Intelligence