Analytic – Click on Advertisement Prediction

12 min readFeb 10, 2023

imgsource: https://www.sparkminute.com/wp-content/uploads/2011/02/ClickOnAd.jpg

Please check here for the complete code.

Problem Statement

The problem states to answer following questions:

Question 1: Data Understanding Explore the basic information of the datasets.

Question 2: Data Cleaning and Preprocessing Clean and preprocess the datasets (such as missing values, outliers, dummy, merging etc.).

Question 3: Model Generation and Evaluation Please split the data into train and test sets with a ratio of 0.7:0.3. Build and optimize classification models you learned in this course.

Question 4: Which model has the best performance? What have you learned from the models you built?

context:

A website sent advertisements by emails to users who were interested in their product. the task is to find a good model to predict if an advertisement will be clicked with given data sets. my objective is to compare two ensemble models (Random Forest and Adaboost), Decision Tree, plus Artificial Neural Network to see which of these models perform best to solve this problem.

input:

user_features.csv = features describing our users
product_features.csv = features describing products shown in the advertisements.
click_history.csv = contains which products users had previously seen and whether that user ordered products in this website before.

output:

EDA of data
Models and Evaluation on the dataset by various models

Solution Design

Data Preparation

create EDA of data
check missing values on the data. if found, fill it with appropriate value
check outlier of data. if found, fill with mean of the column
encode the column that contains list
after the preparation steps are done, merge all of the datasets by appropriate column.
encode the categorical columns and drop some of columns that are useless for the prediction.
check the class distribution ratio. if not balanced, perform upsampling for minority class.
check correlation of the columns.

Data Modeling

split the data 70/30
normalize the data
create the confusion matrix and classification report for evaluation
define some of the functions
perform cross validation for the parameter
fit the data to the models
evaluate the models in term of accuracy, classification report, and confusion matrix
compare all models in ROC AUC Curve

Models that I used:

Random Forest
Adaboost
Decision Tree
Artificial Neural Network


"""this code section is to load the data"""

import pandas as pd
pd.options.mode.chained_assignment = None 
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer, OneHotEncoder, LabelBinarizer
import numpy as np
from sklearn.metrics import roc_curve, roc_auc_score  

#import the csv file
df1 = pd.read_csv(r"D:/COURSE MATERIALS/Analytic Python/final exam/click_history.csv")
df2 = pd.read_csv(r"D:/COURSE MATERIALS/Analytic Python/final exam/product_features.csv")
df3 = pd.read_csv(r"D:/COURSE MATERIALS/Analytic Python/final exam/user_features-1.csv")

categories = ['False','True']

"""this code section is to see and preprocess the first dataset"""
#get snapshot of the first data
df1.head()

#plot the target variable to check data imbalance
sns.countplot(x=df1["clicked"])

We can see that the classification is not balanced at all. almost half in difference on the minority class. I will balance the data later.

#define categories of the target variable
categories = ['False','True']
#check the missing values in the data
df1.isnull().sum()

No missing values found. so, we are good here. next, we are moving on to second data.

"""this code section is to see and preprocess the second dataset"""
#get snapshot of the second data
df2.head()

#see the categories contained in the data
df2['category'].drop_duplicates()

we can see that there are 11 categories in the data. the categories seem to be related to beauty product.

#check the missing values
df2.isnull().sum()

No missing values found. again, we are good here.

#check outlier on number review
sns.boxplot(x=df2['number_of_reviews'])

we can see there is one data point that stands out far beyond the boxplot. as the boxplot presented as only a line. we will normalize this datapoint.

"""this part is to normalize the outlier found in the data"""

#get the column with outlier
outdf = df2['number_of_reviews']

#set the quartile and IQR
Q1 = outdf.quantile(0.25)
Q3 = outdf.quantile(0.75)
IQR = Q3 - Q1

#set the datapoint which located beyond the quartile bound to NaN
outdf[(outdf < (Q1 - 1.5 * IQR)) |(outdf > (Q3 + 1.5 * IQR))] = np.NaN   

#fill the NaN with mean of the column
outdf.fillna(outdf.mean(), inplace=True)

#set the column back to original dataframe
df2['number_of_reviews'] = outdf

#check the boxplot again
sns.boxplot(x=df2['number_of_reviews'])

we can see that the boxplot is now in normal state. and all datapoints are located within the range of the boxplot. we move on to third data.

"""this code section is to see and preprocess the second dataset"""
#get snapshot of third dataset 
df3.head()

we can see that there is column that contains list. we have to break it down and encode each of the content.

#check missing values
df3.isnull().sum()

we can see that there are 500 missing values for number of clicks. for this matter, I will just simply put 0 in the missing values. as i assume that unknown number of clicks is equal to 0.

#fill the NaN with 0
df3 = df3.fillna("0")

"""this part is to breakdown the list in the personal interest column"""

#strip the values from string to readable list
for i in range(len(df3)):
    df3['personal_interests'][i] = df3['personal_interests'][i].strip('][').split(', ')

#assign multilabelbinarizer
mlb = MultiLabelBinarizer()

#join the encoded column to main dataset. and pop the original column.
df3 = df3.join(pd.DataFrame(mlb.fit_transform(df3.pop('personal_interests')),
                          columns=mlb.classes_,
                          index=df3.index))

#rename the list that contains nothing to no interest column
df3 = df3.rename(columns={"": "no_interest"})

df3

we can see that the third data is now looking good. as the list are all broken down to individual columns.

"""this section is to get the final dataframe and perform last data preprocessing"""

#merge all of the dataframe
df = pd.merge(df1, df3, on='user_id', how='inner')
df = pd.merge(df, df2, on='product_id', how='inner')

#drop some of unused columns
df = df.drop(['user_id','product_id'], axis = 1)

#encode boolean to 0/1
df["clicked"] = df["clicked"].astype(int)
df["ordered_before"] = df["ordered_before"].astype(int)
df["on_sale"] = df["on_sale"].astype(int)

#encode categorical data to numeric form
le = LabelEncoder()

le.fit(df['category'])
df['category'] = le.transform(df['category']) + 1
le.fit(df['number_of_clicks_before'])
df['number_of_clicks_before'] = le.transform(df['number_of_clicks_before']) + 1

df

we can see that the final data looks nice and looks ready to be used for next steps.

#getting numeric description of data
df.describe()

we can see that the standard deviation for number of reviews column is pretty high compared to the mean of the data. we will investigate this column.

#create histogram for the number of reviews
sns.histplot(data=df, x="number_of_reviews", kde=True)
plt.title('Number of Reviews Histogram with Density')
plt.show()

we can see that the distribution for this column is approximately unifrom. as the plot tend to be flat. with the central data points being the highest in number. we can conclude that the distribution is not so normal.

"""this function is to get the countplot of the target variable"""
def countplot(nn):
    sns.set(rc={'figure.figsize':(12,8)})

    ax = sns.countplot(x=nn,
                       order=nn.value_counts(ascending=False).index)

    abs_values = nn.value_counts(ascending=False)
    rel_values = nn.value_counts(ascending=False, normalize=True).values * 100
    lbls = [f'{p[0]} ({p[1]:.0f}%)' for p in zip(abs_values, rel_values)]

    ax.bar_label(container=ax.containers[0], labels=lbls)
    plt.title("Variable Count Plot")
    plt.show()
    
countplot(df['clicked'])

we can see as previously mentioned, the sample for the target variable is not balanced. data upsampling will be performed.

"""this section is to balance target variable"""
from sklearn.utils import resample

#assign major and minor class
df_majority = df[df.clicked== False]
df_minority = df[df.clicked== True]

#apply random resample based on major class count
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     
                                 n_samples=len(df_majority),    
                                 random_state=123) 

#concat the result into df
df = pd.concat([df_majority, df_minority_upsampled])
df = df.reset_index(drop = True)
countplot(df['clicked'])

we can see that now the data is balanced in count

#import modeling module

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, classification_report, roc_curve, auc

#split the data to train and test with 70/30 ratio

x_train, x_test, y_train, y_test = train_test_split(df.drop(['clicked'], axis=1), 
                                                    df['clicked'],
                                                    test_size = 0.3,
                                                    shuffle =True,random_state =42)

#assign scaler for train data
ss = StandardScaler()

#assign different train set for scaled data
x_train= pd.DataFrame(ss.fit_transform(x_train), columns = x_train.columns)
x_test= pd.DataFrame(ss.fit_transform(x_test), columns = x_train.columns)

"""this function is to tune the parameter with randomized search cross validation"""

def partuning(x):
    rscv_fit = x.fit(x_train, y_train)
    best_parameters = rscv_fit.best_params_
    best_estimator = rscv_fit.best_estimator_
    print(best_parameters)
    print(best_estimator)

"""this function is to display confusion matrix"""

def confmat(cm):
    ax = sns.heatmap(cm, annot=True, cmap='YlOrBr', fmt='g')
    ax.set_title('Confusion Matrix\n\n')
    ax.set_xlabel('\nPredicted Values')
    ax.set_ylabel('Actual Values ')
    ax.xaxis.set_ticklabels(categories)
    ax.yaxis.set_ticklabels(categories)

    plt.show()

"""this function is to display evaluation of classification"""

def simpleeval(x,y,m):
    pred = m.predict(x)
    score=accuracy_score(y,pred)
    print('Accuracy : %.5f'%score)
    print(classification_report(y, pred, target_names = categories))

Adaboost Approach

#assign classifier
abt = AdaBoostClassifier()

#assign tuning parameter
grid_ab = {
    "n_estimators":[5,50,250,500],
    "learning_rate":[0.01,0.1,1,10,100]
}

#assign RandomizedSearch cross validation with 5 iteration
rscv = RandomizedSearchCV(estimator = abt, 
                          param_distributions = grid_ab, 
                          cv = 5, 
                          n_jobs=-1, 
                          verbose=2, 
                          n_iter=5)

Fitting 5 folds for each of 5 candidates, totalling 25 fits
{'n_estimators': 500, 'learning_rate': 0.1}
AdaBoostClassifier(learning_rate=0.1, n_estimators=500)

"""this part is to fit the classifier to the data with best hyperparameter"""
#assign classifier
adcl = AdaBoostClassifier(learning_rate=0.1, n_estimators=500)
#fit the data
adcl.fit(x_train, y_train)
#print the score of classifier
print("%.2f"%adcl.score(x_test,y_test))

#call the fucnction for adaboost model evaluation
simpleeval(x_test,y_test,adcl)

Accuracy : 0.74188
                 precision    recall  f1-score   support        False      0.76      0.71      0.74      7058
        True       0.72      0.78      0.75      6924    accuracy                           0.74     13982
   macro avg       0.74      0.74      0.74     13982
weighted avg       0.74      0.74      0.74     13982

we can see that the accuracy is not so good but also not so bad. with 74% accuracy. and 74% in macro average F1 score. we can conclude that this model is decent enough.

#call the confusion matrix function 
pred = adcl.predict(x_test) 
confm = confusion_matrix(y_test,pred)
confmat(confm)

we can see that the error is more on the false negative side. in this case, our model predicted true class as false class more than vise versa. that’s why the recall for false class is the lowest.

Neural Network Approach

In [284]:

#import required modules
import tensorflow as tf
import keras
from keras.models import Sequential 
from keras.layers import Dense,Activation,Dropout 
from keras.utils import np_utils

#reshape the y train and test to categorical encoding
YNN_train = np_utils.to_categorical(y_train, 2)
YNN_test = np_utils.to_categorical(y_test, 2)
YNN_train.shape

#define Multi Layer Perceptron model in keras sequential
tf.keras.backend.clear_session()
model=Sequential()
model.add(Dense(256,input_dim=18,activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(128,activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(64,activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(32,activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(2,activation='softmax'))
#assign optimizer and compile the model
opt = tf.keras.optimizers.Adam(learning_rate=0.001) 
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
#get the model architecture
model.summary()

we can see that there are 9 layers with ~48000 trainable parameters

#train the neural network
train = model.fit(x_train, YNN_train, epochs=20, batch_size=64, validation_split = 0.3)

"""this section is to plot the accuracy and loss"""
#plotting accuracy and loss
acc = train.history['accuracy']
val_acc = train.history['val_accuracy']
loss = train.history['loss']
val_loss = train.history['val_loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

We can see that the accuracies and losses for each epoch seem to be flat and stuck in ~74%. we can see that there is no significant improvement in the training process.

"""this section is to evaluate the neural network model"""

#get test score and accuracy
score, acc = model.evaluate(x_test, YNN_test,batch_size=32)
print('Test score: %.2f'%(score*100),'%')
print('Test accuracy: %.2f'%(acc*100),'%')

#get classification report
nnpred = model.predict(x_test)
nnpredmax = np.argmax(nnpred,axis = 1)
print(classification_report(y_test, nnpredmax, target_names = categories))

437/437 [==============================] - 1s 3ms/step - loss: 0.5253 - accuracy: 0.7422
Test score: 52.53 %
Test accuracy: 74.22 %
437/437 [==============================] - 0s 1ms/step
              precision    recall  f1-score   support

       False       0.76      0.72      0.74      7058
        True       0.73      0.77      0.75      6924

    accuracy                           0.74     13982
   macro avg       0.74      0.74      0.74     13982
weighted avg       0.74      0.74      0.74     13982

we can see that the exact accuracy of the model is 74.22% with loss score of 52.53%. our model is neither good or bad. we can also see from the classification report. that for all score, all is ~74%. we can expect that the errors that the model makes are all quite same in number.

#get the confusion matrix
confm = confusion_matrix(y_test,nnpredmax)
confmat(confm)

we can see that the errors are more on false negative side. we can conclude that the neural network performs equally on errors from all the models that we tested. that yield more error on false negative side.

"""this section is to get and plot the ROC AUC"""

r_probs = [0 for _ in range(len(y_test))]
rf_probs = rdcl.predict_proba(x_test)
ad_probs = adcl.predict_proba(x_test)
dt_probs = dtcl.predict_proba(x_test)

rf_probs = rf_probs[:, 1]
ad_probs = ad_probs[:, 1]
dt_probs = dt_probs[:, 1]
nn_probs = nnpred[:, 1]

r_auc = roc_auc_score(y_test, r_probs)
rf_auc = roc_auc_score(y_test, rf_probs)
ad_auc = roc_auc_score(y_test, ad_probs)
dt_auc = roc_auc_score(y_test, dt_probs)
nn_auc = roc_auc_score(y_test, nn_probs)

r_fpr, r_tpr, _ = roc_curve(y_test, r_probs)
rf_fpr, rf_tpr, _ = roc_curve(y_test, rf_probs)
ad_fpr, ad_tpr, _ = roc_curve(y_test, ad_probs)
dt_fpr, dt_tpr, _ = roc_curve(y_test, dt_probs)
nn_fpr, nn_tpr, _ = roc_curve(y_test, nn_probs)

plt.plot(r_fpr, r_tpr, linestyle='--')
plt.plot(rf_fpr, rf_tpr, linestyle='-', label='Random Forest (AUROC = %0.3f)' % rf_auc)
plt.plot(ad_fpr, ad_tpr, linestyle='-', label='Adaboost (AUROC = %0.3f)' % ad_auc)
plt.plot(dt_fpr, dt_tpr, linestyle='-', label='Decision Tree (AUROC = %0.3f)' % dt_auc)
plt.plot(nn_fpr, nn_tpr, linestyle='-', label='Artificial Neural Network (AUROC = %0.3f)' % nn_auc)

# Title
plt.title('ROC Plot')
# Axis labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
# Show legend
plt.legend() 
# Show plot
plt.show()

we can see from the plot, the random forest model outshines all models which are adaboost, Decision Tree, and ANN model. with AUC score of 0.87. while both of adaboost and ANN AUC scores are ~0.82. and decision tree AUC is ~0.83. we can also see that the curves for adaboost and ANN model aligns almost perfectly. meaning that these two models yield results that are likely to be equal. from this plot, we can conclude that random forest gives best result.

Testing & Exception Handling

testing:

run the program as it is (normal case)

exception handling:

no exception handling performed

Discussion

Question 1: Data Understanding Explore the basic information of the datasets.

Answer: EDA has been performed.

Question 2: Data Cleaning and Preprocessing Clean and preprocess the datasets (such as missing values, outliers, dummy, merging etc.).

Answer: Data Cleaning has been performed.

Question 3: Model Generation and Evaluation Please split the data into train and test sets with a ratio of 0.7:0.3. Build and optimize classification models you learned in this course.

Answer: Data Modeling has been performed. all models are trained with 70/30 ratio

Question number 4 : Which model has the best performance? What have you learned from the models you built?

Answer: The model that give best result out of all models that i tested, apparently, is Random Forest model. the lesson that i learned is that random forest can even beat neural network when tuned correctly.

Conclusion¶

All errors are mainly on the false negative side. meaning that the models are struggling to classify True class.
The accuracies for all models are likely equal. the differences are not to extreme.
Random Forest outshine all models with 78% accuracy and AUC score of 0.87. this conclude that for ensemble methods, random forest is better than adaboost for the data.
Random Forest also outshine decision tree. this means that it is better to use more trees than just relying on 1 decision tree.
Neural Network performed equally likely with traditional machine learning classifier. but maybe, with different architecture and technique, the accuracy can be increased higher than current accuracy.