Analytic – Click on Advertisement Prediction
Please check here for the complete code.
Problem Statement
The problem states to answer following questions:
Question 1: Data Understanding Explore the basic information of the datasets.
Question 2: Data Cleaning and Preprocessing Clean and preprocess the datasets (such as missing values, outliers, dummy, merging etc.).
Question 3: Model Generation and Evaluation Please split the data into train and test sets with a ratio of 0.7:0.3. Build and optimize classification models you learned in this course.
Question 4: Which model has the best performance? What have you learned from the models you built?
context:
A website sent advertisements by emails to users who were interested in their product. the task is to find a good model to predict if an advertisement will be clicked with given data sets. my objective is to compare two ensemble models (Random Forest and Adaboost), Decision Tree, plus Artificial Neural Network to see which of these models perform best to solve this problem.
input:
- user_features.csv = features describing our users
- product_features.csv = features describing products shown in the advertisements.
- click_history.csv = contains which products users had previously seen and whether that user ordered products in this website before.
output:
- EDA of data
- Models and Evaluation on the dataset by various models
Solution Design
Data Preparation
- create EDA of data
- check missing values on the data. if found, fill it with appropriate value
- check outlier of data. if found, fill with mean of the column
- encode the column that contains list
- after the preparation steps are done, merge all of the datasets by appropriate column.
- encode the categorical columns and drop some of columns that are useless for the prediction.
- check the class distribution ratio. if not balanced, perform upsampling for minority class.
- check correlation of the columns.
Data Modeling
- split the data 70/30
- normalize the data
- create the confusion matrix and classification report for evaluation
- define some of the functions
- perform cross validation for the parameter
- fit the data to the models
- evaluate the models in term of accuracy, classification report, and confusion matrix
- compare all models in ROC AUC Curve
Models that I used:
- Random Forest
- Adaboost
- Decision Tree
- Artificial Neural Network
"""this code section is to load the data"""
import pandas as pd
pd.options.mode.chained_assignment = None
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer, OneHotEncoder, LabelBinarizer
import numpy as np
from sklearn.metrics import roc_curve, roc_auc_score
#import the csv file
df1 = pd.read_csv(r"D:/COURSE MATERIALS/Analytic Python/final exam/click_history.csv")
df2 = pd.read_csv(r"D:/COURSE MATERIALS/Analytic Python/final exam/product_features.csv")
df3 = pd.read_csv(r"D:/COURSE MATERIALS/Analytic Python/final exam/user_features-1.csv")
categories = ['False','True']
"""this code section is to see and preprocess the first dataset"""
#get snapshot of the first data
df1.head()
#plot the target variable to check data imbalance
sns.countplot(x=df1["clicked"])
We can see that the classification is not balanced at all. almost half in difference on the minority class. I will balance the data later.
#define categories of the target variable
categories = ['False','True']
#check the missing values in the data
df1.isnull().sum()
No missing values found. so, we are good here. next, we are moving on to second data.
"""this code section is to see and preprocess the second dataset"""
#get snapshot of the second data
df2.head()
#see the categories contained in the data
df2['category'].drop_duplicates()
we can see that there are 11 categories in the data. the categories seem to be related to beauty product.
#check the missing values
df2.isnull().sum()
No missing values found. again, we are good here.
#check outlier on number review
sns.boxplot(x=df2['number_of_reviews'])
we can see there is one data point that stands out far beyond the boxplot. as the boxplot presented as only a line. we will normalize this datapoint.
"""this part is to normalize the outlier found in the data"""
#get the column with outlier
outdf = df2['number_of_reviews']
#set the quartile and IQR
Q1 = outdf.quantile(0.25)
Q3 = outdf.quantile(0.75)
IQR = Q3 - Q1
#set the datapoint which located beyond the quartile bound to NaN
outdf[(outdf < (Q1 - 1.5 * IQR)) |(outdf > (Q3 + 1.5 * IQR))] = np.NaN
#fill the NaN with mean of the column
outdf.fillna(outdf.mean(), inplace=True)
#set the column back to original dataframe
df2['number_of_reviews'] = outdf
#check the boxplot again
sns.boxplot(x=df2['number_of_reviews'])
we can see that the boxplot is now in normal state. and all datapoints are located within the range of the boxplot. we move on to third data.
"""this code section is to see and preprocess the second dataset"""
#get snapshot of third dataset
df3.head()
we can see that there is column that contains list. we have to break it down and encode each of the content.
#check missing values
df3.isnull().sum()
we can see that there are 500 missing values for number of clicks. for this matter, I will just simply put 0 in the missing values. as i assume that unknown number of clicks is equal to 0.
#fill the NaN with 0
df3 = df3.fillna("0")
"""this part is to breakdown the list in the personal interest column"""
#strip the values from string to readable list
for i in range(len(df3)):
df3['personal_interests'][i] = df3['personal_interests'][i].strip('][').split(', ')
#assign multilabelbinarizer
mlb = MultiLabelBinarizer()
#join the encoded column to main dataset. and pop the original column.
df3 = df3.join(pd.DataFrame(mlb.fit_transform(df3.pop('personal_interests')),
columns=mlb.classes_,
index=df3.index))
#rename the list that contains nothing to no interest column
df3 = df3.rename(columns={"": "no_interest"})
df3
we can see that the third data is now looking good. as the list are all broken down to individual columns.
"""this section is to get the final dataframe and perform last data preprocessing"""
#merge all of the dataframe
df = pd.merge(df1, df3, on='user_id', how='inner')
df = pd.merge(df, df2, on='product_id', how='inner')
#drop some of unused columns
df = df.drop(['user_id','product_id'], axis = 1)
#encode boolean to 0/1
df["clicked"] = df["clicked"].astype(int)
df["ordered_before"] = df["ordered_before"].astype(int)
df["on_sale"] = df["on_sale"].astype(int)
#encode categorical data to numeric form
le = LabelEncoder()
le.fit(df['category'])
df['category'] = le.transform(df['category']) + 1
le.fit(df['number_of_clicks_before'])
df['number_of_clicks_before'] = le.transform(df['number_of_clicks_before']) + 1
df
we can see that the final data looks nice and looks ready to be used for next steps.
#getting numeric description of data
df.describe()
we can see that the standard deviation for number of reviews column is pretty high compared to the mean of the data. we will investigate this column.
#create histogram for the number of reviews
sns.histplot(data=df, x="number_of_reviews", kde=True)
plt.title('Number of Reviews Histogram with Density')
plt.show()
we can see that the distribution for this column is approximately unifrom. as the plot tend to be flat. with the central data points being the highest in number. we can conclude that the distribution is not so normal.
"""this function is to get the countplot of the target variable"""
def countplot(nn):
sns.set(rc={'figure.figsize':(12,8)})
ax = sns.countplot(x=nn,
order=nn.value_counts(ascending=False).index)
abs_values = nn.value_counts(ascending=False)
rel_values = nn.value_counts(ascending=False, normalize=True).values * 100
lbls = [f'{p[0]} ({p[1]:.0f}%)' for p in zip(abs_values, rel_values)]
ax.bar_label(container=ax.containers[0], labels=lbls)
plt.title("Variable Count Plot")
plt.show()
countplot(df['clicked'])
we can see as previously mentioned, the sample for the target variable is not balanced. data upsampling will be performed.
"""this section is to balance target variable"""
from sklearn.utils import resample
#assign major and minor class
df_majority = df[df.clicked== False]
df_minority = df[df.clicked== True]
#apply random resample based on major class count
df_minority_upsampled = resample(df_minority,
replace=True,
n_samples=len(df_majority),
random_state=123)
#concat the result into df
df = pd.concat([df_majority, df_minority_upsampled])
df = df.reset_index(drop = True)
countplot(df['clicked'])
we can see that now the data is balanced in count
#import modeling module
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, classification_report, roc_curve, auc
#split the data to train and test with 70/30 ratio
x_train, x_test, y_train, y_test = train_test_split(df.drop(['clicked'], axis=1),
df['clicked'],
test_size = 0.3,
shuffle =True,random_state =42)
#assign scaler for train data
ss = StandardScaler()
#assign different train set for scaled data
x_train= pd.DataFrame(ss.fit_transform(x_train), columns = x_train.columns)
x_test= pd.DataFrame(ss.fit_transform(x_test), columns = x_train.columns)
"""this function is to tune the parameter with randomized search cross validation"""
def partuning(x):
rscv_fit = x.fit(x_train, y_train)
best_parameters = rscv_fit.best_params_
best_estimator = rscv_fit.best_estimator_
print(best_parameters)
print(best_estimator)
"""this function is to display confusion matrix"""
def confmat(cm):
ax = sns.heatmap(cm, annot=True, cmap='YlOrBr', fmt='g')
ax.set_title('Confusion Matrix\n\n')
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ')
ax.xaxis.set_ticklabels(categories)
ax.yaxis.set_ticklabels(categories)
plt.show()
"""this function is to display evaluation of classification"""
def simpleeval(x,y,m):
pred = m.predict(x)
score=accuracy_score(y,pred)
print('Accuracy : %.5f'%score)
print(classification_report(y, pred, target_names = categories))
Adaboost Approach
#assign classifier
abt = AdaBoostClassifier()
#assign tuning parameter
grid_ab = {
"n_estimators":[5,50,250,500],
"learning_rate":[0.01,0.1,1,10,100]
}
#assign RandomizedSearch cross validation with 5 iteration
rscv = RandomizedSearchCV(estimator = abt,
param_distributions = grid_ab,
cv = 5,
n_jobs=-1,
verbose=2,
n_iter=5)
Fitting 5 folds for each of 5 candidates, totalling 25 fits
{'n_estimators': 500, 'learning_rate': 0.1}
AdaBoostClassifier(learning_rate=0.1, n_estimators=500)
"""this part is to fit the classifier to the data with best hyperparameter"""
#assign classifier
adcl = AdaBoostClassifier(learning_rate=0.1, n_estimators=500)
#fit the data
adcl.fit(x_train, y_train)
#print the score of classifier
print("%.2f"%adcl.score(x_test,y_test))
#call the fucnction for adaboost model evaluation
simpleeval(x_test,y_test,adcl)
Accuracy : 0.74188
precision recall f1-score support False 0.76 0.71 0.74 7058
True 0.72 0.78 0.75 6924 accuracy 0.74 13982
macro avg 0.74 0.74 0.74 13982
weighted avg 0.74 0.74 0.74 13982
we can see that the accuracy is not so good but also not so bad. with 74% accuracy. and 74% in macro average F1 score. we can conclude that this model is decent enough.
#call the confusion matrix function
pred = adcl.predict(x_test)
confm = confusion_matrix(y_test,pred)
confmat(confm)
we can see that the error is more on the false negative side. in this case, our model predicted true class as false class more than vise versa. that’s why the recall for false class is the lowest.
Neural Network Approach
In [284]:
#import required modules
import tensorflow as tf
import keras
from keras.models import Sequential
from keras.layers import Dense,Activation,Dropout
from keras.utils import np_utils
#reshape the y train and test to categorical encoding
YNN_train = np_utils.to_categorical(y_train, 2)
YNN_test = np_utils.to_categorical(y_test, 2)
YNN_train.shape
#define Multi Layer Perceptron model in keras sequential
tf.keras.backend.clear_session()
model=Sequential()
model.add(Dense(256,input_dim=18,activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(128,activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(64,activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(32,activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(2,activation='softmax'))
#assign optimizer and compile the model
opt = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
#get the model architecture
model.summary()
we can see that there are 9 layers with ~48000 trainable parameters
#train the neural network
train = model.fit(x_train, YNN_train, epochs=20, batch_size=64, validation_split = 0.3)
"""this section is to plot the accuracy and loss"""
#plotting accuracy and loss
acc = train.history['accuracy']
val_acc = train.history['val_accuracy']
loss = train.history['loss']
val_loss = train.history['val_loss']
epochs = range(len(acc))
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()
We can see that the accuracies and losses for each epoch seem to be flat and stuck in ~74%. we can see that there is no significant improvement in the training process.
"""this section is to evaluate the neural network model"""
#get test score and accuracy
score, acc = model.evaluate(x_test, YNN_test,batch_size=32)
print('Test score: %.2f'%(score*100),'%')
print('Test accuracy: %.2f'%(acc*100),'%')
#get classification report
nnpred = model.predict(x_test)
nnpredmax = np.argmax(nnpred,axis = 1)
print(classification_report(y_test, nnpredmax, target_names = categories))
437/437 [==============================] - 1s 3ms/step - loss: 0.5253 - accuracy: 0.7422
Test score: 52.53 %
Test accuracy: 74.22 %
437/437 [==============================] - 0s 1ms/step
precision recall f1-score support
False 0.76 0.72 0.74 7058
True 0.73 0.77 0.75 6924
accuracy 0.74 13982
macro avg 0.74 0.74 0.74 13982
weighted avg 0.74 0.74 0.74 13982
we can see that the exact accuracy of the model is 74.22% with loss score of 52.53%. our model is neither good or bad. we can also see from the classification report. that for all score, all is ~74%. we can expect that the errors that the model makes are all quite same in number.
#get the confusion matrix
confm = confusion_matrix(y_test,nnpredmax)
confmat(confm)
we can see that the errors are more on false negative side. we can conclude that the neural network performs equally on errors from all the models that we tested. that yield more error on false negative side.
"""this section is to get and plot the ROC AUC"""
r_probs = [0 for _ in range(len(y_test))]
rf_probs = rdcl.predict_proba(x_test)
ad_probs = adcl.predict_proba(x_test)
dt_probs = dtcl.predict_proba(x_test)
rf_probs = rf_probs[:, 1]
ad_probs = ad_probs[:, 1]
dt_probs = dt_probs[:, 1]
nn_probs = nnpred[:, 1]
r_auc = roc_auc_score(y_test, r_probs)
rf_auc = roc_auc_score(y_test, rf_probs)
ad_auc = roc_auc_score(y_test, ad_probs)
dt_auc = roc_auc_score(y_test, dt_probs)
nn_auc = roc_auc_score(y_test, nn_probs)
r_fpr, r_tpr, _ = roc_curve(y_test, r_probs)
rf_fpr, rf_tpr, _ = roc_curve(y_test, rf_probs)
ad_fpr, ad_tpr, _ = roc_curve(y_test, ad_probs)
dt_fpr, dt_tpr, _ = roc_curve(y_test, dt_probs)
nn_fpr, nn_tpr, _ = roc_curve(y_test, nn_probs)
plt.plot(r_fpr, r_tpr, linestyle='--')
plt.plot(rf_fpr, rf_tpr, linestyle='-', label='Random Forest (AUROC = %0.3f)' % rf_auc)
plt.plot(ad_fpr, ad_tpr, linestyle='-', label='Adaboost (AUROC = %0.3f)' % ad_auc)
plt.plot(dt_fpr, dt_tpr, linestyle='-', label='Decision Tree (AUROC = %0.3f)' % dt_auc)
plt.plot(nn_fpr, nn_tpr, linestyle='-', label='Artificial Neural Network (AUROC = %0.3f)' % nn_auc)
# Title
plt.title('ROC Plot')
# Axis labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
# Show legend
plt.legend()
# Show plot
plt.show()
we can see from the plot, the random forest model outshines all models which are adaboost, Decision Tree, and ANN model. with AUC score of 0.87. while both of adaboost and ANN AUC scores are ~0.82. and decision tree AUC is ~0.83. we can also see that the curves for adaboost and ANN model aligns almost perfectly. meaning that these two models yield results that are likely to be equal. from this plot, we can conclude that random forest gives best result.
Testing & Exception Handling
testing:
- run the program as it is (normal case)
exception handling:
- no exception handling performed
Discussion
Question 1: Data Understanding Explore the basic information of the datasets.
Answer: EDA has been performed.
Question 2: Data Cleaning and Preprocessing Clean and preprocess the datasets (such as missing values, outliers, dummy, merging etc.).
Answer: Data Cleaning has been performed.
Question 3: Model Generation and Evaluation Please split the data into train and test sets with a ratio of 0.7:0.3. Build and optimize classification models you learned in this course.
Answer: Data Modeling has been performed. all models are trained with 70/30 ratio
Question number 4 : Which model has the best performance? What have you learned from the models you built?
Answer: The model that give best result out of all models that i tested, apparently, is Random Forest model. the lesson that i learned is that random forest can even beat neural network when tuned correctly.
Conclusion¶
- All errors are mainly on the false negative side. meaning that the models are struggling to classify True class.
- The accuracies for all models are likely equal. the differences are not to extreme.
- Random Forest outshine all models with 78% accuracy and AUC score of 0.87. this conclude that for ensemble methods, random forest is better than adaboost for the data.
- Random Forest also outshine decision tree. this means that it is better to use more trees than just relying on 1 decision tree.
- Neural Network performed equally likely with traditional machine learning classifier. but maybe, with different architecture and technique, the accuracy can be increased higher than current accuracy.