Click-through-rate (CTR), is the key metric that defines how many clicks an ad receives divided by the number of views or impressions. Accurately predicting CTR (primary metric) is critical to media-buying decisions because it ensures that ads are targeted to the right users. This workspace predicts CTR based on categorical and numeric features. It also calculates the cost, return, and return on investment (ROI) based on predicted clicks.
Step-by-step analysis focuses on the implementation of a basic model in Python to be able to better optimize ads with machine learning using web browser data (ctr_data.csv) recorded for 1 day.
Some overview of the dataset:
- It is a web browser data called ctr_data.csv.
clickis the binary target variable,non-click(0) or click(1). - 3 numerical features have already been created. ("search_engine_type_count", "product_type_count", and "advertiser_type_count")
- There are no NaN/NA values.
Step 0: Import Libraries β³
# Basic operations
import pandas as pd
import numpy as np
# Data visualizations
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
# sklearn for predictive analytics
## Data engineering
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import PowerTransformer, OneHotEncoder
## Model preparation
#Data split
from sklearn.model_selection import train_test_split, RandomizedSearchCV
#Scaling numerical features after the data split
from sklearn.preprocessing import StandardScaler
#Feature selection
from sklearn.feature_selection import mutual_info_classif
#Model building
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
# Model evaluation metrics
from sklearn.metrics import balanced_accuracy_score,confusion_matrix, precision_score, recall_score, f1_score, roc_curve, auc
from imblearn.metrics import classification_report_imbalanced , geometric_mean_score
# Set model SEED for consistent results
SEED = 123 Step 1: Get to know the dataset π
1.1. Read the data
# Read the data
df = pd.read_csv("data/ctr_data.csv")
# Preview the data
df.head(5)1.2. Observations for pre-processing investigation :π π
- The summary of the data includes data types and the number of non-missing values. It is clear that there is no missing data. All the variables are coded as numeric. Based on the descriptive statistics and the description of feature variables, categorical features are also coded as integer. For feature engineering, we need to clearly define them before preparing the data for modeling.
- The bar chart below shows the class imbalance as a nature of the data. Using over/under sampling technics, splitting the data into test and train based on the ratio, looking at imbalanced accuracy metrics such as balanced accuracy and geometric mean could be helpful to handle this issue and to further refine the machine learning model.
1.2.1. Data summary:
df.info()
df.describe()1.2.1. Target variable:
# Calculating the percentage representation of each class
class_percentages = df['click'].value_counts(normalize=True) * 100
# Converting the series to a dataframe for plotting
class_percentages_df = class_percentages.reset_index()
class_percentages_df.columns = ['click', 'Percentage']
# Creating the bar plot
fig = px.bar(class_percentages_df, x='click', y='Percentage',
title='Percentage Representation of Click-through-rate',
labels={'Percentage': 'Percentage', 'CTR': 'CTR'})
fig.update_layout(xaxis_title='CTR', yaxis_title='Percentage', xaxis={'categoryorder':'total descending'})
fig.show()Step 2: Pre-processing the data π©π»βπ¬
2.1. Before the data processing
Observations for data engineering: π π
- Categorical Variables π§©: There are 7 categorical variables and their unique values range between 4 to 2473.
device_model_inthas too many unique variables. Therefore,all categorical varibles other than this variable will be processed. - Numerical Variables π: Variables are very skewed, so they will be transformed in the next section. Although heatmap shows us that there is a strong positive correlation between "product_type_count" and "advertiser_type_count", we will keep them in the base model. However, it is reasonable to fine-tune model without the one with less mutual importance. (Please refer to 'Feature Selection' for mutual importance)
# How many unique values per each categorical variable?
categorical_features = ["search_engine_type", "banner_pos","device_type","device_conn_type","product_type","advertiser_type","device_model_int"]
print(df[categorical_features].nunique())
catcol_values = {}
for col in categorical_features:
catcol_values[col] = df[col].unique()
catcol_valuesβ
β