Digital Marketing Campaign Analysis: Predicting Conversion and Clustering Customers
Abstract
The project below deals with synthetic data of customer interactions with a digital marketing campaign dataset. Using hierarchical clustering methods and the tidymodels package, the goal of this project is to simulate the types of exploratory data analysis and strategic recommendations an analyst would make in a real-world marketing or advertising business context.
The first half of the project handled customer segmentation, creating buyer personas based on demographics, purchase behaviors, and engagement levels. In the second half, we used logistic regression and hyperparameter-tuned decision tree models from the tidymodels package to predict customer Conversion from our dataset.
Load Dataset and Libraries
library(readr)
library(dplyr)
library(ggplot2)
library(plotly)
library(broom)
#install.packages('naniar')
library(naniar)
library(corrplot)
library(dendextend)
library(forcats)
library(tidyr)
library(caret)
library(caTools)
library(tidymodels)
library(xgboost)1 hidden cell
# Install dataset
marketing_df <- read_csv("~/Downloads/digital_marketing_campaign_dataset.csv")
str(marketing_df)The data had 8,000 observations across 20 variables. Two of those variables (AdvertisingPlatform and AdvertisingTool) weren't of use because the observations were labeled as confidential. However, the other 18 can be split into five distinct groups:
Demographic Information
- CustomerID: Unique identifier for each customer.
- Age: Age of the customer.
- Gender: Gender of the customer (Male/Female were the only options in the dataset).
- Income: Annual income of the customer in USD.
Marketing Variables
- CampaignChannel: The channel through which the marketing campaign is delivered (Email, Social Media, SEO, PPC, Referral).
- CampaignType: Type of the marketing campaign (Awareness, Consideration, Conversion, Retention).
- AdSpend: Amount spent on the marketing campaign in USD.
- ClickThroughRate: Rate at which customers click on the marketing content.
- ConversionRate: Rate at which clicks convert to desired actions (e.g., purchases).
- AdvertisingPlatform: Confidential (not included in final analysis).
- AdvertisingTool: Confidential (not included in final analysis).
Customer Engagement Variables
- WebsiteVisits: Number of visits to the website.
- PagesPerVisit: Average number of pages visited per session.
- TimeOnSite: Average time spent on the website per visit (in minutes).
- SocialShares: Number of times the marketing content was shared on social media.
- EmailOpens: Number of times marketing emails were opened.
- EmailClicks: Number of times links in marketing emails were clicked.
Historical Customer Data
- PreviousPurchases: Number of previous purchases made by the customer.
- LoyaltyPoints: Number of loyalty points accumulated by the customer.
Target Variable
- Conversion: Binary variable indicating whether the customer converted (1) or not (0).
Data Cleaning
# Check structure and missing values
str(marketing_df)
miss_var_summary(marketing_df)No variables were missing, which reduced the amount of data cleaning needed before exploratory data anlaysis. For ease of analysis later, however, Gender was converted to a factor variable.
# Convert Gender from Character to Factor
marketing_df <- marketing_df %>%
mutate(Gender = factor(Gender, levels = c("Male", "Female")))
str(marketing_df)Exploratory Data Analysis
There were two primary goals of exploratory data analysis in this project:
-
Visualize the distribution and relationship between numeric predictors in this dataset. This was achieved through both visualization and clustering.
-
Explore the relationship between our target variable conversion and the categorical predictors (Campaign Channel, Campaign Type, Gender).
# Histograms of Numeric Variables
marketing_long <- pivot_longer(marketing_df, cols = where(is.numeric))
ggplot(marketing_long, aes(x = value)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
facet_wrap(~name, scales = "free") +
theme_minimal() +
labs(title = "Distribution of Numeric Variables",
subtitle = "Digital Marketing Campaign Dataset",
x = "",
y = "Frequency")Most of the numeric variables appear to have relatively uniform distributions.The exceptions are SocialShares and WebsiteVisits, which are multi-modal at several different peaks. That may indicate a potential relationship between the two variables, both of which relate to digital activity from customers (i.e, a certain number of website visits may correspond with a related range of social media shares about the marketing campaign).
After exploring each individual numeric predictor's distribution, it's also worth looking at the relationship between the potential predictors.
Visualizing Correlation Between Numeric Variables
# Select only numeric columns
numeric_variables <- marketing_df[sapply(marketing_df, is.numeric)]
# Compute the correlation matrix
corr_mat <- round(cor(numeric_variables, use = "pairwise.complete.obs"), 2)
# Correlation heatmap
corrplot(cor(numeric_variables), method = "color", col = rev(heat.colors(50)))