Project Introduction: Using Machine Learning to Improve Marketing Insights for AZ Watch
Machine learning offers a powerful way to uncover deeper insights in marketing analytics. It moves beyond basic metrics like monthly customer counts, helping businesses understand user behavior and identify the underlying factors driving engagement and churn.
This project demonstrates the key steps involved in applying machine learning to marketing analytics, using AZ Watch—a leading video streaming platform focused on educational content—as a real-world example. With its wide range of tutorials, AZ Watch aims to boost user acquisition and retention through data-informed strategies.
Project Objectives:
Predict Customer Churn Build a supervised machine learning model to identify subscribers likely to cancel their subscriptions.
Customer Segmentation Apply unsupervised learning to discover distinct customer groups based on behavior and engagement.
These insights are intended to support targeted marketing campaigns, improve user retention, and inform strategic decision-making.
Data validation
The first step in any data analytics task is data validation. This involves understanding the structure and quality of the dataset, determining which columns will serve as features, identifying the target variable, and deciding which columns to drop if they do not contribute meaningful information to the model.
The data/AZWatch_subscribers.csv dataset contains information about subscribers and their status over the last year:
| Column name | Description | 
|---|---|
| subscriber_id | The unique identifier of each subscriber user | 
| age_group | The subscriber's age group | 
| engagement_time | Average time (in minutes) spent by the subscriber per session | 
| engagement_frequency | Average weekly number of times the subscriber logged in the platform (sessions) over a year period | 
| subscription_status | Whether the user remained subscribed to the platform by the end of the year period (subscribed), or unsubscribed and terminated her/his services (churned) | 
Below is a summary of the validation process:
- 
subscriber_id Contains 1,000 unique values. No cleaning required, but dropped from the analysis as it does not provide predictive value. 
- 
age_group A categorical column with 3 distinct categories. Requires encoding for machine learning models. 
- 
engagement_time A numeric column with values within a reasonable range. No cleaning necessary. 
- 
engagement_frequency A numeric column. Eight negative values were removed, as most machine learning models do not accept negative values in this context. One outlier was also removed to prevent skewing the data. 
- 
subscription_status A binary categorical column. Requires encoding to be used as the target variable in modeling. 
After the validation process, the cleaned dataset consists of 991 rows ready for further analysis.
The code for loading data and validation is shown below.
# Importing the necessary modules
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.cluster import KMeans
import seaborn as sns
from matplotlib import pyplot as plt
import numpy as np
#Setting option to display all columns in output
pd.set_option('display.max_columns', None)
# Specifying the file path of CSV file
file_path = "data/AZWatch_subscribers.csv"
# Reading the CSV file into a DataFrame
df = pd.read_csv(file_path)
#Data Validation
#Engagement frequency contains negative values 
df =df[df['engagement_frequency'] > 0.001] #dropped 8 rows  - left with 992 
Exploratory Data Analysis (EDA)
Following data cleaning and validation, the next step is Exploratory Data Analysis (EDA). The purpose of EDA is to gain a deeper understanding of the dataset before applying machine learning models.
EDA was performed to:
Examine the distribution of numeric variables, as skewed or bimodal distributions can impact model performance and may require transformation.
Identify any class imbalances within categorical variables—particularly the target variable (subscription_status)—to ensure that modeling techniques such as oversampling or undersampling are only used when necessary and appropriately.
The analysis revealed that the numeric columns exhibited a bimodal distribution, indicating the presence of two distinct behavioral patterns among users. On the other hand, the categorical variables appeared to be fairly evenly distributed, suggesting no immediate need for rebalancing.
This step is crucial for detecting underlying patterns, anomalies, and relationships that could influence model outcomes and guide feature engineering.
#EDA
#Visualizing distribution of engagement frequency
sns.boxplot(df['engagement_frequency']) #Right skewed
plt.show()
outliers = df[df['engagement_frequency'] > 40]
print(outliers)
#Removing outliers
df = df[df['engagement_frequency'] < 40]
#engagement frequency 
sns.distplot(df['engagement_frequency']) 
plt.show()
#engagement time 
sns.distplot(df['engagement_time']) 
plt.show()
#subscription status
sns.barplot(df['subscription_status']) 
plt.show()
#age group
sns.barplot(df['age_group']) 
plt.show()
Data Preprocessing for Machine Learning
The next step in the pipeline is data preprocessing, which involves structuring the data in a format that machine learning algorithms can interpret effectively. For supervised learning, preprocessing typically follows these key steps:
- 
Separate the Target Variable ( y) from Features (X) The target variable is what we want the model to predict—in this case,subscription_status. Since this is a categorical variable, it needs to be converted into a numeric format using encoding. The result is typically binary (e.g., 0 and 1). Although some libraries assign these values automatically (often in alphabetical order), it is best practice to manually assign class labels to avoid ambiguity and maintain consistency.
- 
Drop the Target and Irrelevant Columns from the Feature Set ( X) Once the target is isolated, we remove it along with any other columns that do not contribute meaningful information to the model. In this case,subscriber_idis also dropped from the feature set.
- 
Encode Categorical Features The age_groupcolumn is also categorical and must be encoded using the appropriate method (e.g., one-hot encoding or label encoding), depending on the model requirements.
- 
Scale Numerical Features To ensure that numerical variables are on the same scale and do not disproportionately influence the model, standardization is applied. This helps models like logistic regression and decision trees interpret the data more effectively. 
- 
Split the Dataset into Training and Testing Sets The final step is to divide the data into training and testing sets. The training set is used to train the model, while the testing set evaluates its performance. In this case, a test size of 0.3 was used—meaning 30% of the data was reserved for testing. 
#Data prepreprocessing 
#Defining the target variable 
y_raw= df['subscription_status']
#Encoding the target variable 
y = df['subscription_status'].map({
    'churned': 1,
    'subscribed': 0})
#Seperate target from features and remove unnecessary columns 
X = df.drop(['subscriber_id', 'subscription_status'], axis  = 1)
#Encoding age_group with onehot encoding 
X = pd.get_dummies(X, columns = ['age_group'], drop_first = False)
#Scaling numeric features  
from sklearn.preprocessing import StandardScaler
X[['engagement_time', 'engagement_frequency']] = StandardScaler().fit_transform(
    X[['engagement_time', 'engagement_frequency']]
)
#Spliting data into test and train
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3)
#Validating test size proportion
print(len(X_test)/len(X))
Model Selection
Once the data is properly prepared, the next step is to select a machine learning model. The choice of model depends on the type of the target variable and the objective of the analysis.
Since the goal of this task is to predict a binary outcome—whether a subscriber will churn or not—the problem falls under supervised classification. In such cases, it's common practice to experiment with multiple models and select the one that delivers the best performance based on evaluation metrics.
For this project, the following classification models were used:
Logistic Regression
Random Forest Classifier
Decision Tree Classifier
These models were selected for their interpretability, performance with classification problems, and ability to handle both categorical and numerical variables. Each model was trained and tested on the prepared dataset, and their performance was evaluated using the following metrics:
Accuracy: The percentage of correct predictions out of all predictions made.
Precision: The proportion of positive predictions that were actually correct (useful when the cost of false positives is high).
Recall: The proportion of actual positives that were correctly identified (important when missing positives is costly).
ROC Curve (Receiver Operating Characteristic Curve): The ROC curve is a visual tool used to evaluate the performance of a classification model. It plots the True Positive Rate (Recall) against the False Positive Rate at various threshold settings. The curve shows how well the model distinguishes between the two classes (e.g., churn vs. no churn). A model with good classification ability will have a curve that bows towards the top-left corner of the plot.
Why it’s useful: The ROC curve helps compare different models and choose the best one, especially when class distributions are imbalanced. It gives a complete picture of a model’s performance across all classification thresholds, not just one.
AUC (Area Under the Curve): Measures the ability of the model to distinguish between classes. An AUC of 1.0 means perfect classification, while 0.5 suggests no better than random guessing. The higher the AUC, the better the model is at predicting positives as positives and negatives as negatives.
Since the goal is to combat customer churn, the most important thing is to identify subscribers who are at risk of cancelling, so the company can take action (e.g. send re-engagement emails or offers). Best Metrics to Prioritize: Recall (True Positive Rate): To catch as many at-risk subscribers as possible, even if some predictions are wrong.
The logistic Regression model proves to be the best perfoming with a recall score of 0.903, therefore this model is chosen for prediction.
#Model selection 
#Logistic regression 
from sklearn.linear_model  import LogisticRegression
#Instantiating the model 
model1 = LogisticRegression()
#Training model 
model1.fit(X_train, y_train)
#Predicting test data 
y_pred1 = model1.predict(X_test)
#Assessing model perfomance 
#Accuracy 
model1_accuracy = model1.score(X_test, y_test)
#Recall score
from sklearn.metrics import recall_score
model1_recall = recall_score(y_test, y_pred1)
#Precision 
from sklearn.metrics import precision_score
model1_precision = precision_score(y_test,y_pred1)
#Confusion matrix 
from sklearn.metrics import confusion_matrix
model1_confusion = confusion_matrix(y_test,y_pred1)
#Plotting ROC 
y_pred1_probabilities = model1.predict_proba(X_test)[:,1]
from sklearn.metrics import roc_curve 
fpr, tpr, thresholds = roc_curve(y_test,y_pred1_probabilities)
plt.plot(fpr, tpr)
#AUC
from sklearn.metrics import roc_auc_score
model1_auc =roc_auc_score(y_test,y_pred1_probabilities)
print(model1_accuracy)
print(model1_recall)
print(model1_precision)
print(model1_auc)
print("\nLogistic regression accuracy score: ", model1_accuracy)#Model 2 : Random Forest Classifier 
from sklearn.ensemble import RandomForestClassifier
#Instantiating model 
model2 = RandomForestClassifier()
#Training model 
model2.fit(X_train,y_train)
#predicting using test values 
y_pred2 = model2.predict(X_test)
#Assesing model performance 
#Accuracy
model2_accuracy = model2.score(X_test,y_test)
#Recall score 
model2_recall = recall_score(y_test, y_pred2)
#Precision
model2_precision = precision_score(y_test,y_pred2)
#Confusion matrix 
y_pred2_probabilities = model2.predict_proba(X_test)[:,1]
model2_confusion = confusion_matrix(y_test, y_pred2)
fpr, tpr, thresholds = roc_curve(y_test,y_pred2_probabilities)
#ROC plot 
plt.plot(fpr,tpr)
#AUC 
model2_auc = roc_auc_score(y_test, y_pred2_probabilities)
print(model2_accuracy)
print(model2_recall)
print(model2_precision)
print(model2_auc)
print("\nRandom Forest accuracy score: ", model2_accuracy)#Model 3: Decision Tree Classifier 
from sklearn.tree import DecisionTreeClassifier
#Instatiating model 
model3 = DecisionTreeClassifier()
#Fitting model 
model3.fit(X_train, y_train)
#Predictiins 
y_pred3 = model3.predict(X_test)
#Assesing model perfomance 
#Accuracy
model3_accuracy = model3.score(X_test, y_test)
#Recall 
model3_recall = recall_score(y_test, y_pred3)
#Precision
model3_precision = precision_score(y_test,y_pred3)
#Confusion matrix 
y_pred3_probabilities = model3.predict_proba(X_test)[:,1]
model3_confusion = confusion_matrix(y_test, y_pred3)
fpr, tpr, thresholds = roc_curve(y_test,y_pred3_probabilities)
#ROC plot 
plt.plot(fpr,tpr)
#AUC 
model3_auc = roc_auc_score(y_test, y_pred3_probabilities)
print(model3_accuracy )
print(model3_recall)
print(model3_accuracy)
print(model3_auc)
#Model with highest accuracy: model 1: Logistic Regression 
print("\nDecision tree accuracy score: ", model3_accuracy)Customer Segmentation with Unsupervised Learning
The second goal of this analysis is to segment customers for targeted marketing. This is achieved using unsupervised machine learning, which helps uncover patterns in the data without using labeled outcomes.
The overall process shares similarities with supervised learning, especially in the data preprocessing stage. However, one key difference is that unsupervised models like clustering algorithms (e.g., K-Means) rely heavily on the shape and distribution of the data. Therefore, an additional transformation step is introduced to make the data more suitable for clustering.
Key Steps:
- 
Drop unnecessary columns Just like in supervised learning, non-informative columns like subscriber_idare removed.
- 
Encode categorical variables Variables like age_groupare encoded into numerical form so the model can interpret them.
- 
Transform numerical variables (if needed) Since clustering algorithms are sensitive to skewed data, numeric features are transformed to reduce skewness and move them closer to a normal distribution. This could involve techniques like log transformation, Box-Cox, or Yeo-Johnson. 
- 
Standardize the features After transformation, all numeric features are scaled to ensure they contribute equally to the clustering process. 
- 
Apply clustering algorithm A clustering algorithm like K-Means is applied to segment the customers into distinct groups based on their behavior and characteristics. 
Why K-Means? K-Means was chosen for customer segmentation because it's simple, fast, and effective for grouping similar customers based on numerical features like engagement time and frequency. It works well with standardized, continuous data and is widely used in marketing to identify distinct customer profiles. It's easy to interpret and explain to non-technical stakeholders, making it ideal for business applications like targeted marketing. The number of clusters can be optimized using methods like the elbow method, ensuring meaningful segmentation.
K-Means requires a predefined number of clusters, known as K. To choose the optimal K, the elbow method was used. This involves plotting the model’s inertia (within-cluster variance) against different values of K. The point where the curve starts to "bend" or form an elbow indicates the ideal number of clusters—balancing model accuracy with simplicity.
Inertia is a measure of how tightly the data points in each cluster are grouped around the cluster center (centroid). It represents the sum of squared distances between each point and its assigned centroid. It us used in K-Means to evaluate how well the model fits the data. Lower inertia means tighter, more compact clusters, which is desirable. However, inertia always decreases as the number of clusters increases, so the elbow method is used to find the point where adding more clusters doesn’t significantly reduce inertia—indicating the optimal K.
After segmenting the customers, the cluster labels are joined back to the original dataset. This allows for aggregation and analysis of each cluster to identify key characteristics—such as engagement patterns and age group distribution. These insights help define distinct customer segments for more effective, targeted marketing.
#Apply a standard clustering method to the numerical features X. Analyze the average values by cluster_id for these numerical features and store them in analysis, rounding values to the nearest whole number.
#Manually encoding target column
df['subscription_status'] = df['subscription_status'].map({
    'churned': 1,
    'subscribed': 0
})
#Dropping unnecessary columns
segmentation = df.drop(['subscriber_id', 'subscription_status', 'age_group'], axis=1)
#Applying transformation
segmentation['engagement_time'] = np.sqrt(segmentation['engagement_time'])
segmentation['engagement_frequency'] = np.log(segmentation['engagement_frequency']) 
#Standardize the features
scaler = StandardScaler()
scaled_data = scaler.fit_transform(segmentation)
#Elbow method to choose the number of clusters (k)
inertia = []
K = range(1, 10)
for k in K:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(scaled_data)
    inertia.append(kmeans.inertia_)
#Plot the elbow curve
plt.figure(figsize=(8, 4))
plt.plot(K, inertia, 'bo-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.show()
#Fit KMeans with k = 3
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit_predict(scaled_data)
segmentation['cluster_id'] = kmeans.labels_
#Analyze average values by cluster
analysis = segmentation.groupby('cluster_id').agg({
    'engagement_frequency': 'mean',
    'engagement_time': 'mean'
}).round()
df['cluster_id'] = kmeans.labels_
#Getting the more info 
cluster_info = df.groupby('cluster_id').agg({
    'engagement_frequency': 'median',
    'engagement_time': 'mean',
    'subscription_status': lambda x: x.value_counts(normalize=True).get(1, 0),  # % of churned users
    'age_group': lambda x: x.value_counts().idxmax()  # most common age group
}).round()
print(cluster_info)Clustering Results
The K-Means clustering algorithm revealed three distinct customer segments based on engagement behavior and age group.
Two clusters (Cluster 0 and 1) consist mostly of younger users (18–34) with moderate to high engagement and low likelihood of churn.
Cluster 2 includes users aged 35 and over with low engagement and a higher likelihood of churn.
These findings can help AZ Watch tailor its marketing and retention strategies—for example, by re-engaging low-activity users or reinforcing loyalty among highly engaged ones. Segmenting customers this way provides a data-driven foundation for targeted marketing and personalized user experiences.
Final Conclusion
This project demonstrated how machine learning can enhance marketing analytics by providing actionable insights into customer behavior.
Using supervised learning, a logistic regression model was selected for predicting subscriber churn due to its simplicity, interpretability, and solid performance. This model can help AZ Watch identify subscribers at risk of cancellation, allowing the company to take proactive steps to retain them through personalized engagement strategies.
On the other hand, unsupervised learning through K-Means clustering successfully segmented the subscriber base into three distinct groups based on engagement and demographic patterns. These segments provide a foundation for targeted marketing, enabling AZ Watch to optimize campaigns for different customer types—whether it's nurturing high-potential users or reactivating low-engagement segments.
Together, these approaches offer a data-driven framework for improving both customer retention and marketing effectiveness, helping AZ Watch grow its platform in a more strategic and personalized way.