Skip to content

AZ Watch is a popular video streaming platform specialized in educational content, where creators publish online video tutorials and lessons about any topic, from speaking a new language to cooking to learning to play a musical instrument.

Their next goal is to leverage AI-driven solutions to analyze and make predictions about their subscribers and improve their marketing strategy around attracting new subscribers and retaining current ones. This project uses machine learning to predict subscribers likely to churn and find customer segments. This may help AZ Watch find interesting usage patterns to build subscriber personas in future marketing plans!

The data/AZWatch_subscribers.csv dataset contains information about subscribers and their status over the last year:

Column nameDescription
subscriber_idThe unique identifier of each subscriber user
age_groupThe subscriber's age group
engagement_timeAverage time (in minutes) spent by the subscriber per session
engagement_frequencyAverage weekly number of times the subscriber logged in the platform (sessions) over a year period
subscription_statusWhether the user remained subscribed to the platform by the end of the year period (subscribed), or unsubscribed and terminated her/his services (churned)

Carefully observe and analyze the features in the dataset, asking yourself if there are any categorical attributes requiring pre-processing?

The subscribers dataset from the data/AZWatch_subscribers.csv file is already being loaded and split into training and test sets for you:

# Import the necessary modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.cluster import KMeans
import seaborn as sns
from matplotlib import pyplot as plt

# Specify the file path of your CSV file
file_path = "data/AZWatch_subscribers.csv"

# Read the CSV file into a DataFrame
df = pd.read_csv(file_path)

# Separate predictor variables from class label
X = df.drop(['subscriber_id','subscription_status'], axis=1)
y = df.subscription_status

# Split intro training and test sets (20% test)
X_train, X_test, y_train, y_test = train_test_split(
                        X, y, test_size=.2, random_state=42)
# Start your code here! Use as many cells as you like!
X_train.info()
X_train['age_group'].value_counts()
def create_dummies(df, col):
    dummies = pd.get_dummies(df[col], drop_first=True)
    df = pd.concat([df, dummies], axis=1)
    df.drop(col, axis=1, inplace=True)
    return df
X_train = create_dummies(X_train, 'age_group')
X_test = create_dummies(X_test, 'age_group')
X_train.head()
model1 = LogisticRegression()
model2 = DecisionTreeClassifier()
model3 = RandomForestClassifier()
model1.fit(X_train, y_train)
model2.fit(X_train, y_train)
model3.fit(X_train, y_train)
model1_score = model1.score(X_test, y_test)
model2_score = model2.score(X_test, y_test)
model3_score = model3.score(X_test, y_test)
print(f'LogisticRegression score: {model1_score}\nDecisionTreeClassifier score: {model2_score}\nRandomForestClassifier score: {model3_score}')
score = model1_score
segmentation = X.drop('age_group', axis=1)
scaler = StandardScaler()
segmentation_scaled = scaler.fit_transform(segmentation)
sse = {}
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(segmentation_scaled)
    sse[k] = kmeans.inertia_
plt.plot(list(sse.keys()), list(sse.values()), marker='o')
plt.title('Elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()
kmeans = KMeans(n_clusters=4, random_state=5)
clusters = kmeans.fit_predict(segmentation_scaled)
segmentation['cluster_id'] = kmeans.labels_
analysis = segmentation.groupby('cluster_id').agg({'engagement_time': 'mean', 'engagement_frequency': 'mean'}).round(0)
analysis