Purchase probability modeling for telecom marketing

Introduction

Telecom is running a marketing campaign to promote a new data service. The goal is to contact selected customers and offer them this service. Each contact costs 5 units, while a successful sale brings an incremental discounted Customer Lifetime Value (CLV) of 100 units. Due to call center capacity limits, only a quarter of the total customer base can be contacted.

The main business objective is to improve effectiveness of this campaign and maximize the profit by carefully selecting which customers to contact. The profit formula is defined as:

PROFIT(decision₁, decision₂, … , decisionₙ) = ∑ (100 × purchaseᵢ(decisionᵢ) - 5 × decisionᵢ)

The analytical goal is to support profit maximization by developing predictive models to estimate purchase probabilities for each customer. Based on these probabilities, the profit formula, and capacity constraints, the task is to recommend which customers should be contacted.

Data description and expected outputs

Telecom provided two datasets:

pilot.csv contains results from a randomized campaign (A/B test). Each row represents one customer, with 20 characteristics (V1 to V20). It also includes a treatment column showing if the customer was contacted, and a purchase column showing if the customer bought the service.
CustomerBase.csv contains the full customer base with no labels. Based on the analysis, a binary decision must be made for each customer — 1 to contact, 0 to skip.

Data exploration and cleaning

The first step was loading all datasets: pilot.csv, CustomerBase.csv, and decisions.csv. A quick check of the structure and sample rows confirmed that the files were read correctly, and data matched expectations.

Basic checks for missing values and duplicates showed that both datasets were complete, with no missing data or fully duplicated rows.

V2 and V19 were identified as categorical columns, so they were converted to the category type to save memory and improve processing speed.

Next, categories in V2 and V19 were compared between the pilot and customer base datasets. V2 contained no unexpected values, but V19 in the customer base had six categories that were not present in the pilot data. These categories were rare, each making up between 0.001% and 0.034% of the dataset.

Rare categories can cause issues in modeling, especially for tree-based models, which may assign unnecessary importance to categories with very few samples, leading to overfitting. To prevent this, the six rare categories were grouped into a single category called ‘Other’. This helped stabilize the model’s learning process by reducing noise and preventing rare categories from distorting predictions.

This process ensured the datasets were clean, consistent, and optimized for further steps.

# Step 1. Exploring and cleaning datasets

# Import necessary libraries
import pandas as pd

# Load datasets
pilot = pd.read_csv('pilot.csv')
customers = pd.read_csv('CustomerBase.csv')
decisions = pd.read_csv('decisions.csv')

# Quick overview
print('Pilot data:')
print(pilot.head())
print(pilot.info())

print('\nCustomers data:')
print(customers.head())
print(customers.info())

print('\nDecisions example:')
print(decisions.head())
print(decisions.info())

# Check for missing values
missing_pilot = pilot.isnull().sum().sum()
missing_customers = customers.isnull().sum().sum()
print(f'\nNumber of missing values:')
print(f'pilot: {missing_pilot}')
print(f'customers: {missing_customers}')

# Check for duplicates
duplicates_pilot = pilot.duplicated().sum()
duplicates_customers = customers.duplicated().sum()
print(f'\nNumber of fully duplicated rows:')
print(f'pilot: {duplicates_pilot}')
print(f'customers: {duplicates_customers}')

# Convert V2 and V19 to category dtype
for col in ['V2', 'V19']:
    pilot[col] = pilot[col].astype('category')
    customers[col] = customers[col].astype('category')
    
# Compare categories and handle new ones
for col in ['V2', 'V19']:
    print(f'\nUnique values in {col}:')
    print(f'pilot: {sorted(pilot[col].unique())}')
    print(f'customers: {sorted(customers[col].unique())}')
    
    # Find new categories in customers (not present in pilot)
    new_categories = set(customers[col].unique()) - set(pilot[col].unique())
    
    if new_categories:
        print(f'New categories found in {col}: {sorted(new_categories)}')

        # Calculate number and percentages of customers within new categories
        new_category_customers = customers[customers[col].isin(new_categories)]

        print(f'\nDistribution of customers among new categories in {col} (with % of total 95000):')
        for category in sorted(new_categories):
            count = (new_category_customers[col] == category).sum()
            percent = (count / len(customers)) * 100
            print(f'{category}: {count} customers ({percent:.3f}%)')
    else:
        print(f'No new categories found in {col}.')
        
# Group new categories into 'Other'
known_categories = set(pilot[col].unique())
customers[col] = customers[col].apply(lambda x: x if x in known_categories else 'Other')

Diagnosing datasets and defining preprocessing approaches

After cleaning the data, the pilot dataset was examined to detect potential issues that could influence modeling.

First, all columns were checked for constant values. No constant columns were found, meaning every column showed some variation, so no columns were removed at this stage.

Next, numeric columns were tested for high correlation. No pairs of numeric columns showed correlation above the chosen threshold of 0.85. This meant no columns had to be removed due to redundancy.

Outlier analysis was performed for numeric columns. For columns naturally bounded between 0 and 1, outliers were not relevant and were skipped. For the rest, outliers were identified using the 99th percentile cutoff. Outliers were found in several columns, indicating the presence of extreme values that could influence model training.

Skewness was checked for numeric columns - several features showed high skewness. This could be a potential risk for linear models, which assume normally distributed features, so skewness was noted for possible transformations later if required by a model.

The effectiveness of the A/B test was also evaluated. About 20% of customers were contacted, and the purchase rate in this group was around 10.2%, compared to 7.9% in the control group. The uplift of about 2.36 percentage points suggested that contact alone did not fully explain purchase decisions. This led to the decision to train models on the full pilot dataset, regardless of whether customers were contacted or not, using 'treatment' as one of the features.

This diagnostic step provided important information about the structure and quality of the data. It also highlighted possible preprocessing needs.

# Step 2. Diagose datasets for defining preprocessing approaches

# Constant columns
# Check for columns where all values are the same
constant_columns = [col for col in pilot.columns if pilot[col].nunique() == 1]
if constant_columns:
    print(f'Found constant columns in pilot: {constant_columns}')
    print('These columns provide no useful information and could be dropped.')
else:
    print('No constant columns found in pilot, all columns show some variation.')

# Correlation
# Check for highly correlated numeric columns
numeric_columns = pilot.select_dtypes(include=['number']).columns
correlation_matrix = pilot[numeric_columns].corr()

# Threshold for high correlation
correlation_threshold = 0.85

# Find pairs of numeric columns with correlation above the threshold
highly_correlated_pairs = []
for col1 in numeric_columns:
    for col2 in numeric_columns:
        if col1 != col2 and abs(correlation_matrix.loc[col1, col2]) > correlation_threshold:
            highly_correlated_pairs.append((col1, col2, correlation_matrix.loc[col1, col2]))

if highly_correlated_pairs:
    print('\nHighly correlated column pairs detected in pilot:')
    for col1, col2, corr in highly_correlated_pairs:
        print(f'{col1} and {col2} (correlation = {corr:.2f})')
    print('Consider removing one column from each highly correlated pair to avoid redundancy.')
else:
    print('\nNo strongly correlated numeric columns found in pilot.')

# Outliers
# Detect outliers for numeric columns, skipping columns that are bounded between 0 and 1
outlier_report = []

for col in numeric_columns:
    if pilot[col].between(0, 1).all():
        continue

    # Identify the 99th percentile cutoff
    q99 = pilot[col].quantile(0.99)

    # Find all values above the cutoff
    outliers = pilot[col][pilot[col] > q99]

    if not outliers.empty:
        outlier_report.append((col, len(outliers), outliers.max(), q99))

if outlier_report:
    print('\nPotential outliers detected in pilot:')
    for col, count, max_value, q99 in outlier_report:
        print(f'{col}: {count} outliers (max = {max_value:.2f}, 99th percentile = {q99:.2f})')
else:
    print('\nNo extreme outliers detected in pilot.')

# Skewness
# Check for skewness in numeric columns
def interpret_skewness(value):
    if abs(value) < 0.5:
        return 'symmetric'
    elif abs(value) < 1:
        return 'moderate skew'
    else:
        return 'highly skewed'

skewness_report = pilot[numeric_columns].skew()

print('\nSkewness analysis for numeric columns in pilot:')
for col, skew in skewness_report.items():
    interpretation = interpret_skewness(skew)
    print(f'{col}: skewness = {skew:.2f} ({interpretation})')
    
# A/B test effectiveness evaluation
# Treatment distribution (1 = treatment group, 0 = control group)
print('\nTreatment distribution in pilot:')
print(pilot['treatment'].value_counts(normalize=True))

# Buying decision distribution in pilot (1 = purchased, 0 = not purchased)
print('\nBuying decision distribution in pilot:')
print(pilot['purchase'].value_counts(normalize=True))

# Purchase rates by treatment group
purchase_by_treatment = pilot.groupby('treatment')['purchase'].mean()

print('\nPurchase rates by group:')
print(purchase_by_treatment)

# Overall summary by group
print('\nSummary statistics by treatment group:')
print(pilot.groupby('treatment').describe())

# Purchase rate difference
uplift = purchase_by_treatment[1] - purchase_by_treatment[0]
print(f'\nTreatment effect: {uplift:.4f}')

Model choice and data preprocessing

Based on the business goal, data structure, and results of exploratory analysis, two models were selected: Logistic Regression and Random Forest.

Logistic Regression was chosen because:

It is simple, interpretable, and works well for binary classification problems, and is commonly used in marketing campaigns to predict customer response.
The training dataset has only 5000 rows and 20 features, so a simple model could perform well if relationships between features and the target are relatively linear.

Random Forest was chosen because:

It naturally handles both numeric and categorical features, even when some features are skewed or contain outliers, which is the case in this dataset.
It captures non-linear relationships and interactions between features, which are likely present in customer behavior.

After selecting the models, the pilot dataset was split into training (75%) and test (25%) sets, and stratified sampling was used to ensure that the purchase rate was preserved in both sets. This was done early to avoid data leakage and to simulate a real-world scenario, where the model is applied to unseen data after training.

Since Logistic Regression and Random Rorest have different preprocessing needs, separate copies of the dataset were prepared for each model.

For Logistic Regression, preprocessing included:

One-hot encoding for categorical features V2 and V19, with columns aligned between train and test to handle cases where a category appears only in one set.
Log transformation for the most skewed columns with extreme skewness (V5, V12, V14, V18) to reduce the impact of extreme values.
Capping outliers at the 99th percentile in previously identified columns to limit the influence of extreme values.
Standard scaling for continuous numeric columns to fit logistic regression’s needs.

As tree-based models do not require scaling or transformations, and are less sensitive to outliers, preprocessing for Random Forest was simpler:

One-hot encoding was applied to categorical features V2 and V19, with columns aligned between train and test if needed. This was chosen for consistency with the logistic regression preprocessing pipeline and was sufficient in this case given the small number of categories in both features.

The final result was two preprocessed, ready-to-use train and test sets with 37 features each, prepared specifically for the chosen models.

# Step 3. Preprocessing pipelines

# Import necessary libraries
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Define features and target columns
X = pilot.drop(columns=['purchase'])
y = pilot['purchase']

# Split and stratify to ensure purchase ratio is preserved in both train & test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42)

# Verify resulting split
print(f'Train set size: {len(X_train)}')
print(f'Test set size: {len(X_test)}')

print('\nPurchase rate in train:')
print(y_train.value_counts(normalize=True))

print('\nPurchase rate in test:')
print(y_test.value_counts(normalize=True))

# Define types of columns
categorical_columns = ['V2', 'V19']
numeric_columns = [col for col in X_train.columns if col not in categorical_columns]

# Create separate copies for each model type
X_train_logistic = X_train.copy()
X_test_logistic = X_test.copy()

X_train_tree = X_train.copy()
X_test_tree = X_test.copy()

# Logistic Regression model preprocessing
# One-hot encode categorical features (V2, V19)
X_train_logistic = pd.get_dummies(X_train_logistic, columns=['V2', 'V19'], drop_first=True)
X_test_logistic = pd.get_dummies(X_test_logistic, columns=['V2', 'V19'], drop_first=True)

# Align columns if train/test have different categories
X_train_logistic, X_test_logistic = X_train_logistic.align(X_test_logistic, join='outer', axis=1, fill_value=0)

# Log transform highly skewed numeric columns
log_transform_columns = ['V5', 'V12', 'V14', 'V18']

for col in log_transform_columns:
    X_train_logistic[col] = np.log1p(X_train_logistic[col])
    X_test_logistic[col] = np.log1p(X_test_logistic[col])

# Cap numeric outliers at 99th percentile
outlier_columns = ['V1', 'V5', 'V6', 'V7', 'V9', 'V10', 'V11', 'V12', 'V14', 'V15', 'V17', 'V20']

for col in outlier_columns:
    cap = X_train[col].quantile(0.99)
    X_train_logistic[col] = np.clip(X_train_logistic[col], None, cap)
    X_test_logistic[col] = np.clip(X_test_logistic[col], None, cap)

# Scale numeric continuous features
continuous_columns_logistic = X_train_logistic.select_dtypes(include='float64').columns

scaler = StandardScaler()
X_train_logistic[continuous_columns_logistic] = scaler.fit_transform(X_train_logistic[continuous_columns_logistic])
X_test_logistic[continuous_columns_logistic] = scaler.transform(X_test_logistic[continuous_columns_logistic])

# Confirm final shape after preprocessing for Logistic Regression
print(f'\nLogistic regression train set shape: {X_train_logistic.shape}')
print(f'Logistic regression test set shape: {X_test_logistic.shape}')

# Random Forest model preprocessing
# One-hot encode categorical features (V2, V19)
X_train_tree = pd.get_dummies(X_train_tree, columns=['V2', 'V19'], drop_first=True)
X_test_tree = pd.get_dummies(X_test_tree, columns=['V2', 'V19'], drop_first=True)

# Align columns if train/test have different categories
X_train_tree, X_test_tree = X_train_tree.align(X_test_tree, join='outer', axis=1, fill_value=0)

# Confirm final shape after preprocessing for Random Forest model
print(f'\nTree-based train set shape: {X_train_tree.shape}')
print(f'Tree-based test set shape: {X_test_tree.shape}')

Model training and evaluation

Both models were trained on the preprocessed training data and evaluated on the test set.

For evaluation, both models were assessed using:

AUC to check how well each model ranks customers by purchase probability, which fits the goal of selecting the best customers for contact.
Classification report to check precision, recall, and f1-score, to see how well each model balances finding buyers and avoiding unnecessary contacts.
Both models used a 0.3 probability threshold to capture more buyers, since the target class (purchase) was rare, and lowering the threshold helped capture more positive cases.

Metric	Logistic Regression	Random Forest
AUC	0.839	0.780
Precision (positive class)	0.40	0.53
Recall (positive class)	0.34	0.17
Overall accuracy	90%	92%

Logistic Regression performed better in terms of AUC, showing that it separated buyers from non-buyers more effectively across all thresholds. This suggests that the relationship between features and purchase probability was reasonably linear. This model also balanced precision and recall more evenly for the positive class. Its recall was higher than random forest’s (34% vs 17%), meaning it caught more buyers, even at the cost of lower precision.

Random Forest achieved higher precision for buyers (53%), but this came at the expense of low recall (only 17%), meaning it missed most actual buyers. This suggests Random Forest struggled to generalize to unseen data and handle the small number of buyers.

The ROC curve shows that logistic regression separated buyers and non-buyers better than random forest, which was closer to random guessing.

Overall, Logistic Regression was preferred for the final decision process, as its better AUC and recall aligned better with the business goal - maximizing profit by correctly identifying buyers to target, even if some non-buyers are mistakenly contacted.

# Step 4. Train and evaluate models

# Import necessary libraries
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, classification_report, roc_curve, auc
import matplotlib.pyplot as plt

# Models training
# Initialize models
logreg = LogisticRegression(random_state=42, max_iter=500)
ramfor = RandomForestClassifier(random_state=42, n_estimators=100, class_weight='balanced')

# Train Logistic Regression model
logreg.fit(X_train_logistic, y_train)

# Train Random Forest model
ramfor.fit(X_train_tree, y_train)

# Models evaluation
# Predict probabilities
y_test_proba_logistic = logreg.predict_proba(X_test_logistic)[:, 1]
y_test_proba_tree = ramfor.predict_proba(X_test_tree)[:, 1]

# Predict actual class with treshold 0.3
y_test_pred_logistic = (y_test_proba_logistic > 0.3).astype(int)
y_test_pred_tree = (y_test_proba_tree > 0.3).astype(int)

# Evaluate Logistic Regression model
print('Logistic Regression - test set evaluation')
print(f'AUC: {roc_auc_score(y_test, y_test_proba_logistic):.4f}')
print(classification_report(y_test, y_test_pred_logistic))

# Evaluate Random Forest model
print('\nRandom Forest - test set evaluation')
print(f'AUC: {roc_auc_score(y_test, y_test_proba_tree):.4f}')
print(classification_report(y_test, y_test_pred_tree))

# Plot ROC curves for both models
fpr_logreg, tpr_logreg, _ = roc_curve(y_test, y_test_proba_logistic)
fpr_tree, tpr_tree, _ = roc_curve(y_test, y_test_proba_tree)

plt.figure(figsize=(8, 6))
plt.plot(fpr_logreg, tpr_logreg, label=f'Logistic Regression (AUC = {roc_auc_score(y_test, y_test_proba_logistic):.3f})')
plt.plot(fpr_tree, tpr_tree, label=f'Random Forest (AUC = {roc_auc_score(y_test, y_test_proba_tree):.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random guess')
plt.xlabel('False positive rate', fontweight = 'bold')
plt.ylabel('True positive rate', fontweight = 'bold')
plt.title('ROC curve comparison', fontweight = 'bold')
plt.legend()
plt.grid(True)
plt.show()

Final model application and decision-making

As the main goal of this analysis was to improve marketing campaign effectiveness by maximizing profit, while respecting the call center’s limited capacity, my approach focused on selecting the customers who generate the highest expected profit.

After evaluating both models, the Logistic Regression model was selected for final decision-making because it provided better purchase ranking and better recall, which fit this profit-driven goal.

The full customer base was preprocessed using the same steps applied to the training data to ensure consistency:

A synthetic treatment column with value 1 was added to reflect that all customers were considered potential targets, since the goal was to simulate the effect of contacting them.
Columns were reordered and aligned with the training set to avoid mismatches.
Categorical features were one-hot encoded.
The most skewed numeric columns were log-transformed.
Outliers were capped at the 99th percentile based on the training data.
All numeric continuous columns were scaled using the scaler fitted on the training set.

Once the data was prepared, the model predicted the purchase probability for each customer. These probabilities were used to calculate expected profit for every customer, using the formula:
expected_profit = 100 * purchase_proba - 5
This approach directly links the business goal (profit maximization) with the model outputs (purchase probability).

Reasoning behind final selection process

To make the contact decisions, the distribution of predicted purchase probabilities was analyzed. The 75th percentile became the natural cut-off, selecting the top 25% of customers, which matched the call center’s maximum capacity.

At this threshold, the expected profit showed a clear positive margin, meaning it was reasonable to contact the full 25%. However, this process was designed to stay flexible. If the expected profit at the 75th percentile had been negative or close to zero, the threshold would have been adjusted upwards — reducing the number of contacts and prioritizing profitability over simply filling capacity.

This approach ensured that customer selection was guided not just by purchase probability, but by expected financial impact. In the final recommendation, 23750 customers were selected — meeting the call center’s limit while keeping the decision fully aligned with the profit-first goal.

The decisions were saved in the required format: 20250305_LogisticRegression.csv, containing a single column 'decision' with binary values (1 = contact, 0 = no contact). Final verification confirmed that exactly 23750 customers were selected.

# Step 5. Apply final Logistic Regression model to customer base

# Add synthetic treatment column
customers['treatment'] = 1

# One-hot encode categorical features (V2, V19)
customers = pd.get_dummies(customers, columns=['V2', 'V19'], drop_first=True)

# Align columns with train set if they have different categories
X_train_logistic, customers = X_train_logistic.align(customers, join='inner', axis=1, fill_value=0)

# Reorder columns to exactly match training order
customers = customers[X_train_logistic.columns]

# Log transform highly skewed numeric columns
for col in ['V5', 'V12', 'V14', 'V18']:
    customers[col] = np.log1p(customers[col])

# Cap numeric outliers at 99th percentile (based on train data)
outlier_columns = ['V1', 'V5', 'V6', 'V7', 'V9', 'V10', 'V11', 'V12', 'V14', 'V15', 'V17', 'V20']

for col in outlier_columns:
    cap = X_train[col].quantile(0.99)
    customers[col] = np.clip(customers[col], None, cap)

# Scale numeric continuous features
customers[continuous_columns_logistic] = scaler.transform(customers[continuous_columns_logistic])

# Predict purchase probability using Logistic Regression
customers['purchase_proba'] = logreg.predict_proba(customers)[:, 1]

# Calculate expected profit for each customer
customers['expected_profit'] = 100 * customers['purchase_proba'] - 5

# Make decision based on expected profit > 0
customers['decision'] = (customers['expected_profit'] > 0).astype(int)

# Check distribution of predicted purchase probabilities
print('Distribution of predicted purchase probabilities:')
print(customers['purchase_proba'].describe())

# Sort customers by predicted purchase probability (descending)
customers = customers.sort_values('purchase_proba', ascending=False)

# Define profit treshold for the top 25% customers
capacity = int(len(customers) * 0.25)
threshold = customers['expected_profit'].iloc[capacity - 1]

# Adjust decision rule to cap at top 25% customers only
customers['decision'] = (customers['expected_profit'] >= threshold).astype(int)

# Confirm number of positive decisions
print(f"\nNumber of customers to contact: {customers['decision'].sum()} out of {len(customers)} (max capacity = {capacity}).\n")

# Save the final decision to the text file YYYYMMDD_[NameOfYourModel].csv
final_decision = customers['decision']
final_decision.to_csv('20250305_LogisticRegression.csv', index=False)

# Verify the results in the new text file
final = pd.read_csv('20250305_LogisticRegression.csv')
print(final['decision'].value_counts())

Summary

The goal of this analysis was to improve marketing campaign effectiveness by maximizing profit while respecting call center capacity. After testing two models, Logistic Regression was chosen for the final recommendation as it provided better ranking and recall, aligning better with the profit-driven goal.

The final decision process followed a profit-based flexible approach, where the contact threshold was set by balancing expected profit with the call center’s capacity. This ensured that only potentially profitable customers were selected, and the process could easily adjust if the profit threshold changed in future campaigns.

Further improvements to consider

Handle skewness more thoroughly by transforming more highly skewed columns, not just the most extreme ones, to better support Logistic Regression model.
Tune existing models by adjusting regularization for Logistic Regression and experimenting with tree depth or splits for Random Forest.
Optimize hyperparameters using Grid Search or Random Search to find better parameter combinations, especially for Logistic Regression since it only has a few important hyperparameters.
Test advanced models like Gradient Boosting or XGBoost, which often perform well on structured data with mixed features.
Explore uplift modeling in future campaigns, especially if the business wants to identify persuadable customers — those who only buy if contacted.

These steps could help improve models accuracy and the overall effectiveness of future campaigns.