Skip to content

Practical Exam: Customer Purchase Prediction

RetailTech Solutions is a fast-growing international e-commerce platform operating in over 20 countries across Europe, North America, and Asia. They specialize in fashion, electronics, and home goods, with a unique business model that combines traditional retail with a marketplace for independent sellers.

The company has seen rapid growth. A key part of their success has been their data-driven approach to personalization. However, as they plan their expansion into new markets, they need to improve their ability to predict customer behavior.

Their marketing team wants to predict which customers are most likely to make a purchase based on their browsing behavior.

As an AI Engineer, you will help build this prediction system. Your work will directly impact RetailTech's growth strategy and their goal of increasing revenue.

Data Description

Column NameCriteria
customer_idInteger. Unique identifier for each customer. No missing values.
time_spentFloat. Minutes spent on website per session. Missing values should be replaced with median.
pages_viewedInteger. Number of pages viewed in session. Missing values should be replaced with mean.
basket_valueFloat. Value of items in basket. Missing values should be replaced with 0.
device_typeString. One of: Mobile, Desktop, Tablet. Missing values should be replaced with "Unknown".
customer_typeString. One of: New, Returning. Missing values should be replaced with "New".
purchaseBinary. Whether customer made a purchase (1) or not (0). Target variable.

Task 1

The marketing team has collected customer session data in raw_customer_data.csv, but it contains missing values and inconsistencies that need to be addressed. Create a cleaned version of the dataframe:

  • Start with the data in the file raw_customer_data.csv
  • Your output should be a DataFrame named clean_data
  • All column names and values should match the table below.
Column NameCriteria
customer_idInteger. Unique identifier for each customer. No missing values.
time_spentFloat. Minutes spent on website per session. Missing values should be replaced with median.
pages_viewedInteger. Number of pages viewed in session. Missing values should be replaced with mean.
basket_valueFloat. Value of items in basket. Missing values should be replaced with 0.
device_typeString. One of: Mobile, Desktop, Tablet. Missing values should be replaced with "Unknown".
customer_typeString. One of: New, Returning. Missing values should be replaced with "New".
purchaseBinary. Whether customer made a purchase (1) or not (0). Target variable.
# Write your answer to Task 1 here 
import pandas as pd

# Load raw customer dataset
customer_data = pd.read_csv('raw_customer_data.csv')

# Handle missing values
customer_data['time_spent'] = customer_data['time_spent'].fillna(
    customer_data['time_spent'].median()
)

customer_data['pages_viewed'] = (
    customer_data['pages_viewed']
    .fillna(customer_data['pages_viewed'].mean().round())
    .astype(int)
)

customer_data['basket_value'] = customer_data['basket_value'].fillna(0)

customer_data['device_type'] = customer_data['device_type'].fillna('Unknown')

customer_data['customer_type'] = customer_data['customer_type'].fillna('New')

# Create final cleaned dataset as an independent copy
clean_data = customer_data.copy()

# Display cleaned data
clean_data

Task 2

The pre-cleaned dataset model_data.csv needs to be prepared for our neural network. Create the model features:

  • Start with the data in the file model_data.csv
  • Scale numerical features (time_spent, pages_viewed, basket_value) to 0-1 range
  • Apply one-hot encoding to the categorical features (device_type, customer_type)
    • The column names should have the following format: variable_name_category_name (e.g., device_type_Desktop)
  • Your output should be a DataFrame named model_feature_set, with all column names from model_data.csv except for the columns where one-hot encoding was applied.
# Write your answer to Task 2 here
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# Load dataset
model_data = pd.read_csv('model_data.csv')

# Initialize scaler
scaler = MinMaxScaler()

# Apply Min-Max scaling to numeric columns
numeric_cols = ['time_spent', 'pages_viewed', 'basket_value']
model_data[numeric_cols] = scaler.fit_transform(model_data[numeric_cols])

# One-hot encode categorical variables
model_feature_set = pd.get_dummies(
    model_data,
    columns=['device_type', 'customer_type'],
    prefix=['device_type', 'customer_type']
)

# Inspect the transformed dataset
model_feature_set.head()

Task 3

Now that all preparatory work has been done, create and train a neural network that would allow the company to predict purchases.

  • Using PyTorch, create a network with:
    • At least one hidden layer with 8 units
    • ReLU activation for hidden layer
    • Sigmoid activation for the output layer
  • Using the prepared features in input_model_features.csv, train the model to predict purchases.
  • Use the validation dataset validation_features.csv to predict new values based on the trained model.
  • Your model should be named purchase_model and your output should be a DataFrame named validation_predictions with columns customer_id and purchase. The purchase column must be your predicted values.
# Write your answer to Task 3 here
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
import numpy as np

# Load training and validation datasets
input_model_features = pd.read_csv('input_model_features.csv')
validation_features = pd.read_csv('validation_features.csv')

# Prepare training and validation tensors
X_train = input_model_features.drop(['customer_id', 'purchase'], axis=1).values.astype(np.float32)
y_train = input_model_features['purchase'].values.astype(np.float32)
X_val = validation_features.drop('customer_id', axis=1).values.astype(np.float32)

X_train_tensor = torch.tensor(X_train)
y_train_tensor = torch.tensor(y_train).unsqueeze(1)  # Make it (N,1) for BCELoss
X_val_tensor = torch.tensor(X_val)

# Define the neural network model
class PurchaseModel(nn.Module):
    def __init__(self, input_size):
        super(PurchaseModel, self).__init__()
        self.hidden_layers = nn.Sequential(
            nn.Linear(input_size, 8),
            nn.ReLU(),
            nn.Linear(8, 1),
            nn.Sigmoid()  # Output probability for binary classification
        )

    def forward(self, x):
        return self.hidden_layers(x)

# Initialize model, loss function, and optimizer
purchase_model = PurchaseModel(X_train_tensor.shape[1])
criterion = nn.BCELoss()  # Binary Cross-Entropy Loss
optimizer = optim.Adam(purchase_model.parameters(), lr=0.001)

# Training loop
num_epochs = 100
for epoch in range(num_epochs):
    purchase_model.train()
    optimizer.zero_grad()
    outputs = purchase_model(X_train_tensor)
    loss = criterion(outputs, y_train_tensor)
    loss.backward()
    optimizer.step()

# Evaluate on validation data
purchase_model.eval()
with torch.no_grad():
    predictions = purchase_model(X_val_tensor)
    predicted_purchase = torch.round(predictions)  # Convert probabilities to 0 or 1

# Create final DataFrame with predictions
validation_predictions = pd.DataFrame({
    'customer_id': validation_features['customer_id'].values,
    'purchase': predicted_purchase.flatten().numpy()
})

validation_predictions