Supervised learning for revenue forecasting
International Essentials is an international supermarket chain.
Shoppers at their supermarkets can sign up for a loyalty program that provides rewards each year to customers based on their spending. The more you spend the bigger the rewards.
The supermarket would like to be able to predict the likely amount customers in the program will spend, so they can estimate the cost of the rewards.
This will help them to predict the likely profit at the end of the year.
The dataset contains records of customers for their last full year of the loyalty program.
At least one of the two models must have a Root Mean Squared Error below 0.35 to pass.
Task 1
The table below shows what the data should look like.
Create a cleaned version of the dataframe.
- You should start with the data in the file "loyalty.csv".
- Your output should be a dataframe named
clean_data. - All column names and values should match the table below.
| Column Name | Criteria |
|---|---|
| customer_id | Unique identifier for the customer. Missing values are not possible due to the database structure. |
| spend | Continuous. The total spend of the customer in their last full year. This can be any positive value to two decimal places. Missing values should be replaced with 0. |
| first_month | Continuous. The amount spent by the customer in their first month of the year. This can be any positive value, rounded to two decimal places. Missing values should be replaced with 0. |
| items_in_first_month | Discrete. The number of items purchased in the first month. Any integer value greater than or equal to zero. Missing values should be replaced by 0. |
| region | Nominal. The geographic region that the customer is based in. One of four values Americas, Asia/Pacific, Europe, Middle East/Africa. Missing values should be replaced with "Unknown". |
| loyalty_years | Oridinal. The number of years the customer has been a part of the loyalty program. One of five ordered categories, '0-1', '1-3', '3-5', '5-10', '10+'. Missing values should be replaced with '0-1'. |
| joining_month | Nominal. The month the customer joined the loyalty program. One of 12 values "Jan", "Feb", "Mar", "Apr", etc. Missing values should be replaced with "Unknown". |
| promotion | Nominal. Did the customer join the loyalty program as part of a promotion? Either 'Yes' or 'No'. Missing values should be replaced with 'No'. |
# Import necessary libraries
import pandas as pd
import numpy as np
# Load dataset
data = pd.read_csv('loyalty.csv')
# Handle missing values
data['joining_month'] = data['joining_month'].fillna('Unknown')
# Address inconsistency in categorical values
data['promotion'] = data['promotion'].str.capitalize()
# Check for duplicates
data.duplicated().sum()
# Find non-numeric values in 'first_month'
non_numeric_values = data[pd.to_numeric(data['first_month'], errors='coerce').isna()]['first_month'].unique()
# Replace '.' with NaN
data['first_month'] = data['first_month'].replace('.', np.nan)
# Handle missing values
data['first_month'] = data['first_month'].fillna(0)
# Convert to float
data['first_month'] = data['first_month'].astype(float)
# Convert all object columns to category dtype
data = data.apply(lambda x: x.astype('category') if x.dtype == 'object' else x)
# Store ordinal categorical data as ordered categories in natural order
data['loyalty_years'] = data['loyalty_years'].cat.set_categories(
['0-1', '1-3', '3-5', '5-10', '10+'], ordered=True)
# Save cleaned dataset
clean_data = data.copy()
clean_data.head()Task 2
The team at International Essentials have told you that they have always believed that the number of years in the loyalty scheme is the biggest driver of spend.
Producing a table showing the difference in the average spend by number of years in the loyalty programme along with the variance to investigate this question for the team.
-
You should start with the data in the file 'loyalty.csv'.
-
Your output should be a data frame named
spend_by_years. -
It should include the three columns
loyalty_years,avg_spend,var_spend. -
Your answers should be rounded to 2 decimal places.
# Load dataset
df = pd.read_csv('loyalty.csv')
# Produce a new dataframe with average spend by number of years along with the variance
spend_by_years = df.groupby('loyalty_years')['spend'].agg(
avg_spend=np.mean,
var_spend=np.var
).round(2).reset_index()
spend_by_yearsTask 3
Fit a baseline model to predict the spend over the year for each customer.
- Fit your model using the data contained in “train.csv”
- Use “test.csv” to predict new values based on your model. You must return a dataframe named
base_result, that includescustomer_idandspend. Thespendcolumn must be your predicted values.
# 1. Import necessary libraries
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
# 2. Load datasets
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
# 3. Data preprocessing
# Identify categorical columns
categorical_cols = ['region', 'loyalty_years', 'joining_month', 'promotion']
# Preprocessing pipeline for One-Hot Encoding
preprocessor = ColumnTransformer(
transformers=[('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)],
remainder='passthrough' # Keep numerical columns as-is
)
# 4. Model pipeline (preprocessing + Linear Regression)
pipeline = Pipeline([
('preprocessor', preprocessor),
('regressor', LinearRegression())
])
# Define features (X) and target (y)
X_train = train.drop(['customer_id', 'spend'], axis=1)
y_train = train['spend']
X_test = test.drop(['customer_id'], axis=1)
# 5. Cross-Validation (5-Fold)
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='neg_root_mean_squared_error')
# 6. Model training & predictions
pipeline.fit(X_train, y_train)
predictions = np.maximum(0, pipeline.predict(X_test)) # Ensure no negative spend
# 7. RMSE calculation (training set)
train_rmse = np.sqrt(mean_squared_error(y_train, pipeline.predict(X_train)))
# 8. Prepare submission
base_result = pd.DataFrame({
'customer_id': test['customer_id'],
'spend': np.round(predictions, 2)
})
# 9. Display results
print(base_result.head())
print(f'\nCross-validated RMSE: {abs(cv_scores.mean()):.2f}')
print(f'Root Mean Squared Error (Training Set): {train_rmse:.2f}')
# 10. RMSE threshold check
if train_rmse < 0.35:
print('Model passed with RMSE below 0.35!')
else:
print('Model did NOT pass. RMSE is above 0.35.')Task 4
Fit a comparison model to predict the spend over the year for each customer.
- Fit your model using the data contained in “train.csv”
- Use “test.csv” to predict new values based on your model. You must return a dataframe named
compare_result, that includescustomer_idandspend. Thespendcolumn must be your predicted values.
# 1. Import necessary libraries
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
# 2. Load datasets
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
# 3. Feature engineering
# Spending ratio
for df in [train, test]:
df['spending_ratio'] = df['first_month'] * df['items_in_first_month']
# Cyclical encoding for joining_month
month_map = {month: idx for idx, month in enumerate(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'], start=1)}
for df in [train, test]:
df['joining_month_num'] = df['joining_month'].map(month_map)
df['month_sin'] = np.sin(2 * np.pi * df['joining_month_num'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['joining_month_num'] / 12)
# 4. Data preprocessing
categorical_cols = ['region', 'loyalty_years', 'promotion']
numerical_cols = ['first_month', 'items_in_first_month', 'spending_ratio', 'month_sin', 'month_cos']
# Preprocessing pipeline
preprocessor = ColumnTransformer([
('num', StandardScaler(), numerical_cols),
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
])
# 5. Model training with Gradient Boosting
gbr = GradientBoostingRegressor(random_state=42)
# Hyperparameter tuning
param_grid = {
'regressor__n_estimators': [100, 200],
'regressor__learning_rate': [0.05, 0.1],
'regressor__max_depth': [3, 5]
}
# Pipeline
pipeline = Pipeline([
('preprocessor', preprocessor),
('regressor', gbr)
])
# Prepare training data
X_train = train.drop(columns=['customer_id', 'spend', 'joining_month', 'joining_month_num'])
y_train = train['spend']
# GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
grid_search.fit(X_train, y_train)
# Best parameters
print('Best Parameters:', grid_search.best_params_)
# 6. Predictions
X_test = test.drop(columns=['customer_id', 'joining_month', 'joining_month_num'])
predictions = np.maximum(0, grid_search.predict(X_test)) # Ensure no negative spend
# 7. RMSE calculation
train_rmse = np.sqrt(mean_squared_error(y_train, grid_search.predict(X_train)))
# 8. Prepare submission
compare_result = pd.DataFrame({
'customer_id': test['customer_id'],
'spend': np.round(predictions, 2)
})
# 9. Display results
print(compare_result.head())
print(f'\nRoot Mean Squared Error (Training Set): {train_rmse:.2f}')
# 10. RMSE threshold check
if train_rmse < 0.35:
print('Model passed with RMSE below 0.35!')
else:
print('Model did NOT pass. RMSE is above 0.35.')