Project: Demystifying AI: Explaining Demand Forecasting Models

E-commerce businesses rely on explainable artificial intelligence (AI) models to anticipate customer needs, improve inventory planning, personalize marketing campaigns, and most importantly, explain the results to stakeholders. Understanding what factors drive purchases in a specific product category, such as home decor, can help businesses tailor their strategies to maximize sales. If a company knows that customers who historically spend more on children's accessories would also spend more on home decor items moving forward, they could target these customers with promotional ads on home decor items and give bundling discounts to foster this pattern.

A major online retailer has enlisted your help for this very task. You already have two forecast models (model.pkl, knn_model.pkl) and now you need to explain the results to stakeholders so they can make key business decisions about marketing and budgets.

Data

Each row in X_train represents a snapshot of a customer's features for a specific month, and y_train is the customer's sales for the next month for the 'home_decor' product category. The data is a modified version of the original data that is publicly available on Kaggle.

X_train/X_test.csv

Column	Description
`logsales`	Logarithm of (customer sales+1) (+1 to handle 0 sales)
`lag1`	The log of sales from 1 month ago
`lag2`	The log of sales from 2 months ago
`sma_2m`	Average log sales over the last 2 months (simple moving average)
`sma_4m`	Average log sales over the last 4 months (simple moving average)
`sma_6m`	Average log sales over the last 5 months (simple moving average)
`months_since_first`	Months since first purchase
`children_s_accessories` `colourful_essentials` `home_decor` `home_storage` `quirky_stationery` `soft_furnishings` `toys_games` ...	Category-specific logarithm of (customer sales+1)
`sma_2m__birthday_gifts` `sma_4m__birthday_gifts` `sma_3m__birthday_gifts`	2, 4, 6-month average log sales per category (simple moving average)

y_train/y_test.csv

'nextmonth__home_decor': Logarithm of (customer sales+1) for the home_decor product category in the next month for prediction

Model

This forecast model has been trained on the X_train, y_train provided.

model.pkl

Fitted sklearn.ensemble.RandomForestRegressor on X_train, y_train

knn_model.pkl

Fitted sklearn.neighbors.KNeighborsRegressor on X_train, y_train

Update to Python 3.10

Due to how frequently the libraries required for this project are updated, you'll need to update your environment to Python 3.10:

In the workbook, click on "Environment," in the top toolbar and select "Session details".
In the workbook language dropdown, select "Python 3.10".
Click "Confirm" and hit "Done" once the session is ready.

# Re-run this cell
# Import required libraries
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import pandas as pd
import joblib
import sys
assert (
    sys.version_info.major == 3 and sys.version_info.minor == 10
), "Please ensure that you are on Python 3.10."

# Load a sample of the data and the models
X_train = pd.read_csv("data/X_train.csv").sample(500, random_state=42)
X_test = pd.read_csv("data/X_test.csv").sample(500, random_state=42)
y_train = pd.read_csv("data/y_train.csv")["nextmonth__home_decor"].sample(500, random_state=42)
y_test = pd.read_csv("data/y_test.csv")["nextmonth__home_decor"].sample(500, random_state=42)
model = joblib.load("data/model.pkl")
knn_model = joblib.load("data/knn_model.pkl")

# Start coding here
# Use as many cells as you need
import shap
xai = 'shap'

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
print(shap_values.shape)
mean_shap_values = np.mean(np.abs(shap_values), axis=0)
#print(mean_shap_values)
df_features_shapvalues = pd.DataFrame({'feature' : X_test.columns, 'avg shap value' : mean_shap_values})
print(df_features_shapvalues)

top_feats = df_features_shapvalues.sort_values('avg shap value', ascending=False).reset_index(drop=True).iloc[:5]
print(top_feats)

X_test2 = pd.read_csv("data/X_test.csv").sample(50, random_state=42)
explainer2 = shap.KernelExplainer(knn_model.predict, data=shap.kmeans(X_test2, 10))
shap_values2 = explainer.shap_values(X_test2)
print(shap_values2.shape)
mean_shap_values2 = np.mean(np.abs(shap_values2), axis=0)
#print(mean_shap_values)
df_features_shapvalues2 = pd.DataFrame({'feature' : X_test2.columns, 'avg shap value' : mean_shap_values2})
print(df_features_shapvalues2)

top_feats2 = df_features_shapvalues2.sort_values('avg shap value', ascending=False).reset_index(drop=True).iloc[:5]
print(top_feats2)

consistency = np.round(cosine_similarity([top_feats['avg shap value'].values], [top_feats2['avg shap value'].values])[0][0], 2)
print(consistency)

print(X_test['lag2'].describe())

y_pred = model.predict(X_test)
print(X_test.head(2))
print(X_test.loc[2389, 'lag2'], X_test.loc[3034, 'lag2'])
X_test.loc[2389, 'lag2'] += 100
X_test.loc[3034, 'lag2'] += 100
print(X_test.loc[2389, 'lag2'], X_test.loc[3034, 'lag2'])
y_pred2 = model.predict(X_test)
print(y_pred2[0] - y_pred[0])
print(y_pred2[1] - y_pred[1])

reliable = 'yes'