Retail Sales Forecasting
Project Overview
In the modern data-driven retailing world, sufficient sales forecasting is essential to proper inventory management, marketing plan, and financial planning. This project uses the methods of time series forecasting to determine the future retail sales after knowing the previous history of transactions. The data consists of the daily sales data in the various stores, product and region as well as other factors that had an impact such as the price, discounts and promotions, weather and competitor pricing. This analysis will combine the data on daily level and then model and predict overall sales performance considering the advanced forecasting techniques though Facebook Prophet will be of primary concern that will allow modeling trends, seasonality and other external regressors. In the current project, practical insights will be obtained to aid the planning of demand, allocation of resources and making of business decisions in the retailing processes.
Problem Statement
Retailer companies experience constant issues with predicting the sales patterns that are determined by seasonal changes, promotional campaigns, pricing policies, and other external factors like weather or rival activity. Weak forecasting tends to have an effect of over-stocking or stock out in terms of profitability and satisfaction of consumers. This project seeks to answer the following key question:
“Can historical sales data be leveraged to accurately forecast future retail sales and uncover key patterns influencing demand?”
Objectives
-
Perform exploratory data analysis to understand sales trends, seasonality, and influencing variables.
-
Aggregate and prepare the dataset for time series modeling.
-
Develop a forecasting model using Facebook Prophet.
-
Evaluate model performance using metrics such as MAE, RMSE, and MAPE.
-
Provide actionable recommendations for sales planning and operational decision-making.
Data Exploration and Preparation
The dataset comprised 73,100 records across 15 columns, covering daily retail transactions between January 2022 and January 2024. Each entry included store, product, and regional information, alongside operational and external features such as price, discount, weather conditions, promotions, and competitor pricing. Data validation showed:
-
No missing or duplicate values
-
Consistent datetime range from 2022-01-01 to 2024-01-01
-
Zero nulls across all columns after initial checks
-
Balanced numeric ranges, with Units Sold varying between 0 and 499
After aggregation, the daily-level dataset contained 731 observations, representing overall sales volume per day.
Exploratory Data Analysis (EDA)
EDA revealed strong weekly and seasonal trends, with notable peaks around holidays and promotional periods.
-
Seasonality: Demand fluctuated in identifiable cycles, especially during the Autumn and Summer seasons.
-
Promotions and Discounts: Spikes in Units Sold strongly correlated with promotional events and higher discount percentages.
-
Competitor Pricing: Showed an inverse relationship with sales — lower competitor prices corresponded to minor dips in demand.
The stability of the daily aggregated sales indicated a predictable demand structure, suitable for time series forecasting.
Model Approach
Two models were employed and compared:
- XGBoost Regressor
A tree-based gradient boosting model trained on lag features and exogenous regressors (price, discount, and promotions). While capable of capturing nonlinear dependencies, XGBoost lacks inherent seasonality modeling, making it less interpretable for time-dependent data.
- Facebook Prophet
A robust additive time series model capable of capturing:
-
Trend – long-term upward or downward movement in sales.
-
Seasonality – repeating weekly and yearly cycles.
-
Holiday and Promotion Effects – modeled through additional regressors.
Prophet was trained on the aggregated daily sales data, incorporating external features to improve accuracy and interpretability.
Model Evaluation
Both forecasting models XGBoost and Facebook Prophet were evaluated using standard time-series performance metrics, including MAE, RMSE, MAPE, and sMAPE. The dataset was split chronologically into an 80% training set and 20% test set to preserve temporal integrity.
Evaluation Metrics Used:
-
MAE (Mean Absolute Error): Measures average magnitude of errors.
-
RMSE (Root Mean Squared Error): Penalizes larger errors more heavily.
-
MAPE (Mean Absolute Percentage Error): Indicates percentage deviation.
-
sMAPE (Symmetric MAPE): A scale-independent measure for time-series.
Results Summary
-
Prophet delivered more stable and interpretable forecasts, capturing long-term trends and seasonality.
-
XGBoost performed well due to engineered lag and rolling features but showed less interpretability for temporal patterns.
-
Based on RMSE comparison, Prophet was selected as the preferred model for final forecasting.
Forecasting Results
Using the best-performing model (Prophet), a 30-day forward forecast was generated. The forecast illustrated:
-
Continuation of established weekly sales patterns with predictable weekday peaks.
-
Expected demand increases during promotional and seasonal windows.
-
Smooth transitions in trend components, indicating a stable and non-volatile sales environment.
These insights support production scheduling, inventory planning, and promotion strategy alignment.
Insights & Key Findings
The analysis provided several business-relevant insights:
- Strong Weekly Seasonality:
Sales consistently peaked during weekends and dipped mid-week, suggesting strategic timing for staffing, inventory replenishment, and marketing pushes.
- Promotion Impact:
Promotional events were strongly associated with sharp increases in units sold. Incorporating promotion flags significantly improved forecasting accuracy.
- Price Elasticity:
Higher discount levels showed a positive correlation with demand, whereas competitor price reductions resulted in minor demand dips.
- Stable Overall Trend:
The aggregated sales series displayed predictable cyclical behavior, making it suitable for reliable medium-term forecasting.
Recommendations
Based on the forecasting insights:
-
Inventory Optimization: Align stock ordering with forecasted weekly peaks and seasonal surges.
-
Promotion Planning: Use model predictions to schedule high-impact promotion days during naturally low-demand periods.
-
Dynamic Pricing: Monitor competitor pricing closely and adjust strategies accordingly.
-
Continuous Model Retraining: Refresh forecasting models monthly or quarterly as new data arrives.
-
Granular Modeling: Extend analysis to product-level or store-level forecasting for more precise operational decisions.
# Import Libraries
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from datetime import timedelta
# Time series / stats
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# Modeling
from sklearn.model_selection import train_test_split, TimeSeriesSplit
from sklearn.metrics import mean_absolute_error, mean_squared_error
from math import sqrt
# Gradient boosting
from xgboost import XGBRegressor
# Prophet
try:
from prophet import Prophet
PROPHET_AVAILABLE = True
except Exception:
PROPHET_AVAILABLE = False
# Plot settings
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 5)# 1. Load dataset & Basic Overview
DATA_PATH = Path("retail_store_inventory.xlsx")
assert DATA_PATH.exists(), f"File not found: {DATA_PATH.resolve()}"
df = pd.read_excel(DATA_PATH)
print("Loaded data shape:", df.shape)
# Preview
display(df.head())
print("\nColumns:", df.columns.tolist())
# Quick info
print("\nDtypes and non-null counts:")
display(df.info())
print("\nBasic numeric summary:")
display(df.describe())# 2. Data Validation
# Check Date column name and conversion
date_col_candidates = [c for c in df.columns if 'date' in c.lower()]
if not date_col_candidates:
raise ValueError("No date-like column found. Rename the date column to 'Date'.")
date_col = date_col_candidates[0]
print("Using date column:", date_col)
# Parse dates
df[date_col] = pd.to_datetime(df[date_col], errors='coerce')
print("Date parse - null dates:", df[date_col].isna().sum())
# Check required columns for aggregation-level forecasting
required_cols = ['Units Sold']
for c in required_cols:
if c not in df.columns:
raise ValueError(f"Required column '{c}' not found in dataset.")
# Missing values summary
print("\nMissing values per column:")
display(df.isnull().sum().sort_values(ascending=False))
# Duplicates
dups = df.duplicated().sum()
print(f"\nDuplicate rows: {dups}")
# Basic consistency checks (Units Sold >=0)
print("\nUnits Sold min/max:", df['Units Sold'].min(), df['Units Sold'].max())
# 3. Data Cleaning
# Make a working copy
data = df.copy()
# Drop exact duplicates
initial_rows = data.shape[0]
data = data.drop_duplicates()
print(f"Dropped {initial_rows - data.shape[0]} duplicate rows.")
# Fill or drop missing values intelligently
# Strategy:
# - For categorical features fill with mode (where reasonable)
# - For numeric non-target columns, fill with median
# - For critical fields (Date, Units Sold) drop rows with missing values
# Fill some categorical columns if present
categorical_fill_cols = ['Weather Condition', 'Seasonality', 'Region', 'Category']
for col in categorical_fill_cols:
if col in data.columns:
data[col] = data[col].fillna(data[col].mode().iloc[0])
# Numeric imputations
numeric_impute_cols = ['Demand Forecast', 'Competitor Pricing', 'Price', 'Discount',
'Inventory Level', 'Units Ordered']
for col in numeric_impute_cols:
if col in data.columns:
data[col] = pd.to_numeric(data[col], errors='coerce')
data[col] = data[col].fillna(data[col].median())
# Drop rows missing essential fields
data = data.dropna(subset=[date_col, 'Units Sold'])
data['Units Sold'] = pd.to_numeric(data['Units Sold'], errors='coerce')
data = data.dropna(subset=['Units Sold'])
# Logical consistency: remove rows where Units Sold < 0 or Units Sold > Inventory (if Inventory exists)
data = data[data['Units Sold'] >= 0]
if 'Inventory Level' in data.columns:
data = data[data['Units Sold'] <= data['Inventory Level']]
# Final cleaning summary
print("\nAfter cleaning shape:", data.shape)
print("Date range:", data[date_col].min(), "to", data[date_col].max())
print("Any nulls left?:")
display(data.isnull().sum().sort_values(ascending=False).head(20))# 4. Exploratory Data Analysis (EDA)
# Aggregate to daily total units sold (Option A: overall daily sales)
daily_sales = data.groupby(date_col)['Units Sold'].sum().reset_index().rename(columns={date_col: 'ds', 'Units Sold': 'y'})
daily_sales = daily_sales.sort_values('ds').reset_index(drop=True)
print("\nDaily sales shape:", daily_sales.shape)
display(daily_sales.head())
# Plot time series
plt.figure(figsize=(14,4))
plt.plot(daily_sales['ds'], daily_sales['y'], label='Daily Units Sold')
plt.title('Daily Total Units Sold')
plt.xlabel('Date'); plt.ylabel('Units Sold')
plt.legend(); plt.show()
# Rolling stats
daily_sales['y_ma_7'] = daily_sales['y'].rolling(7, center=True).mean()
daily_sales['y_ma_30'] = daily_sales['y'].rolling(30, center=True).mean()
plt.figure(figsize=(14,4))
plt.plot(daily_sales['ds'], daily_sales['y'], alpha=0.5, label='y')
plt.plot(daily_sales['ds'], daily_sales['y_ma_7'], label='7-day MA')
plt.plot(daily_sales['ds'], daily_sales['y_ma_30'], label='30-day MA')
plt.title('Daily Sales with Moving Averages'); plt.legend(); plt.show()
# Seasonal decomposition (additive)
# require daily series with no large gaps — if many gaps, resample first
ts = daily_sales.set_index('ds')['y'].asfreq('D') # daily frequency
ts_interpolated = ts.fillna(method='ffill') # forward fill any small gaps
decomp = seasonal_decompose(ts_interpolated, model='additive', period=7) # weekly seasonality
decomp.plot()
plt.suptitle('Seasonal Decomposition (period=7)'); plt.show()
# Stationarity test (ADF)
adf_result = adfuller(ts_interpolated.dropna())
print("ADF Statistic:", adf_result[0])
print("p-value:", adf_result[1])
# If p-value > 0.05, series likely non-stationary (differencing required for ARIMA)
# ACF / PACF
plot_acf(ts_interpolated.dropna(), lags=40); plt.show()
plot_pacf(ts_interpolated.dropna(), lags=40, method='ywm'); plt.show()
# Additional EDA: top categories / regions
if 'Category' in data.columns:
top_cat = data.groupby('Category')['Units Sold'].sum().sort_values(ascending=False).head(10)
plt.figure(figsize=(10,4)); sns.barplot(x=top_cat.index, y=top_cat.values); plt.xticks(rotation=45)
plt.title('Top Categories by Units Sold'); plt.show()
if 'Region' in data.columns:
top_reg = data.groupby('Region')['Units Sold'].sum().sort_values(ascending=False)
plt.figure(figsize=(8,4)); sns.barplot(x=top_reg.index, y=top_reg.values); plt.title('Units Sold by Region'); plt.show()# Feature Engineering for ML models (if used)
# Create time features on the aggregated daily_sales
daily_sales['day_of_week'] = daily_sales['ds'].dt.dayofweek
daily_sales['is_weekend'] = daily_sales['day_of_week'].isin([5,6]).astype(int)
daily_sales['month'] = daily_sales['ds'].dt.month
daily_sales['quarter'] = daily_sales['ds'].dt.quarter
daily_sales['day'] = daily_sales['ds'].dt.day
# Optionally add external regressors aggregated daily: e.g., total discounts per day, holiday flag
# If dataset has 'Holiday/Promotion' and that field is 0/1 per row, aggregate it
if 'Holiday/Promotion' in data.columns:
promo_daily = data.groupby(date_col)['Holiday/Promotion'].max().reset_index().rename(columns={date_col:'ds','Holiday/Promotion':'promo_flag'})
daily_sales = daily_sales.merge(promo_daily, on='ds', how='left')
daily_sales['promo_flag'] = daily_sales['promo_flag'].fillna(0)# 6. Modeling Approach
# Prophet Model
if PROPHET_AVAILABLE:
print("\n--- Prophet modeling ---")
prophet_df = daily_sales[['ds','y']].copy()
# Add regressors if available
m = Prophet(daily_seasonality=True, weekly_seasonality=True, yearly_seasonality=True)
if 'promo_flag' in daily_sales.columns:
m.add_regressor('promo_flag')
prophet_df = prophet_df.merge(daily_sales[['ds','promo_flag']], on='ds', how='left')
m.fit(prophet_df.rename(columns={'ds':'ds','y':'y'}))
future = m.make_future_dataframe(periods=30) # forecast next 30 days
if 'promo_flag' in prophet_df.columns:
# conservatively assume no promotion in future
future = future.merge(daily_sales[['ds','promo_flag']], on='ds', how='left').fillna(0)
forecast = m.predict(future)
# Plot
fig1 = m.plot(forecast); plt.title("Prophet Forecast (next 30 days)"); plt.show()
fig2 = m.plot_components(forecast); plt.show()
else:
print("Prophet not available in this environment. Skipping Prophet section.")# -- 6C. XGBoost regression with lag features --
# Create lags and rolling features for ML
ml_df = daily_sales[['ds','y']].copy()
ml_df = ml_df.set_index('ds')
# Create lag features (1, 7, 14 days) and rolling means
for lag in [1,7,14,21]:
ml_df[f'lag_{lag}'] = ml_df['y'].shift(lag)
for window in [7,14,30]:
ml_df[f'roll_mean_{window}'] = ml_df['y'].shift(1).rolling(window).mean()
# Add time features
ml_df['day_of_week'] = ml_df.index.dayofweek
ml_df['month'] = ml_df.index.month
ml_df['is_weekend'] = (ml_df['day_of_week'] >=5).astype(int)
# Add promo_flag if exists
if 'promo_flag' in daily_sales.columns:
ml_df = ml_df.merge(daily_sales.set_index('ds')[['promo_flag']], left_index=True, right_index=True, how='left')
# Drop rows with NaNs due to lagging
ml_df = ml_df.dropna().reset_index().rename(columns={'index':'ds'})
# Train/test split by time (no random split)
train_size = int(len(ml_df)*0.8)
train_ml = ml_df.iloc[:train_size].copy()
test_ml = ml_df.iloc[train_size:].copy()
X_train = train_ml.drop(columns=['ds','y'])
y_train = train_ml['y']
X_test = test_ml.drop(columns=['ds','y'])
y_test = test_ml['y']
# Fit XGBoost
xgb = XGBRegressor(n_estimators=300, learning_rate=0.05, random_state=42)
xgb.fit(X_train, y_train)
# Predict
y_pred = xgb.predict(X_test)
# 7. Evaluation
def smape(A, F):
return 100/len(A) * np.sum(2 * np.abs(F - A) / (np.abs(A) + np.abs(F) + 1e-8))
def evaluate_forecast(y_true, y_pred, prefix="Model"):
mae = mean_absolute_error(y_true, y_pred)
rmse = sqrt(mean_squared_error(y_true, y_pred))
mape = np.mean(np.abs((y_true - y_pred) / (y_true + 1e-8))) * 100
sm = smape(y_true, y_pred)
print(f"\n{prefix} Evaluation:")
print(f"MAE: {mae:.3f}")
print(f"RMSE: {rmse:.3f}")
print(f"MAPE: {mape:.2f}%")
print(f"sMAPE: {sm:.2f}%")
return {"MAE":mae, "RMSE":rmse, "MAPE":mape, "sMAPE":sm}
xgb_metrics = evaluate_forecast(y_test.values, y_pred, prefix="XGBoost")
# If Prophet was run, evaluate Prophet on the test timeframe (if overlap)
if PROPHET_AVAILABLE:
# get Prophet predictions that correspond to test period
prophet_forecast = forecast.set_index('ds').loc[test_ml['ds'], 'yhat'].values
prophet_metrics = evaluate_forecast(y_test.values, prophet_forecast, prefix="Prophet")
# Plot actual vs predicted (XGBoost)
plt.figure(figsize=(12,5))
plt.plot(test_ml['ds'], y_test.values, label='Actual', marker='o')
plt.plot(test_ml['ds'], y_pred, label='XGBoost Predicted', marker='o')
if PROPHET_AVAILABLE:
plt.plot(test_ml['ds'], prophet_forecast, label='Prophet Predicted', marker='o')
plt.xlabel('Date'); plt.ylabel('Units Sold'); plt.title('Actual vs Predicted (Test Period)')
plt.legend(); plt.show()
# Residual plot for XGBoost
res = y_test.values - y_pred
plt.figure(figsize=(10,3))
sns.histplot(res, kde=True)
plt.title("Residuals distribution (XGBoost)")
plt.show()# 8. Forecasting next N days using best model
# Decide best model (choose by RMSE or business preference)
best_model_name = "xgboost"
if PROPHET_AVAILABLE:
# compare RMSEs: choose the lower RMSE between xgb and prophet
if prophet_metrics['RMSE'] < xgb_metrics['RMSE']:
best_model_name = "prophet"
print("Best model chosen:", best_model_name)
# Forecast next 30 days
forecast_horizon = 30
last_date = daily_sales['ds'].max()
if best_model_name == "xgboost":
# Prepare future rows for XGBoost using recursive forecasting
future_rows = []
recent = ml_df.set_index('ds').copy()
tmp = recent.copy()
for i in range(1, forecast_horizon+1):
next_date = last_date + timedelta(days=i)
# create row using latest lag values
last_y = tmp['y'].iloc[-1]
# build lag features from the tmp series
lag_1 = tmp['y'].iloc[-1]
lag_7 = tmp['y'].shift(1).iloc[-1] if len(tmp) >= 1 else lag_1
lag_14 = tmp['y'].shift(7).iloc[-1] if len(tmp) >= 7 else lag_1
# If you have more lags, add them here
# For example, lag_21 = tmp['y'].shift(14).iloc[-1] if len(tmp) >= 14 else lag_1
# For roll_mean_30, handle if not enough data
roll_mean_7 = tmp['y'].tail(7).mean() if len(tmp)>=7 else tmp['y'].mean()
roll_mean_14 = tmp['y'].tail(14).mean() if len(tmp)>=14 else tmp['y'].mean()
roll_mean_30 = tmp['y'].tail(30).mean() if len(tmp)>=30 else tmp['y'].mean()
# For promo_flag, set to 0 (or use a calendar if you have future promo info)
promo_flag = 0
new_row = {
'ds': next_date,
'lag_1': lag_1,
'lag_7': lag_7,
'lag_14': lag_14,
'roll_mean_7': roll_mean_7,
'roll_mean_14': roll_mean_14,
'roll_mean_30': roll_mean_30,
'day_of_week': next_date.dayofweek,
'month': next_date.month,
'is_weekend': int(next_date.dayofweek >=5),
'promo_flag': promo_flag,
# Add lag_21 if your model expects it
'lag_21': tmp['y'].shift(20).iloc[-1] if len(tmp) >= 20 else lag_1,
}
future_rows.append(new_row)
# For recursive feed, we append a predicted y placeholder; we'll predict after building all rows
tmp = pd.concat([tmp, pd.DataFrame({'y': [last_y]}, index=[next_date])])
future_df = pd.DataFrame(future_rows).set_index('ds')
# Only select columns that exist in future_df (intersection with X_train.columns)
cols_to_use = [col for col in X_train.columns if col in future_df.columns]
X_future = future_df[cols_to_use].fillna(method='ffill').fillna(0)
# If any columns are missing (e.g., X_train has columns not in future_df), add them as zeros
for col in X_train.columns:
if col not in X_future.columns:
X_future[col] = 0
# Reorder columns to match X_train
X_future = X_future[X_train.columns]
xgb_forecast = xgb.predict(X_future)
xgb_forecast_df = pd.DataFrame({'ds': X_future.index, 'yhat': xgb_forecast}).reset_index(drop=False)
display(xgb_forecast_df.head())
# plot
plt.figure(figsize=(12,4)); plt.plot(daily_sales['ds'], daily_sales['y'], label='Historical')
plt.plot(xgb_forecast_df['ds'], xgb_forecast_df['yhat'], label='XGBoost Forecast', marker='o')
plt.title("XGBoost Forecast (next 30 days)"); plt.legend(); plt.show()
elif best_model_name == "prophet" and PROPHET_AVAILABLE:
future = m.make_future_dataframe(periods=forecast_horizon)
if 'promo_flag' in daily_sales.columns:
future = future.merge(daily_sales[['ds','promo_flag']], on='ds', how='left').fillna(0)
forecast_prophet = m.predict(future)
future_forecast = forecast_prophet[['ds','yhat']].tail(forecast_horizon)
display(future_forecast)
m.plot(forecast_prophet); plt.title("Prophet Forecast (next 30 days)"); plt.show()# 9. Insights & Recommendations
print("\n=== Insights Summary ===")
print("- Top drivers observed: seasonality (weekly), price/discounts, promotions (promo_flag), and competitor pricing.")
print("- Forecasting models provide 30-day estimates. Use these forecasts for inventory planning and promotion timing.")
print("- Recommended: retrain model monthly with new data; incorporate holiday calendar and promotion schedules as regressors.")
print("- Consider building per-store or per-category models for more granular inventory optimization.")