Skip to content

Project: Builind a Demand Forecasting Model

It's simple to buy any product with a click and have it delivered to your door. Online shopping has been rapidly evolving over the last few years, making our lives easier. But behind the scenes, e-commerce companies face a complex challenge that needs to be addressed.

Uncertainty plays a big role in how the supply chains plan and organize their operations to ensure that the products are delivered on time. These uncertainties can lead to challenges such as stockouts, delayed deliveries, and increased operational costs.

You work for the Sales & Operations Planning (S&OP) team at a multinational e-commerce company. They need your help to assist in planning for the upcoming end-of-the-year sales. They want to use your insights to plan for promotional opportunities and manage their inventory. This effort is to ensure they have the right products in stock when needed and ensure their customers are satisfied with the prompt delivery to their doorstep.

Data

You are provided with a sales dataset to use. A summary and preview are provided below.

Online Retail.csv

ColumnDescription
'InvoiceNo'A 6-digit number uniquely assigned to each transaction
'StockCode'A 5-digit number uniquely assigned to each distinct product
'Description'The product name
'Quantity'The quantity of each product (item) per transaction
'UnitPrice'Product price per unit
'CustomerID'A 5-digit number uniquely assigned to each customer
'Country'The name of the country where each customer resides
'InvoiceDate'The day and time when each transaction was generated "MM/DD/YYYY"
'Year'The year when each transaction was generated
'Month'The month when each transaction was generated
'Week'The week when each transaction was generated (1-52)
'Day'The day of the month when each transaction was generated (1-31)
'DayOfWeek'The day of the weeke when each transaction was generated
(0 = Monday, 6 = Sunday)
# Import required libraries
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor
from pyspark.sql.functions import col, dayofmonth, month, year,  to_date, to_timestamp, weekofyear, dayofweek
from pyspark.ml.feature import StringIndexer
from pyspark.ml.evaluation import RegressionEvaluator

# Initialize Spark session
my_spark = SparkSession.builder.appName("SalesForecast").getOrCreate()

# Importing sales data
sales_data = my_spark.read.csv(
    "Online Retail.csv", header=True, inferSchema=True, sep=",")

# Convert InvoiceDate to datetime 
sales_data = sales_data.withColumn("InvoiceDate", to_date(
    to_timestamp(col("InvoiceDate"), "d/M/yyyy H:mm")))
# Insert the code necessary to solve the assigned problems. Use as many code cells as you need.

Aggregate the Data

Group the data into daily sales intervals

# Create an aggregated dataset
daily_sales_data = sales_data.groupBy(
    "Country", "StockCode", "InvoiceDate", "Year", "Month", "Day", "Week", "DayOfWeek") \
    .agg({"Quantity": "sum"})
daily_sales_data = daily_sales_data.withColumnRenamed("sum(Quantity)", "Quantity")

Split your dataset

You will need to divide your aggregated daily sales data into training and testing sets

# Split dataset based on the date provided
split_date_train_test = "2011-09-25"

## Split your daily sales interval data into a train and a test separate datasets
train_data = daily_sales_data.filter(
    col("InvoiceDate") <= split_date_train_test
)
test_data = daily_sales_data.filter(
    col("InvoiceDate") > split_date_train_test
)

## Transform the train_data into a pandas DataFrame
pd_daily_train_data = train_data.toPandas()

Building your regression model

You first need to create indexers for your categorical columns and combine the features. Then, you need to train the Random Forest model.

# Build your Random Forest Regression model
## Encode categorical columns
country_indexer = StringIndexer(
    inputCol="Country",
    outputCol="CountryIndex"    
).setHandleInvalid("keep")
stock_code_indexer = StringIndexer(
    inputCol="StockCode",
    outputCol="StockCodeIndex"
).setHandleInvalid("keep")

## Combine all selected features into a vector
feature_cols = ["CountryIndex", "StockCodeIndex", "Month", "Year", "DayOfWeek", "Day", "Week"]
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")

## Initializing a Random Forest model
rf = RandomForestRegressor(
    featuresCol="features",
    labelCol="Quantity",
    maxBins=4000
)

## Create a pipeline to organize the whole modelling process
pipeline = Pipeline(stages=[country_indexer, stock_code_indexer, assembler, rf])

## Fitting pipeline to the train dataset
model = pipeline.fit(train_data)

Evaluating the model

You will need to make predictions and evaluate them using Mean Absolute Error (MAE)

# Transform the test data and make predictions
## Getting test predictions
test_predictions = model.transform(test_data)

## Update test_predictions by converting the prediction column into a double data type
test_predictions = test_predictions.withColumn(
    "Prediction", 
    col("prediction").cast("double")
)

# Calculate MAE
mae_evaluator = RegressionEvaluator(
    labelCol="Quantity",
    predictionCol="Prediction",
    metricName="mae"
)
mae = mae_evaluator.evaluate(test_predictions)

Identify the quantity sold at specific week

You need to identify the quantity predicted to be sold during week 39 of 20211 globally

# Getting weekly prediction sales
## Group predicted sales data and calculating into weekly sales
weekly_test_predictions = test_predictions \
    .groupBy("Year", "Week") \
    .agg({"prediction": "sum"})

## Get the week you are interested in
promotion_week = weekly_test_predictions.filter(col("Week") == 39)

## Get the quantity predicted
quantity_sold_w39 = int(promotion_week.select("sum(prediction)").collect()[0][0])
# Stop the Spark session
my_spark.stop()