Skip to content

It's simple to buy any product with a click and have it delivered to your door. Online shopping has been rapidly evolving over the last few years, making our lives easier. But behind the scenes, e-commerce companies face a complex challenge that needs to be addressed.

Uncertainty plays a big role in how the supply chains plan and organize their operations to ensure that the products are delivered on time. These uncertainties can lead to challenges such as stockouts, delayed deliveries, and increased operational costs.

You work for the Sales & Operations Planning (S&OP) team at a multinational e-commerce company. They need your help to assist in planning for the upcoming end-of-the-year sales. They want to use your insights to plan for promotional opportunities and manage their inventory. This effort is to ensure they have the right products in stock when needed and ensure their customers are satisfied with the prompt delivery to their doorstep.

The Data

Online Retail.csv

ColumnDescription
'InvoiceNo'A 6-digit number uniquely assigned to each transaction
'StockCode'A 5-digit number uniquely assigned to each distinct product
'Description'The product name
'Quantity'The quantity of each product (item) per transaction
'UnitPrice'Product price per unit
'CustomerID'A 5-digit number uniquely assigned to each customer
'Country'The name of the country where each customer resides
'InvoiceDate'The day and time when each transaction was generated "MM/DD/YYYY"
'Year'The year when each transaction was generated
'Month'The month when each transaction was generated
'Week'The week when each transaction was generated (1-52)
'Day'The day of the month when each transaction was generated (1-31)
'DayOfWeek'The day of the weeke when each transaction was generated
(0 = Monday, 6 = Sunday)
# Import required libraries
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor
from pyspark.sql.functions import col, count, isnan, when
from pyspark.sql.functions import dayofmonth, month, year,  to_date, to_timestamp, weekofyear, dayofweek
from pyspark.ml.feature import StringIndexer
from pyspark.ml.evaluation import RegressionEvaluator
# Initialize Spark session
my_spark = SparkSession.builder.appName("SalesForecast").getOrCreate()

# Importing sales data
sales_data = my_spark.read.csv("Online Retail.csv", 
                               header=True, 
                               inferSchema=True, 
                               sep=",")

Exploratory Data Analysis

With the dataset loaded, let's begin by taking a look at the first five rows.

sales_data.show(5)

By examining these initial rows, we can infer the data types of each column—or at least what they should be. Let's now ensure that every column has the appropriate data type by inspecting the schema.

sales_data.printSchema()

From the schema we can conclude that all columns have an appropriate data type except for InvoiceDate, which is currently stored as a timestamp. Since the time component is not relevant in this case, we can procede converting it to datetime type, which will also help reducing complexity and simplifying data aggregation.

# Convert InvoiceDate to datetime 
sales_data = sales_data.withColumn("InvoiceDate", to_date(to_timestamp(col("InvoiceDate"), "d/M/yyyy H:mm")))

sales_data.show(5, truncate= 12)

Once having the correct data types for every column, we can proceed to explore other aspects of the dataset, such as its dimensions, namely, the number of rows and columns, the presence of null values in any columns, and some descriptive statistics.

print(f"sales_data dimensions: ({sales_data.count()}, {len(sales_data.columns)})")
# Find out null values
for c in sales_data.columns:
    nas = sales_data.filter(f"{c} IS NULL").count()
    print(f"{nas} null values in column {c}")

Regarding the descriptive statistics of the columns, we can divide the analysis into two parts: the target variable and the features, keeping in mind the regression model that will be implemented.
Starting with the target variable, since it is numeric, we can extract statistics such as the mean, standard deviation, minimum, maximum, and median.

# Examine target columns
sales_data.select('Quantity').describe().show()