Skip to content

Online shopping decisions rely on how consumers engage with online store content. I am pretending to work for a new startup company that has just launched a new online shopping website. The marketing team has asked me, a new data scientist, to review a dataset of online shoppers' purchasing intentions gathered over the last year. Specifically, the team wants me to generate some insights into customer browsing behaviors in the busiest months for shoppers. I have decided to identify two groups of customers: those with a low purchase rate and returning customers. After identifying these groups, I want to determine the probability that any of these customers will make a purchase in a new marketing campaign to help gauge potential success for next year's sales. I will also be looking for any trends or correlations with purchases made.

Data description:

I have been given an online_shopping_session_data.csv that contains several columns about each shopping session. Each shopping session corresponded to a single user.

ColumnDescription
SessionIDunique session ID
Administrativenumber of pages visited related to the customer account
Administrative_Durationtotal amount of time spent (in seconds) on administrative pages
Informationalnumber of pages visited related to the website and the company
Informational_Durationtotal amount of time spent (in seconds) on informational pages
ProductRelatednumber of pages visited related to available products
ProductRelated_Durationtotal amount of time spent (in seconds) on product-related pages
BounceRatesaverage bounce rate of pages visited by the customer
ExitRatesaverage exit rate of pages visited by the customer
PageValuesaverage page value of pages visited by the customer
SpecialDaycloseness of the site visiting time to a specific special day
Weekendindicator whether the session is on a weekend
Monthmonth of the session date
CustomerTypecustomer type
Purchaseclass label whether the customer make a purchase
# Import required libraries to play with
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from numpy import random
from scipy import stats
from scipy.stats import mannwhitneyu

# Load and view data
shopping_data = pd.read_csv("online_shopping_session_data.csv")
print(shopping_data.shape)
print(shopping_data.info())
display(shopping_data.tail())
# Fill that one random single null value
df = shopping_data.fillna(method='ffill')

# Create total duration column
df['Total_Duration'] = df['ProductRelated_Duration'] + df['Informational_Duration'] + df['Administrative_Duration']

# There are no values for the month of April, yet May numbers are particularily high
may_count = int(df[df['Month']=='May']['Month'].count()/2)

# Select index of May values
may_index = df[df['Month'] == 'May'].index

# Plant numpy seed and select indices to move
np.random.seed(0)
random_index = np.random.choice(may_index, may_count, replace=False)

# Update DataFrame
df.loc[random_index, 'Month'] = 'Apr'

# Organize Month column
df['Month'] = df['Month'].replace('N', 'Nov')
df['Month'] = df['Month'].replace('June', 'Jun')
months = pd.CategoricalDtype(['Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'], ordered=True)
df['Month'] = df['Month'].astype(months)

# View progress
print(df['Month'].value_counts())
# Double check for missing or incorrect values
print(df['CustomerType'].value_counts())
print(10387 + 1668)
print(df['Weekend'].value_counts())
print(9249 + 2806)
# Organize data for visualization 
graph = df.groupby(['Month', 'Purchase'], as_index=False)\
        [['Total_Duration']].sum().sort_values(by='Month')

# Convert seconds to hours
graph['Total_Duration'] = graph['Total_Duration'] / 360

# Map for yes and no 
graph['Purchase'] = graph['Purchase'].map({0: 'No', 1: 'Yes'})

# Plot Total Duration by Month
custom_palette = {'Yes': '#1E3A8A', 'No': '#B22222'}
sns.pointplot(data=graph, x='Month', y='Total_Duration', hue='Purchase', hue_order=['Yes', 'No'], palette=custom_palette)
sns.set_style('whitegrid')
plt.ylabel('Total Duration (Hours)')
plt.xlabel('Month')
plt.title('Online Shopping Trends by Month')
plt.show()

From our graph, it is apparent that November and December, the Holiday season, is the busiest time of the year for online shopping in our dataset.

# Sort for most popular months, November and December
nov_dec =df[(df['Month']=='Nov') | (df['Month']=='Dec')]

# Find purchase rate
rates = nov_dec.groupby('CustomerType', as_index=False)[['Purchase']].value_counts(normalize=True)

# Filter for only customers who made a purchase
purchase_rate = rates[rates['Purchase'] > 0 ][['CustomerType', 'proportion']]

# Rename columns
purchase_rate = purchase_rate.rename(columns={'proportion': 'PurchaseRate'})\
                         .sort_values(by='PurchaseRate').round(4)
# Take a look
display (purchase_rate.head())

# Convert to dictionary
purchase_rates = {purchase_rate['CustomerType'].values[0]: purchase_rate['PurchaseRate'].values[0], \
                 purchase_rate['CustomerType'].values[1]: purchase_rate['PurchaseRate'].values[1]}
# Print Results
display (purchase_rates) 
# A new campaign for returning customers next holiday should boost purchase rate by 15% 
purchase_rate_returning = purchase_rate['PurchaseRate'].values[0]
print("Current purchase rate:", purchase_rate_returning)
increased_purchase_rate_returning = (1.15 * purchase_rate_returning).round(4)
print("Increased purchase rate:", increased_purchase_rate_returning)

# Probability of achieving at least 100 sales out of 500 customers 
cdf = stats.binom.cdf(k=100, n=500, p=increased_purchase_rate_returning)
print(f"Probability of having at least 100 sales: {(cdf).round(2) * 100}%")

# Probability of having more than 100 sales
probability = 1 - cdf
print(f"Probability of having more than 100 sales: {(probability).round(2) * 100}%")
# Filter and sort to search for trends
duration = df.groupby(['CustomerType', 'Weekend', 'Purchase'], as_index=False).agg(Administrative=('Administrative_Duration', 'mean'), Informational=('Informational_Duration', 'mean'), ProductRelated=('ProductRelated_Duration', 'mean'), TotalDuration=('Total_Duration', 'mean')).round(2)
display(duration)

# Melt columns together for visualization
duration_melted = duration.melt(id_vars=['CustomerType', 'Weekend', 'Purchase'], \
                                value_vars=['Administrative', 'Informational', 'ProductRelated'], \
                                var_name='DurationType', value_name='AverageDuration')
print(duration_melted)

# Violin plot
plt.figure(figsize=(12, 6))
sns.violinplot(data=duration_melted, x='CustomerType', y='AverageDuration', hue='DurationType', inner='quart')
plt.title('Average Duration by Customer Type')
plt.ylabel('Average Duration (seconds)')
plt.legend(title='Duration Type', loc='upper left')
plt.show()

From our violin plot it is clear that time spent on Product Related websites far outweigh both the administrative and informational durations. Next I will explore to see if the duration spent on these product related websites correlate with more purchases.

# Filter for each indivual variable
purchase = df[df['Purchase']==1][['ProductRelated_Duration']]
no_purchase = df[df['Purchase']==0][['ProductRelated_Duration']]

# Create a 2x2 grid of subplots
fig, axes = plt.subplots(2, 1, figsize=(12, 10))

# Plot distributions
sns.histplot(data=purchase, x='ProductRelated_Duration', ax=axes[0])
axes[0].set_title('Purchase Distribution')
axes[0].set_xlim(0, 15000)

sns.histplot(data=no_purchase, x='ProductRelated_Duration', ax=axes[1])
axes[1].set_title('No Purchase Distribution')
axes[1].set_xlim(0, 15000)
axes[1].set_ylim(0, 100)
plt.tight_layout()
plt.show()

The data in this representation is skewed to left, meaning that the average is less than the median. Since the distribution does not follow a normal distribution, I will perform the Mann-Whitney U Test to determine if there is any significant difference between the average product related website duration and purchases.

Null Hypothesis: There is no significant difference in the time spent on product-related websites between users who make a purchase and those who do not.

Alternative Hypothesis: Users who make a purchase spend significantly more time on product-related websites than those who do not.

I will be using a 95% Confidence Interval, removing the outliers for improved accuracy, with a 5% Significance level.

# Function to remove outliers
def remove_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

# Clean dataframes
purchase_clean = remove_outliers(purchase, 'ProductRelated_Duration')
no_purchase_clean = remove_outliers(no_purchase, 'ProductRelated_Duration')
# State significance level
alpha = 0.05

# Purchase vs. No Purchase
u_stat, p_value = mannwhitneyu(purchase_clean['ProductRelated_Duration'], \
                         no_purchase_clean['ProductRelated_Duration'])