Skip to content

Online shopping decisions rely on how consumers engage with online store content. You work for a new startup company that has just launched a new online shopping website. The marketing team asks you, a new data scientist, to review a dataset of online shoppers' purchasing intentions gathered over the last year. Specifically, the team wants you to generate some insights into customer browsing behaviors in November and December, the busiest months for shoppers. You have decided to identify two groups of customers: those with a low purchase rate and returning customers. After identifying these groups, you want to determine the probability that any of these customers will make a purchase in a new marketing campaign to help gauge potential success for next year's sales.

Data description:

You are given an online_shopping_session_data.csv that contains several columns about each shopping session. Each shopping session corresponded to a single user.

ColumnDescription
SessionIDunique session ID
Administrativenumber of pages visited related to the customer account
Administrative_Durationtotal amount of time spent (in seconds) on administrative pages
Informationalnumber of pages visited related to the website and the company
Informational_Durationtotal amount of time spent (in seconds) on informational pages
ProductRelatednumber of pages visited related to available products
ProductRelated_Durationtotal amount of time spent (in seconds) on product-related pages
BounceRatesaverage bounce rate of pages visited by the customer
ExitRatesaverage exit rate of pages visited by the customer
PageValuesaverage page value of pages visited by the customer
SpecialDaycloseness of the site visiting time to a specific special day
Weekendindicator whether the session is on a weekend
Monthmonth of the session date
CustomerTypecustomer type
Purchaseclass label whether the customer make a purchase

Purchase Rates for November and December

What are the purchase rates for online shopping sessions by customer type for November and December?

# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Load and view your data
shopping_data = pd.read_csv("online_shopping_session_data.csv")
shopping_data.head()
#Just to see the data in the column month
shopping_data['Month'].value_counts()
#Seeing which fraction of the data those months represent to have an overview and filtering the data for the months of interest (november and december)
shopping_data_nov_december = shopping_data[ shopping_data['Month'].isin(['Nov', 'Dec'])]

print(f"all months: {shopping_data.shape[0]:,}")
print(f"only november and december: {shopping_data_nov_december.shape[0]:,}")
#Grouping data to calculate purchase rate for the different groups
Purchase_rate_grouped_df = shopping_data_nov_december.groupby(['CustomerType'])['Purchase'].agg(
    Total_purchase = 'sum',
    Total_sessions = 'count'
).reset_index()


Purchase_rate_grouped_df
#Calculating the purchase rate for the groups
Purchase_rate_grouped_df['Purchase_rate'] = Purchase_rate_grouped_df['Total_purchase']/Purchase_rate_grouped_df['Total_sessions']

Purchase_rate_grouped_df
#Extracting the values to store in the dictionary
p_rate_ret_customer = Purchase_rate_grouped_df.loc[ 
    Purchase_rate_grouped_df['CustomerType'] == "Returning_Customer", 'Purchase_rate' 
].values[0]



p_rate_new_customer = Purchase_rate_grouped_df.loc[
    Purchase_rate_grouped_df['CustomerType'] == "New_Customer", 'Purchase_rate'
].values[0]


#Storing the values on the dictionary
purchase_rates = {
    "Returning_Customer": round(p_rate_ret_customer, 3),
    "New_Customer": round(p_rate_new_customer, 3)
}

purchase_rates

Strongest Correlation

What is the strongest correlation in total time spent among page types by returning customers in November and December?

shopping_data_nov_december.head(5)

shopping_data_nov_december.columns.tolist()

#shopping_data_nov_december['CustomerType'].value_counts()
# Step 1: Filter the sessions only returning customers are flagged
returning_nov_dec = shopping_data_nov_december[shopping_data_nov_december['CustomerType'] == "Returning_Customer"]

# Step 2: Keep only the columns that contains pages duration
corr_returning_nov_dec = returning_nov_dec[['Administrative_Duration', 'Informational_Duration', 'ProductRelated_Duration']]

# Step 3: Calculate the correlation in time spent on pages of each type
corr_Table_ret_customer = corr_returning_nov_dec.corr()

corr_Table_ret_customer
#Unpivoting the correlation matrix
corr_long = corr_Table_ret_customer.stack().reset_index()

corr_long.head(5)
corr_long.columns.tolist()