Skip to content

Online shopping decisions rely on how consumers engage with online store content. You work for a new startup company that has just launched a new online shopping website. The marketing team asks you, a new data scientist, to review a dataset of online shoppers' purchasing intentions gathered over the last year. Specifically, the team wants you to generate some insights into customer browsing behaviors in November and December, the busiest months for shoppers. You have decided to identify two groups of customers: those with a low purchase rate and returning customers. After identifying these groups, you want to determine the probability that any of these customers will make a purchase in a new marketing campaign to help gauge potential success for next year's sales.

Data description:

You are given an online_shopping_session_data.csv that contains several columns about each shopping session. Each shopping session corresponded to a single user.

ColumnDescription
SessionIDunique session ID
Administrativenumber of pages visited related to the customer account
Administrative_Durationtotal amount of time spent (in seconds) on administrative pages
Informationalnumber of pages visited related to the website and the company
Informational_Durationtotal amount of time spent (in seconds) on informational pages
ProductRelatednumber of pages visited related to available products
ProductRelated_Durationtotal amount of time spent (in seconds) on product-related pages
BounceRatesaverage bounce rate of pages visited by the customer
ExitRatesaverage exit rate of pages visited by the customer
PageValuesaverage page value of pages visited by the customer
SpecialDaycloseness of the site visiting time to a specific special day
Weekendindicator whether the session is on a weekend
Monthmonth of the session date
CustomerTypecustomer type
Purchaseclass label whether the customer make a purchase
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom
from scipy.stats import pearsonr

# Load and view your data
df = pd.read_csv("online_shopping_session_data.csv")
df.head()
# Start your code here!
print(df.columns)


nov_dec_data = df[df['Month'].isin(['Nov', 'Dec'])]
returning_customer_data = nov_dec_data[nov_dec_data['CustomerType'] == 'Returning_Customer']

# Step 1: Calculate the purchase rate for Returning Customers
returning_purchase_rate = returning_customer_data["Purchase"].sum() / len(returning_customer_data)

# Step 2: Filter data for New Customers
new_customer_data = nov_dec_data[nov_dec_data["CustomerType"] == "New_Customer"]

# Step 3: Calculate the purchase rate for New Customers
new_purchase_rate = new_customer_data["Purchase"].sum() / len(new_customer_data)

# Store results in the dictionary
purchase_rates = {
    "Returning_Customer": round(returning_purchase_rate, 3),
    "New_Customer": round(new_purchase_rate, 3)}

print("Purchase Rates:", purchase_rates)


# Step 1: Select relevant columns for page durations
page_durations = returning_customer_data[
    ["Administrative_Duration", "Informational_Duration", "ProductRelated_Duration"]
]

# Step 2: Compute pairwise correlations using Pearson
correlations = {}
columns = page_durations.columns
for i, col1 in enumerate(columns):
    for j, col2 in enumerate(columns):
        if i < j:  #  Exclude duplicate pairs (e.g., (A, B) and (B, A) are same).
                   #  Exclude self-correlation (e.g., (A, A)).
            corr, _ = pearsonr(page_durations[col1], page_durations[col2])
            correlations[(col1, col2)] = corr

# Step 3: Identify the strongest correlation
strongest_pair = max(correlations, key=correlations.get)
strongest_correlation_value = correlations[strongest_pair]

# Step 4: Store the result in the required format
top_correlation = {
    "pair": strongest_pair,
    "correlation": round(strongest_correlation_value, 3)
}

print("Top Correlation:", top_correlation)

Binomial Distribution: The binomial distribution models the number of successes (sales) in a fixed number of independent trials (sessions), where each trial has the same probability of success (boosted purchase rate).

# 3. Calculate the likelihood of achieving at least 100 sales
# Boosted purchase rate for returning customers
boosted_rate = purchase_rates["Returning_Customer"] * 1.15
n_sessions = 500
success_threshold = 100

# Calculate the probability using the binomial distribution
from scipy.stats import binom
# Computes the cumulative probability of achieving fewer than 100 sales.
# The cumulative distribution function (CDF) sums the probabilities of achieving 0, 1, 2, ..., 99 sales.
# Subtracts this cumulative probability from 1 to calculate the probability of achieving at least 100 sales.
prob_at_least_100_sales = 1 - binom.cdf(success_threshold - 1, n_sessions, boosted_rate)

print("Probability of achieving at least 100 sales:", prob_at_least_100_sales)