Online shopping decisions rely on how consumers engage with online store content. You work for a new startup company that has just launched a new online shopping website. The marketing team asks you, a new data scientist, to review a dataset of online shoppers' purchasing intentions gathered over the last year. Specifically, the team wants you to generate some insights into customer browsing behaviors in November and December, the busiest months for shoppers. You have decided to identify two groups of customers: those with a low purchase rate and returning customers. After identifying these groups, you want to determine the probability that any of these customers will make a purchase in a new marketing campaign to help gauge potential success for next year's sales.
Data description:
You are given an online_shopping_session_data.csv that contains several columns about each shopping session. Each shopping session corresponded to a single user.
| Column | Description |
|---|---|
SessionID | unique session ID |
Administrative | number of pages visited related to the customer account |
Administrative_Duration | total amount of time spent (in seconds) on administrative pages |
Informational | number of pages visited related to the website and the company |
Informational_Duration | total amount of time spent (in seconds) on informational pages |
ProductRelated | number of pages visited related to available products |
ProductRelated_Duration | total amount of time spent (in seconds) on product-related pages |
BounceRates | average bounce rate of pages visited by the customer |
ExitRates | average exit rate of pages visited by the customer |
PageValues | average page value of pages visited by the customer |
SpecialDay | closeness of the site visiting time to a specific special day |
Weekend | indicator whether the session is on a weekend |
Month | month of the session date |
CustomerType | customer type |
Purchase | class label whether the customer make a purchase |
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns
# Load and view your data
shopping_data = pd.read_csv("online_shopping_session_data.csv")
shopping_data.shape
shopping_data.infoshopping_data.head(10)#Filter to Nov/Dec Only
nov_dec_data = shopping_data[(shopping_data['Month'] == 'Nov') | (shopping_data['Month'] == 'Dec')]
#Quick Count of purchaes by new and returning cutomers
purchase_count = nov_dec_data.groupby('CustomerType')['Purchase'].sum()
print(purchase_count)
customer_type_count = nov_dec_data.groupby('CustomerType')['Purchase'].count()
print(customer_type_count)
#Determine the purchasing rate for new and returning customers
new_customer_purchase_rate = nov_dec_data[nov_dec_data['CustomerType']=='New_Customer']['Purchase'].sum()/len(nov_dec_data[nov_dec_data['CustomerType']=='New_Customer'])
returning_customer_purchase_rate = nov_dec_data[nov_dec_data['CustomerType']=='Returning_Customer']['Purchase'].sum()/len(nov_dec_data[nov_dec_data['CustomerType']=='Returning_Customer'])
#store results to a dictionary and print them out
purchase_rates = {"Returning_Customer": returning_customer_purchase_rate, "New_Customer" : new_customer_purchase_rate}
print(purchase_rates)Returning customers made 3.5x as many purchases as new customers, but the new customers had a higher purchase rate. This is because returning customers made up 5x more of the total visitors to the site than new customers.
#What is the top correlation in total time spent among page typs by returning customers
returning_customers = nov_dec_data[nov_dec_data['CustomerType'] == 'Returning_Customer']
returning_customers.head()#Determine the Correlation between page visits and visit duration by page type for returning customers
admin_info_corr = returning_customers['Administrative_Duration'].corr(returning_customers['Informational_Duration'])
info_product_corr = returning_customers['Informational_Duration'].corr(returning_customers['ProductRelated_Duration'])
product_admin_corr = returning_customers['ProductRelated_Duration'].corr(returning_customers['Administrative_Duration'])
print(admin_info_corr,info_product_corr, product_admin_corr)
top_correlation = {'pair':('ProductRelated_Duration', 'Administrative_Duration'),'correlation': product_admin_corr}There is moderate correlations between the views fo admin pages and product pages during the holiday season.
Assuming we have a new marketing campaign that will boost the purchase rate 15%, what is the probability of achieving as least 100 sales our of 500 online sessions for 500 returning customers?
This is a binomial probablity problem, since there are only 2 outcomes, the customer buys a product or doesn't buy a product.
from scipy.stats import binom
successes = 100
n_trials = 500
success_probability = 1.15 * returning_customer_purchase_rate
prob_at_least_100_sales = 1 - binom.cdf(k=successes, n=n_trials, p=success_probability)
print(f"The probability of observing at least {successes} successes in {n_trials} trials is: {prob_at_least_100_sales:.1%}")#Plot of the probability distribution
n_sessions = 500
k_values = np.arange(500) + 1
p_binom_values = [stats.binom.pmf(k, n_sessions, success_probability) for k in k_values]
plt.bar(k_values, p_binom_values)
plt.vlines(100, 0, 0.08, color='r', linestyle='dashed', label="sales=100")
plt.xlabel("number of sales")
plt.ylabel("probability")
plt.legend()
plt.show()