Skip to content
Project: Will This Customer Purchase Your Product?
  • AI Chat
  • Code
  • Report
  • Online shopping decisions rely on how consumers engage with online store content. You work for a new startup company that has just launched a new online shopping website. The marketing team asks you, a new data scientist, to review a dataset of online shoppers' purchasing intentions gathered over the last year. Specifically, the team wants you to generate some insights into customer browsing behaviors in November and December, the busiest months for shoppers. You have decided to identify two groups of customers: those with a low purchase rate and returning customers. After identifying these groups, you want to determine the probability that any of these customers will make a purchase in a new marketing campaign to help gauge potential success for next year's sales.

    Data description:

    You are given an online_shopping_session_data.csv that contains several columns about each shopping session. Each shopping session corresponded to a single user.

    ColumnDescription
    SessionIDunique session ID
    Administrativenumber of pages visited related to the customer account
    Administrative_Durationtotal amount of time spent (in seconds) on administrative pages
    Informationalnumber of pages visited related to the website and the company
    Informational_Durationtotal amount of time spent (in seconds) on informational pages
    ProductRelatednumber of pages visited related to available products
    ProductRelated_Durationtotal amount of time spent (in seconds) on product-related pages
    BounceRatesaverage bounce rate of pages visited by the customer
    ExitRatesaverage exit rate of pages visited by the customer
    PageValuesaverage page value of pages visited by the customer
    SpecialDaycloseness of the site visiting time to a specific special day
    Weekendindicator whether the session is on a weekend
    Monthmonth of the session date
    CustomerTypecustomer type
    Purchaseclass label whether the customer make a purchase
    # Import required libraries
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from scipy import stats
    
    # Load and preview data
    shopping_data = pd.read_csv("online_shopping_session_data.csv")
    shopping_data.head()
    
    ## Calculate the online purchase rates during online shopping sessions by customer type for November and December.
    # Subset dataframe for November and December data
    shopping_Nov_Dec = shopping_data[shopping_data['Month'].isin(['Nov', 'Dec'])]
    
    # Preview to make sure the subset is correct
    print(shopping_Nov_Dec.head())
    
    # Make sure we only have November and December data
    print(shopping_Nov_Dec['Month'].unique())
    
    # Get session frequency stats by CustomerType and Purchase
    count_session = shopping_Nov_Dec.groupby(['CustomerType'])['Purchase'].value_counts()
    print(count_session)
    
    # Total number of session by CustomerType
    total_new_customer = np.sum(count_session['New_Customer'])
    total_returning_customer = np.sum(count_session['Returning_Customer'])
    
    # Total number of purchase by CustomerType
    purchase_new_customer = count_session[('New_Customer', 1)]
    purchase_returning_customer = count_session[('Returning_Customer', 1)]
    
    # Calculate purchase rates
    purchase_rate_new = purchase_new_customer / total_new_customer
    purchase_rate_returning = purchase_returning_customer / total_returning_customer
    
    # Therefore, the online purchase for the returning customers is lower than that of the new customers.
    purchase_rates = {"Returning_Customer": purchase_rate_returning, "New_Customer": purchase_rate_new}
    print(purchase_rates)
    
    ## Identify the strongest correlation in total time spent between different types of pages visited by the returning customers during the months of November and December. 
    # Calculate correlation with pandas
    cor_admin_info = shopping_Nov_Dec['Administrative_Duration'].corr(shopping_Nov_Dec['Informational_Duration'])
    cor_admin_product = shopping_Nov_Dec['Administrative_Duration'].corr(shopping_Nov_Dec['ProductRelated_Duration'])
    cor_product_info = shopping_Nov_Dec['ProductRelated_Duration'].corr(shopping_Nov_Dec['Informational_Duration'])
    
    print(cor_admin_info)
    print(cor_admin_product)
    print(cor_product_info)
    
    # Another way to solve this is to use Scipy pearsonr function
    # cor_admin_info = stats.pearsonr(shopping_Nov_Dec['Administrative_Duration'], shopping_Nov_Dec['Informational_Duration'])
    
    # Another way to solve this is to use Pandas correlation matrix
    # shopping_Nov_Dec[['Administrative_Duration','Informational_Duration','ProductRelated_Duration' ]].corr()
    
    # Store top correlation
    top_correlation = {"pair": ('Administrative_Duration', 'ProductRelated_Duration'), "correlation": cor_admin_product}
    print(top_correlation)
    
    ## A new campaign for the returning customers will boost the purchase rate by 15%. 
    ## What is the likelihood of achieving at least 100 sales out of 500 online shopping sessions for the returning customers?
    
    # Purchase is a binomial random variable taking the value of either 0 or 1
    # We know that the current purchase rate for the returning customers is
    print("Current purchase rate for the returning customer:", purchase_rate_returning)
    
    # 15% Increase in this rate would be
    increased_purchase_rate_returning = 1.15 * purchase_rate_returning
    print("Increased purchase rate for the returning customer:", increased_purchase_rate_returning)
    
    # First, we find the likelihood of having <100 sales of 500 sessions
    # We can find this from binomial cdf
    prob_sales_100_less = stats.binom.cdf(k=100, n=500, p=increased_purchase_rate_returning)
    print("probability of having <100 sales:", prob_sales_100_less)
    
    # Then, to find the probability of having 100 or more sales is 1-prob_sales_100_less
    prob_at_least_100_sales = 1 - prob_sales_100_less
    print("probability of having at least 100 sales:", prob_at_least_100_sales)
    
    # Plotting the binomial probability distribution
    n_sessions = 500
    k_values = np.arange(500) + 1
    p_binom_values = [stats.binom.pmf(k, n_sessions, increased_purchase_rate_returning) for k in k_values ] 
    plt.bar(k_values, p_binom_values) 
    plt.vlines(100, 0, 0.08, color='r', linestyle='dashed', label="sales=100")
    plt.xlabel("number of sales")
    plt.ylabel("probability")
    plt.legend()
    plt.show()
    # Start your code here!
    # Use as many cells as you like