Product Sales Data Analysis - Code

About Pens and Printers

Pens and Printers was founded in 1984 and provides high quality office products to large organizations. We are a trusted provider of everything from pens and notebooks to desk chairs and monitors. We don’t produce our own products but sell those made by other companies. We have built long lasting relationships with our customers and they trust us to provide them with the best products for them. As the way in which consumers buy products is changing, our sales tactics have to change too. Launching a new product line is expensive and we need to make sure we are using the best techniques to sell the new product effectively. The best approach may vary for each new product so we need to learn quickly what works and what doesn’t.

New Product Sales Methods

Six weeks ago we launched a new line of office stationery. Despite the world becoming increasingly digital, there is still demand for notebooks, pens and sticky notes. Our focus has been on selling products to enable our customers to be more creative, focused on tools for brainstorming. We have tested three different sales strategies for this, targeted email and phone calls, as well as combining the two.

Email: Customers in this group received an email when the product line was launched, and further email three weeks later. This required very little work for the team.
Call: Customers in this group were called by a member of the sales team. On average members of the team were on the phone for around thirty minutes per customer.
Email and call: Customers in this group were first sent the product information email, then called a week later by the sales team to talk about their needs and how this new product may support their work. The email required little work from the team, the call was around ten minutes per customer.

Data Information

The sales rep has pulled some data from their sales tracking system for us. They haven’t included numbers for how much time was spent on each customer, but there may be some other useful customer information in here. The data only relates to the new products sold. As there are multiple different products, the revenue will vary depending on which products were sold.

We need to know:

How many customers were there for each approach?
What does the spread of the revenue look like overall? And for each method?
Was there any difference in revenue over time for each of the methods?
Based on the data, which method would you recommend we continue to use? Some of these methods take more time from the team so they may not be the best for us to use if the results are similar.

We don’t really know if there are other differences between the customers in each group, so anything you can tell us would be really helpful to give some context to what went well.

# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import FuncFormatter
import matplotlib.ticker as ticker

# Load and view data
sales = pd.read_csv('product_sales.csv')

### Data validation & cleaning

## Addressing missing values

# Check for missing values in all columns, and for unexpected missing values represented by '-', etc.
missing_values = sales.isnull().sum()
unexpected_missing_values = (sales.isin(['-', 'missing', ' ', 'NaN'])).sum()

# Since revenue seems to be a critial value address business request, impute missing revenue using the median revenue per sales method
sales['revenue'] = sales['revenue'].fillna(sales.groupby('sales_method')['revenue'].transform('median'))

# Check for missing values again to ensure successful imputation
upd_missing_values = sales.isnull().sum()

## Clean categorical and text data

# Clean 'state': Ensure consistent values (remove spaces, convert to title case)
sales['state'] = sales['state'].str.strip().str.title()

# Clean the 'sales_method' column and correct specific inconsistencies
sales['sales_method'] = sales['sales_method'].str.strip().str.title()
sales['sales_method'] = sales['sales_method'].replace({
    'Em + Call': 'Email + Call'
})

# Convert and validate numeric columns, combining steps
sales['week'] = pd.to_numeric(sales['week'], errors='coerce').astype('Int64')
sales['nb_sold'] = pd.to_numeric(sales['nb_sold'], errors='coerce').astype('Int64')
sales['revenue'] = pd.to_numeric(sales['revenue'], errors='coerce').round(2).astype('float')
sales['years_as_customer'] = pd.to_numeric(sales['years_as_customer'], errors='coerce').astype('Int64')
sales['nb_site_visits'] = pd.to_numeric(sales['nb_site_visits'], errors='coerce').astype('Int64')

# Validate 'years_as_customer' matches company founding year with filtering and dropping rows with invalid 'years_as_customer'
sales = sales[sales['years_as_customer'] <= (2024-1984)]

# Drop the 'valid_years_as_customer' column if it exists
if 'valid_years_as_customer' in sales.columns:
    sales = sales.drop(columns=['valid_years_as_customer'])

sales.head()

### Exploratory Data Analysis (EDA)

# Define a function to format ticks with a dollar sign
def dollar_formatter(x, pos):
    return f'${x:,.0f}'

formatter = FuncFormatter(dollar_formatter)

# Set a common figure size and grid style for all plots
common_figsize = (10, 6)
sns.set(style="whitegrid")

## 1. Number of customers by sales method

# Calculate the counts of customers by sales method
customer_counts = sales['sales_method'].value_counts()

# Print the customer counts for reference
print('Number of customers by sales method:')
for method, count in customer_counts.items():
    print(f'{method}: {count}')

# Generate the countplot with sorted bars
plt.figure(figsize=(6, 6))
sns.countplot(x='sales_method', data=sales, palette='Set2', order=customer_counts.index)
plt.title('Number of customers by sales method', fontsize=14, fontweight='bold')
plt.xlabel('Sales method', fontweight='bold')
plt.ylabel('Number of customers', fontweight='bold')
plt.show()

### Exploratory Data Analysis (EDA)

## 2. Spread of revenue

# Overall revenue spread and spread by sales method calculations
overall_revenue = sales['revenue'].describe()
revenue_by_method = sales.groupby('sales_method')['revenue'].describe()

# Sort the sales methods by median revenue (50th percentile)
sorted_revenue_methods = revenue_by_method['50%'].sort_values()

# Get the most, middle, and least profitable methods based on median revenue
most_profitable_method = sorted_revenue_methods.idxmax()
most_profitable_value = sorted_revenue_methods.max()
middle_profitable_method = sorted_revenue_methods.index[1]
middle_profitable_value = sorted_revenue_methods.iloc[1]
least_profitable_method = sorted_revenue_methods.idxmin()
least_profitable_value = sorted_revenue_methods.min()

# Present summary
print("Overall revenue summary:")
print(f"Revenue ranges from a minimum of ${overall_revenue['min']:.2f} to a maximum of ${overall_revenue['max']:.2f}.")
print(f"The overall average revenue is ${overall_revenue['mean']:.2f}, and most typical revenue is ${overall_revenue['50%']:.2f}.")
print("\nSummary of revenue performance by sales method:")
print(f"Sales method with highest revenue is '{most_profitable_method}', with representative revenue of ${most_profitable_value:.2f}.")
print(f"Sales method with middle revenue is '{middle_profitable_method}', with representative revenue of ${middle_profitable_value:.2f}.")
print(f"Sales method with lowest revenue is '{least_profitable_method}', with representative revenue of ${least_profitable_value:.2f}.")

# Overall revenue spread
plt.figure(figsize=common_figsize)
sns.histplot(sales['revenue'], bins=30, kde=True, color='dodgerblue')  # kde=True adds a smooth curve (density estimate)
plt.title('Overall revenue distribution', fontsize=14, fontweight='bold')
plt.xlabel('Revenue', fontweight='bold')
plt.ylabel('Frequency', fontweight='bold')
plt.gca().xaxis.set_major_formatter(formatter)  # Apply dollar formatter to the x-axis
plt.show()

# Revenue spread by sales method
plt.figure(figsize=common_figsize)
sns.boxplot(x='sales_method', y='revenue', data=sales, palette='Set3')
plt.title('Revenue spread by sales method', fontsize=14, fontweight='bold')
plt.xlabel('Sales method', fontweight='bold')
plt.ylabel('Revenue', fontweight='bold')
plt.gca().yaxis.set_major_formatter(formatter)  # Apply dollar formatter
plt.ylim(0, 250)
plt.show()

### Exploratory Data Analysis (EDA)

## 3. Revenue over time

# Calculate the mean revenue for the first and last weeks for each method
mean_revenue_first_week = sales[sales['week'] == sales['week'].min()].groupby('sales_method')['revenue'].mean().round(2)
mean_revenue_last_week = sales[sales['week'] == sales['week'].max()].groupby('sales_method')['revenue'].mean().round(2)

# Calculate the percentage change in mean revenue between the first and last weeks
percentage_change = ((mean_revenue_last_week - mean_revenue_first_week) / mean_revenue_first_week) * 100

# Print the mean_revenue_first_week with 2 decimal places and a dollar symbol
print('At the first week average revenue by sales method was:')
for method, revenue in mean_revenue_first_week.items():
    print(f'{method}: ${revenue:.2f}')

# Print the mean_revenue_last_week with 2 decimal places and a dollar symbol
print('\nAt the last week average revenue by sales method was:')
for method, revenue in mean_revenue_last_week.items():
    print(f'{method}: ${revenue:.2f}')
    
# Print the percentage change with 2 decimal places and a percentage symbol
print('\nDifference in revenue between first and last weeks by sales method:')
for method, percentage in percentage_change.items():
    print(f'{method}: {percentage:.2f}%')
    
# plt.figure(figsize=common_figsize)
sales_pivot = sales.pivot_table(index='week', columns='sales_method', values='revenue', aggfunc='mean')
sales_pivot.plot(kind='area', stacked=False, colormap='Set1', figsize=(10, 6))
plt.title('Revenue trend over time by sales method', fontsize=14, fontweight='bold')
plt.xlabel('Week', fontweight='bold')
plt.ylabel('Revenue', fontweight='bold')
plt.gca().yaxis.set_major_formatter(formatter)  # Apply dollar formatter
plt.ylim(0, 250)

# Get the handles and labels from the existing plot
handles, labels = plt.gca().get_legend_handles_labels()

# Manually reorder them based on the desired order
order = ['Email + Call', 'Email', 'Call']

# Create the legend using the reordered labels and handles
plt.legend([handles[labels.index(l)] for l in order], order, title='Sales method')
plt.show()

### Exploratory Data Analysis (EDA)

## 4. Sales method recommendations

# Calculate total revenue for each sales method
total_revenue_by_method = (sales.groupby('sales_method')['revenue'].sum()).round(2)

# Count the number of customers for each sales method
customers_by_method = sales['sales_method'].value_counts()

# Effort data (in minutes per customer) for each sales method
effort_by_method = {
    'Email': 1,  # Minimal effort (assume ~ 1 min per customer)
    'Call': 30,  # 30 minutes per customer for calls
    'Email + Call': 11  # 1 minute email + 10 minutes per customer for follow-up call
}

# Combine total revenue, customer count, and effort into one DataFrame for comparison
revenue_effort = pd.DataFrame({
    'Total revenue': total_revenue_by_method,
    'Customers count': customers_by_method,
    'Effort (minutes per customer)': pd.Series(effort_by_method)
})

# Calculate total effort (minutes per customer * number of customers)
revenue_effort['Total effort (minutes)'] = revenue_effort['Effort (minutes per customer)'] * revenue_effort['Customers count']

# Calculate revenue per minute of effort (corrected)
revenue_effort['Revenue per minute of effort'] = (revenue_effort['Total revenue'] / revenue_effort['Total effort (minutes)']).round(2)

# Print the results
print('Revenue and effort comparison:')
print(revenue_effort.to_string())

# Recommendation based on corrected revenue efficiency
best_method = revenue_effort['Revenue per minute of effort'].idxmax()
best_revenue_efficiency = revenue_effort['Revenue per minute of effort'].max().round(2)

# Conclusions
print(f"\nMaximize total revenue: \nThe '{total_revenue_by_method.idxmax()}' method generates the highest total revenue (${total_revenue_by_method.max()}). This strategy is most effective for campaigns focused on broad outreach and ensuring the largest overall financial return.")
print(f"\nMaximize individual returns: \nThe '{most_profitable_method}' method produces the highest typical revenue per customer (${most_profitable_value}). This approach is more time-intensive but provides a higher return per customer interaction, making it the best choice for high-value customers who may need more personalized attention.")
print(f"\nMaximize efficiency: \nThe '{best_method}' method has the highest revenue efficiency per minute of effort (${best_revenue_efficiency}). It’s the preferred option for high-volume, low-effort sales campaigns, especially when the team wants to reach a broad audience with minimal time investment.")

# Final recommendation: the best technique to continue using to sell the product effectively
if best_method == total_revenue_by_method.idxmax():
    print(f"\nAddressing concerns of the sales team about time investments and campaign costs, we recommend continuing with the '{best_method}' method.")
    print(f"Using the '{best_method}' strategy, the sales team achieves both the highest total revenue (${total_revenue_by_method.max()}) and the greatest efficiency (${best_revenue_efficiency} per minute of effort), making it a highly effective and low-cost sales approach.")
else:
    print(f"\nAddressing concerns of the sales team about time investments and campaign costs, we recommend continuing with the '{best_method}' method.")
    print(f"While '{total_revenue_by_method.idxmax()}' generates the highest total revenue (${total_revenue_by_method.max()}), the '{best_method}' method provides the greatest efficiency (${best_revenue_efficiency} per minute of effort), making it the top choice for cost-effective, large-scale campaigns.")

### Exploratory Data Analysis (EDA)

## 5. Customer behavior differences across sales methods

# Create cohorts based on years as a customer
sales['cohort'] = pd.cut(sales['years_as_customer'], bins=[0, 5, 10, 15, 20, 25, 30, 35, 40], 
                         labels=['0-5', '6-10', '11-15', '16-20', '21-25', '26-30', '31-35', '36-40'])

# Calculate total revenue per cohort
cohort_revenue_data = sales.pivot_table(index='cohort', columns='sales_method', values='revenue', aggfunc='sum').fillna(0)

# Find the cohort and method with the highest revenue
max_revenue_method = cohort_revenue_data.max().idxmax()  # Get the sales method with the highest revenue
max_revenue_cohort = cohort_revenue_data[max_revenue_method].idxmax()  # Get the cohort for that sales method
max_revenue_value = cohort_revenue_data[max_revenue_method].max()  # Get the highest revenue value

# Calculate total revenue for the '0-5' cohort across all methods
total_revenue_0_5 = sales[sales['cohort'] == '0-5'].groupby('cohort')['revenue'].sum().values[0]

# Calculate total revenue across all cohorts and methods
total_revenue_all = sales['revenue'].sum()

# Calculate percentage of total revenue generated by the '0-5' cohort
percentage_revenue_0_5 = (total_revenue_0_5 / total_revenue_all) * 100

# Print the results
print(f"Total revenue for the '0-5' cohort across all methods: ${total_revenue_0_5:,.2f}.")
print(f"Percentage of total revenue generated by the '0-5' cohort: {percentage_revenue_0_5:.2f}%.")
print(f"Customers in the '{max_revenue_cohort} years' cohort and '{max_revenue_method}' method generate the highest revenue (${max_revenue_value:,.2f}).")

# Adjust the data for log scaling (add a small constant to avoid log(0) issues)
log_data = np.log1p(cohort_revenue_data)

# Format cohort_revenue_data to include dollar symbols for annotations
annot_data = cohort_revenue_data.applymap(lambda x: f"${x:,.0f}")

# Log-transformed revenue values
plt.figure(figsize=common_figsize)

# Create the heatmap with log scale
heatmap = sns.heatmap(log_data, cmap='coolwarm', 
                      annot=cohort_revenue_data.applymap(lambda v: f"${int(v):,}"), 
                      fmt="", cbar_kws={'label': 'Log-transformed revenue'})

# Title and labels to indicate log-transformed scale
plt.title('Revenue distribution by customer tenure and sales method', fontsize=14, fontweight='bold')
plt.xlabel('Sales method', fontweight='bold')
plt.ylabel('Years as customer', fontweight='bold')

# Show the plot
plt.show()

Product Sales Data Analysis - Code

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}About Pens and Printers

New Product Sales Methods

Data Information

We need to know:

About Pens and Printers