Skip to content

1. Data Validation

Checking for data types, possible null-values, and anomalies:

A. Null-values Imputation

  1. Using the df.info() method. There is a possibility of null-values in the revenue column. The other column has 15000 values while the revenue column only has 13926.
  2. There are 1074 missing values in the revenue column, which account for 7% of the data. Therefore we could not remove the records with null-values in the revenue column since it is >5%.
  3. After checking the null-value records in the revenue column, there were actually products sold on the record and no 'revenue' column's record with null-values has 0 product sold. This means that we cannot put 0 to the missing values. Since there are products sold, the revenue could not be 0.
  4. I tried to plot the distribution of value per item. There were items with average revenue of 15 /item. After calculation, the average revenue per item is $9. Therefore, I can impute the value with nb_sold * 9 formula to fill the null records.

B. Checking for Anomalies

  1. There's also something not right when looking at the max value in the years_as_customer's column. There are two records that the customer has already been a customer to the company for 63 and 47 years whereas the company has only lasted for 41 years. I need to confirm to the sales rep about this if the data is necessary for the presentation. But for now, I can neglect this outliers.
  2. It was found out that the values in the sales_method column need to be fixed. Since there were 5 unique values where there should be only 3. I replaced the values so that only the right sales method is displayed, they are: Call, Email, and Email + Call
  3. There was no anomaly in the customer_id column. All the values are unique, meaning that no same customer doing repeat purchase on the record. Need to clarify with sales team if needed.

C. Checking for Data Types

  1. There is no need to fix the data type since all the data types are correct in representing the data in the respective column.

2. Exploratory Data Analysis to answer customer's questions

1. How many customers were there for each approach?

Here is the summary:

  • There are 7466 customers that were approached using Email only
  • There are 4985 customers that were approached using Call only, and
  • There are 2549 customers that were approached both using Call and Email

2. What does the spread of the revenue look like overall? And for each method?

The overall spread of the revenue could be visualized by plotting histogram for the revenue.

The mean of the overall revenue is $90.1

With overall standard deviation of $40.96

Here is the spread of revenue from each sales method:

Visually, the boxplot below will show the spread of revenue for each sales method:

3. Was there any difference in revenue over time for each of the methods?

The graph above shows the average revenue over time for each sales method, they yield to different average revenue over time during the 6 weeks period.

  • The Email + Call method yield the most revenue during the 6 weeks period.
  • The Email only method has the second biggest revenue out of the three method.
  • Call only method ranks last in generating revenue.

However, despite the increase in average revenue over time for each sales method, it was found out that total revenue for Email method was significantly decreasing during the six week period. Below is the total revenue over time for each sales method.

Further deep dive shows that the number of customer for Email method has been decreasing over the six week period. Sales team needs to check whether they need to do something with the Email method. The number of customer for both the Call and Email+Call was steadily increasing.

4. Based on data, which method would you recommend we continue to use? Some of these methods take more time from the team so they may not the best for us to use if the results are similar.

The best method to use is Email + Call with the revenue result significantly higher compared to Call or Email only method. The growth of the revenue overtime using the Email + Call method is also bigger. As shown by the larger slope of the line graph answering question No. 3.

5. Is there any differences in customers between each group?

Customers that have been a long time customers tend to puchase via Email or Call only compared to the Email + Call method. The treshold seen is > 28 years customers. But not all customers with that threshold that has been with the company for that long prefers call or email only, sales team can tailor this.

Average site visit over time for each sales method also shows that the Email + Call method is the best among the three, although with no significant difference.

How should sales team monitor what they want to achieve? initial value for the metric based on the current data?

The sales team could monitor the average weekly sales and average week-on-week (WoW) sales growth to monitor how the sales of the new product line go. This could be done in overview or for each sales method. Below is the initial values of the two metrics (mean weekly revenue in USD($) & mean weekly growth in percent vs previous week(%))) for each sales method:

On top of those two metrics, sales team could also track the ARPU (Average Revenue Per Unit sold) and Customer Count Growth.

Final Summary & Recommendations

After analyzing the data provided by the sales team. There are few key points that could be taken, they are:

  1. The revenue distribution is not normal. Most of the revenue falls into the left side of the normal distribution graph. Most of the revenue falls in the range of $50/customer and $90/customer. Average revenue per customer is $90.1 with $40.96 standard deviation.
  2. The average revenue per item sold is $9.01
  3. Revenue for Call and Email+Call method was growing over the six week period while the revenue of Email sales method was decreasing. The decrease of revenue in the Email method was probably due to the decrease of customer count.
  4. The average revenue for each sales method was increasing steadily over the six weeks period. With the Email+Call contributing to the highest average revenue at $220 in the 6th week, followed by Email method with $130 at the 6th week, and the third is Call method at $70 in the 6h week).
  5. The Email method yield better revenue compared to the Call method. Both in average and in exact term. But the total revenue in the Email method was decreasing during the 6 weeks period. Which relates to the decrease of customer count. Sales team need to check on this, since Email delivers the biggest revenue on the first week.
  6. Looking at the overall data, we recommend the Email+Call method to be implemented. Since it would yield better revenue and revenue growth compared to the other two. The Email+Call method also require shorter time (10 minutes) for the employee compared to the Call only method (30 minutes).
  7. The recommended metrics to track are:
    • Average Weekly Revenue
    • Week on Week Revenue Growth
    • Number of customer growth, and
    • Average revenue per unit
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from scipy.stats import iqr

#Step 1 - Data Validation
#1.1 Getting the data type and the possible columns with null values. Also checking the summary statistics of each variable.
df = pd.read_csv('product_sales.csv')

#1.2 Filtering the data to subset the non-null value to get the average revenue/item without the records with null values.
df_non_null = df.dropna(subset=['revenue'])

df_non_null['revenue_per_item'] = df_non_null['revenue']/df_non_null['nb_sold']
avg_rev_per_item = df_non_null['revenue_per_item'].mean()

#1.3 Filling the null value in the revenue columns with formula: nb_sold * average revenue per item
df['revenue'] = df['revenue'].fillna(avg_rev_per_item * df['nb_sold'])

#1.4 Checking the off-values from the years_as_customer column
df[df['years_as_customer'] > (2025-1986)]

#1.5 Checking unique values in sales_method column & changing the method
df['sales_method'].unique()
df['sales_method'] = df['sales_method'].replace({'em + call':'Call','email':'Email'})

#Step 2 - Answering Sales Rep Questions
#2.1 How many customers were there for each approach?
approach_grouping = df.groupby('sales_method')['sales_method'].count()

#2.2 What does the spread of the revenue look like overall? And for each method?
#2.2.1 Revenue distribution
plt.hist(df['revenue'], bins=20)

sns.catplot(data=df, kind='box', ci=False, x='sales_method', y='revenue', order=['Email','Call', 'Email + Call'])

#2.2.2 Mean of revenue
mean_rev = df['revenue'].mean

#2.2.3 Standard Deviation
rev_std = np.std(df['revenue'])

#2.2.4 Mean and Std dev for each method
mean_std_grouped = round(df.groupby('sales_method')['revenue'].agg(['mean','std']), 2)

#2.3 Was there any difference in revenue over time for each of the methods?
sns.relplot(kind='line', data=df, x='week', y='revenue', hue='sales_method', ci=None, col_order=['Call','Email','Email + Call'])

#Plotting the sum of revenue per week for each sales method during the 6-week period
rev_sum = df.groupby(['sales_method','week'], as_index=False)['customer_id'].sum()
sns.relplot(kind='line', data=rev_sum, x='week', y='customer_id', hue='sales_method', ci=None)

#Plotting sum of customer each week for each sales method during the 6-week period
rev_sum = df.groupby(['sales_method','week'], as_index=False)['revenue'].sum()
sns.relplot(kind='line', data=rev_sum, x='week', y='revenue', hue='sales_method', ci=None)

#3 Is there any differences in customer characteristics between each sales method?
sns.heatmap(df.corr(), annot=True)
sns.catplot(kind='bar', data=df, x='years_as_customer', y='revenue', col='sales_method', ci=None,  col_order=['Call','Email','Email + Call'])
sns.relplot(kind='line', data=df, x='week', y='nb_site_visits', hue='sales_method', ci=None)

#4 Metrics and the initial value
#4.1 Average revenue for each sales method
df.groupby('sales_method')['revenue'].mean()

#4.2 WoW revenue growth for each sales method
rev_grouped_per_week = df.groupby(['sales_method','week'], as_index=False)['revenue'].sum()
rev_grouped_per_week['pct_growth'] = rev_grouped_per_week.groupby('sales_method')['revenue'].pct_change()
Hidden output