Background
- The following report is commissioned on behalf of the Head of Analytics at Pens and Printers.
Context:
-
Established in 1984, Pens and Printers are a B2B wholesale office supply company that provides high quality office products to large organizations, from stationary to office furniture and devices.
-
Timely response to customer feedback is imperative for the company’s sales team to ensure that they can adapt their strategy accordingly to maximize revenue.
-
Six weeks ago the team launched a new line of office stationery. Despite the world becoming increasingly digital, demand for notebooks, pens and sticky notes still exists.
-
For this product launch, the sales’ messaging strategy has centered on how tools for brainstorming can help to improve and promote creativity.
-
In order to cater to the differing needs and expectations of their customers, the team have chosen 3 different outreach methods, the details for which can be found below:
- Email: Customers receive an email when the product line was launched, then a follow-up email three weeks later. Minimal work for the team.
- Call: Customers were called by a member of the sales team. On average members of the team were on the phone for around thirty minutes per customer.
- Email and call: Customers were first sent the product information email, then called a week later by the sales team to talk about their needs and how this new product may support their work. The email required little work from the team, the call was around ten minutes per customer.
Brief:
-
In general the team would like to understand, given the investment and demand of each approach, which method looks most effective at maximising revenue.
-
Thus the primary metric we will use to evaluate campaign success will be overall revenue attributed to each group.
-
The following analysis will help to answer these questions put forth by the team:
- What does the spread of the revenue look like overall?
- How many customers were there for each method and what does the spread of the revenue look like for each method?
- Was there any difference in revenue over time for each of the methods?
- What method would optimize our return on investment (maximize revenue for given time spent) ?
Things to Consider:
-
Although we are given some information regarding time frames it is not explicitly stated when the calls took place.
-
For instance it’s not clear if all the calls that were made to the call group all occurred on the same day during the product launch, or if the calls were made at any time during launch week. Same for the 10 min follow up calls made to the Email and call group.
-
Thus our analysis will assume that all users who were supposed to receive calls were all called on the same day of their respective groups e.g. All users in the call group were only called on the first day of the launch week and all users in the Email + Call group were only called on the first day of the second week after launch.
-
Another important factor to consider which is not explicitly stated, is the outcome of the call. We cannot be certain whether all calls were successful at reaching customers. Without this information it is difficult to assess whether successful sales conversions are the result of actual sales pitches, or did just being called suffice e.g users may have converted even though they missed the call.
-
Thus our analysis will assume that the result of a successful sales conversion is not tied to the content of a call, but just the fact you received a call regardless of whether you were able to speak to the sales team.
-
This assumption calls into question the validity of needing 30 and 10 min call conversations, perhaps in future iterations of analysis we can include a call outcome to see if customers who successfully received calls had higher conversion rates than those who missed them.
Data Validation
Summary:
- The sales data set contains 15000 rows and 8 columns.
- According to the data dictionary only 3 out of the 8 fields should be character (sales_method, customer_id, state), revenue should be the only float field and the remainder 4 fields should be all be integer.
- The only column with missing values in the data set was revenue.
Revenue:
- Since the field number of units sold (nb_sold ) has no missing values with a min value of 7, we cannot infer that missing revenue values should be assigned to zero.
- Intuitively revenue is most affected by number of items sold, since revenue is units sold multiplicative of price.
- Different states also have different tax rates which can affect revenue.
- We are going to assume that the team did not give out any special discounts that took place during the 6 week promotional stint so we can rule out the impact of time (week) and years as a customer.
- Thus the only factors that should be considered for estimating missing revenue are number of units sold (nb_sold) and state.
- For our analysis we imputed missing revenue values with the average revenue segmented by nb_sold and state.
Years as Customer:
- Since the company has been active since 1984 the maximum number of years a customer could be active is only 40 (2024 -1984 = 40).
- Only 2 customers had values that exceeded our calculated value (customer ids: 13731: 63, 13800: 47).
- These values were replaced with 40 to reflect the correct maximum possible number of years as a customer (this year:2024 - 1984).
Sales method
- The field sales_method originally had 5 unique values ("Email", "Email + Call", "Call", "em + call", "email") which cleaned to match the 3 listed in the data dictionary ("Email", "Email and Call", "Call").
- Since by definition “Email and Call” customers were first sent the product information email, then called a week later, any “Email and Call” customer who purchased during the first week would be by default an "Email" customer.
- Thus “Email and Call” customers who made a purchase during week 1 were reclassified as “Email”. We have decided to make a separate field called cleaned_sales_method to reflect this.
Below is a summary of the data validation findings against the provided data dictionary:
1. week:
- No missing values
- 6 unique integer values 1 through 6 ( 1,2,3,4,5,6)
- No cleaning needed.
2. sales_method:
- No missing values
- 5 unique values: "Email", "Email + Call", "Call", "em + call", "email".
- Items were cleaned to match the fields listed in the data dictionary ("Email", "Email + Call", "Call").
3. customer_id:
- No missing values, 15000 unique string values, no cleaning needed.
4. nb_sold:
- No missing values
- 10 unique integer values 7 to 16 (7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- No cleaning needed.
5. revenue:
- 6743 unique float values out of a posible 15000
- 1074 missing values which accounts for ~ 7.16% of data set.
- Since the field number of units sold (nb_sold ) has no missing values with a min value of 7, we cannot infer that missing revenue values should be assigned to zero.
- For our analysis we imputed missing revenue values with the average revenue segmented by nb_sold and state.
6. years_as_customer:
- No missing values, 42 unique integer values 0 to 63.
- Since the company has been active since 1984 the maximum number of years a customer could be active is only 40 (2024 -1984 = 40).
- Only 2 customers had values that exceeded our calculated value (customer ids: 13731: 63, 13800: 47).
- These values were replaced with 40 to reflect the correct maximum possible number of years as a customer (this year:2024 - 1984).
7. nb_site_visits:
- No missing values
- 27 integer unique values 12 to 41 (12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,29, 30, 31, 32, 33, 34, 35, 36, 37, 41)
- No cleaning needed.
8. state:
- No missing values
- 50 unique string categories ('Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming')
- No cleaning needed.
4 hidden cells
6 hidden cells
Baseket Size (nb_sold)
- We started by sense checking number of units sold because this will verify if all users in the data set have completed a purchase
- We can see that there are no missing values and the minimum number of units sold is 7, which means the data only includes successful purchases.
- Thus the findings of this analysis can only be applied to customers who have successfully completed a purchase.
- We cannot evaluate how successful each sales method is at converting a sales purchase since we are unable to calculate conversion rate, without knowing how many customers who did not go on to make a purchase for each method.
1 hidden cell
- Most of our customers had basket sizes between 8 and 11 items
Site Visits (nb_site_visits)
- The number of site visits has no null values and a minimum value of 12.
- This suggests our data is subset for customers who successfully made an online purchase only.
- Thus the findings of this analysis is limited to ecommerce customers who made a purchase.
- The data suggests that 50% of ecommerce customers who made a purchase did so by visiting the site 23-27 times before their purchase.
- 75% users who made a purchase did so with 27 or less site visits.
- Creating propensity triggers based on site visit thresholds could be beneficial for the team