📝 Task List
Your written report should include written text summaries and graphics of the following:
- Data validation:
- Describe validation and cleaning steps for every column in the data
- Exploratory Analysis:
- Include two different graphics showing single variables only to demonstrate the characteristics of data
- Include at least one graphic showing two or more variables to represent the relationship between features
- Describe your findings
- Definition of a metric for the business to monitor
- How should the business use the metric to monitor the business problem
- Can you estimate initial value(s) for the metric based on the current data
- Final summary including recommendations that the business should undertake
Start writing report here..
Data Validation
The dataset contains 15000 rows and 8 columns before cleaning and validataion. All the columns against the criteria in the dataset table are validated as follows:
- week : 6 unique values without missing values, same as description. No cleaning is needed.
- sales_method : after cleaning, there are 3 methods, no missing values, same as description. The cleaning process involves replacing 2 sales method names with with correct formatting.
- customer_id : 15000 unique values without missing values, same as description. No cleaning is needed.
- nb_sold : numeric values without missing values, same as description. No cleaning is needed.
- revenue : numeric values rounded to 2 decimal places, 1074 null values are replaced with the average revenue of the respective number of products sold (nb_sold).
- years_as_customer: numeric values without missing values, same as description. The cleaning process involves dropping 2 rows with error (years_as_customer > 40, while the company was founded in 1984).
- nb_site_visits : numeric values without missing values. No cleaning is needed. However, the period observed is "6 weeks", not "6 months" as in the description.
- state : 50 states without missing values, same as description. No cleaning is needed.
After the data validation, the dataset contains 14998 rows and 8 columns without missing values.
import pandas as pd
product_sales = pd.read_csv('product_sales.csv')
product_sales#Data Cleaning
product_sales['sales_method'] = product_sales['sales_method'].replace("em + call","Email + Call")
product_sales['sales_method'] = product_sales['sales_method'].replace("email","Email")
print('sales_methods: ', product_sales['sales_method'].unique())
product_sales['revenue'] = product_sales.groupby('nb_sold')['revenue'].transform(lambda x: x.fillna(round(x.mean(),2)))
print('revenue_non_null: ', product_sales['revenue'].isna().count())
# Dropping rows that are longer than 40 'years_as_customer'
product_sales.drop(product_sales[product_sales['years_as_customer'] > 40].index, axis=0, inplace=True)
# Checking to see that they've been removed
outliers = product_sales[product_sales['years_as_customer'] > 40]
print('outliers:',outliers)
#Check
print(product_sales.info())
print(product_sales.isnull().sum())
product_sales#import necessary library
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd#How many customers were there for each approach?
sm_plot = sns.countplot(data=product_sales, x="sales_method", hue ='sales_method',hue_order=['Email + Call','Email','Call'])
sm_plot.set_xlabel(" ")
sm_plot.set_ylabel("Number of customers")
sm_plot.set_title("Number of customers for each Sales method")
print(product_sales['sales_method'].value_counts())
# What does the spread of the revenue look like overall? And for each method?
rev_plot = sns.histplot(data=product_sales, x='revenue', binwidth=20)
rev_plot.set_ylabel("Customers")
rev_plot.set_title("The distribution of Revenue")ind_rev_plot = sns.boxplot(data=product_sales, x='sales_method',y='revenue', hue ='sales_method',hue_order=['Email + Call','Email','Call'])
ind_rev_plot.set_xlabel("")
ind_rev_plot.set_title("Revenue distribution")nbs_method_plot = sns.boxplot(data=product_sales, x='sales_method',y='nb_sold', hue ='sales_method',hue_order=['Email + Call','Email','Call'])
nbs_method_plot.set_xlabel("")
nbs_method_plot.set_ylabel("Products")
nbs_method_plot.set_title("Number of product sold distribution")sbw = product_sales.groupby(['week','sales_method'])['revenue'].sum().reset_index().pivot(index="week", columns="sales_method", values="revenue")
sbw = sbw[['Email + Call','Email','Call']]
sbw.plot(kind="line", marker="o")
plt.title("Total Revenue by Sales Method over weeks")
plt.xlabel("Week")
plt.ylabel("Total Revenue")
plt.legend(title="Sales Method")
plt.show()cbw = product_sales[['week', 'sales_method']]
cbw = cbw.groupby(['week', 'sales_method']).size().reset_index(name='count')
cbw_pivot = cbw.pivot(index='week',columns='sales_method',values='count')
cbw_pivot = cbw_pivot[['Email + Call','Email','Call']]
cbw_pivot.plot(kind="line", marker="o")
plt.title("Number of customers by Sales Method over weeks")
plt.xlabel("Week")
plt.ylabel("Customers")
plt.legend(title="Sales Method")
plt.show()
print(cbw)The differences among Sales methods
Based on the collected data, the sales team is implementing the Email method the most to connect with customers, which is about 1.5 times higher than the second most used method - Call. And the least used method is Email + Call, which is applied to half the number of customers compared to Call.
This result is understandable as Email requires the least time and effort from the team.
With the above-mentioned methods, the Revenue gained from each customers throughout 6 weeks is majorly less than 200 GBP, any values greater can be considered as outliers.
Regarding each sales method, Email + Call seems to have the highest number of product sold, as well as the highest revenue. The average numbers (median) of product sold using Mail and Call are similar, at around 10 products/customers. However, the revenue generated by the Mail method ranks second, while the Call method yields the lowest revenue.
Next, it is interesting to observe the changes to each sales method over the six week period. In the first week, the Email method reaches the highest number of customers, followed by Call and Email + Call. This might explain the similar ranking of the total revenue earned in the first week.
Over the period, the number of customers and the revenue gained by Email method drop significantly to the lowest rank. The number of customers reached by the remaining sales methods experiences a slight increase until week 5, followed by a decrease in week 6. In terms of revenue, the Call method shows a similar pattern, whereas the revenue from the Email + Call method visibly increases from week 1 to week 5 and only experiences a slight decline in the final week.
sales_by_week = product_sales.groupby(['week','sales_method'])['revenue'].sum().reset_index()
sales_by_week_method = sales_by_week.groupby('week')['revenue'].transform('sum')
sales_by_week['pct'] = round((sales_by_week['revenue']/sales_by_week_method *100),2)
sales_by_week
pivot_sbw = sales_by_week.pivot_table(index='week', columns='sales_method', values='pct', aggfunc='sum')
#stacked bar chart
pivot_sbw = pivot_sbw[['Email + Call','Email','Call']]
pivot_sbw.plot(kind='bar', stacked = True, label = 'sales_method')
plt.ylabel('% of revenue')
plt.title('% of revenue by sales method each week')
plt.legend(title='Sales Method', loc = 'upper right', bbox_to_anchor=(1.4,1))
plt.show()
print(pivot_sbw)
nbp_by_week = product_sales.groupby(['week','sales_method'])['nb_sold'].sum().reset_index()
nbp_by_week_method = nbp_by_week.groupby('week')['nb_sold'].transform('sum')
nbp_by_week['pct'] = round((nbp_by_week['nb_sold']/nbp_by_week_method *100),2)
nbp_by_week
pivot_nbw = nbp_by_week.pivot_table(index='week', columns='sales_method', values='pct', aggfunc='sum')
#stacked bar chart
pivot_nbw = pivot_nbw[['Email + Call','Email','Call']]
pivot_nbw.plot(kind='bar', stacked = True, label = 'sales_method')
plt.ylabel('% of no. product sold')
plt.title('% of no. product sold by sales method each week')
plt.legend(title='Sales Method', loc = 'upper right', bbox_to_anchor=(1.4,1))
plt.show()
print(pivot_nbw)Business Metric
The goal is to pick out the best sales strategy for the new product line, I would recommend using the metric "percentage of products sold by sales method each week" to measure the effectiveness of the sales methods.
Although "number of customers" might often seen as the metric to measure the effectiveness of sales methods, the collected data suggests that current most used method has lower number of products sold and lower revenue, compared to the other approaches. Therefore, to identify the most recommendable sales method, the metrics should rely on the number of products sold or revenue instead. However, as there are null values in revenue data, the number of products sold is a more precise estimation.
According to our currently available data, at the end of the observed period (week 6), Email + Call is currently has the highest share (50.18%), and that of Call and Email are 34.14% and 14.98% correspondingly.