Data Analyst Practical Exam
# Import packages that will be used.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
product_sales = pd.read_csv('product_sales.csv')
# Validate & Summarize Sales Method
product_sales['sales_method'] = product_sales['sales_method'].str.lower().replace('em + call', 'email + call')
product_sales['sales_method'] = pd.Categorical( product_sales['sales_method'], categories=['email', 'call', 'email + call'])
#Validate Revenue
med_revenue = product_sales['revenue'].quantile(0.50)
product_sales['revenue'] = product_sales['revenue'].fillna(med_revenue)
product_sales['revenue'] = round(product_sales['revenue'], 2)
#Validate Years as Customers
product_sales = product_sales[product_sales['years_as_customer'] <= 40]
Data Validation
The original data set provided contains 15,000 rows and 8 columns. After validation, there were 14,988 rows remaining. The data validation steps taken are as follows:
- week: Numeric data type with 6 unique values without missing values. Maximum time of 6 weeks with a minimum of 1 week. No outliers. Data is as presented in the provided table description; no cleaning is needed.
- sales_method: Object data type with initially 5 unique values and no missing values. Issues with capitalization and spelling (i.e. 'Email + Call' & 'em + call'). Data was transformed to all lower case and 'em + call' was changed to 'email + call'. After cleaning, there are 3 unique sales methods. This is in accordance with the table description provided.
- customer_id: Object data type, unique identifier for each customer; 15,000 unique values in the original dataset. Data is as presented in the table description provided; no cleaning is needed.
- nb_sold: Numeric data type with values ranging between 7 and 16. No missing values or outliers. Data is as presented in the provided table description; no cleaning is needed.
- revenue: Numeric data type, same as description; 1,074 missing values or 7.16% of records in the original data set were missing revenue values. Missing values were replaced with median revenue due to the data being right skewed. Data rounded to two decimal places.
- years_as_customer: Numeric values; 2 values exceeded the time the company was in business and were most likely inputted in error. Those two values were dropped from the data set. Final values ranged from 0 to 40, this is to be expected.
- nb_site_visits: Numeric values, 27 unique values; range of values between 12 and 41, no missing values. Data is as presented in the table description provided; no cleaning is needed.
- state: Object data type, 50 unique values; no missing values. Data is as presented in the table description provided; no cleaning is needed.
How many customers were there for each approach?
The most frequent sales method performed by the sales team was email. 7,465 of the sales, or 49.77%, were done by email. Calls accounted for 4,961 sales, or 33.08%. The combination of both methods were used for the remaining 2,572 sales, or 17.15%. Since emails required the least amount of effort, these results are to be expected.
ax1 = sns.set_style('ticks')
ax1 = sns.histplot(data=product_sales, x='sales_method', palette='bright', hue='sales_method')
ax1 = plt.title("Distribution of Sales by Method")
ax1 = plt.xlabel("Sales Method")
ax1 = plt.ylabel("# of Sales")
What does the spread of the revenue look like overall? And for each method?
The spread of revenue overall is right skewed with a median revenue of 89.50, an average revenue of 93.62, and outliers towards the higher end of revenue values.
Emails accounted for 51% of the total revenues generated. The 'Email and Calls' method represented 31% of total revenues and just calls represented only 17%.
However, the distribution of revenue per sale by sales method tells a better story. Since the highest frequency of sales were performed via email, email is showing the highest total revenue. Yet, the combination of email & calls accounted for the highest median revenue per sale of 182.14 as well as the largest spread of revenue values. This is almost twice as high as the highest total revenue generating method, email, which only had a median revenue of 94.27 per sale. The most laborious sales method, calls, had the smallest median revenue per sale of 49.93 and the smallest spread of revenues.
While emails are the easiest method of sales, the combination of email and calls shows the highest potential for revenue and only added a small amount of effort to the sales team (approximately ten minutes per customer). The large revenue disparity between methods indicates that strictly calling customers should be a last resort.
Going forward, Email & Calling customers should be best practice for Pens and Printers (as indicated by the revenue data).
per_of_rev = pd.DataFrame(product_sales.groupby(by='sales_method')['revenue'].sum())
per_of_rev.loc["Total"] = per_of_rev.sum(axis=0)
per_of_rev = per_of_rev.reset_index()
per_of_rev["rev_percent"] = round((per_of_rev["revenue"] / per_of_rev.iloc[3,1]),2)
data_use = per_of_rev.drop(index=3, axis=0)
print(data_use)
sns.barplot(data=data_use, x='sales_method', y='revenue', hue='sales_method', palette='bright')
plt.xlabel('Sales Method')
plt.ylabel('Total Revenue')
plt.title('Total Revenue by Sales Method')
plt.show()
print(per_of_rev)
ax2 = sns.set_style('ticks')
ax2 = sns.boxplot(data=product_sales, x="sales_method", y="revenue", palette='bright')
ax2 = plt.title("Distribution of Revenue by Sales Method")
ax2 = plt.xticks(ticks=['email', 'call', 'email + call'], labels=['Email', 'Call', 'Email & Call'])
ax2 = plt.xlabel("Sales Method")
ax2 = plt.ylabel("Revenue")
plt.show()
ax3 = sns.set_style('ticks')
ax3 = sns.boxplot(data=product_sales, x='revenue')
ax3 = plt.title("Distribution of Revenue")
ax3 = plt.xlabel("Revenue")
plt.show()
Was there any difference in revenue over time for each of the methods?
Irregardless of week sold, the 'Email & Call' sales method consistently had higher revenues per sale over the 'Email only' and 'Call only' sales method(s). All three sales methods show the same pattern: there is a positive relationship between revenues and weeks since the product was launched.
Another visualization showing the same pattern, below the sharper increase in revenue over time for the 'Email & Call' sales method is highlighted.
Based on what the data is showing it can be interpreted that customers are spending more money as the amount they are contacted increases over time through multiple methods. While the customer is spending more money regardles of sales method, the combination of both is showing to be the most effective method.
A look at the table below shows the top 10 sales by revenue. The trend from above is shown again: the top 10 highest revenue producing sales were all done in the 6th week since the product launch and all were performed using the 'Email and Call' sales method. Additionally - it is noted that 'returning customers' didn't generate higher revenues, rather customer's who have only been with us for 1-3 years.
(While this is a relatively 'crude' visualization of the data, it is being used to further support the previous visualizations and interpretations of the data by the analyst.)
g = sns.FacetGrid(data=product_sales, col="sales_method", hue='sales_method', palette='bright')
g.map(sns.barplot, 'week', 'revenue', alpha=.7)
g.fig.subplots_adjust(top=.85)
g.fig.suptitle("Revenue by Week for each Sales Method (Seperated by Method)")
plt.show()
ax = sns.barplot(data=product_sales, x='week', y='revenue', hue='sales_method', alpha=.7, palette='bright')
plt.title("Revenue by Week for each Sales Method")
plt.show()
Based on the data, which method would you recommend we continue to use?
Based on the data provided, I would recommend we continue to use the "Email and Call" and "Email" sales methods.
Per your email, with both of these methods customers, are intially emailed when the product was launched. When they were followed up with a phone call a week later (the "Email and Call" sales method), company revenues were higher per sale the first week. As the weeks since the product launch progress, the number of units sold and revenue's generated by the "Email and Call" sales method continued to increase and outperformed the other two (with outliers as exceptions) per sale.
While there is a positive trend between number of units sold and revenue per sale with all sales methods, "Email and Call" outperforms both other methods on average.
Without restructuring the sales teams' work flow and CRM methods, I would recommend all customer's be initially emailed when the product line is launched and then followed up with a phone call a week later. The email requires very little work from the sales team and the call only takes ten minutes per customer. They can then follow up with the customer via email in weeks 3-4 if needed. This would optimize revenue and units sold per sale while minimzing time spent on each sale.
g = sns.FacetGrid(data=product_sales, col="week", col_wrap = 3, hue='sales_method', palette='bright')
g.add_legend()
g.map(sns.barplot, 'nb_sold', 'revenue', alpha=.7)
g.fig.subplots_adjust(top=.85)
g.fig.suptitle("Revenue(per sale) for Units Sold by Week for each Sales Method")
g.set_xlabels("# of Units Sold per Sale")
g.set_ylabels("Revenue per Sale")
g.add_legend()
plt.show()
Other Factors to Consider
While my analysis primarily focuses on: weeks since product launch and method of sales, initially other factors were considered with the data provided.
Number of Units Sold and Site Visits
The number of units sold per sale and the number of times the customer visited the site showed a relatively weak positive relationship. Even amongst sales methods, any differences in number of units sold can most likely be attributed to the sales method itself. The sales method doesn't seem to have impacted the number of site visits either.
Revenue by Years as Customer
Similarly, the average years a customer has had a relationship with us is only 4.97 years, with a median of 3 years. While we value the length of our relationship with our customers, the majority of sales were with customers of tenure 3 years and less.
Revenue by States
Finally, the top states for total revenue seems to mirror the U.S. population distribution. California makes up 12.3% of total revenue. California, Texas, and New York combined account for 26.7% of total revenues.