Data Analyst Professional Practical Exam Submission
You can use any tool that you want to do your analysis and create visualizations. Use this template to write up your summary for submission.
You can use any markdown formatting you wish. If you are not familiar with Markdown, read the Markdown Guide before you start.
📝 Task List
Your written report should include written text summaries and graphics of the following:
- Data validation:
- Describe validation and cleaning steps for every column in the data
- Exploratory Analysis:
- Include two different graphics showing single variables only to demonstrate the characteristics of data
- Include at least one graphic showing two or more variables to represent the relationship between features
- Describe your findings
- Definition of a metric for the business to monitor
- How should the business use the metric to monitor the business problem
- Can you estimate initial value(s) for the metric based on the current data
- Final summary including recommendations that the business should undertake
Start writing report here..
Hello, Head of Analytics!
I've done the analysis you asked for. Sending you the report. I divided the content into three main parts: Data validation, Exploratory data analysis, final summary.
Data validation
The original data is 15000 rows and 8 columns. After validation there were 13926 rows remaining. The following describes what I did to each column:
- Week: Values were numerical and column had no missing data or unusual values, as expected.
- Sales_method: Had other data type (object instead of category), and has some extra values ('em + call', 'email'm 'call') apart from 'Email', 'Call', 'Email + Call'. I changed dtype to category and replaces all extra values with Email, Call, Email+Call as appropriate.
- Customer_id: Had other data type (object instead of category). I changed dtype to category. All values were unique, as expected.
- Nb_sold: Values were numerical and column had no missing data or unusual values, as expected.
- Revenue: contained missing data. I decided to look in details on all column before I would choose what to do with it. I've made a new dataset based on rows with null values in this column to see whether there is any system (my idea was that there may be missing values i.e. based on state or sales method), but found nothing specific. As all questions of sales department related to revenue, I dropped all rows with null values in this column. Then I rounded values to two decimals.
- Years_as_customer: Values were numerical as it should be but the maximum value was 63, which is impossible if the company was founded in 1984 (2023 - 1984=39, therefore, no values may be higher than that). I found that there are only 2 values > 39, it is 47 and 63. Assuming that there should be 37 and 36 respectivelly, I replaced them.
- Np_site_visits: Values were numerical and column had no missing data or unusual values, as expected.
- State: Had other data type (object instead of category). I changed dtype to category. I found no typos or strange values.
Exploratory data analysis
Our goal was to see what method of communication we should stay on. Which one has the highest revenue, is there any difference between methods in revenue spread, how the revenue changed over time in each method, etc.
As Sales Rep interested in how many customers were there for each approach, I made a following barplot for this (Pic 1).
Pic 1. Barplot of total revenue per method.
As you can see, the majority of customers were informed due to email method. Also I've made a table for Rep which included info about how many records were dropped due to missing values in revenue (Table 1). I will inform them about big percentage of missing data for Email+Call method.
Table 1. Number of customers per each method with missing data.
Also, I created a histogram and a single boxplot to show how the spread looks overall, answering their second question (Pic 2).
Pic 2. Histogram and boxplot of total revenue.
As you can see, the data has a non-normal distribution (right-skewed) with several peaks. This looks like it comes from several subgroups. Hovewer, the mean and median are quite close.
I have done a barplot of total revenue by method to see which method earned the most money (Pic 3).
Pic 3. Barplot of total revenue per method.
And here you can see that total revenue of Call+email method is about 2/3 from total revenue of Email method, despite the difference between the groups in about 3 times. This hinted at a much higher income from the group Email+Call.
I have done a histogram and boxplots of revenue based on method, as they would like me to do, where you can see that the Email+Call had the highest average income and was on the right tail of histogram values (Pic 4).
Pic 4. Histogram and boxplot of revenue per method.
Thus, the most profitable in terms of revenue is the Email + Call method.
Sales Rep also would like me to show was there any difference in revenue over time for each method. I have done a lineplot (Pic 5) and a barplot (Pic 6) to show changes in average revenue and total revenue per method per week.
Pic 5. Lineplot of average revenue per method over time.
Pic 6. Barplot of total revenue per method over time.
We can see that the highest average revenue per customer was for Email + Call throughout all the time, and the total revenue increased for this group over time. This correlates with an increase in the number of buyers depending on the method over time, so it would be useful to also record the time of purchase and the time of informing the buyer to see if there is a difference in response time to informing.
Also Sales Rep additionaly asked to inform if I found any other findings from data. I created a new column called 'Customer category' based on the values in 'years_as_customer' column. If the value == 0, the value in new category was 'New customer', if value>0 then 'Existing customer'. Then I made a boxplot to see whether there is a difference in revenue per old and new customers. I found none significant difference (Pic 7)
Pic 7. Boxplot of revenue per method by Customer category
I have made a countplot to see is there any visible difference between the method in number of new and existing customers and visually it seemed that for Email+Call method the percent of new customers was higher (Pic 8)
Pic 8. Number of new and old customers per method
Then I calculated Percentage of Revenue from New vs. Existing Customers by formula ([New / Existent] Customers Revenue / Total Revenue) and found that Email+Call has the highest percentage of revenue from new customers (Table 2)
Table 2. Percentage of revenue per new/old customers per method
Metrics for the business to monitor
I recommend to use such metrics:
• Total revenue per method (TRPM) = sum of revenue per customers that were informed by a certain method
• Average revenue per customer (ARPC) = Total Revenue/Customer Count
• Percentage of Revenue from New vs. Existing Customers = [New/Existent] Customers Revenue / Total Revenue
We can use them to see how much revenue we take per day, month, year etc and to see the difference between methods irrespectively to the number of customers. Also this allow us to monitor how good is the method for attracting new customers.
TRPM can be seen in Table 3:
Table 3. Total revenue per method
ARPC can bee seen in Table 4:
Table 4. Average revenue per customer by method
Percentage of Revenue from New vs. Existing Customers can be seen in Table 5:
Table 5. Percentage of revenue per new/old customers per method
Final summary:
Which method to choose? • Email + call has the highest average revenue per customer (which is growing over time), the highest percent of new customers revenue among all who bought the product, quite fast way to deliver. • Just email has a moderate average revenue per customer, but the fastest to deliver. • Just calling seems to be the less effective – the lowest average revenue per customer, the lowest percent of new customers revenue among all who bought the product, the longest way to deliver.
Therefore, I would recommend to use Email+Call method.
Metrics for the business to monitor: • Total revenue per method (TRPM) • Average revenue per customer (ARPC) • Percentage of Revenue from New vs. Existing Customers
What else we can do? To start collecting data about date of informing the customer an date of buying the product by this customer to see whether there is a difference between methods in this.
#importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#importing data
sales = pd.read_csv(r'https://s3.amazonaws.com/talent-assets.datacamp.com/product_sales.csv')
#EXPORATORY DATA ANALYSIS
## 1. exploring the whole dataset
display(sales.head())
display(sales.info())
##viewing descriptive statistics for 'week' column
print('Descriptive statistics for week column:')
display(sales['week'].describe())
##viewing unique values for 'sales_method'
print('Unique values for sales_method:')
display(sales['sales_method'].unique())
#counting the number of unique customers
print('The number of unique customers:', sales['customer_id'].nunique())
##viewing descriptive statistics for 'nb_sold' column
print('Descriptive statistics for nb_sold column:')
display(sales['nb_sold'].describe())
##viewing descriptive statistics for 'revenue' column
print('Descriptive statistics for revenue column:')
display(sales['revenue'].describe())
##viewing descriptive statistics for 'years_as_customer' column
print('Descriptive statistics for years_as_customer column:')
display(sales['years_as_customer'].describe())
##viewing descriptive statistics for 'nb_site_visits' column
print('Descriptive statistics for nb_site_visits column:')
display(sales['nb_site_visits'].describe())
##viewing unique values for 'state'
print('Unique values for state:')
display(sales['state'].unique())
## DATA CLEANING
## removing extra strings in sales_method
sales['sales_method'] = sales['sales_method'].str.replace('email', 'Email')
sales['sales_method'] = sales['sales_method'].str.replace('em', 'Email')
sales['sales_method'] = sales['sales_method'].str.replace('call', 'Call')
## rounding revenue to 2 decimals
sales['revenue'] = sales['revenue'].round(2)
#looking for extreme values in years_as_customer column
unusual_cust_sales = sales.loc[sales['years_as_customer'] > 39]
display(unusual_cust_sales)
#replacing 63 to 36 and 47 to 37 in years_as_customer
sales['years_as_customer'].replace(47, 37, inplace=True)
sales['years_as_customer'].replace(63, 36, inplace=True)
#How many customers there for each approach?
grouped_data = sales.groupby('sales_method')['customer_id'].count().reset_index()
display(grouped_data)
#Visualizing number of customers per approach
sns.set_style('whitegrid')
sns.color_palette('Set2')
sns.set(font_scale=1.5)
pal = sns.color_palette("Greens_d", len(grouped_data))
sns.barplot(data=grouped_data, x='sales_method', y='customer_id').set(xlabel='Sales method', ylabel='Customers', title='Number of customers for each approach')
plt.show()
#dropping rows that has null value in revenue column
sales.dropna(subset=['revenue'], inplace=True)
display(sales.info())
#How many customers there for each approach after excluding missing values?
grouped_data = sales.groupby('sales_method')['customer_id'].count().reset_index()
display(grouped_data)
#Creating new category of existing customers and new customers
sales['Customer category'] = ['New customer' if x == 0 else 'Existing customer' for x in sales['years_as_customer']]
## changing sales_method, customer_id, state dtype to categorical
cols = ['sales_method', 'customer_id', 'state','Customer category']
for col in cols:
sales[col] = sales[col].astype('category')# exploring the whole dataset again
display(sales.info())
display(sales.head())EDA
#Calculating sum per each method
sales_call = sales[sales['sales_method'] == 'Call']
sales_email = sales[sales['sales_method'] == 'Email']
sales_email_call= sales[sales['sales_method'] == 'Email + Call']
print(sales_call['revenue'].sum())
print(sales_email['revenue'].sum())
print(sales_email_call['revenue'].sum())
print(sales['revenue'].sum())# Calculating percentage of Revenue from New vs. Existing Customers per each method
#For email
sales_email = sales[sales['sales_method'] == 'Email']
new_cust_email = sales_email[sales_email['Customer category'] == 'New customer']
exist_cust_email = sales_email[sales_email['Customer category'] == 'Existing customer']
perc_new_cust_email = ((new_cust_email['revenue'].sum().round(2) / sales_email['revenue'].sum().round(2)) * 100).round(2)
perc_exist_cust_email = ((exist_cust_email['revenue'].sum().round(2) / sales_email['revenue'].sum().round(2)) * 100).round(2)
print('New customers in email method: ', perc_new_cust_email)
print('Existing customers in email method: ', perc_exist_cust_email)
#For calls
sales_call = sales[sales['sales_method'] == 'Call']
new_cust_call = sales_call[sales_call['Customer category'] == 'New customer']
exist_cust_call = sales_call[sales_call['Customer category'] == 'Existing customer']
perc_new_cust_call = ((new_cust_call['revenue'].sum().round(2) / sales_call['revenue'].sum().round(2)) * 100).round(2)
perc_exist_cust_call = ((exist_cust_call['revenue'].sum().round(2) / sales_call['revenue'].sum().round(2)) * 100).round(2)
print('New customers in call method: ', perc_new_cust_call)
print('Existing customers in call method: ', perc_exist_cust_call)
#For Email+Call
sales_email_call= sales[sales['sales_method'] == 'Email + Call']
new_cust_email_call = sales_email_call[sales_email_call['Customer category'] == 'New customer']
exist_cust_email_call = sales_email_call[sales_email_call['Customer category'] == 'Existing customer']
perc_new_cust_email_call = ((new_cust_email_call['revenue'].sum().round(2) / sales_email_call['revenue'].sum().round(2)) * 100).round(2)
perc_exist_cust_email_call = ((exist_cust_email_call['revenue'].sum().round(2) / sales_email_call['revenue'].sum().round(2)) * 100).round(2)
print('New customers in email+call method: ', perc_new_cust_email_call)
print('Existing customers in email+call method: ', perc_exist_cust_email_call)
#What does the spread of the revenue look like overall? And for each method?
sns.set_style('whitegrid')
sns.color_palette('Set2')
sns.set(font_scale=1.5)
fig, axes = plt.subplots(ncols=2, figsize=(15,7))
#making histogram for revenue values
sns.histplot(data=sales, x='revenue', ax=axes[0]).set(xlabel='Revenue', ylabel='Count', title='Revenue histogram (overall)')
#making histogram for revenue values per sales method
sns.boxplot(data=sales, y='revenue', ax=axes[1], showmeans=True, meanprops={"marker":"o", "markerfacecolor":"white", "markeredgecolor":"black", "markersize":"10"}).set(xlabel='Revenue', ylabel='Revenue spread', title='Revenue boxplot (overall)')
plt.show()#What does the spread of the revenue look like overall? And for each method?
sns.set_style('whitegrid')
sns.color_palette('Set2')
sns.set(font_scale=1.5)
fig, axes = plt.subplots(ncols=2,figsize=(15,7))
#making histogram for revenue values
sns.histplot(data=sales, x='revenue', hue='sales_method', ax=axes[0]).set(xlabel='Revenue', ylabel='Count', title='Revenue spread for each method (Email/Call/Both)')
#making histogram for revenue values per sales method
sns.boxplot(data=sales, x='sales_method', y='revenue', ax=axes[1], showmeans=True, meanprops={"marker":"o", "markerfacecolor":"white", "markeredgecolor":"black", "markersize":"10"}).set(xlabel='Sales method', ylabel='Average and median revenue per customer', title='Revenue spread for each method')
plt.show()#Was there any difference in revenue over time for each of the methods?
fig = plt.figure(figsize=(15, 8))
sns.set(font_scale=1.5)
sns.lineplot(data=sales, x='week', y='revenue', hue='sales_method', style='sales_method', markers=True).set(xlabel='Week', ylabel='Median revenue', title='Median revenue over time for each method')
fig = plt.figure(figsize=(15, 8))
sns.set(font_scale=1.5)
sns.pointplot(x="week", y="revenue", hue="sales_method", data=sales, markers="o", dodge=True, estimator= "mean")
plt.title('Average revenue per customer over time')
plt.xlabel('Week')
plt.ylabel('Average revenue')
plt.show()# Calculate the total revenue by sales method and week
total_revenue_by_week = sales.groupby(['week', 'sales_method'])['revenue'].sum().reset_index()
# Create the barplot of total revenue by sales method and week
sns.barplot(x="week", y="revenue", hue="sales_method",hue_order=['Email + Call', 'Call', 'Email'], data=total_revenue_by_week)
# Set the title and axes labels
plt.title('Total Revenue by Sales Method and Week')
plt.xlabel('Week')
plt.ylabel('Total Revenue')
# Show the plot
plt.show()# Calculate the total revenue by sales method and week
total_revenue_by_week = sales.groupby(['week', 'sales_method'])['customer_id'].count().reset_index()
# Create the barplot of total revenue by sales method and week
sns.barplot(x="week", y="customer_id", hue="sales_method",hue_order=['Email + Call', 'Call', 'Email'], data=total_revenue_by_week)
# Set the title and axes labels
plt.title('Number Customers by Sales Method and Week')
plt.xlabel('Week')
plt.ylabel('Number of customers')
# Show the plot
plt.show()# Calculate the total revenue by sales method
total_revenue_by_method = sales.groupby('sales_method')['revenue'].sum().reset_index()
# Create the barplot of total revenue by sales method and week
sns.barplot(x="sales_method", y="revenue", data=total_revenue_by_method)
# Set the title and axes labels
plt.title('Total Revenue by Sales Method')
plt.xlabel('Sales method')
plt.ylabel('Total Revenue')
# Show the plot
plt.show()