Skip to content

Data Analyst Professional Practical Exam Submission

You can use any tool that you want to do your analysis and create visualizations. Use this template to write up your summary for submission.

You can use any markdown formatting you wish. If you are not familiar with Markdown, read the Markdown Guide before you start.

📝 Task List

Your written report should include written text summaries and graphics of the following:

  • Data validation:
    • Describe validation and cleaning steps for every column in the data
  • Exploratory Analysis:
    • Include two different graphics showing single variables only to demonstrate the characteristics of data
    • Include at least one graphic showing two or more variables to represent the relationship between features
    • Describe your findings
  • Definition of a metric for the business to monitor
    • How should the business use the metric to monitor the business problem
    • Can you estimate initial value(s) for the metric based on the current data
  • Final summary including recommendations that the business should undertake

Data Validation

The dataset contains 15000 rows and 8 columns before cleaning and validataion. I have validated all the columns against the criteria in the dataset table:

  • week: Week sale was made, counted as weeks since product launch, the data type was the same as the description given.
  • sales_method: Character, which of the three sales methods were used for that customer; the methods should only be Email, Call, and Email + Calls. But there were extraneous data that shouldnt be there, like 'en + calls' and 'enail'. These extraneous data were removed in order to match the description given. The data will remain 14967.
  • customer_id: Character, unique identifier for the customer. I checked for duplicate values, but there was none. The data matched the description.
  • nb_sold: Numeric, number of new products sold. The data matched the description given.
  • revenue: Numeric, revenue from the sales, rounded to 2 decimal places. There were 1068 missing values, which represent about 7.14% of the data. This was imputed with the median of the data, since the distribution is not mormal.
  • years_as_customer: Numeric, number of years customer has been buying from us (company founded in 1984). The data matched the description given.
  • nb_site_visits: Numeric, number of times the customer has visited our website in the last 6 months. The data matched the description given.
  • state: Character, location of the customer i.e. where orders are shipped. The data matched the dedescription given.

After the data validation, the dataset contains 14967 rows and 8 columns without missing values.

Exploratory Analysis

Single Variable Analysis:
Distribution of the Revenue

Image1

From the distribution plot above, it can be seen that most of the revenue lie between 20 to 200 was generated from most of the customers. Values beyond this range could be seen as outiers.

How many customers were there for each approach?

'image'

  • Email: 7456
  • Call: 4962
  • Email + Call: 2549

Also we can see that most of the customers that purchased the products prefer using emails rather than calls. next we will compare the revenue generated from these sales method to see the most effective sales method, we will try looking for a way to improve the most prefered sales method if the revenue generated is poor.

Multiple Variable Analysis
What does the spread of the revenue look like overall? And for each method?

Revenue Spread by Sales Method

We can see the high revenue range between 120-200 for 'Email + Calls', other methods were below 100. This signals a focus on 'Email' method for improvement.

Revenue Generated by Sales Method

Revenue by Sales Method

Now having seen that most of our customers prefer been reached out to via email before making purchases, it has also been seen that not as more revenue was generated in the last six week per customer basis from this method compared to 'Email + Calls'. I also understand that the later can be very time consuming and expenseive. We will see how we can improve the revenue generated per customer using this email approach.

Was there any difference in revenue over time for each of the methods?

So from the plot above I can clearly infer there was a significant difference in revenue over time for each method, as 'Email + Calls' by far generated more revenue per customer compared to other methods.

Weekly Revenue by Sales Method

The plot below shows that in the past six weeks, that massive revenue have been generated per customer by combining both emails and follow-up via calls, compared to other methods.

Weekly Revenue by Sales Method

Definition of a metric for the business to monitor

How should the business monitor what they want to achieve?

Now from the exploratory Analysis, I have gained lots of tremendous insights from the behavior of customers in response to the sales methods which include, Emails, Calls, and Email + Calls. I have found out that most of our customers prefer pruchasing after reaching out to them via email. But less revenue have been generated per customer basis in the last six(6) weeks, as the trend above clearly showed. So the key metric here, will be to use this last six weeks results to monitor our Total revenue per Number of new products sold for the next 4 weeks per method, since the goal of every company would be to increase revenue, it means if we can have the same amount of customers for all groups, we can generate much more revenue using the 'Email + Calls' method. I want to see how the Total Revenue generated per Number of new products sold from email will improve because this method is less time consuming and more cost effective.

sales_method for Total Revenue per Number of new products sold

  • Call: 5.165740
  • Email: 9.926226
  • Email + Call: 13.976339 dtype: float64

The percentage revenue generated per Number of new products sold from email only is about 34% of the sales method within the last 6 weeks, so this is a very important metric to monitor overtime to se improvement.

Recommendation for Sales Approach at "Pens and Printers"

1. Introduction: Our analysis over the past six weeks reveals distinct patterns in customer behavior in response to different sales methods: Email, Call, and a combination of both (Email + Calls). By analyzing revenue, conversion rates, and customer preferences, we can recommend an optimal sales strategy.

2. Key Findings:

  • Customer Preference: A significant portion of our customers show a clear inclination towards communication via Email. This method reached out to the highest number of customers, suggesting its scalability and acceptance.

  • Revenue Generation: The "Email + Calls" method, while being the third most popular, generated the highest revenue per product sold. This indicates a higher conversion rate or purchase value when this combined approach is used.

  • Efficiency and Effectiveness: While emails are time and cost-efficient, the combination of "Email + Calls" proves to be the most effective, albeit time-consuming.

3. Recommendations:

  • Adopt a Tiered Approach:

    • Tier 1 (Primary Outreach): Continue using Emails as the primary method of outreach given its scalability and preference among customers.
    • Tier 2 (High-Value Conversion): For high-value products or potentially large orders, follow up with a call after the initial email. The data suggests that this combined approach can potentially boost revenue for such sales.
  • Enhance Email Quality: Given that emails are the most preferred method, invest in enhancing the content. This could include personalized recommendations, attractive visuals of the new product line, or limited-time promotions.

  • Training & Scripts: For the combined "Email + Calls" approach, provide sales representatives with training and possibly scripts to ensure the phone call's effectiveness, given its proven potential to significantly boost revenue.

  • Feedback Loop: Implement a feedback mechanism for customers who purchase via the combined method. Understanding what specifically drove their purchase decision can provide insights for refining the strategy.

  • Cost-Benefit Analysis: While the combined method is more revenue-effective, it is also more time-consuming. A thorough cost-benefit analysis should be conducted to gauge the net profit of this approach versus the others, considering operational costs.

4. Future Monitoring:

  • Metric Monitoring: Monitor the "Total Revenue per Number of New Products Sold" metric, especially for the "Email" method. Given its current percentage revenue is about 34% of the total, any significant improvement here could have a substantial impact on overall revenue.

5. Conclusion:

While our findings clearly favor the combined "Email + Calls" method in terms of revenue generation, it's crucial to balance efficiency and effectiveness. By leveraging the scalability of emails and the effectiveness of calls judiciously, "Pens and Printers" can maximize revenue without overwhelming the sales team or alienating customers.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

prod_sale = pd.read_csv("product_sales.csv")
prod_sale.head()
prod_sale.info()
# Count rows with duplicates for customer_id column
count_duplicates_ci = prod_sale['customer_id'].duplicated(keep=False).sum() 
count_duplicates_ci
prod_sale['customer_id'].isnull().sum()
prod_sale['customer_id'].str.contains('[^a-zA-Z]', regex=True).any()
prod_sale['customer_id'] = prod_sale['customer_id'].str.replace('[^a-zA-Z]', '', regex=True)
prod_sale.info()
prod_sale['customer_id'].isnull().sum()
# checking for extraneous data in the sales_method column.
prod_sale['sales_method'].value_counts()
# changing the extraneous data contained in the sales_method column to match the actual categories.
replace_values = {'em + call':'Email + Call', 'email':'Email'}
prod_sale['sales_method'] = prod_sale['sales_method'].replace(replace_values)
prod_sale['sales_method'].value_counts()