Skip to content

Data Analyst Professional Practical Exam Submission

📝 Task List

Your written report should include written text summaries and graphics of the following:

  • Data validation:
    • Describe validation and cleaning steps for every column in the data
  • Exploratory Analysis:
    • Include two different graphics showing single variables only to demonstrate the characteristics of data
    • Include at least one graphic showing two or more variables to represent the relationship between features
    • Describe your findings
  • Definition of a metric for the business to monitor
    • How should the business use the metric to monitor the business problem
    • Can you estimate initial value(s) for the metric based on the current data
  • Final summary including recommendations that the business should undertake

Report

Hello, Head of Analytics!

I've done the analysis you asked for. Sending you the report. I divided the content into three main parts: Data validation, Exploratory data analysis, final summary.

Data validation

The original data is 15000 rows and 8 columns. After validation there were 13926 rows remaining. The following describes what I did to each column:

  1. Week: Values were numerical and column had no missing data or unusual values, as expected.
  2. Sales_method: Had other data type (object instead of category), and has some extra values ('em + call', 'email'm 'call') apart from 'Email', 'Call', 'Email + Call'. I changed dtype to category and replaces all extra values with Email, Call, Email+Call as appropriate.
  3. Customer_id: Had other data type (object instead of category). I changed dtype to category. All values were unique, as expected.
  4. Nb_sold: Values were numerical and column had no missing data or unusual values, as expected.
  5. Revenue: contained missing data. I decided to look in details on all column before I would choose what to do with it. I've made a new dataset based on rows with null values in this column to see whether there is any system (my idea was that there may be missing values i.e. based on state or sales method), but found nothing specific. As all questions of sales department related to revenue, I dropped all rows with null values in this column. Then I rounded values to two decimals.
  6. Years_as_customer: Values were numerical as it should be but the maximum value was 63, which is impossible if the company was founded in 1984 (2023 - 1984=39, therefore, no values may be higher than that). I found that there are only 2 values > 39, it is 47 and 63. Assuming that there should be 37 and 36 respectivelly, I replaced them.
  7. Np_site_visits: Values were numerical and column had no missing data or unusual values, as expected.
  8. State: Had other data type (object instead of category). I changed dtype to category. I found no typos or strange values.

Exploratory data analysis

Our goal was to see what method of communication we should stay on. Which one has the highest revenue, is there any difference between methods in revenue spread, how the revenue changed over time in each method, etc.

As Sales Rep interested in how many customers were there for each approach, I made a following barplot for this (Pic 1).

Pic 1. Barplot of total revenue per method.

As you can see, the majority of customers were informed due to email method. Also I've made a table for Rep which included info about how many records were dropped due to missing values in revenue (Table 1). I will inform them about big percentage of missing data for Email+Call method.

Table 1. Number of customers per each method with missing data.

Also, I created a histogram and a single boxplot to show how the spread looks overall, answering their second question (Pic 2).

Pic 2. Histogram and boxplot of total revenue.

As you can see, the data has a non-normal distribution (right-skewed) with several peaks. This looks like it comes from several subgroups. Hovewer, the mean and median are quite close.

I have done a barplot of total revenue by method to see which method earned the most money (Pic 3).

Pic 3. Barplot of total revenue per method.

And here you can see that total revenue of Call+email method is about 2/3 from total revenue of Email method, despite the difference between the groups in about 3 times. This hinted at a much higher income from the group Email+Call.

I have done a histogram and boxplots of revenue based on method, as they would like me to do, where you can see that the Email+Call had the highest average income and was on the right tail of histogram values (Pic 4).

Pic 4. Histogram and boxplot of revenue per method.

Thus, the most profitable in terms of revenue is the Email + Call method.

Sales Rep also would like me to show was there any difference in revenue over time for each method. I have done a lineplot (Pic 5) and a barplot (Pic 6) to show changes in average revenue and total revenue per method per week.

Pic 5. Lineplot of average revenue per method over time.

Pic 6. Barplot of total revenue per method over time.

We can see that the highest average revenue per customer was for Email + Call throughout all the time, and the total revenue increased for this group over time. This correlates with an increase in the number of buyers depending on the method over time, so it would be useful to also record the time of purchase and the time of informing the buyer to see if there is a difference in response time to informing.

Also Sales Rep additionaly asked to inform if I found any other findings from data. I created a new column called 'Customer category' based on the values in 'years_as_customer' column. If the value == 0, the value in new category was 'New customer', if value>0 then 'Existing customer'. Then I made a boxplot to see whether there is a difference in revenue per old and new customers. I found none significant difference (Pic 7)

Pic 7. Boxplot of revenue per method by Customer category

I have made a countplot to see is there any visible difference between the method in number of new and existing customers and visually it seemed that for Email+Call method the percent of new customers was higher (Pic 8)

Pic 8. Number of new and old customers per method

Then I calculated Percentage of Revenue from New vs. Existing Customers by formula ([New / Existent] Customers Revenue / Total Revenue) and found that Email+Call has the highest percentage of revenue from new customers (Table 2)

Table 2. Percentage of revenue per new/old customers per method

Metrics for the business to monitor
I recommend to use such metrics: • Total revenue per method (TRPM) = sum of revenue per customers that were informed by a certain method • Average revenue per customer (ARPC) = Total Revenue/Customer Count • Percentage of Revenue from New vs. Existing Customers = [New/Existent] Customers Revenue / Total Revenue

We can use them to see how much revenue we take per day, month, year etc and to see the difference between methods irrespectively to the number of customers. Also this allow us to monitor how good is the method for attracting new customers.

TRPM can be seen in Table 3:

Table 3. Total revenue per method

ARPC can bee seen in Table 4:

Table 4. Average revenue per customer by method

Percentage of Revenue from New vs. Existing Customers can be seen in Table 5:

Table 5. Percentage of revenue per new/old customers per method

Final summary:

Which method to choose? • Email + call has the highest average revenue per customer (which is growing over time), the highest percent of new customers revenue among all who bought the product, quite fast way to deliver. • Just email has a moderate average revenue per customer, but the fastest to deliver. • Just calling seems to be the less effective – the lowest average revenue per customer, the lowest percent of new customers revenue among all who bought the product, the longest way to deliver.

Therefore, I would recommend to use Email+Call method.

Metrics for the business to monitor: • Total revenue per method (TRPM) • Average revenue per customer (ARPC) • Percentage of Revenue from New vs. Existing Customers

What else we can do? To start collecting data about date of informing the customer an date of buying the product by this customer to see whether there is a difference between methods in this.