Data Analyst Practical Exam
Data Validation
The dataset contains 15000 rows and 8 columns before cleaning and validataion. I have validated all the columns against the criteria in the dataset table:
- week: 6 unique values for weeks representing the week sale was made, counted as weeks since product launch, without missing values. No cleaning is needed.
- sales_method: initially with 5 unique values, needing to correct typing mistakes to get 3 unique sales methods as in description. Correction was made.
- customer_id: character unique identifier for the customer, without missing values. No cleaning is needed.
- nb_sold: numeric values without missing values, representing the number of new products sold. No cleaning is needed.
- revenue: revenue from the sales, containing 1074 (7.16%) missing values. I grouped the revenue by the nb_sold and calculated the mean between them. Then I replaced the missing values with the average revenue, taking into account how many products had been sold.
- years_as_customer: number of years customer has been buying from us, 42 unique values. It was necessary cleaning two observations because the time the client had been with us was incompatible with the age of the company.
- nb_site_visits: numeric values without missing values, same as the description. No cleaning is needed.
- state: location of the customer i.e. where orders are shipped, 50 states without missing values. No cleaning is needed.
After the data validation, the dataset contains 14998 rows and 8 columns without missing values.
How many customers were there for each approach?
The number of customers per approach is as follows:
- Email: 7465 customers
- Call: 4961 customers
- Email + Call: 2572 customers After data cleaning, two records were discarded.So the total number of customers after cleaning the data is 14998.
A count plot for the sales method is provided below to illustrate the result:
What does the spread of the revenue look like overall? And for each method?
The revenue distribution appears to be right-skewed, with a concentration of values on the lower end of the spectrum. The mean revenue is approximately 95.03 , with a standard deviation of 47.63, indicating a moderate level of variability around the mean. The minimum revenue recorded is 32.54, while the maximum reaches 238.32, showcasing a wide range of revenue values. The interquartile range (IQR), which spans from the 25th percentile ( 53.01 ) to the 75th percentile (107.56), suggests that the middle 50% of the revenue data is relatively tightly grouped. The median revenue is 89.66, which is slightly lower than the mean, further supporting the notion of a right-skewed distribution. In summary, the revenue distribution is characterized by a majority of lower revenue values with a long tail extending towards higher revenue figures. This indicates that while most transactions or entities generate revenue on the lower end, there are a few that significantly exceed the average, contributing to the overall spread. The histogram illustrating the revenue distribution can be seen below:
And for each method?
Call:
- Mean Revenue: 48.86
- Standard Deviation: 11.53
- IQR: 41.56 to 52.94
- Median Revenue: 49.85
- Range: 32.54 to 170.16
The revenue distribution for the Call method is relatively narrow, with most revenues concentrated around the mean. The low standard deviation and IQR suggest that this method generates consistent but lower revenue compared to other methods.
Email:
- Mean Revenue: 96.33
- Standard Deviation: 12.04
- IQR: 87.52 to 104.60
- Median Revenue: 94.43
- Range: 64.53 to 170.16
The Email method shows a higher mean revenue compared to the Call method, with a similar standard deviation. The IQR indicates that the middle 50% of revenues are closely grouped, suggesting consistency in revenue generation.
Email + Call:
- Mean Revenue: 180.30
- Standard Deviation: 34.15
- IQR: 154.17 to 191.28
- Median Revenue: 183.70
- Range: 64.53 to 238.32
The Email + Call method generates the highest mean revenue, with a broader spread indicated by the higher standard deviation. The IQR shows that the middle 50% of revenues are significantly higher than the other methods, indicating that this method is more effective in generating higher revenues.
The boxplot chart illustrating these distributions can be observed below, providing a visual representation of the revenue spread for each sales method:
Was there any difference in revenue over time for each of the methods?
-
Email: The revenue generated through the Email method shows a consistent trend over the weeks. There is no significant increase or decrease, indicating stable performance.
-
Email + Call: The Email + Call method demonstrates a noticeable increase in revenue over the weeks. This suggests that this method becomes more effective over time, possibly due to the combined approach enhancing customer engagement.
-
Call: The Call method shows a relatively stable trend with slight fluctuations. There is no significant change in revenue over the weeks, indicating consistent but not improving performance.
There is a clear difference in revenue trends over time for each method. The Email + Call method shows an upward trend, indicating increasing effectiveness. The Email and Call methods remain relatively stable, with no significant changes in revenue over the observed period. It's possible to see the revenue changes for each sales method over time through the graph below:
Recommendation
Based on the data, we can analyze the sales methods and recommend the best strategy.
- Average Revenue per Customer:
- Call: Lowest average revenue.
- Email: Intermediate average revenue.
- Email + Call: Highest average revenue per customer.
The Email + Call method generates more revenue per customer, indicating that the dual contact approach may be more effective for conversion. As can be seen in the graph below:
- Total Revenue by Sales Method:
- Call: Lowest total revenue.
- Email: Highest total revenue, suggesting that more customers were reached.
- Email + Call: High total revenue, but lower than Email alone.
The Email method generated the highest total revenue, possibly because it reached more customers with less effort. Email + Call has good total revenue and excellent average revenue but may be more labor-intensive. Call alone has the worst performance. It can be seen in the graph below:
Business Metrics
Since our goal is to optimize revenue generation and improve the effectiveness of sales methods, I recommend using Average Revenue per Customer as our primary metric. This metric will help us understand how much revenue each sales method generates per customer and identify which methods are most profitable.
Based on the current data, the Average Revenue per Customer for each sales method is as follows:
- Call: 48.86
- Email: 96.33
- Email + Call: 180.30
If the average for the Email + Call method increases over the next quarter, it will indicate that this method is becoming even more effective at driving higher revenue per customer. Similarly, an increase in average for the Email method would show improved efficiency in reaching and converting customers. Monitoring these values will provide clear insights into whether our strategies are working and help us achieve our revenue goals.
If the goal is total revenue, it's better continuing with Email (less effort, more customers). If we want to increase revenue per customer, Email + Call are better (more conversions, but requires more time). Discard Call alone, as it has the worst performance in both average and total revenue.
Final recommendation: Prioritizing Email, with strategic tests of Email + Call for higher-value customers.