Data Analyst Professional Practical Exam Submission
You can use any tool that you want to do your analysis and create visualizations. Use this template to write up your summary for submission.
You can use any markdown formatting you wish. If you are not familiar with Markdown, read the Markdown Guide before you start.
13 hidden cells
Introduction
Pens and Printers is a stationary distributor founded in 1984, and they recently added a new line of products that have been selling for the past 6 weeks. This sales analysis report aims to evaluate the effectiveness of three primary sales methods employed by the company: email, calls, and a combination of emails and calls. The focus of this analysis is to determine which sales method should be prioritized and continued to ensure the highest revenue growth.
Business questions this analysis seeks to answer:
- How many customers were there for each approach?
- What does the spread of the revenue look like overall? And for each method?
- Was there any difference in revenue over time for each of the methods?
- Which method should we continue to use?
Data Validation
The dataset contains 15000 rows and 8 columns before data cleaning. Here is some additional information regarding each column and the cleaning steps followed for those that needed some adjustments:
- week: It contains discrete values ranging from 1 to 6 which correspond to the number of weeks passed since the product launch. No missing values, and no cleaning needed.
- sales_method: It contains 3 categories corresponding to the different sales methods used (Calls, Emails, Emails + Calls) with no missing values. This column required some cleaning because there were inconsistencies in the naming and capitalization of a few categories.
- customer_id: It contains 15,000 unique customer ids. No duplicates, no missing values, and no cleaning needed.
- nb_sold: It contains discrete values ranging from 7 to 16 which correspond to the number of new products sold per transaction. No missing values, and no cleaning needed.
- revenue: It contains numeric values corresponding to the total revenue per transaction. Values in this column range from $32.54 to $238.32. This column had a total of 1074 missing values which is around 7% of all of the data. Since the percentage of missing data is greater than the accepted threshold of 5% I proceeded to impute the missing values. To impute the values I calculated the average sale price of new products sold and I multiplied this average times the number of items sold in each row with missing values. Using this impute method yields no missing data retaining the original 15,000 rows.
- years_as_customer: It contains discrete values which ranged from 0 to 63 years as a customer before the data cleaning. However, the company has only been in business for 39 years; therefore, the outliers were swapped with the maximum number of years in business (39), and a total of 2 rows were corrected without affecting the total number of rows.
- nb_site_visits: It contains discrete values ranging from 12 to 41. No missing values, and no cleaning needed.
- state: It contains the state to which the sales will be dispatched. This column has a total of 50 states with no capitalization or misspelling issues.
After validating the data the dataset contains 15000 rows and 8 columns without missing values.
How many customers per sales approach?
As seen in the figure below, approximately 50% of the customers bought our products as a result of the email campaign. Ranking second is the calls approach with around 33% of the total transactions, and lastly, we have Email + Call with around 17% of the total customers.
These numbers are a good starting point for our analysis; however, they cannot be used to draw any conclusions because there are several other missing variables needed to draw insights from this data, such as:
- Resources allocation: How many man-hours were allocated to each sales method? and what criteria were used to allocate customers to each sales method?
- Conversion rate: How many customers were targetted per sales method? and what was the success rate per method?
What does the spread of the revenue look like overall?
Understanding the spread of revenue in sales analysis is crucial to identify areas of strength, weakness, and growth potential, enabling effective resource allocation and decision-making.
The histogram displayed below indicates a positively skewed and bimodal distribution of revenue, with the highest occurrence of revenues observed around $60 and $90.
1 hidden cell
Revenue distribution per sales method
The violin plot below shows the probability density and the distribution of revenue per sales method. The data reveals that the email + call sales method has the greatest variation and the highest median revenue per customer.
Exploring the difference in revenue over time for each method
The two figures provided below depict the sales trends for each sales method over the analyzed period. Our earlier observation highlighted that email sales generated the highest number of sales. However, upon closer examination, it is evident that the performance of email sales is declining over time, whereas the other two sales methods demonstrate a positive upward trend.
An interesting observation is that in week 5, calls resulted in a higher number of items sold compared to other sales methods. However, it is important to note that even though calls had higher sales volume, the combined approach of email + call still outperformed in terms of revenue generated during that week. This highlights the importance of being able to generate more revenue per interaction, showcasing the effectiveness of the email + call strategy.