Background
PetMind is a retailer of products for pets. They are based in the United States. PetMind sells products that are a mix of luxury items and everyday items. Luxury items include toys. Everyday items include food. The company wants to increase sales by selling more everyday products repeatedly. They have been testing this approach for the last year. They now want a report on how repeat purchases impact sales
For every column in the data:
a. State whether the values match the description given in the table above.
b. State the number of missing values in the column.
c. Describe what you did to make values match the description if they did not match.
| Column Name | Description | Missing values |
|---|---|---|
| product_id | The number of distinct values matches the number of total rows | No missing values |
| category | Found 25 records that don’t match the 6 given categories ( - instead of a category) and they have been replaced with ‘unknown’ | 25 missing values. |
| animal | selected all distinct animals and matched the four given | No missing values |
| size | I found 1050 capitalization errors on this column that do not match the given ones. Capitalization was corrected to match Small, Medium, Large | No missing values |
| price | there are 150 rows where the price is unlisted, these have been changed to the overall median and rounded to two decimals, all numbers were positive | 150 missing values. |
| sales | rounded to two decimals, no negative values | No missing found. |
| rating | 150 NA values found and replaced with 0 | 150 missing values |
| repeat_purchase | nothing was changed | no missing values |
Create a visualization that shows how many products are repeat purchases.
Use the visualization to:
a. State which category of the variable repeat purchases has the most observations
b. Explain whether the observations are balanced across categories of the variable repeat purchases
a. Repeatedly buy products, with a total of 906 observations, that represents 60.4% of the total, havethe most observations.
b. The observations across the categories of the variable repeat_purchase are not balanced. Repeatedly buy products (1) has significantly more observations compared to unique buy (0). In a balanced dataset, we would expect to see an approximately equal number of observations. In this case, about 60.4% of the total observations are in the Repeatedly buy products (1), while only 39.5% are in the Unique buy products(0).
Describe the distribution of all of the sales. Your answer must include a visualization that shows the distribution.
Sales appear to be skewed to the right, with most of the sales values falling in the lower range, and fewer sales in the higher range.
This type of distribution is common in situations where a large number of observations have smaller values (in this case, sales), and a few observations have much larger values. This could suggest that while most products have moderate sales, there are a few products that sell significantly more.
When looking for products that have h
Describe the relationship between repeat purchases and sales. Your answer must include a visualization to demonstrate the relationship.
To understand the impact of purchases on sales, we need to combine the two pieces of information. While pet stores with over 1000 sales are ideal, we need to look at both variables together to see if this is realistic.
When we looked at the sales data alone, we removed the outlier so that we could see the majority of the data. To show the impact of this outlier, we can look at the range of sales by purchase with the outlier included in the data. In the graphic above, you can see that the outlier is dominating the data and making it difficult to compare the rest of the data. Therefore, we will remove this outlier to make it easier to compare the rest of the data.
After removing the outliers, we can focus on the main range of data. Even though pet purchases include the sales with the highest number of purchases, the interquartile range (IQR) of the sales for one-off purchases (0) is lower than the IQR of the sales for repeated purchases (1). This suggests that the majority of the number of sales may be higher if customers make repeated purchases. However, this could also be an artifact of having the largest number of repeated purchases, as the large number of repeated purchases could bring the median down.