Skip to content

Background

PetMind is a retailer of products for pets. They are based in the United States. PetMind sells products that are a mix of luxury items and everyday items. Luxury items include toys. Everyday items include food. The company wants to increase sales by selling more everyday products repeatedly. They have been testing this approach for the last year. They now want a report on how repeat purchases impact sales

For every column in the data:

a. State whether the values match the description given in the table above.

b. State the number of missing values in the column.

c. Describe what you did to make values match the description if they did not match.

Column NameDescriptionMissing values
product_idThe number of distinct values matches the number of total rowsNo missing values
categoryFound 25 records that don’t match the 6 given categories ( - instead of a category) and they have been replaced with ‘unknown’25 missing values.
animalselected all distinct animals and matched the four givenNo missing values
sizeI found 1050 capitalization errors on this column that do not match the given ones. Capitalization was corrected to match Small, Medium, LargeNo missing values
pricethere are 150 rows where the price is unlisted, these have been changed to the overall median and rounded to two decimals, all numbers were positive150 missing values.
salesrounded to two decimals, no negative valuesNo missing found.
rating150 NA values found and replaced with 0150 missing values
repeat_purchasenothing was changedno missing values

Create a visualization that shows how many products are repeat purchases.

Use the visualization to:

a. State which category of the variable repeat purchases has the most observations

b. Explain whether the observations are balanced across categories of the variable repeat purchases

a. Repeatedly buy products, with a total of 906 observations, that represents 60.4% of the total, havethe most observations.

b. The observations across the categories of the variable repeat_purchase are not balanced. Repeatedly buy products (1) has significantly more observations compared to unique buy (0). In a balanced dataset, we would expect to see an approximately equal number of observations. In this case, about 60.4% of the total observations are in the Repeatedly buy products (1), while only 39.5% are in the Unique buy products(0).

Describe the distribution of all of the sales. Your answer must include a visualization that shows the distribution.

Sales appear to be skewed to the right, with most of the sales values falling in the lower range, and fewer sales in the higher range.

This type of distribution is common in situations where a large number of observations have smaller values (in this case, sales), and a few observations have much larger values. This could suggest that while most products have moderate sales, there are a few products that sell significantly more.

When looking for products that have h

Describe the relationship between repeat purchases and sales. Your answer must include a visualization to demonstrate the relationship.

To understand the impact of purchases on sales, we need to combine the two pieces of information. While pet stores with over 1000 sales are ideal, we need to look at both variables together to see if this is realistic.

When we looked at the sales data alone, we removed the outlier so that we could see the majority of the data. To show the impact of this outlier, we can look at the range of sales by purchase with the outlier included in the data. In the graphic above, you can see that the outlier is dominating the data and making it difficult to compare the rest of the data. Therefore, we will remove this outlier to make it easier to compare the rest of the data.

After removing the outliers, we can focus on the main range of data. Even though pet purchases include the sales with the highest number of purchases, the interquartile range (IQR) of the sales for one-off purchases (0) is lower than the IQR of the sales for repeated purchases (1). This suggests that the majority of the number of sales may be higher if customers make repeated purchases. However, this could also be an artifact of having the largest number of repeated purchases, as the large number of repeated purchases could bring the median down.