Skip to content

Project: Data Analysis, Manipulation, Visualization and Reporting of Pet Box Subscription

Informations

Task 1

For every column in the data: State whether the values match the description given in the table above. State the number of missing values in the column. Describe what you did to make values match the description if they did not match.

  • product_id; There are 1500 unique identifier values this column. There are no missing values. No changes were made to this column.
  • category: There are 6 unique categories this column. There were 25 missing values and they are replaced with “Unknown”.
  • animal: There are 4 unique categories this column. There were no missing values. No changes were made to this column.
  • size: There are 3 unique categories and there were no missing values this column. In addition to it, there are more than 3 types of spelling (Upper case, lower case difference. For example MEDIUM instead of Medium). Therefore, these values were arranged to same spelling.
  • price: The values of this column ranged from 12.85 to 54.16, which is consistent with the description given. There were 150 missing values. The missing values were replaced with the median value of the remaining data, which was 28.065 (Rounded to 28.07).
  • sales: The values of this column ranged from 286.94 to 2255.96, which is consistent with the description given. There were no missing values.
  • rating: The values of this column ranged from 0 to 9. There were 150 missing values. The missing values were replaced with 0.
  • repeat_purchase: There are 2 unique categories and there were no missing values this column. In addition to it, there are more than 2 types of spelling (-0 instead of 0). Count of this typo (-0 instead of 0) was 142. This values were arranged.

Task 2

Create a visualization that shows how many products are repeat purchases. Use the visualization to: State which category of the variable repeat purchases has the most observations Explain whether the observations are balanced across categories of the variable repeat purchases.

There are 6 categories in this data. The categories are balanced. The most common category listed is a equipment. 4 categories are almost the same values which are food, housing, medicine and toys but, equipment category has %47 more repeat purchases products than average of these 4 categories. Therefore, the company should focus on these 5 categories, focusing primarily on the equipment category.

Task 3

Describe the distribution of all of the sales. Your answer must include a visualization that shows the distribution.

The distribution of all of the sales is symmetrical distribution. There is no skewness. There are some outliers, but this is very uncommon. Looking at this visual, it is seen that half of the distribution (%54.2) is between 800 and 1100. In addition to it, %91 of distribution is between 600 and 1500. The company should target these sales ranges when looking to increase its sales by repeatedly selling more everyday products.

Task 4

Describe the relationship between repeat purchases and sales. Your answer must include a visualization to demonstrate the relationship.

  • So far we have determined which category repeat purchase products are selling more and we have examined the range of sales amounts of the all purchases without paying attention to wherher they are repurchased or not.
  • We will analyze the sales distribution and visualize according to whether the customers buy the product repeatedly or not.
  • Further analysis should be done to learn in detail the effects of repurchased products. For conclusive results, categories should be examined one by one and results should be compared.
  • The non repeat purchase products box plot graph has a median of 1028.37. The repeat purchase products box plot graph has a median of 975.77. The non repeat graph has a larger inter quartile range (IQR), outliers are uncommon. - Eventually, in the light of these data, it has been determined that this approach, which has been tested for the last year, repeat purchase products do not positively affect the sales positively, and non repeat products provide more sales amount.