Data Analyst Associate Practical Exam Submission
You can use any tool that you want to do your analysis and create visualizations. Use this template to write up your summary for submission.
You can use any markdown formatting you wish. If you are not familiar with Markdown, read the Markdown Guide before you start.
Background
PetMind is a retailer of products for pets. They are based in the United States. PetMind sells products that are a mix of luxury items and everyday items. Luxury items include toys. Everyday items include food. The company wants to increase sales by selling more everyday products repeatedly. They have been testing this approach for the last year. They now want a report on how repeat purchases impact sales.
Task 1
For every column in the data:
a. State whether the values match the description given in the table above.
b. State the number of missing values in the column.
c. Describe what you did to make values match the description if they did not match.
product_id - no adjustment made since index is 1459 and product_id ended in 1460 as expected
category - column is split into 6 category(Equipment, Food, Toys, Medicine, Housing, and Accesory) but has 25 '-' values and relaced to 'unknown'
animal -colum split into 4 category with no missing values, and no adjustment has been made.
size - there is a value inconsistency in size column such as (medium, MEDIUM, large, Large), adjusted to a uniform lower case on all categories
price - there is 150 'unlisted' in price column and replaced with overall median price of 28.06
sales - no null values and rounded to 2 decimal places
rating - rating has 150 NULL values and assigned 0 to Null, now the range is from 0-9 with no decimal places as expected.
repeat_purchase - Assigned values of 'yes' to 1 and 'no' to 0.
import pandas as pd
import numpy as np
pet_supplies_2212df = pd.read_csv('pet_supplies_2212.csv')
#pet_supplies_2212df.head(50)
# category with '-' values and relaced to 'unknown'
pet_supplies_2212df['category'] = pet_supplies_2212df['category'].replace('-', 'unknown')
# make all size column values uniform to lower case
pet_supplies_2212df['size'] = pet_supplies_2212df['size'].str.lower()
#change the 'unlisted' string to the median of existing price.
pet_supplies_2212df['price'] = pd.to_numeric(pet_supplies_2212df['price'], errors='coerce')
median_price = np.median(pet_supplies_2212df['price'].dropna())
pet_supplies_2212df['price'] = pet_supplies_2212df['price'].fillna(median_price)
#rounding to 2 decimal places
pet_supplies_2212df['price'] = pet_supplies_2212df['price'].round(2)
pet_supplies_2212df['sales'] = pet_supplies_2212df['sales'].round(2)
#fill NULL with 0
pet_supplies_2212df['rating'] = pet_supplies_2212df['rating'].fillna(0)
#repeat_purchase change to 'yes' and 'no'
pet_supplies_2212df['repeat_purchase'] = pet_supplies_2212df['repeat_purchase'].replace(1, 'yes')
pet_supplies_2212df['repeat_purchase'] = pet_supplies_2212df['repeat_purchase'].replace(0, 'no')
#Review of the Changes
pet_supplies_2212df.head(100)
Task 2
Create a visualization that shows how many products are repeat purchases. Use the visualization to:
a. State which category of the variable repeat purchases has the most observations
b. Explain whether the observations are balanced across categories of the variable repeat purchases
How many observations of repeated purchase in the dataframe.?
import seaborn as sns
import matplotlib.pyplot as plt
sns.histplot(data=pet_supplies_2212df, x='repeat_purchase')
plt.xlabel("Repeat Purchase")
plt.ylabel("Count")
plt.title('Purchase Type')
sns.set()
plt.show()
Upon analyzing the dataset, it is evident that the majority of observations correspond to repeat purchases, with a count of 906, whereas non-repeat purchases account for only 594.With repeat purchases accounting for a majority of the observations and comprising around 60% of the total observation, based on that we can now focus on repeated sales, and how it impact sales.
Among the repeat purchases, which category has the highest number of observations?
#Subset only repeat purchase 'yes'
repeat_purchases_yes = pet_supplies_2212df[pet_supplies_2212df['repeat_purchase'].isin(['yes'])]
categorize_repeat_purchases = repeat_purchases_yes.groupby('category')['repeat_purchase'].count().reset_index()
#Plotting pie chart
plt.pie(categorize_repeat_purchases['repeat_purchase'],
labels = categorize_repeat_purchases['category'],
autopct='%.0f%%')
plt.title('Repeat Purchase by Category')
sns.set()
plt.show()
categorize_repeat_purchases
The highest number of observations is in the category of "Equipment" with 24% of repeat purchase. This indicates that the equipment category has the most instances of repeat purchases in the dataset. Comparing with different categories, we can observe varying levels of repeat purchases, while the categories of "Food", "Housing", and "Medicine" all have an equal percentage of repeat purchases in the dataset.