Pet Box Subscription Data Analysis
Tasks 1
For every column in the data:
- State whether the values match the description given in the table above.
- State the number of missing values in the column.
- Describe what you did to make values match the description if they did not match.
Solution:
Product ID: There were 1500 unique values that match the description given. There are no missing values. No changes were made to this column.
Category: This column had seven unique categories,'Food' ,'Housing' ,'Medicine', 'Toys', 'Equipment' 'Accessory', '-'
that not match those in the description. There are missing values '-'
, and I renamed the '-'
category to 'Unknown' like the discription of the column ask.After cleaning of the column we have these categories,'Food' ,'Housing' ,'Medicine', 'Toys', 'Equipment' 'Accessory', 'Unknown'
.
Animal: This column had four unique categories,'Bird', 'Dog', 'Cat', 'Fish'
, that match those in the description. There were no missing values.
Size: This column these unique categories, 'large', 'MEDIUM', 'medium', 'small', 'Small', 'Large' 'SMALL', 'Medium', 'LARGE', that not match those in the description. There were miss writing of values.I replace the repeated value and harmonize the values and after cleaning we have three unique categories size, 'Large', 'Medium', 'Small'
that match those in the description.
price: The values of this column were number
and string
, this string
values are unlisted
and we transfrom the values to null
then we get 150 values that were null, which is not consistent with the description given.The missing values were replaced with the median value of the remaining data, which was 29.42.
Sales: The values of this column were all rounded to 2 decimal places, and the values ranges from 286.94 to 2255.96. There were no missing values.
Rating: The values of the column are integer
and NA
.The number of NA
values are 150.The values of the column did not match the description.We replace the NA
values to 0
as the description said.
Repeat Purchase: There were (0, 1) unique values that match the description given. There are no missing values. No changes were made to this column.
Task 2
Create a visualization that shows how many products are repeat purchases. Use the visualization to:
- State which category of the variable repeat purchases has the most observations
- Explain whether the observations are balanced across categories of the variable repeat purchases
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.style as style
style.use('ggplot')
plt.figure(figsize=(10,6))
sns.set_palette(['gray'])
df = pd.read_csv('Pet_Box_Subscription.csv')
df.head()
df.info()
df.describe()
df.boxplot()
nulls_count = df.isnull().sum()
print(nulls_count)
Data Validation
I will be doing validation one after another.
Let's start the data validation with the first column product_id
# 1 product_id
# Check if there is any duplicates
has_duplicates = df['product_id'].duplicated().any()
# Print duplicate
print(has_duplicates)
# Check for number of duplicates
df['product_id'].duplicated().sum()
# Check for number of unique ID
len(df['product_id'].unique())
# 2 category
# Unique category
unique_category = df['category'].unique()
print(unique_category)
# Replace value
df['category'].replace({'-': 'Unknown'}, inplace=True)
# Count nul category
null_counts_category = df['category'].isnull().sum()
print(null_counts_category)
# Group count
group = df.groupby('category')['category'].count()
print(group)
# 3 animal
# Unique animal
unique_animal = df['animal'].unique()
print(unique_animal)
# Count null animal
null_counts_animal = df['animal'].isnull().sum()
print(null_counts_animal)
# Group count
group = df.groupby('animal')['animal'].count()
print(group)