Skip to content
Pet Box Subscription Exploratory Data Analysis
  • AI Chat
  • Code
  • Report
  • Pet Box Subscription Data Analysis

    Tasks 1

    For every column in the data:

    • State whether the values match the description given in the table above.
    • State the number of missing values in the column.
    • Describe what you did to make values match the description if they did not match.

    Solution:

    Product ID: There were 1500 unique values that match the description given. There are no missing values. No changes were made to this column.

    Category: This column had seven unique categories,'Food' ,'Housing' ,'Medicine', 'Toys', 'Equipment' 'Accessory', '-' that not match those in the description. There are missing values '-', and I renamed the '-' category to 'Unknown' like the discription of the column ask.After cleaning of the column we have these categories,'Food' ,'Housing' ,'Medicine', 'Toys', 'Equipment' 'Accessory', 'Unknown'.

    Animal: This column had four unique categories,'Bird', 'Dog', 'Cat', 'Fish', that match those in the description. There were no missing values.

    Size: This column these unique categories, 'large', 'MEDIUM', 'medium', 'small', 'Small', 'Large' 'SMALL', 'Medium', 'LARGE', that not match those in the description. There were miss writing of values.I replace the repeated value and harmonize the values and after cleaning we have three unique categories size, 'Large', 'Medium', 'Small' that match those in the description.

    price: The values of this column were number and string, this string values are unlisted and we transfrom the values to null then we get 150 values that were null, which is not consistent with the description given.The missing values were replaced with the median value of the remaining data, which was 29.42.

    Sales: The values of this column were all rounded to 2 decimal places, and the values ranges from 286.94 to 2255.96. There were no missing values.

    Rating: The values of the column are integer and NA.The number of NA values are 150.The values of the column did not match the description.We replace the NA values to 0 as the description said.

    Repeat Purchase: There were (0, 1) unique values that match the description given. There are no missing values. No changes were made to this column.

    Task 2

    Create a visualization that shows how many products are repeat purchases. Use the visualization to:

    • State which category of the variable repeat purchases has the most observations
    • Explain whether the observations are balanced across categories of the variable repeat purchases
    import pandas as pd
    import seaborn as sns
    import matplotlib.pyplot as plt 
    import numpy as np
    import matplotlib.style as style
    style.use('ggplot')
    plt.figure(figsize=(10,6))
    sns.set_palette(['gray'])
    df = pd.read_csv('Pet_Box_Subscription.csv')
    
    df.head()
    df.info()
    df.describe()
    df.boxplot()
    nulls_count = df.isnull().sum()
    print(nulls_count)

    Data Validation

    I will be doing validation one after another.

    Let's start the data validation with the first column product_id

    # 1 product_id
    
    # Check if there is any  duplicates
    
    has_duplicates = df['product_id'].duplicated().any()
    
    # Print duplicate
    print(has_duplicates)
    
    # Check for number of duplicates
    df['product_id'].duplicated().sum()
    
    # Check for number of unique ID
    len(df['product_id'].unique())
    # 2 category
    
    # Unique category
    unique_category = df['category'].unique()
    
    print(unique_category)
    
    # Replace value
    df['category'].replace({'-': 'Unknown'}, inplace=True)
    
    # Count nul category
    null_counts_category = df['category'].isnull().sum()
    print(null_counts_category)
    
    # Group count
    group = df.groupby('category')['category'].count()
    print(group)
    # 3 animal
    
    # Unique animal
    unique_animal = df['animal'].unique()
    print(unique_animal)
    
    # Count null animal
    null_counts_animal = df['animal'].isnull().sum()
    print(null_counts_animal)
    
    # Group count
    group = df.groupby('animal')['animal'].count()
    print(group)