Competition - Cats and dogs

Cats vs Dogs: The Great Pet Debate 🐱🐶

📖 Background

You and your friend have debated for years whether cats or dogs make more popular pets. You finally decide to settle the score by analyzing pet data across different regions of the UK. Your friend found data on estimated pet populations, average pets per household, and geographic factors across UK postal code areas. It's time to dig into the numbers and settle the cat vs. dog debate!

💾 The data

There are three data files, which contains the data as follows below.

The `population_per_postal_code.csv` data contains these columns:

Column	Description
`postal_code`	An identifier for each postal code area
`estimated_cat_population`	The estimated cat population for the postal code area
`estimated_dog_population`	The estimated cat population for the postal code area

The `avg_per_household.csv` data contains these columns:

Column	Description
`postal_code`	An identifier for each postal code area
`cats_per_household`	The average number of cats per household in the postal code area
`dog_per_household`	The average number of dogs per household in the postal code area

The `postal_code_areas.csv` data contains these columns:

Column	Description
`postal_code`	An identifier for each postal code area
`town`	The town/towns which are contained in the postal code area
`county`	The UK county that the postal code area is located in
`population`	The population of people in each postal code area
`num_households`	The number of households in each postal code area
`uk_region`	The region in the UK which the postal code is located in

*Acknowledgments: Data has been assembled and modified from two different sources: Animal and Plant Health Agency, Postcodes.

💪 Challenge

Leverage the pet data to analyze and compare cat vs. dog rates across different regions of the UK. Your goal is to identify factors associated with higher cat or dog popularity.

Some examples:

Examine if pet preferences correlate to estimated pet populations, or geographic regions. Create visualizations to present your findings.
Develop an accessible summary of study findings on factors linked to cat and dog ownership rates for non-technical audiences.
See if you can identify any regional trends; which areas prefer cats vs. dogs?

Data Cleansing

import pandas as pd

# Load the data from the CSV files
population_per_postal_code = pd.read_csv('data/population_per_postal_code.csv')
avg_per_household = pd.read_csv('data/avg_per_household.csv')
postal_code_areas = pd.read_csv('data/postal_codes_areas.csv')

# Display the first few rows of each dataset
population_per_postal_code_head = population_per_postal_code.head()
avg_per_household_head = avg_per_household.head()
postal_code_areas_head = postal_code_areas.head()

population_per_postal_code_head

avg_per_household_head

postal_code_areas_head

# Ensure the columns are strings before using .str accessor
population_per_postal_code['estimated_cat_population'] = population_per_postal_code['estimated_cat_population'].astype(str).str.replace(',', '').astype(float)
population_per_postal_code['estimated_dog_population'] = population_per_postal_code['estimated_dog_population'].astype(str).str.replace(',', '').astype(float)

# Convert 'postcode' in avg_per_household to string (to match with 'postal_code')
avg_per_household['postcode'] = avg_per_household['postcode'].astype(str)

# Display the data types and first few rows to confirm the changes
population_per_postal_code.dtypes, avg_per_household.dtypes

# Check for missing values in each dataset
missing_values_population = population_per_postal_code.isnull().sum()
missing_values_avg_household = avg_per_household.isnull().sum()
missing_values_postal_code_areas = postal_code_areas.isnull().sum()

missing_values_population, missing_values_avg_household, missing_values_postal_code_areas

# Handle missing values in the 'county' and 'uk_region' columns
# Remove rows with missing 'county' as there's only one
postal_code_areas = postal_code_areas.dropna(subset=['county'])

# For 'uk_region', we will impute with the most common region if possible
most_common_region = postal_code_areas['uk_region'].mode()[0]
postal_code_areas['uk_region'] = postal_code_areas['uk_region'].fillna(most_common_region)

# Review rows with missing 'population' and 'num_households'
missing_population_households = postal_code_areas[postal_code_areas['population'].isnull() | postal_code_areas['num_households'].isnull()]

# Display the rows with missing 'population' and 'num_households' for further action
missing_population_households

# Check the total number of rows in the postal_code_areas dataset
total_rows_postal_code_areas = postal_code_areas.shape[0]
total_rows_postal_code_areas

# Remove rows with missing 'population' and 'num_households'
postal_code_areas_cleaned = postal_code_areas.dropna(subset=['population', 'num_households'])

# Verify the number of remaining rows
remaining_rows_postal_code_areas = postal_code_areas_cleaned.shape[0]
remaining_rows_postal_code_areas

# Ensure postal codes are consistent
# Trim whitespace and convert to uppercase if necessary (generally, postal codes are uppercase in the UK)
population_per_postal_code['postal_code'] = population_per_postal_code['postal_code'].str.strip().str.upper()
avg_per_household['postcode'] = avg_per_household['postcode'].str.strip().str.upper()
postal_code_areas_cleaned['postal_code'] = postal_code_areas_cleaned['postal_code'].str.strip().str.upper()

# Check for unique postal codes in each dataset
unique_population_postal_codes = population_per_postal_code['postal_code'].nunique()
unique_avg_household_postal_codes = avg_per_household['postcode'].nunique()
unique_postal_code_areas = postal_code_areas_cleaned['postal_code'].nunique()

unique_population_postal_codes, unique_avg_household_postal_codes, unique_postal_code_areas

# Merge the datasets on postal codes
merged_data = pd.merge(population_per_postal_code, avg_per_household, left_on='postal_code', right_on='postcode', how='inner')
merged_data = pd.merge(merged_data, postal_code_areas_cleaned, on='postal_code', how='inner')

# Drop redundant columns
merged_data = merged_data.drop(columns=['postcode'])

# Display the first few rows of the merged dataset to confirm the merge
merged_data.head()

‌
‌
‌