Skip to content
0

Cats vs Dogs: The Great Pet Debate ๐Ÿฑ๐Ÿถ

๐Ÿ“– Background

You and your friend have debated for years whether cats or dogs make more popular pets. You finally decide to settle the score by analyzing pet data across different regions of the UK. Your friend found data on estimated pet populations, average pets per household, and geographic factors across UK postal code areas. It's time to dig into the numbers and settle the cat vs. dog debate!

๐Ÿ’พ The data

There are three data files, which contains the data as follows below.

The population_per_postal_code.csv data contains these columns:
ColumnDescription
postal_codeAn identifier for each postal code area
estimated_cat_populationThe estimated cat population for the postal code area
estimated_dog_populationThe estimated cat population for the postal code area
The avg_per_household.csv data contains these columns:
ColumnDescription
postal_codeAn identifier for each postal code area
cats_per_householdThe average number of cats per household in the postal code area
dog_per_householdThe average number of dogs per household in the postal code area
The postal_code_areas.csv data contains these columns:
ColumnDescription
postal_codeAn identifier for each postal code area
townThe town/towns which are contained in the postal code area
countyThe UK county that the postal code area is located in
populationThe population of people in each postal code area
num_householdsThe number of households in each postal code area
uk_regionThe region in the UK which the postal code is located in

*Acknowledgments: Data has been assembled and modified from two different sources: Animal and Plant Health Agency, Postcodes.

๐Ÿ’ช Challenge

Leverage the pet data to analyze and compare cat vs. dog rates across different regions of the UK. Your goal is to identify factors associated with higher cat or dog popularity.

Some examples:

  • Examine if pet preferences correlate to estimated pet populations, or geographic regions. Create visualizations to present your findings.
  • Develop an accessible summary of study findings on factors linked to cat and dog ownership rates for non-technical audiences.
  • See if you can identify any regional trends; which areas prefer cats vs. dogs?

Data Cleansing

import pandas as pd

# Load the data from the CSV files
population_per_postal_code = pd.read_csv('data/population_per_postal_code.csv')
avg_per_household = pd.read_csv('data/avg_per_household.csv')
postal_code_areas = pd.read_csv('data/postal_codes_areas.csv')

# Display the first few rows of each dataset
population_per_postal_code_head = population_per_postal_code.head()
avg_per_household_head = avg_per_household.head()
postal_code_areas_head = postal_code_areas.head()

population_per_postal_code_head
avg_per_household_head
postal_code_areas_head
# Ensure the columns are strings before using .str accessor
population_per_postal_code['estimated_cat_population'] = population_per_postal_code['estimated_cat_population'].astype(str).str.replace(',', '').astype(float)
population_per_postal_code['estimated_dog_population'] = population_per_postal_code['estimated_dog_population'].astype(str).str.replace(',', '').astype(float)

# Convert 'postcode' in avg_per_household to string (to match with 'postal_code')
avg_per_household['postcode'] = avg_per_household['postcode'].astype(str)

# Display the data types and first few rows to confirm the changes
population_per_postal_code.dtypes, avg_per_household.dtypes
# Check for missing values in each dataset
missing_values_population = population_per_postal_code.isnull().sum()
missing_values_avg_household = avg_per_household.isnull().sum()
missing_values_postal_code_areas = postal_code_areas.isnull().sum()

missing_values_population, missing_values_avg_household, missing_values_postal_code_areas
# Handle missing values in the 'county' and 'uk_region' columns
# Remove rows with missing 'county' as there's only one
postal_code_areas = postal_code_areas.dropna(subset=['county'])

# For 'uk_region', we will impute with the most common region if possible
most_common_region = postal_code_areas['uk_region'].mode()[0]
postal_code_areas['uk_region'] = postal_code_areas['uk_region'].fillna(most_common_region)

# Review rows with missing 'population' and 'num_households'
missing_population_households = postal_code_areas[postal_code_areas['population'].isnull() | postal_code_areas['num_households'].isnull()]

# Display the rows with missing 'population' and 'num_households' for further action
missing_population_households
# Check the total number of rows in the postal_code_areas dataset
total_rows_postal_code_areas = postal_code_areas.shape[0]
total_rows_postal_code_areas
# Remove rows with missing 'population' and 'num_households'
postal_code_areas_cleaned = postal_code_areas.dropna(subset=['population', 'num_households'])

# Verify the number of remaining rows
remaining_rows_postal_code_areas = postal_code_areas_cleaned.shape[0]
remaining_rows_postal_code_areas
# Ensure postal codes are consistent
# Trim whitespace and convert to uppercase if necessary (generally, postal codes are uppercase in the UK)
population_per_postal_code['postal_code'] = population_per_postal_code['postal_code'].str.strip().str.upper()
avg_per_household['postcode'] = avg_per_household['postcode'].str.strip().str.upper()
postal_code_areas_cleaned['postal_code'] = postal_code_areas_cleaned['postal_code'].str.strip().str.upper()

# Check for unique postal codes in each dataset
unique_population_postal_codes = population_per_postal_code['postal_code'].nunique()
unique_avg_household_postal_codes = avg_per_household['postcode'].nunique()
unique_postal_code_areas = postal_code_areas_cleaned['postal_code'].nunique()

unique_population_postal_codes, unique_avg_household_postal_codes, unique_postal_code_areas
# Merge the datasets on postal codes
merged_data = pd.merge(population_per_postal_code, avg_per_household, left_on='postal_code', right_on='postcode', how='inner')
merged_data = pd.merge(merged_data, postal_code_areas_cleaned, on='postal_code', how='inner')

# Drop redundant columns
merged_data = merged_data.drop(columns=['postcode'])

# Display the first few rows of the merged dataset to confirm the merge
merged_data.head()
โ€Œ
โ€Œ
โ€Œ