Cats vs Dogs: The Great Pet Debate ๐ฑ๐ถ
๐ Background
My friend and I have debated for years whether cats or dogs make more popular pets. I finally decided to settle the score the best way I know how - data - by analyzing pet data across different regions of the UK. I found data on estimated pet populations, average pets per household, and geographic factors across UK postal code areas. It's time to dig into the numbers and settle the cat vs. dog debate, stay tuned!
๐พ The data
There are three data files, which contains the data as follows below.
The population_per_postal_code.csv data contains these columns:
population_per_postal_code.csv data contains these columns:| Column | Description |
|---|---|
postal_code | An identifier for each postal code area |
estimated_cat_population | The estimated cat population for the postal code area |
estimated_dog_population | The estimated cat population for the postal code area |
The avg_per_household.csv data contains these columns:
avg_per_household.csv data contains these columns:| Column | Description |
|---|---|
postal_code | An identifier for each postal code area |
cats_per_household | The average number of cats per household in the postal code area |
dog_per_household | The average number of dogs per household in the postal code area |
The postal_code_areas.csv data contains these columns:
postal_code_areas.csv data contains these columns:| Column | Description |
|---|---|
postal_code | An identifier for each postal code area |
town | The town/towns which are contained in the postal code area |
county | The UK county that the postal code area is located in |
population | The population of people in each postal code area |
num_households | The number of households in each postal code area |
uk_region | The region in the UK which the postal code is located in |
*Acknowledgments: Data has been assembled and modified from two different sources: Animal and Plant Health Agency, Postcodes.
import pandas as pd
import numpy as np
import json
import folium
!pip install geojson
import geojson
import matplotlib.pyplot as plt
import seaborn as sns
from urllib.request import urlopen
population_raw_data = pd.read_csv('data/population_per_postal_code.csv')
population_raw_dataavg_raw_data = pd.read_csv('data/avg_per_household.csv')
avg_raw_datapostcodes_raw_data = pd.read_csv('data/postal_codes_areas.csv')
postcodes_raw_dataโ๏ธ Executive Summary
This Python project focuses on analyzing and visualizing pet population data across different postal codes in the UK. Utilizing several data sources, the project aims to provide insights into the distribution of pet populations, specifically cats and dogs, and how these distributions correlate with human populations and household numbers.
Data Sources and Preparation
The project leverages three primary datasets:
population_per_postal_code.csv- Contains estimated cat and dog populations per postal code.avg_per_household.csv- Provides average numbers of cats and dogs per household, categorized by postcode.postal_codes_areas.csv- Includes postal code areas along with town, county, population, number of households, and UK region information.
These datasets were loaded into pandas DataFrames for cleaning and analysis. The data preparation phase involved handling missing values, merging datasets, and ensuring data consistency across different sources.
Analysis
The analysis focused on understanding the distribution of pet populations (cats and dogs) across different regions and how these distributions relate to human demographic factors such as population size and the number of households. Key metrics calculated include pets per household and estimated pet populations.
Visualization
To aid in the interpretation of the findings, several visualizations were created:
- Maps using Folium to geographically display the distribution of pet populations across the UK.
- Bar charts and scatter plots using Matplotlib and Seaborn to explore the relationships between pet populations, human populations, and the number of households.
Key Findings
- The project identified regions with the highest and lowest densities of pet populations.
- Analysis revealed correlations between the size of human populations, the number of households, and pet populations, providing insights into pet ownership trends.
- The data suggests regional variations in pet ownership, with some areas showing a higher preference for dogs and others for cats.
Conclusion
This project provides valuable insights into pet ownership patterns across the UK, highlighting the importance of considering both human demographic factors and regional preferences in understanding pet population distributions. The findings can inform stakeholders, including pet supply companies and veterinary services, about potential market opportunities and areas requiring pet-related services.
# Inspect the population_raw_data
population_raw_data.describe()
population_raw_data.isna().sum()
population_raw_data["estimated_cat_population"] = population_raw_data["estimated_cat_population"].str.replace(",", "")
population_raw_data["estimated_dog_population"] = population_raw_data["estimated_dog_population"].str.replace(",", "")
population_data = population_raw_data
# Inspect the avg_raw_data
avg_raw_data.describe()
avg_raw_data.isna().sum()
avg_data = avg_raw_data
# Inspect the postcodes_raw_data
postcodes_raw_data.describe()
print(postcodes_raw_data.isna().sum())
# Begin postcodes_data cleanup
# Drop single row without county
postcodes_data = postcodes_raw_data.dropna(subset=["county"])
# Identify counties with all records missing both population and num_households data
counties_with_data = postcodes_data.dropna(subset=["population", "num_households"])["county"].unique()
all_counties = postcodes_data["county"].unique()
counties_no_data_all_records = [county for county in all_counties if county not in counties_with_data]
# Remove all counties without population and num_households data
postcodes_data = postcodes_data[~postcodes_data["county"].isin(counties_no_data_all_records)]
# Impute the mean summary statistic to missing values in population and num_households
cols_with_missing_values = ["population", "num_households"]
for col in cols_with_missing_values:
postcodes_data[col] = postcodes_data.groupby("county")\
[col].transform(lambda x: x.fillna(x.mean()).astype(int))
# Impute the mode to missing values in uk_region
postcodes_data["uk_region"] = postcodes_data.groupby("county")\
["uk_region"].transform(lambda x: x.fillna(x.mode().iloc[0]))
print(postcodes_data.isna().sum())# Merging the datasets on postal code
merged_data = postcodes_data.merge(population_data, how='left', on='postal_code')
merged_data = merged_data.merge(avg_data, how='left', left_on='postal_code', right_on='postcode').drop('postcode', axis=1)
# Ensure columns are the right dtype
merged_data = merged_data.astype({"postal_code": "string",
"town": "string",
"county": "string",
"uk_region": "string",
"estimated_cat_population": "float64",
"estimated_dog_population": "float64",
})
print(merged_data.dtypes)
# Inspect correlation
sns.heatmap(merged_data.corr(), annot=True)Based on the heatmap generated from the merged_data dataframe, here are some potential key findings:
-
Correlation between Cat and Dog Populations: If there is a high positive correlation value between
estimated_cat_populationandestimated_dog_population, it suggests that areas with high cat populations also tend to have high dog populations. This could indicate that pet ownership in general is higher in these areas. -
Household Size and Pet Ownership: The correlation between
num_householdsand bothcats_per_householdanddogs_per_householdcould reveal if larger household sizes influence the number of pets per household. A positive correlation might suggest that larger households tend to have more pets. -
Population Density and Pet Ownership: The correlation between
populationand pet populations (estimated_cat_population,estimated_dog_population) could indicate whether more densely populated areas have higher or lower rates of pet ownership. -
Regional Differences in Pet Ownership: The heatmap won't directly show correlations with
uk_regionsince it's a categorical variable, but if the data were encoded or if additional analysis were done, one might find regional trends in pet ownership. -
Data Completeness and Reliability: If there are any NaN or very low correlation values (close to 0) between variables that are expected to be related, it might indicate missing or unreliable data in those fields.
Remember, correlation does not imply causation. High or low correlation between two variables does not mean that one variable causes the other to increase or decrease. Further analysis would be required to understand the underlying reasons for these correlations.
# Create separate dataframes for cat and dog data to handle them individually
cat_data = merged_data[['postal_code', 'town', 'county', 'population', 'num_households', 'uk_region',
'estimated_cat_population', 'cats_per_household']]
dog_data = merged_data[['postal_code', 'town', 'county', 'population', 'num_households', 'uk_region',
'estimated_dog_population', 'dogs_per_household']]
# Rename the columns so cat_data and dog_data match
cat_data = cat_data.rename(columns={
"estimated_cat_population": "estimated_pet_population",
"cats_per_household": "pets_per_household"
})
dog_data = dog_data.rename(columns={
"estimated_dog_population": "estimated_pet_population",
"dogs_per_household": "pets_per_household"
})
# Add columns to differentiate both dataframes
cat_data["pet_type"] = "Cat"
dog_data["pet_type"] = "Dog"
# Combine cat and dog dataframes into one dataframe
pet_data = cat_data.append(dog_data, ignore_index=True)\
.sort_values(by="postal_code", ascending=True)
pet_data
pet_data["uk_region"].value_counts()
# Visualize relationship between pet population and number of pets per household
palette = {
'Cat': 'tab:blue',
'Dog': 'tab:red',
}
plot = sns.scatterplot(data=pet_data, x="estimated_pet_population", y="pets_per_household", alpha=0.7, hue="pet_type", palette=palette)
plot.set(xlabel="Pets per Household", ylabel="Pet Population (million)")
plot._legend.set_title("Pet Type")
โ
โ