Skip to content
0

Cats vs Dogs: The Great Pet Debate ๐Ÿฑ๐Ÿถ

๐Ÿ“– Background

You and your friend have debated for years whether cats or dogs make more popular pets. You finally decide to settle the score by analyzing pet data across different regions of the UK. Your friend found data on estimated pet populations, average pets per household, and geographic factors across UK postal code areas. It's time to dig into the numbers and settle the cat vs. dog debate!

๐Ÿ’พ The data

There are three data files, which contains the data as follows below.

The population_per_postal_code.csv data contains these columns:
ColumnDescription
postal_codeAn identifier for each postal code area
estimated_cat_populationThe estimated cat population for the postal code area
estimated_dog_populationThe estimated cat population for the postal code area
The avg_per_household.csv data contains these columns:
ColumnDescription
postal_codeAn identifier for each postal code area
cats_per_householdThe average number of cats per household in the postal code area
dog_per_householdThe average number of dogs per household in the postal code area
The postal_code_areas.csv data contains these columns:
ColumnDescription
postal_codeAn identifier for each postal code area
townThe town/towns which are contained in the postal code area
countyThe UK county that the postal code area is located in
populationThe population of people in each postal code area
num_householdsThe number of households in each postal code area
uk_regionThe region in the UK which the postal code is located in

*Acknowledgments: Data has been assembled and modified from two different sources: Animal and Plant Health Agency, Postcodes.

import pandas as pd
population_raw_data = pd.read_csv('data/population_per_postal_code.csv')
population_raw_data
population_raw_data.info()
population_raw_data['estimated_cat_population'] = population_raw_data['estimated_cat_population'].str.replace(',', '').astype(float)
population_raw_data['estimated_dog_population'] = population_raw_data['estimated_dog_population'].str.replace(',', '').astype(float)
population_raw_data.info()
population_raw_data
avg_raw_data = pd.read_csv('data/avg_per_household.csv')
avg_raw_data
avg_raw_data.info()
avg_raw_data.rename(columns = {'postcode':'postal_code'}, inplace = True) 
avg_raw_data.info()
avg_raw_data
postcodes_raw_data = pd.read_csv('data/postal_codes_areas.csv')
postcodes_raw_data
postcodes_raw_data['uk_region'] = postcodes_raw_data['uk_region'].str.replace(' ', '')
postcodes_raw_data.info()
# Find differences and common elements
def findDiffandCommon(df1, df2):
    df_1 = set(df1['postal_code'])
    df_2 = set(df2['postal_code'])
    cols_diff = df_1.symmetric_difference(df_2)
    cols_common = df_1.intersection(df_2)
    
    print(' The length: ' + str(len(cols_diff)) + " & Differences in cols:", cols_diff)
    #print(' The length: ' + str(len(cols_common)) + " & Common elements in cols:", cols_common)

findDiffandCommon(population_raw_data, avg_raw_data)
print('-------')
findDiffandCommon(population_raw_data, postcodes_raw_data)

def tableMissing(df):
    print('TotaL # of NA values in every column')
    #print(df.isnull().sum())   
    for col in df:
        count_nan = len(df[col]) - df[col].count()
        print(str(col) + ': ' + str(count_nan))  
        
tableMissing(population_raw_data) 
print(' ')
print('-----')
tableMissing(avg_raw_data) 
print(' ')
print('-----')
tableMissing(postcodes_raw_data) 
postcodes_raw_data.dropna(subset=['population', 'num_households', 'uk_region'], inplace=True)
tableMissing(postcodes_raw_data)
postcodes_raw_data

4 hidden cells
โ€Œ
โ€Œ
โ€Œ