Skip to content
New Workbook
Sign up
Create population segments with US Census Bureau data

Create population segments with US Census Bureau data

In this workbook we will use open data from the US Census Bureau to build population segments in the state of California. The goal is to showcase the use of k-means clustering by transforming a set of demographic and socio-economic measures into a vectorized space that can then be partitioned according to a specified number of clusters. The optimal number of clusters will yield high cluster density and good separation between clusters. Note that in this analysis all measures have Census Block Group (CBG) as the lowest level of detail. These can be rolled up to counties and states.

Data source:

US Census Bureau of Statistics (https://www.census.gov/programs-surveys/decennial-census/data.html)

# load libraries
import pandas as pd
import numpy as np
import math as mt
import json
import matplotlib.pyplot as plt
import seaborn as sns
import random as rdm
import plotly.figure_factory as ff
from sklearn.preprocessing import RobustScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn import metrics
# load demographic and socio-economic data by census block group in CA
cadf = pd.read_csv('censusx_ca.csv')
# look at the data file structure
cadf.info()

There are over 70 different attributes many of them tied to demographic and socio-economic attributes of the population such as education, age, income and work. In addition, we have a bit more than 20 thousand census block groups in the state of CA; with a population of 40M, we estimate about 2,000 people per CBG. There are also some columns that are not needed for this anlaysis and will be discarded later.

Hidden code

The histogram shows a long tail; one CBG has a population close to 40,000! Where is it located?

mk1 = cadf['POP_TOTAL'] > 35000
cadf.loc[mk1,['COUNTY', 'CITY', 'POP_TOTAL']]
# before proceeding, set index and remove columns not needed for k-means clustering
df = cadf.copy()
df.set_index('CENSUS_BLOCK_GROUP', inplace = True)

# remove fields not needed
cc = ['Unnamed: 0', 'COUNTY', 'STATE', 'URBANITY', 'URBANITY_CODE', 'CITY', 'POP_TOTAL']
df.drop(cc, axis = 1, inplace = True)
# check for null values
cnn = df.isnull().sum()
cnn[cnn > 0]

Excellent, no fields with null values were found

## Standardize data values
# initialise the standard scaler
scaler = RobustScaler()

# create a copy of the original dataset and drop unnecessary columns
scale_df = df.copy()

# fit transform all of our data
for c in scale_df.columns:
    feature = scale_df[c].values.reshape(-1,1)
    scale_df[c] = scaler.fit_transform(feature)
# load feature dictionary to create multipe vector spaces via principal component analysis
# the feature dictionary creates groups for geographic, demographic, socio-economic and work-related attributes
import json

with open('./pca_features.json', 'r') as fp:
    pcaf = json.load(fp)
fp.close()

for ff in pcaf:
    print(ff['name'] + '  ' + str(ff['num_components']))

Transform data with PCA

Principal component analysis (PCA) is a statistical technique used to reduce the dimensionality of multivariate data by creating new uncorrelated variables that successively maximize variance.2 It is a linear dimensionality reduction technique with applications in data analysis, visualization, and data preprocessing. PCA is achieved by linearly transforming the data onto a new coordinate system (principal components) such that the directions capturing the largest variation in the data can be easily identified.

# run PCA for each group of features
pca_dict = {}

for ff in pcaf:
    name = ff['name']
    n = ff['num_components']
    features = ff['features'] 
    x = scale_df.loc[:,features].values
    pca  = PCA(n_components = n)
    # Fit to scaled dataset and add to dictionary
    pca_dict[name] = pca.fit(x)