Forecasting 2024 ZIP-Level Child Populations from ACS 2011–2023 Data Using Linear Regression
Project Overview:
This project uses five-year estimates from the American Community Survey (ACS) covering years 2011 through 2023 to build a linear regression pipeline that forecasts child populations (Under 5 and Ages 5-9) at the ZIP code level across Ohio. The resulting model generates actionable estimates for 2024, enabling stakeholders in youth services and policy planning to better allocate resources and anticipate shifts in community demand.
The workflow ingests all Excel tabs, standardizes column labels across schema changes, and reshapes to a tidy panel. For each ZIP code, a simple linear model is fit on annual totals to project 2024 population for ages Under 5 and 5–9.
The deliverables are Tableau‑ready Excel outputs and a clean ZIP‑level panel for dashboards. Geographic focus is Northeast Ohio: Cuyahoga, Lake, Geauga, and Ashtabula counties. The pipeline is intentionally lightweight and explainable, with room to drop in more advanced forecasting later.
Import libraries
# Import pandas
import pandas as pd
import numpy as np
About this File:
Building the ZIP–PUMA Mapping Framework
This file maps ZIP codes to their corresponding PUMAs to link ZIP-level and census-level population data. For ZIPs spanning multiple PUMAs, assignments were made based on the majority of geographic area. It also includes the percentage of 5-year-olds within the 5–9 age group by PUMA, allowing for more accurate 0–5 population estimates used in childcare forecasting and site planning.
# ABOUT: ZIP→PUMA mapping + share of 5-year-olds within 5–9 by PUMA to refine 0–5 estimates.
# Read template for permitted ZIP codes (Tableau integration)
template = pd.read_excel('Zip Code Forecast.xlsx', sheet_name='tableau')
template.info()
template.head()About this File:
ACS 5-Year Population Data (2011–2023)
This Excel workbook contains American Community Survey (ACS) 5-Year Estimate data from 2011 to 2023. Each sheet represents one year of census data, including population counts by age group. It is the primary source for building annual trends and forecasting 2024 child population estimates for ages 0–5 across Northeast Ohio.
This Excel workbook contains American Community Survey (ACS) 5-Year Estimate data from 2011 to 2023. Each sheet represents one year of census data, including population counts by age group. It is the primary source for building annual trends and forecasting 2024 child population estimates for ages 0–5 across Northeast Ohio.
# Use the provided function to read all sheets from the excel file
FILEPATH = 'ACSDP5Y2011-2023.xlsx'
# read sheets into a dictionary
sheets_dict = pd.read_excel(FILEPATH, sheet_name=None)
Assigning Year Labels and Standardizing Headers
This step promotes the first row to headers, then assigns each dataframe its correct survey year (2011–2023). The years array ensures consistent year tagging for later merging and trend analysis.
# list to store processed dataframes
processed_dfs = []
# Create a separate dataframe for each tab in FILEPATH
for sheet_name, df in sheets_dict.items():
processed_dfs.append(df)
# Iterate over each dataframe in processed_dfs
for i, df in enumerate(processed_dfs):
# Set the first row as the header
df.columns = df.iloc[0]
# Drop the first row as it is now the header
df = df[1:]
# Update the dataframe in the list
processed_dfs[i] = dfAssigning Survey Years to DataFrames
This step assigns a year label to each dataframe in the list, corresponding to the ACS 5-Year data from 2011 through 2023. The years array ensures each sheet’s data is tagged with the correct survey year for later merging and trend analysis.
# Create years array to be added to each dataframe in processed_dfs
years = np.arange(2011, 2024, 1)
# Add the column Year to each dataframe using the data in the years array
for i, df in enumerate(processed_dfs):
df['Year'] = years[i]Standardizing ACS Column Schemas Across Years
The ACS column names changed between 2011–2016 and 2017–2023, so this step standardizes them.
It filters each dataframe to keep only the relevant population columns (Under 5 and 5–9) using the correct field names for each period, ensuring consistency before merging all years into a single dataset.
# Columns in years 2011 to 2016 have different column headings for the columns of interest than the last 7 years (2017-2023)
first_6 = ["Geography", "Estimate!!SEX AND AGE!!Under 5 years", "Estimate!!SEX AND AGE!!5 to 9 years", "Year"]
last_7 = ["Geography", "Estimate!!SEX AND AGE!!Total population!!Under 5 years", "Estimate!!SEX AND AGE!!Total population!!5 to 9 years", "Year"]
# Filter the first 6 dataframes with the column names in first_6
filtered_first_6_dfs = [df[df.columns.intersection(first_6)] for df in processed_dfs[:6]]
# Filter the last 7 dataframes with the column names in last_7
filtered_last_7_dfs = [df[df.columns.intersection(last_7)] for df in processed_dfs[-7:]]