Cleaning data and the skies

Ozone Watch: Cleaning and Analyzing EPA Air Quality Data

by Abdulazeez Saliu

from IPython.display import Image, display
# Display the image
Image(filename=r'ozone.jpg', width=1100, height=700)

1. Introduction

Air quality is a critical public health and environmental concern, particularly in densely populated and industrialized regions like California. Among the key pollutants, ozone (O₃) plays a significant role due to its impact on respiratory health and overall air quality. To address this issue, the U.S. Environmental Protection Agency (EPA) monitors daily ozone concentrations through an extensive network of air quality stations.

This project focuses on evaluating ozone pollution levels across California in 2024, using daily summary data provided by the EPA. The dataset includes detailed measurements such as 8-hour ozone concentration levels, Air Quality Index (AQI) values, observation completeness, and geographic identifiers like site coordinates and county names. However, as is often the case with real-world data, the dataset is not immediately ready for analysis, it contains missing values, inconsistencies, duplicates, and potential outliers that must be addressed before any meaningful insights can be drawn.

Objective:

The primary goals of this project are to:

Clean and validate the ozone dataset for accuracy and completeness,
Analyze spatial and temporal trends in ozone pollution across California,
Identify high-risk regions with consistently poor air quality, and
Provide data-driven insights that can support environmental policy and public health interventions.

Through this analysis, we aim to transform raw environmental data into actionable knowledge, helping stakeholders better understand the extent and distribution of ozone pollution and where mitigation efforts should be focused most urgently.

Methodology:

This project followed a structured approach to analyze ozone pollution data in California for 2024:

Data Acquisition: Retrieved daily ozone data from the U.S. EPA, including AQI, ozone concentration, and site information.
Data Cleaning: Handled missing values, removed duplicates, corrected data types, and addressed outliers to ensure data quality.
Validation: Verified data completeness, consistency, and correct geographic placement of monitoring sites.
Exploratory Data Analysis (EDA): Analyzed temporal and regional trends in ozone levels, compared data sources, and assessed observation quality.
Visualization: Used charts, time series, and maps to highlight pollution trends and regional risk levels.
Insight Generation: Identified high-risk areas, peak pollution periods, and provided data-driven recommendations for action.

Data:

The data is a modified dataset from the U.S. Environmental Protection Agency (EPA).

Ozone contains the daily air quality summary statistics by monitor for the state of California for 2024. Each row contains the date and the air quality metrics per collection method and site

Dataset Schema Overview

Category	Column Name	Description
Date & Source Info	`Date`	Calendar date associated with air quality values
	`Source`	Data origin: EPA's AQS or AirNow reports
Identification	`Site ID`	Unique ID for the air monitoring site
	`POC`	Parameter Occurrence Code — monitor identifier
	`Method Code`	Code indicating the method of data collection
Measurements	`Daily Max 8-hour Ozone Concentration`	Highest 8-hour ozone concentration for the day
	`Units`	Measurement unit (parts per million — ppm)
	`Daily AQI Value`	Daily Air Quality Index (AQI) — 50 is good, >300 is hazardous
Location Info	`Local Site Name`	Name of the monitoring location
	`Site Latitude`	Latitude of the monitoring site
	`Site Longitude`	Longitude of the monitoring site
Observation Metrics	`Daily Obs Count`	Number of observations reported that day
	`Percent Complete`	Percentage of expected samples collected
Geographic Codes	`CBSA Code`	Code for Core Based Statistical Area
	`CBSA Name`	Name of Core Based Statistical Area
	`County FIPS Code`	County's Federal Information Processing Standard code
	`County`	Name of the county

Summary & Recommendation

Ozone pollution in California shows strong seasonal and weekday patterns, with peak levels during summer and weekdaays,especially in counties like San Bernardino, Riverside, and Los Angeles. These trends highlight the urgent need for targeted intervention.

Recommended Action:

Launch a focused ozone mitigation plan that combines:

Stricter summer emissions controls
Real-time public health alerts
Upgraded monitoring infrastructure This strategy will protect communities and drive cleaner air statewide.

2. Libraries & Configurations

Loading the relevant libraries and setting the configurations to be used for our analysis

"""importing relevant libraries"""
import pandas as pd  # for data manipulation
import numpy as np   # for data computation
import matplotlib.pyplot as plt #for 2D data visualization
import seaborn as sns    #for 2D data visualization
import plotly.express as px # interative plot
from scipy import stats     # for statistics
from IPython.display import Markdown, display
import warnings
from dateutil import parser
from datetime import timedelta
# import calplot  # Commented out because it's not installed
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from scipy.stats import gaussian_kde
from scipy.stats import f_oneway
from scipy.stats import ttest_ind

# Install calplot if not already installed
import sys
!{sys.executable} -m pip install calplot
import calplot

%matplotlib inline

"""setting configurations"""
#set seaborn theme
sns.set_theme(style="darkgrid", palette="colorblind")
#displaying all columns
pd.set_option('display.max_columns', None)
plt.rcParams['font.family'] = 'DejaVu Sans'  # or 'Arial'
warnings.filterwarnings("ignore", category=UserWarning,module='matplotlib.font_manager')

Hidden output

3. Data Wrangling

The dataset contains 54,759 rows and 17 columns, comprising both numeric and categorical features. An initial inspection reveals imputation errors in the Date column, indicating possible formatting or placeholder inconsistencies that require correction before time-based analysis can proceed.

#loading the dataframe
df= pd.read_csv(r'data/ozone.csv')
#viewing the dataframe
display(df.head(10))
#checking the number of rows and columns in the dataframe
display(df.shape)

Table 1.0 presents the DataFrame containing various columns from the dataset sourced from the United States Environmental Protection Agency (EPA).

3.1 Data Validation

Data validation is a critical component of the data cleaning and preparation process. It ensures that the dataset is accurate, consistent, and reliable prior to any analytical or modeling tasks. Without rigorous validation, insights derived from the data may be misleading or incorrect. In this project, the dataset was validated through a combination of summary statistics, heatmaps to identify missing values, and categorical value checks. These steps were employed to confirm the integrity and readiness of the data for meaningful analysis. All columns demonstrated consistency in their data types, aligning with expectations for both numerical and categorical analysis. However, the Date column was initially stored as an object (string), which is unsuitable for time-series operations. To facilitate temporal filtering and enable robust time-series analysis, the Date column was explicitly converted to a datetime format using Python’s datetime module.

#cheking information on the all columns
df.info()

Table 1.1 displays the data types and corresponding row counts for each column in the dataset.

#checking statistical information about the numeric colunms
df.describe()

Table 1.2 provides the statistical summary of the numeric columns in the dataset.

The dataset contains 54,759 rows and 12 numeric columns, with most columns showing consistent and reasonable ranges. However, a few validity concerns are observed:
Daily Max 8-hour Ozone Concentration and Daily AQI Value fall within expected environmental ranges but should be cross-validated with EPA standards.
Site Latitude and Longitude values fall within U.S. geographic bounds, supporting spatial validity.
The Daily Obs Count ranges from 1 to 1000, which is unusually wide and may need verification against measurement frequency norms.
Overall, while the data is largely consistent, specific columns warrant deeper validation due to potential anomalies or limited variability.

3.2 Missing Data

checking for missing data

‌
‌
‌