Executive Summary
This report details the comprehensive cleaning and analysis of the U.S. Environmental Protection Agency (EPA) ozone dataset for various regions across California. The primary objective was to identify areas with consistently high ozone pollution and explore potential temporal trends, such as differences between weekday and weekend concentrations.
The raw data presented significant quality issues, including inconsistent date formats, invalid observation counts, conflicting completeness percentages, and missing values. A rigorous data cleaning process was executed to create a reliable dataset for analysis. This involved standardizing dates, reconciling observation data, handling missing values, and filtering out locations with insufficient data to ensure statistical validity. One key limitation discovered during this process was the loss of all Method Code
data, which prevented a comparative analysis of different measurement techniques.
The analysis of the cleaned data yielded two primary conclusions:
- High-Concentration Areas: Specific regions consistently exhibit higher ozone levels than others. The analysis identified Tulare, Imperial, El Dorado, and Riverside as the counties with the highest average ozone concentrations. These findings were visualized using both bar charts and a geospatial heatmap to clearly illustrate the high-risk areas.
- Weekday vs. Weekend Effect: There is no statistically significant evidence that urban activity, as measured by a weekday versus weekend comparison, affects ozone levels. While the raw averages were nearly identical, a formal independent t-test was performed, resulting in a p-value of 0.2308. This value is substantially greater than the standard 0.05 significance level, leading to the conclusion that the minor observed difference is not statistically meaningful.
Exploratory Data Analysis and Cleaning of EPA Ozone Data
1. Introduction
The goal of this analysis is to explore daily ozone measurement data from the U.S. Environmental Protection Agency (EPA) for monitoring stations across California. The raw dataset contains numerous quality issues typical of real-world data, including missing values, inconsistent formatting, and invalid entries.
This notebook documents the entire process, including:
- Initial Data Inspection: Understanding the structure, data types, and initial state of the data.
- Data Cleaning and Preprocessing: A step-by-step process to handle inconsistencies, errors, and missing information.
- Exploratory Analysis: Answering key questions about ozone pollution patterns.
2. Initial Data Loading and Inspection
The first step is to load the dataset and perform a high-level overview to identify potential issues.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from scipy import stats
from folium.plugins import HeatMap
# Load the dataset from the location specified.
ozone = pd.read_csv('data/ozone.csv')
# Display basic information about the DataFrame
ozone.info()
# Get value counts to understand problematic columns
for column in ozone.columns:
print("===============================")
print(ozone[column].value_counts())
print(ozone[column].isna().sum())
print("===============================")
print()
# Display the first few rows of data
ozone.head()
Initial Findings:
- Inconsistent Dates: The
Date
column contained multiple formats, including partial dates (e.g.,'/2024'
) and month names (e.g.,'March 02/2024'
), making it unusable for time-series analysis. - Missing Data: Numerous columns, including the primary target
Daily Max 8-hour Ozone Concentration
, had thousands of missing values. - Invalid Data: The
Daily Obs Count
column contained impossible values (e.g.,1000
). - Conflicting Data: The
Daily Obs Count
andPercent Complete
columns were inconsistent with each other. - Inconsistent Categorical Data: The
County
column contained both full names and abbreviations (e.g., "Los Angeles" and "LA").
3. Data Cleaning and Preprocessing
A series of cleaning steps were performed to create a reliable dataset for analysis.
3.1 Standardizing the 'Date' Column
- Problem: The
Date
column's inconsistent formats prevented proper date-based analysis. - Action: Used Pandas'
to_datetime
function with a strict format (%m/%d/%Y
). This approach converts any date string that does not match the format into a null value (NaT
), which can then be easily removed. - Result: This step successfully standardized all valid dates and allowed for the removal of over 9,000 rows with improperly formatted date entries.
# Coerce the Date column to standardize the format, using NaT in any rows that couldn't be coerced.
ozone['Date'] = pd.to_datetime(ozone['Date'], format='%m/%d/%Y', errors='coerce')
# Drop the columns that are NaT (or null)
ozone.dropna(subset=['Date'], inplace=True)
print(ozone['Date'].value_counts())
print(ozone['Date'].isna().sum())
3.2 Standardizing Categorical Data
- Problem: Categorical columns like
County
andLocal Site Name
had inconsistencies that would lead to incorrect grouping. - Action:
- Mapped county abbreviations (
LA
,SF
) to their full names. - Standardized the
Local Site Name
column by converting all names to title case and trimming whitespace.
- Mapped county abbreviations (
- Result: Ensured that each location is treated as a single, unique entity, enabling accurate regional analysis.
# Define replacements
county_replacements = {
'LA': 'Los Angeles',
'SF': 'San Francisco'
}
# Apply the replacements
ozone['County'] = ozone['County'].replace(county_replacements)
# Convert to a "Title Case" and trim whitespace
ozone['Local Site Name'] = ozone['Local Site Name'].str.title().str.strip()
print(ozone['Local Site Name'].value_counts())
print()
print(ozone['County'].value_counts())
3.3 Handling Missing and Sparse Data
- Problem: Key analytical columns had missing values, and some monitoring sites had too few observations to provide meaningful insights.
- Action:
- Dropped all rows where the target variables (
Daily Max 8-hour Ozone Concentration
andDaily AQI Value
) were null. - Dropped rows where Core-Based Statistical Area (
CBSA
) information was missing. - Removed all monitoring sites that had fewer than 15 days of valid data after all previous cleaning steps.
- Dropped all rows where the target variables (
- Result: This produced a final, clean dataset with no missing values in critical columns and ensured that analysis was only performed on locations with a reasonable amount of data. However, this did result in the loss of the
Method Code
column.
# Drop rows where ozone concentration, AQI, CVSA Name/Code is missing
ozone.dropna(subset=['Daily Max 8-hour Ozone Concentration', 'Daily AQI Value', 'CBSA Name', 'CBSA Code'], inplace=True)
print(ozone.isna().sum())
3.4 Correcting and Validating Observation Data
- Problem: The dataset contained invalid observation counts and conflicting information between the
Daily Obs Count
andPercent Complete
columns. - Action:
- Filtered out all rows where
Daily Obs Count
was greater than 24. - Recalculated the
Percent Complete
column based on the validatedDaily Obs Count
((count / 24) * 100
). - Applied a 75% completeness threshold, a common standard in air quality analysis, to ensure that daily summary values are based on a representative sample of hourly readings (at least 18 hours).
- Filtered out all rows where
- Result: This ensured the integrity of daily measurements, removing unreliable data points and correcting internal inconsistencies.
# Filter out rows where Daily Obs Count is greater than 24
ozone = ozone[ozone['Daily Obs Count'] <= 24]
# Recalculate the percentage based on 24 hours in a day
ozone['Percent Complete'] = (ozone['Daily Obs Count'] / 24) * 100
# Keep only the rows that meet the 75% completeness threshold
ozone = ozone[ozone['Percent Complete'] >= 75]
print(ozone['Daily Obs Count'].value_counts())
print()
print(ozone['Percent Complete'].value_counts())
3.5 Filtering Data-Sparse Locations
- Problem: After the previous cleaning steps, some monitoring sites were left with very few data points (e.g., fewer than 15 observations for the entire year). Including these data-sparse locations in the final analysis could lead to unreliable conclusions, as a few readings are not representative of a site's overall pollution level.
- Action: To ensure that our analysis of "consistent" high-pollution areas is statistically robust, a minimum threshold was established. All monitoring sites with fewer than 15 total observations were identified and filtered out of the final dataset.
- Result: This final filtering step improved the overall quality of the dataset, ensuring that the subsequent analysis is based only on locations with a meaningful amount of data.
# Calculate the number of observations for each site
site_counts = ozone['Local Site Name'].value_counts()
# Identify sites that have fewer than 15 observations
sites_to_remove = site_counts[site_counts < 15].index
# Filter the DataFrame to remove these data-sparse sites
ozone = ozone[~ozone['Local Site Name'].isin(sites_to_remove)]
print(ozone['Local Site Name'].value_counts())
4. Analysis and Findings
4.1. Variation of Ozone Over Time and Region
Objective: To answer how ozone concentration varies over time and region, the analysis was broken down into two components: the seasonal trend over time and the differences across geographical regions.
Temporal Analysis: Seasonal Trend
To understand how ozone levels change throughout the year, the daily data was resampled into monthly averages. This approach smooths out daily noise and clearly reveals the underlying seasonal pattern.
# Set the Date as the index for time-series analysis
ozone_by_time = ozone.set_index('Date')
# Resample to get monthly average ozone concentration
monthly_avg_ozone = ozone_by_time['Daily Max 8-hour Ozone Concentration'].resample('M').mean()