Skip to content
0

Decoding California's Skies: A Data-Driven Analysis of 2024 Ozone Pollution Patterns


Executive Summary

This report presents a comprehensive analysis of California's daily maximum 8-hour ozone concentration data for 2024, aimed at identifying key temporal and geospatial risk patterns to inform targeted public health and environmental policy. A rigorous data preparation phase was undertaken, involving the standardization of inconsistent date formats and a robust, multi-stage imputation process for missing values. This process was highlighted by the use of a K-Nearest Neighbors (KNN) model, where the optimal k hyperparameter was determined via 10-fold cross-validation to ensure maximum predictive accuracy for the 136 most critical missing data points.

The analysis yielded two primary findings. First, a temporal analysis revealed that while median ozone concentrations are remarkably stable throughout the week, the risk of acute, high-pollution episodes is disproportionately concentrated on weekdays. This contradicts the common "weekend ozone effect" hypothesis and indicates that the most dangerous pollution spikes are tied to the workweek cycle.

Second, a geospatial analysis identified clear pollution hotspots. The Riverside-San Bernardino-Ontario, CA metropolitan area consistently registered the highest average ozone concentrations. At a more granular level, the Sequoia & Kings Canyon NPs - Lower Kaweah monitoring site emerged as the location with the highest average concentration, underscoring it as a critical site for intervention.

Based on these data-driven insights, we recommend the following strategic actions:

  1. Refocus Public Health Alerts: Public health advisories and alert systems should be recalibrated to emphasize the heightened risk of extreme pollution spikes during the weekdays, rather than focusing on weekly averages.

  2. Prioritize Regional and Local Interventions: Emission reduction resources and policy enforcement should be strategically concentrated on the Riverside-San Bernardino-Ontario metropolitan area. Specific attention and local mitigation efforts should be directed at the Sequoia & Kings Canyon NPs - Lower Kaweah monitoring site and its surrounding area to address the severe, localized pollution levels.


2 hidden cells

1. Introduction & Objective

The primary objective of this analysis is to conduct a comprehensive exploratory data analysis (EDA) of the 2024 ozone pollution data for California, as provided by the U.S. Environmental Protection Agency (EPA). This report aims to move beyond surface-level metrics to identify significant temporal patterns, pinpoint high-risk geographical hotspots, and understand the factors associated with elevated ozone concentrations. The ultimate goal is to translate these data-driven insights into actionable recommendations for public health agencies and environmental policymakers.

2. Methodology: Data Cleaning & Preparation

A robust and accurate analysis is built upon a foundation of clean, reliable data. The raw dataset presented several challenges, including inconsistent date formats, missing values in critical measurement columns, and data entry anomalies. To address these issues, a multi-step data preparation methodology was implemented:

a) Data Standardization and Anomaly Correction:

  • Date Normalization: The Date column, which contained multiple inconsistent formats (e.g., mm/dd/yyyy, Month dd/yyyy) and implicit sequential entries, was parsed and standardized into a consistent ISO 8601 format (YYYY-MM-DD).
  • Categorical Data Handling: Missing values in categorical columns such as Method.Code and CBSA.Name were assigned a distinct "Unknown" category (1) to retain all records without introducing bias.
  • Outlier Correction: Anomalous entries in Daily.Obs.Count (values of 1000) were identified and corrected by replacing them with the column's mode (17), aligning them with standard operational values.

b) Advanced Imputation for Missing Ozone Data: A significant challenge was the presence of 2,738 missing values in the Daily.Max.8.hour.Ozone.Concentration and Daily.AQI.Value columns. A multi-stage imputation strategy was employed to handle these missing records with maximum precision:

  1. Linear Regression Imputation: Leveraging the extremely strong linear relationship between ozone concentration and AQI (Pearson's r = 0.942), a regression model was developed. This model was able to explain 88.7% of the variance ( = 0.887) in the data, providing a highly accurate method for imputing missing values where one of the two metrics was present.
  2. Optimized KNN Imputation: For the remaining 136 records where both metrics were missing, a K-Nearest Neighbors (KNN) imputation model was developed. To ensure the highest accuracy, the optimal hyperparameter for k (the number of neighbors) was determined through a 10-fold cross-validation process using the caret package. This data-driven approach selected the k value that minimized prediction error, ensuring a robust and defensible imputation of the final missing values.

This rigorous preparation process resulted in a complete and reliable dataset, ready for in-depth exploratory analysis.


3. Exploratory Analysis & Key Findings

With a clean and complete dataset, we can now explore the data to uncover patterns in ozone pollution. This section is organized thematically to first understand when the pollution occurs, and then where it is most concentrated.

3.1 The Rhythm of Pollution: Temporal Patterns

Understanding how ozone levels fluctuate over time is critical for public health alerts and policy timing. We analyzed the data across monthly, weekly, and weekday-versus-weekend cycles.

Monthly Seasonality

First, we examined the monthly distribution of ozone concentrations to identify seasonal trends within 2024.

Figure 1: Monthly Ozone Concentration Distribution. The analysis reveals a clear seasonal pattern, with ozone levels beginning to rise in the spring and peaking during the summer months. The highest median concentrations and the greatest variability are observed in June, July, and August. This trend is consistent with the known drivers of ground-level ozone formation, which is catalyzed by increased sunlight and higher temperatures.

The Weekday vs. Weekend Dynamic

Next, we investigated the "weekend ozone effect," a phenomenon where pollution levels can paradoxically change despite reduced traffic. A direct comparison between weekdays and weekends reveals a more nuanced reality.

Figure 2: Ozone Concentration Comparison, Weekday vs. Weekend. Contrary to some hypotheses, our analysis shows that the median ozone concentration is virtually identical between weekdays and weekends. The critical difference, however, lies in the distribution of extreme events. Weekdays exhibit a significantly higher frequency and magnitude of high-pollution spikes (outliers), as shown by the longer upper tail of the distribution. While the "typical" day is similar, the risk of acute, dangerous pollution episodes is overwhelmingly concentrated during the workweek.

3.2 The Geography of Risk: Identifying Pollution Hotspots

After understanding the temporal dynamics, the next critical step is to identify the geographical areas most affected by high ozone levels. This allows for the prioritization of resources and policy interventions. We analyzed the data at both a broad regional level (Core-Based Statistical Area - CBSA) and a granular local level (monitoring site).

Top 10 Most Polluted Regions (CBSA)

To identify the metropolitan areas with the highest overall pollution, we calculated the average ozone concentration for each CBSA.

Figure 3: Top 10 CBSA by Average Ozone Concentration. The analysis clearly shows that ozone pollution is not evenly distributed across California. The Riverside-San Bernardino-Ontario, CA metropolitan area emerges as the region with the highest average ozone concentration, followed by other areas in Southern California and the Central Valley. This highlights a significant regional disparity in air quality.

Top 10 Most Polluted Monitoring Sites

To pinpoint the problem at a more local level, we identified the specific monitoring sites with the highest average concentrations. This provides actionable intelligence for local authorities.

Figure 4: Top 10 Monitoring Sites by Average Ozone Concentration. Zooming in on a local level, the Sequoia & Kings Canyon NPs - Lower Kaweah monitoring site consistently records the highest average ozone levels, making it a critical location of concern. Identifying these specific sites is essential for investigating local emission sources and implementing targeted mitigation strategies.