Skip to content
0

Ozone Pollution Analysis in California (2024)

📄 Introduction

Welcome!

Welcome to our exploration of Ozone Pollution in California (2024)! This notebook is the result of a structured, analytical workflow aimed at uncovering meaningful environmental insights from real-world air quality data. My primary objective is to provide clear, actionable recommendations for policymakers while demonstrating a systematic approach to data analysis based on industry best practices.

Through this project, we’ll dive into data cleaning, exploratory analysis, geospatial mapping, and trend identification to understand ozone patterns across California. By combining environmental science concepts with data analytics techniques, we’ll translate raw data into insights that can help address one of California’s most persistent environmental challenges. Let’s turn this challenge into an opportunity to learn, act, and make a measurable impact on public health and sustainability.

Why this matters:

Ozone pollution is not just an environmental concern, it has severe public health and economic consequences. Prolonged exposure to elevated ozone levels is linked to respiratory inflammation, worsened asthma, reduced lung function, and increased cardiovascular risks. Economically, high ozone levels harm agricultural yields, increase healthcare costs, and reduce workforce productivity.

California Air Resources Board – Ozone & Health

Ground-level ozone exposure leads to respiratory inflammation, worsened asthma, cardiovascular harm, and significant economic impacts on ecosystem productivity and public health, including increased healthcare and maintenance costs. For detailed evidence, refer to the Costs of Air Pollution in California’s San Joaquin Valley: A Societal Perspective of the Burden of Asthma on Emergency Departments and Inpatient Care

What You’ll Learn from This Notebook

This notebook goes beyond just presenting results it’s a step-by-step guide to data driven environmental analysis. Each section is designed to walk you through the analytical process, from understanding ozone data to pinpointing seasonal spikes, geographic hotspots, and human activity patterns.

The structure reflects my approach to building a reusable analysis framework, showing how each method time series analysis, hotspot detection, weekday/weekend comparisons fits into the bigger picture of environmental decision making. By blending analytical techniques with clear communication, this notebook serves as both a learning resource and a practical tool for making informed policy recommendations. Whether you’re a data enthusiast, environmental researcher, or policymaker, this project aims to inform, inspire, and drive action.

💪 Competition Information

Competition Challenge

Create a report that covers the following:

  • Analyze and interpret ozone pollution trends across California for the year 2024.

  • Identify seasonal patterns, geographic hotspots, and human activity impacts on ozone levels.

  • Evaluate whether certain monitoring sites or regions show consistently higher pollution levels and why.

  • Recommend evidence-based policy actions to target the most affected areas and times of year.

💾 Data Columns Description

The data is a modified dataset from the U.S. Environmental Protection Agency (EPA).

Ozone contains the daily air quality summary statistics by monitor for the state of California for 2024. Each row contains the date and the air quality metrics per collection method and site
  • "Date" - the calendar date with which the air quality values are associated
  • "Source" - the data source: EPA's Air Quality System (AQS), or Airnow reports
  • "Site ID" - the id for the air monitoring site
  • "POC" - the id number for the monitor
  • "Daily Max 8-hour Ozone Concentration" - the highest 8-hour value of the day for ozone concentration
  • "Units" - parts per million by volume (ppm)
  • "Daily AQI Value" - the highest air quality index value for the day, telling how clean or polluted the air is (a value of 50 represents good air quality, while a value above 300 is hazardous)
  • "Local Site Name" - name of the monitoring site
  • "Daily Obs Count" - number of observations reported in that day
  • "Percent Complete" - indicates whether all expected samples were collected
  • "Method Code" - identifier for the collection method
  • "CBSA Code" - identifier for the core base statistical area (CBSA)
  • "CBSA Name" - name of the core base statistical area
  • "State FIPS Code" - identifier for the state
  • "State" - name of the state
  • "County FIPS Code" - identifer for the county
  • "County" - name of the county
  • "Site Latitude" - latitude coordinates of the site
  • "Site Longitude" - longitude coordinates of the side

📄 Dataset Overview

Daily ozone measurements at monitoring stations.

Columns:

  • DateDate of reading.

  • Source, Site ID, POC, Method Code, etc. → Identifiers.

  • Daily Max 8-hour Ozone ConcentrationKey measurement (ppm).

  • Daily AQI ValueAir Quality Index (interpretable scale: 50 = good, >300 = hazardous).

  • Local Site Name, CBSA Name, County, State, Latitude, LongitudeLocation info.

  • Daily Obs Count, Percent CompleteData quality metrics.

🧐 Data Quality Issues

Missing values:

  • Daily Max 8-hour Ozone Concentration → ~2,738 missing.

  • Daily AQI Value → ~2,738 missing.

  • Method Code → ~6,490 missing.

  • CBSA Code & CBSA Name → ~2,408 missing.

  • Several text fields also have 0 missing — good.

  • Data types look correct (dates are object but should likely be converted to datetime).

🧹 Data Cleaning & Preprocessing

To ensure robust results, we cleaned and validated the dataset:

  • Converted Date to datetime.

  • Dropped duplicate records.

  • Removed rows with <75% completeness.

  • Excluded rows with missing key metrics (Daily Max 8-hour Ozone Concentration, Daily AQI Value).

  • Checked missingness patterns and data types.

🌎 Ozone Pollution Dashboard — California 2024: Key Insights at a Glance

Key Findings:

The ozone pollution dashboard reveals clear seasonal and geographic patterns across California in 2024. Ozone concentrations peak sharply during the summer months (June–August), coinciding with atmospheric conditions that accelerate ozone formation. Southern California counties, particularly Los Angeles, Riverside, and San Bernardino, consistently record the highest levels, with monitoring sites such as Claremont and Riverside emerging as persistent hotspots. A weekday versus weekend comparison indicates slightly higher ozone levels on weekdays, reflecting the influence of commuter traffic and industrial operations. Method code analysis shows only minor variations between measurement techniques, though certain sites display anomalous readings that warrant calibration. Geospatial mapping confirms that urban and inland areas face the greatest ozone burden, while many coastal regions remain relatively unaffected. These insights highlight the urgent need for seasonal, location-specific, and activity-focused interventions to reduce pollution and protect public health.

Recommendations

  1. Target High-Impact Regions: Focus monitoring and pollution control measures on Los Angeles, Riverside, and San Bernardino counties, where ozone levels remain persistently high.

  2. Seasonal Policy Measures: Implement stricter emission controls during summer months, particularly in identified hotspot areas.

  3. Reduce Weekday Emissions: Promote public transportation, carpooling, and low-emission commuting policies to address weekday pollution spikes.

  4. Calibrate Monitoring Equipment: Conduct regular equipment checks for sites showing anomalous readings to ensure data accuracy.

  5. Leverage Hotspot Mapping: Use ongoing geospatial monitoring to track emerging high-risk zones and adjust policies accordingly.

Ozone Pollution Dashboard

The next dashboard view provides a closer look at ozone distribution across regions and monitoring sites.

📄 Executive Summary

This project analyzes ozone pollution across California during 2024 using official EPA monitoring data, identifying seasonal patterns, geographic hotspots, and human activity impacts. The objective is to provide data-driven recommendations to help policymakers target high-risk areas and reduce ozone-related health risks.

Key Findings:

  1. Seasonal Peaks: Ozone concentrations are highest during summer months (June–August), driven by atmospheric conditions that accelerate ozone formation.

  2. Geographic Hotspots: Southern California counties—notably Los Angeles, Riverside, and San Bernardino—consistently record elevated ozone levels.

  3. Site-Level Insights: Monitoring locations such as Claremont and Riverside report some of the highest readings statewide.

  4. Human Activity Influence: Weekday ozone levels are slightly higher than weekends, reflecting traffic and industrial activity patterns.

  5. Methodology Review: Minor variations exist between measurement methods; certain sites may require calibration for accuracy.

  6. Spatial Distribution: Geospatial analysis reveals concentrated urban and inland hotspots, while coastal areas generally maintain lower levels.

Recommendations:

  • Target High: Impact Areas: Prioritize monitoring and emission reduction efforts in Los Angeles, Riverside, and San Bernardino counties.

  • Seasonal Interventions: Introduce stricter emission controls during summer months when ozone peaks are most severe.

  • Enhance Public Transport & Traffic Measures: Expand low-emission transport policies, especially on weekdays, to mitigate pollution from human activity.

  • Calibrate Measurement Methods: Regularly inspect and adjust monitoring sites with anomalous readings to maintain data integrity.

  • Leverage Hotspot Mapping: Continuously track ozone trends through geospatial analysis to adapt policies for persistent high-risk areas.

Data Cleaning + EDA for Ozone Dataset

Data Cleaning

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
df = pd.read_csv('data/ozone.csv')
df.head()

Missing Values Analysis — Initial Assessment

  • This section visualizes the amount and distribution of missing values in the dataset before cleaning. The bar graph highlights which columns had missing data and how much, enabling us to identify and prioritize cleaning steps. Understanding missingness is crucial to ensure data quality and reliable analysis downstream.

To maintain the reliability of our analysis and ensure that our insights remain accurate, it is essential to address missing values in the ozone pollution dataset. Missing data, if left untreated, can skew trends, bias results, and reduce the credibility of recommendations.

we’ll apply techniques and lessons learned from the DataCamp course Dealing with Missing Data in Python to handle these missing values.

Confirms the extent of missingness and justifies cleaning choices.

# Visual

df.isnull().sum().plot(kind='bar', figsize=(12,5), color="#1f77b4")
plt.title("Missing Values per Column")
plt.ylabel("Count")
plt.show()
df.info()