Skip to content
0

California - Breathe At Your Own Risk

📖 Background

Ozone pollution across California’s regions was evaluated using data obtained from the U.S. Environmental Protection Agency (EPA). This dataset included daily ozone measurements collected from monitoring stations throughout the state.

As is often the case with real-world environmental data, the dataset was not ready for immediate analysis. It was affected by missing values, inconsistent formats, potential duplicates, and outliers.

Before any meaningful insights could be drawn, the data had to be thoroughly cleaned and validated. Only after these steps were completed could trends be uncovered, high-risk regions identified, and areas in urgent need of policy intervention assessed.

Executive Summary

Data Cleaning

We cleaned the dataset to ensure reliability by fixing date formats, removing duplicates, and handling invalid or missing values. About 11.4% of rows had invalid dates — we kept them for general context but excluded them from any time-based analysis. We also removed 3,650 exact duplicates and filtered out rows with less than 80% completeness or fewer than 14 daily readings.

For missing Ozone Concentrations and AQI values, we used group-wise imputation based on Site and Methods used, choosing mean or median depending on skewness. A fallback method filled any remaining gaps. Missing CBSA codes were mostly found in rural counties outside official metro zones, like Siskiyou and Amador.

Temporal Trends

Ozone levels in California during 2024 show a clear seasonal pattern, with concentrations peaking in summer (Q3) and dropping in winter. This trend holds for both ozone and AQI, which shifts from mostly “Good” in cooler months to “Moderate” in summer (per EPA categories). Weekday vs. weekend averages are similar, but weekdays show more extreme ozone spikes, likely due to traffic and urban activity. While average levels stay within EPA limits (0.070 ppm), they often exceed WHO’s stricter guidelines, especially during summer — suggesting potential health risks even when EPA thresholds aren’t crossed.

Regional Trends

Notably, 24% of all monitoring sites exceeded the WHO threshold of 0.05 ppm, many located in mountain, inland, rural, or recreational areas, which are often underserved in air quality planning and face unique geographic and meteorological challenges. These areas may be high-priority zones for intervention (I recommend you to check the geospatial interactive map)

Method Comparison

Ozone concentration varies by measurement method:

  • Methods 87 and 199 report similar averages (~0.045 ppm), with Method 199 showing higher variability.
  • Method 53 shows the highest average (~0.060 ppm) but the sample size is unrepresentative.
  • Method 47 report lower averages.

Recommendations

  1. Revise California’s Air Quality Goals to Reflect WHO Guidelines

    • While average ozone levels generally meet EPA standards, they frequently exceed WHO's tighter 0.05 ppm threshold, especially in Q3 (summer).
    • Recommendation: Consider adopting WHO thresholds as advisory targets, especially during peak ozone months, to better protect vulnerable populations (e.g., children, the elderly).
  2. Deploy Seasonal, Region-Specific Communication Campaigns

    • Public understanding of “Good” EPA levels may falsely imply low health risk. The data shows moderate levels still pose concern under WHO criteria.
    • Recommendation: Launch localized seasonal alerts in plain language, especially in summer, explaining why even “moderate” days can affect health. Leverage mobile apps, local radio, and recreational signage.
  3. Target Rural and Recreational Zones for Air Quality Action Plans

    • These areas often lack robust air quality infrastructure or tailored regulations despite being ecologically and economically sensitive (e.g., tourism, agriculture).
    • Recommendation: Prioritize outreach and adaptive strategies like seasonal public health alerts, mobile monitoring units, or zoning policies. Consider expanding low-cost sensor networks and using innovative monitoring technologies to better capture local pollution.
  4. Make Ozone Measurements More Consistent

    • Different methods report slightly different ozone levels — some run higher, others lower — which could be due to equipment, setup, or calibration.
    • Recommendation: Run regular checks to make sure all monitoring tools are well-calibrated and give consistent results. When analyzing or comparing ozone levels, adjust for these differences or focus on the most reliable methods.
  5. Integrate Weekday Spike Control into Urban Emissions Policies

    • While weekday and weekend ozone averages are similar, weekdays show stronger outlier spikes, likely tied to commuter traffic and industrial activity.
    • Recommendation: Implement or expand rush-hour emission reduction policies, encourage flex work schedules, and target urban NOx reduction as part of coordinated, regional pollution control efforts to minimize peak-day ozone events.

Sources:

  • The Importance of NOx Control for Peak Ozone Mitigation
  • 3 Strategies for Reducing Toxic Ozone Pollution
  • Reducing Southern California Ozone Concentrations under a Low Carbon Energy Scenario

💾 The data

The data is a modified dataset from the U.S. Environmental Protection Agency (EPA).

Ozone contains the daily air quality summary statistics by monitor for the state of California for 2024. Each row contains the date and the air quality metrics per collection method and site

Column NameDefinition
DateThe calendar date with which the air quality values are associated
SourceThe data source: EPA's Air Quality System (AQS), or Airnow reports
Site IDThe ID for the air monitoring site
POCThe ID number for the monitor
Daily Max 8-hour Ozone ConcentrationThe highest 8-hour value of the day for ozone concentration
UnitsParts per million by volume (ppm)
Daily AQI ValueThe highest air quality index value for the day (50 = good, 300+ = hazardous)
Local Site NameName of the monitoring site
Daily Obs CountNumber of observations reported in that day
Percent CompleteIndicates whether all expected samples were collected
Method CodeIdentifier for the collection method
CBSA CodeIdentifier for the Core Base Statistical Area (CBSA)
CBSA NameName of the Core Base Statistical Area
County FIPS CodeIdentifier for the county (Federal Information Processing Standards code)
CountyName of the county
Site LatitudeLatitude coordinates of the site
Site LongitudeLongitude coordinates of the site

Data Cleaning Process and Results

  • Converted & unified Date to datetime format and identified records with invalid dates.
  • 11.39% of the data contains invalid dates. These rows were kept for general analysis only and excluded from all time-based analysis.
  • Ensured correct data types (e.g., numeric concentration columns).
  • Investigated and handled invalid and missing values:
    • Removed 3,650 exact duplicates.
    • Missing CBSA Code and CBSA Name values were found in counties outside official metro or micro areas (e.g., Amador, Calaveras, Colusa, Glenn, Mariposa, Siskiyou).
    • Filtered out unreliable rows with less than 80% completeness or fewer than 14 daily observations, since they showed skewed readings.
    • Imputed missing values for Daily Max 8-hour Ozone Concentration and Daily AQI Value using group-based imputation, based on Site ID and Method Code.
      • Chose between mean or median depending on how skewed each group was.
      • A fallback method handled any remaining gaps.

Quick Sneak Peek at Our Data

Hidden code
ozone.describe()
ozone.info()

1 hidden cell

Removing Exact Duplicates

Hidden code

Checking Dates

Hidden code

⚠️ 11.39% (5,819 rows) of the data contain invalid dates. As this is too significant to ignore (exceeding a 1% threshold), these rows will be flagged and retained. More importantly, they will be excluded from all temporal analysis, and used for general analysis only.