California - Breathe At Your Own Risk
📖 Background
Ozone pollution across California’s regions was evaluated using data obtained from the U.S. Environmental Protection Agency (EPA). This dataset included daily ozone measurements collected from monitoring stations throughout the state.
As is often the case with real-world environmental data, the dataset was not ready for immediate analysis. It was affected by missing values, inconsistent formats, potential duplicates, and outliers.
Before any meaningful insights could be drawn, the data had to be thoroughly cleaned and validated. Only after these steps were completed could trends be uncovered, high-risk regions identified, and areas in urgent need of policy intervention assessed.
Executive Summary
Data Cleaning
We cleaned the dataset to ensure reliability by fixing date formats, removing duplicates, and handling invalid or missing values. About 11.4% of rows had invalid dates — we kept them for general context but excluded them from any time-based analysis. We also removed 3,650 exact duplicates and filtered out rows with less than 80% completeness or fewer than 14 daily readings.
For missing Ozone Concentrations and AQI values, we used group-wise imputation based on Site and Methods used, choosing mean or median depending on skewness. A fallback method filled any remaining gaps. Missing CBSA codes were mostly found in rural counties outside official metro zones, like Siskiyou and Amador.
Temporal Trends
Ozone levels in California during 2024 show a clear seasonal pattern, with concentrations peaking in summer (Q3) and dropping in winter. This trend holds for both ozone and AQI, which shifts from mostly “Good” in cooler months to “Moderate” in summer (per EPA categories). Weekday vs. weekend averages are similar, but weekdays show more extreme ozone spikes, likely due to traffic and urban activity. While average levels stay within EPA limits (0.070 ppm), they often exceed WHO’s stricter guidelines, especially during summer — suggesting potential health risks even when EPA thresholds aren’t crossed.
Regional Trends
Notably, 24% of all monitoring sites exceeded the WHO threshold of 0.05 ppm, many located in mountain, inland, rural, or recreational areas, which are often underserved in air quality planning and face unique geographic and meteorological challenges. These areas may be high-priority zones for intervention (I recommend you to check the geospatial interactive map)
Method Comparison
Ozone concentration varies by measurement method:
- Methods 87 and 199 report similar averages (~0.045 ppm), with Method 199 showing higher variability.
- Method 53 shows the highest average (~0.060 ppm) but the sample size is unrepresentative.
- Method 47 report lower averages.
Recommendations
-
Revise California’s Air Quality Goals to Reflect WHO Guidelines
- While average ozone levels generally meet EPA standards, they frequently exceed WHO's tighter 0.05 ppm threshold, especially in Q3 (summer).
- Recommendation: Consider adopting WHO thresholds as advisory targets, especially during peak ozone months, to better protect vulnerable populations (e.g., children, the elderly).
-
Deploy Seasonal, Region-Specific Communication Campaigns
- Public understanding of “Good” EPA levels may falsely imply low health risk. The data shows moderate levels still pose concern under WHO criteria.
- Recommendation: Launch localized seasonal alerts in plain language, especially in summer, explaining why even “moderate” days can affect health. Leverage mobile apps, local radio, and recreational signage.
-
Target Rural and Recreational Zones for Air Quality Action Plans
- These areas often lack robust air quality infrastructure or tailored regulations despite being ecologically and economically sensitive (e.g., tourism, agriculture).
- Recommendation: Prioritize outreach and adaptive strategies like seasonal public health alerts, mobile monitoring units, or zoning policies. Consider expanding low-cost sensor networks and using innovative monitoring technologies to better capture local pollution.
-
Make Ozone Measurements More Consistent
- Different methods report slightly different ozone levels — some run higher, others lower — which could be due to equipment, setup, or calibration.
- Recommendation: Run regular checks to make sure all monitoring tools are well-calibrated and give consistent results. When analyzing or comparing ozone levels, adjust for these differences or focus on the most reliable methods.
-
Integrate Weekday Spike Control into Urban Emissions Policies
- While weekday and weekend ozone averages are similar, weekdays show stronger outlier spikes, likely tied to commuter traffic and industrial activity.
- Recommendation: Implement or expand rush-hour emission reduction policies, encourage flex work schedules, and target urban NOx reduction as part of coordinated, regional pollution control efforts to minimize peak-day ozone events.
Sources:
💾 The data
The data is a modified dataset from the U.S. Environmental Protection Agency (EPA).
Ozone contains the daily air quality summary statistics by monitor for the state of California for 2024. Each row contains the date and the air quality metrics per collection method and site
| Column Name | Definition |
|---|---|
| Date | The calendar date with which the air quality values are associated |
| Source | The data source: EPA's Air Quality System (AQS), or Airnow reports |
| Site ID | The ID for the air monitoring site |
| POC | The ID number for the monitor |
| Daily Max 8-hour Ozone Concentration | The highest 8-hour value of the day for ozone concentration |
| Units | Parts per million by volume (ppm) |
| Daily AQI Value | The highest air quality index value for the day (50 = good, 300+ = hazardous) |
| Local Site Name | Name of the monitoring site |
| Daily Obs Count | Number of observations reported in that day |
| Percent Complete | Indicates whether all expected samples were collected |
| Method Code | Identifier for the collection method |
| CBSA Code | Identifier for the Core Base Statistical Area (CBSA) |
| CBSA Name | Name of the Core Base Statistical Area |
| County FIPS Code | Identifier for the county (Federal Information Processing Standards code) |
| County | Name of the county |
| Site Latitude | Latitude coordinates of the site |
| Site Longitude | Longitude coordinates of the site |
Data Cleaning Process and Results
- Converted & unified
Dateto datetime format and identified records with invalid dates. - 11.39% of the data contains invalid dates. These rows were kept for general analysis only and excluded from all time-based analysis.
- Ensured correct data types (e.g., numeric concentration columns).
- Investigated and handled invalid and missing values:
- Removed 3,650 exact duplicates.
- Missing
CBSA CodeandCBSA Namevalues were found in counties outside official metro or micro areas (e.g., Amador, Calaveras, Colusa, Glenn, Mariposa, Siskiyou). - Filtered out unreliable rows with less than 80% completeness or fewer than 14 daily observations, since they showed skewed readings.
- Imputed missing values for
Daily Max 8-hour Ozone ConcentrationandDaily AQI Valueusing group-based imputation, based onSite IDandMethod Code.- Chose between mean or median depending on how skewed each group was.
- A fallback method handled any remaining gaps.
Quick Sneak Peek at Our Data
ozone.describe()ozone.info()1 hidden cell
Removing Exact Duplicates
Checking Dates
⚠️ 11.39% (5,819 rows) of the data contain invalid dates. As this is too significant to ignore (exceeding a 1% threshold), these rows will be flagged and retained. More importantly, they will be excluded from all temporal analysis, and used for general analysis only.