California Air Quality: From Data Challenges to Insights
A Data-Driven Look at California Air Quality Variations
California’s diverse geography and industrial landscape create complex air quality challenges. Through this analysis of 2024 ozone data fthe U.S. Environmental Protection Agency (EPA), sourced through both AirNow and AQS systems, we dive into where, when, and why air pollution worsens.
Summary
As part of our mission to assess air quality and support environmental decision-making, our company was tasked with evaluating ozone pollution across different regions in California. We utilized daily ozone measurement data. We conducted a thorough data quality assessment, validated measurement consistency across different methods and monitoring sites, and analyzed spatial and temporal pollution patterns to identify high-risk areas.
Our analysis was structured around three key phases:
Data Cleaning:
- We addressed key data quality issues to ensure reliable analysis fixed by handling missing values, resolving inconsistencies and filtering outliers
Data Validation
-
AQS data covered all California counties, while AirNow was available for all except
Humboldt
andLake
. -
We validated AQI variation across method codes; notably, method Code 053 consistently reported higher ozone concentrations, indicating potential measurement bias if only one method is used in a county.
Insight Extraction
- Summer months showed significantly higher AQI values
- Increasing of sunlight and temperature, which accelerate ozone formation.
- Wildfires driven by extreme heat contributed substantially to poor air quality. California led the U.S. in 2024 with over 8,300 wildfires burning 1.08 million acres.
- Geography plays a critical role:
-
San Joaquin Valley counties had the worst summer AQI, influenced by topography that traps pollutants, and emissions from agriculture, traffic, and industry.
-
Coastal counties like
Humboldt
andSan Francisco
maintained better AQI year-round, aided by ocean breezes that help disperse pollutants.
- Human activity impacts are evident in weekday vs. weekend AQI trends.
- Elevated NOx and VOCs—emitted from vehicles, industrial operations, and household sources—lead to higher ground-level ozone during weekdays.
🔍 Data Quality: Cleaning Before Meaning
Before diving into insights, we tackled significant data quality issues. It's a vital step to ensure accurate conclusions:
1. Missing Data
- Method Code: 6,490 missing entries—all from AirNow. We flagged this as a known limitation.
- CBSA Info: 2,408 records from non-metro areas lacked CBSA codes; we used a standard placeholder (99999).
- AQI & Ozone: 2,783 missing values. Since AQI calculated based on Ozone concentration, this strong one-to-one relationship allowed us to confidently impute the missing values based on known pairs. We drop rows where both
Daily AQI Value
andMAX 8-hours ozone concentration
are Null.
2. Inconsistent Formats
- County Names: Variations like “LA, SF” vs. “Los Angeles, San Francisco” inflated the number of unique counties—standardization resolved this.
- Partial Dates: 9,202 records like “/2024” defaulted to January 1st, which inflated January’s average AQI. These rows were excluded to avoid seasonal bias.
3. Outliers
- Daily Observation Count havning values of 1000 (vs. typical 1–24) introduced skew. These anomalies were removed.
4. Duplicates
- 267 rows removed to prevent overcounting.
52 hidden cells
🔍 How Ozone Shapes the Air We Breathe
Unveiling the Link Between Daily AQI and Ozone Levels
Ever wondered how clean — or polluted — the air really is? The Air Quality Index (AQI) offers a simple answer.
Table-1: Shows The AQI is divided into six categories, each associated with a specific level of health concern.
Value | AQI Status | Description |
---|---|---|
0 to 50 | Good | Air quality is satisfactory, and air pollution poses little or no risk. |
51 to 100 | Moderate | risk for some people, particularly those who are unusually sensitive to air pollution. |
101 to 150 | Unhealthy for Sensitive Groups | Members of sensitive groups may experience health effects. The general public is less likely to be affected. |
151 to 200 | Unhealthy | Some members of the general public may experience health effects |
201 to 300 | Very Unhealthy | Health alert: The risk of health effects is increased for everyone. |
< 300 | Hazardous | Health warning of emergency conditions: everyone is more likely to be affected |
But behind that number lies a key driver: Ozone Concentration.
The following scatter plot reveals a clear, strong correlation between the daily AQI value and the maximum 8-hour ozone concentration (Part Per Million), highlighting how this pollutant plays a pivotal role in determining air quality levels across EPA's defined health bands.
-
The relationship appears mostly linear but with subtle curve patterns at higher maximum 8-hour ozone concentration.
-
The data transitions smoothly across AQI bands, validating the reliability of ozone concentration as a basis for AQI categorization.
We will use AQI value at the reset of report as it easy to interprete.
🔍 Data Validation: Potential Bias in Methodology and Sampling
Source Validation: Potential Bias from Single-Method Sensors
Is your air quality data fair, or just flawed?
The first thing that comes to mind when analyzing data from different sources is to compare the differences in measurement methods used to collect ozone concentration. to ensures the readings are consistent, reliable, and not biased due to variations in instrumentation, calibration, or data reporting standards. Validating these differences is critical before drawing any conclusions or merging the datasets for further analysis.
Table-2: Ozone Consentration Measurment Methods
Code | Method Description | Type | Notes |
---|---|---|---|
047 | UV Photometric | FEM | Widely used, accurate |
087 | Non-regulatory / Unknown O₃ Method | Non-FEM | Often from AirNow; public estimates |
199 | Other / Undefined Method | N/A | Placeholder, unclassified |
053 | UV Absorption (Gas Phase Chemiluminescence) | FRM | EPA-approved for compliance use |