Skip to content
0

Data Cleaning and Insights of Ozone Concentration of Californaia for 2024

💾 The data

The data is a modified dataset from the U.S. Environmental Protection Agency (EPA).

Ozone contains the daily air quality summary statistics by monitor for the state of California for 2024. Each row contains the date and the air quality metrics per collection method and site
  • "Date" - the calendar date with which the air quality values are associated
  • "Source" - the data source: EPA's Air Quality System (AQS), or Airnow reports
  • "Site ID" - the id for the air monitoring site
  • "POC" - the id number for the monitor
  • "Daily Max 8-hour Ozone Concentration" - the highest 8-hour value of the day for ozone concentration
  • "Units" - parts per million by volume (ppm)
  • "Daily AQI Value" - the highest air quality index value for the day, telling how clean or polluted the air is (a value of 50 represents good air quality, while a value above 300 is hazardous)
  • "Local Site Name" - name of the monitoring site
  • "Daily Obs Count" - number of observations reported in that day
  • "Percent Complete" - indicates whether all expected samples were collected
  • "Method Code" - identifier for the collection method
  • "CBSA Code" - identifier for the core base statistical area (CBSA)
  • "CBSA Name" - name of the core base statistical area
  • "State FIPS Code" - identifier for the state
  • "State" - name of the state
  • "County FIPS Code" - identifer for the county
  • "County" - name of the county
  • "Site Latitude" - latitude coordinates of the site
  • "Site Longitude" - longitude coordinates of the side

Problem Statements

  • The EPA’s ozone dataset for California contains inconsistencies such as missing values and outliers.
  • This project seeks to clean and validate the data before analyzing spatial and temporal ozone concentration patterns to identify areas needing urgent attention.

📊 Summary of Ozone Data Insights

🏙️ Regional Analysis

Among the top 10 sites with the highest ozone concentrations:

  • The place with the highest ozone pollution was Sequoia & Kings Canyon – Lower Kaweah, which is in the Visalia-Porterville area.

  • 7 sites were located in the Riverside–San Bernardino–Ontario metropolitan area, indicating a regional pollution hotspot.

  • 2 sites were in the Los Angeles–Long Beach–Anaheim area.

  • This suggests that Southern California, especially the Inland Empire, experienced some of the most consistently elevated ozone levels in 2024.

  • This may be because of: High temperatures in summer,Lots of cars and industry, Mountain ranges that trap air pollution.

🌞 Monthly Trends
  • Ozone levels were highest June and September , with a peak in August.

  • Many locations stayed above the safe limit (0.07 ppm) until October.

  • This happens because hot, sunny weather helps create more ozone in the air due to photochemical reaction.

📆 Weekly Patterns
  • There was no big difference between weekdays and weekends.

  • This means ozone pollution isn’t caused directly by daily traffic, but by weather and sunlight.

🌞 Seasonal Trends for Top 10 Constant High Ozone Sites
  • Based on seasonal trends in 2024, nearly all of the top monitoring sites—except Joshua Tree National Park—showed ozone concentrations above 0.07 ppm from July to the end of September, with the highest levels in August.

  • San Bernardino, Redlands, and Crestline had the highest ozone levels, each going over 0.09 ppm.

  • They were followed by Santa Clarita, Rubidoux, and Glendora, with concentrations around 0.08 ppm.

  • These levels can be harmful to human health, particularly for people with asthma or other respiratory issues.

📊 Method Code Comparison
  • Method Code 53 shows a notably higher median ozone value of 0.62 with no outliers, unlike the other three methods which have median values around 0.04 and display several outliers.
  • This discrepancy may not be entirely due to the locations where these methods were applied, but may also reflect differences in the measurement techniques themselves.
  • To confirm this, further analysis should be conducted by applying multiple methods at the same sites to isolate the effect of the method from the site characteristics.
⚠️ Health and Policy
  • Even if the average ozone level looks safe according to EPA policy, short-term spikes in summer and air pollution due to other pollutants in the air can still be dangerous to health.

  • Even though Ozone is not directly affected by human activities, ozone formation is due to the reaction of sunlight with other pollutants like NOₓ and VOCs released from cars and factories.

  • Therefore, to reduce ozone pollution, car emissions and factory pollution are needed to cut down in these areas.

Other Insights
  • The correlation between ozone concentration and AQI level is high, with a coefficient of 0.897, indicating a strong positive relationship.

  • However, AQI is influenced by multiple air pollutants, not solely ozone.

  • Consequently, even when ozone concentrations are similar, variations in AQI values suggest that other pollutants may be contributing to overall air quality differences.

  • According to the map, the whole California State experienced poor air quality due to ozone and other pollutents, which shows the urgent actions for the whole state are needed to control the pollutions.

#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime as dt

ozone = pd.read_csv('data/ozone.csv')
ozone.head(6)

1. Data Understanding & Data Validation

ozone.info()
#drop columns
ozone.drop(['Source','Units'], axis = 1,inplace=True)

#rename columns
ozone.rename(columns={
    'Date': 'date',
    'Site ID': 'site_id',
    'POC': 'poc',
    'Daily Max 8-hour Ozone Concentration': 'max_ozone_8hr',
    'Daily AQI Value': 'aqi',
    'Local Site Name': 'site_name',
    'Daily Obs Count': 'obs_count',
    'Percent Complete': 'pct_complete',
    'Method Code': 'method_code',
    'CBSA Code': 'cbsa_code',
    'CBSA Name': 'cbsa_name',
    'County FIPS Code': 'county_fips',
    'County': 'county',
    'Site Latitude': 'lat',
    'Site Longitude': 'lon'
}, inplace=True)


num_cols = ['max_ozone_8hr', 'aqi', 'obs_count', 'pct_complete']
cat_cols = [col for col in ozone.columns if col not in num_cols ]
ozone[num_cols].describe()
#Check null values of each columns
pd.DataFrame({
    'nunique': ozone.nunique(),
    'null_count': ozone.isnull().sum(),
    'null_percent': (ozone.isnull().mean() * 100).round(2)
})
# Check uniques
ozone['aqi'].unique()
ozone['pct_complete'].unique()
ozone['method_code'].unique()
ozone['cbsa_code'].unique()
ozone['obs_count'].unique()
#Changing data types
ozone[['pct_complete', 'method_code', 'cbsa_code']] = (
    ozone[['pct_complete', 'method_code', 'cbsa_code']]
    .astype('Int64')
)
#handling date colummns
ozone['date'] = (
    ozone['date']
    .replace(r'^/?\d{4}$', pd.NA, regex=True)
    .str.strip()
)

month_map = {
    'January': '01/',
    'February': '02/',
    'March': '03/',
    'April': '04/',
    'May': '05/',
    'June': '06/',
    'July': '07/',
    'August': '08/',
    'September': '09/',
    'October': '10/',
    'November': '11/',
    'December': '12/'
}

for month_name, month_num in month_map.items():
    ozone['date'] = ozone['date'].str.replace(month_name, month_num, regex=False)

# Remove spaces between month and day
ozone['date'] = ozone['date'].str.replace(r'/\s*(\d+)', r'/\1', regex=True)

# Now convert to datetime
ozone['date'] = pd.to_datetime(ozone['date'], errors='coerce')
# Function to fill missing dates by group
def fill_date(group):
    dates = group['date'].copy()

    # Iterate backwards through the dates
    for i in range(len(dates) - 1, -1, -1):
        if pd.isna(dates.iloc[i]):
            filled = False

            # Try to fill using next date - 1 day
            for j in range(i + 1, len(dates)):
                if pd.notna(dates.iloc[j]):
                    dates.iloc[i] = dates.iloc[j] - pd.Timedelta(days=1)
                    filled = True
                    break

            # If not filled and it's the last row, use previous date + 1 day
            if not filled and i == len(dates) - 1:
                for j in range(i - 1, -1, -1):
                    if pd.notna(dates.iloc[j]):
                        dates.iloc[i] = dates.iloc[j] + pd.Timedelta(days=1)
                        break

    group['date'] = dates
    return group

# Apply the filling logic to each group
ozone = ozone.groupby('site_name', group_keys=False).apply(fill_date)
# Fill missing ozone concentration with group mean
ozone['max_ozone_8hr'] = (
    ozone.groupby('site_name')['max_ozone_8hr']
    .transform(lambda x: x.fillna(x.mean().round(3)))
)

# Fill missing AQI value with group median
ozone['aqi'] = (
    ozone.groupby('site_name')['aqi']
    .transform(lambda x: x.fillna(x.median()))
)