Skip to content
0

California Ozone Analysis 2024

Project Summary

In this project we clean an analyzed a dataset containing information about ozone levels recorded in California in 2024.

In our analysis we first found that the daily max 8-hour ozone concentrations vary significantly over time and across regions. The overall state trend is that the average daily max 8-hour ozone concentration is lowest in the December/January, increases from January to July and then decreases from July to December. We also found that there is a significant amount of variation from day to day. In terms of variation across regions we found the yearly average of the daily max 8-hour ozone concentration is generally lower in coastal areas than in inland area and is generally lower in the north than in the south. When we looked at varation in daily max 8-hour ozone concentrations both over time and across regions we found that (a) coastal areas tend to have relatively low ozone levels throughout the year, (b) most areas record low ozone levels in the winter, (c) ozone levels begin to increase in the spring starting with the southern inland part of the state, (d) ozone levels continue to increase up through July when both northern and southern inland areas are recording relatively high ozone levels, (e) ozone levels start to decrease in the late summer starting with northern inland areas followed by southern inland areas in the fall/early winter, (f) all areas revert to recording relatively low ozone levels by December.

We also looked into whether there were any areas of the state that consistently showed high ozone concentrations. To do this we looked for counties where the average daily max 8-hour ozone concentration across all recording sites in that county was above 0.070 ppm on a large number of days (0.070 ppm is the national air quality standard - ozone levels above this threshold are considered potentially harmful). We found that San Bernardino County, Tulare County, Kern County, and Riverside County all have a high number of days with an average daily max 8-hour ozone concentration above 0.070 ppm (San Bernardino - 84 days, Tulare - 67 days, Kern - 44 days, Riverside - 37 days). All of these counties are located next to eachother in the southern inland part of the state, so we concluded that southern inland California is a region with consistently high ozone concentrations.

We also briefly considered whether different collection methods reported different ozone levels. We did find that ozone levels varied based on the collection method, but we also found the number of times that each collection method was used and the areas of the state that they were used in was not equally distributed. We therefore concluded that differences in ozone levels reported by different collection methods was more likely a result of the number of times each method was used and the areas that each method tended to get used in rather than any differences in accuracy of the collection methods themselves.

Finally we attempted to discover if ozone levels are generally higher on weekdays or on weekends. We found that the overall state trend showed higher ozone levels on weekdays than on weekends in May, June, July, and August, and ozone levels in other months were about the same on weekdays and weekends. We speculated that this pattern could be caused by pollution from cars during weekday commutes interacting with the higher temperatures in the summer to produce higher ozone levels on weekdays in the summer. However, when we looked at the difference between ozone levels on weekdays and weekends in specific areas the pattern was less clear - not all areas show higher ozone concentrations on weekdays in the summer. We speculated that one difference between areas that follow the overall state trend and those that don't is their proximity to the coast. Inland areas seemed to follow the state trend of having higher ozone levels on weekdays in the summer, while coastal areas seemed to have more consistent ozone levels from weekday to weekend.

Based on the result of our analysis we made the following recommendations for groups sensitive to bad air quality (such as children, older adults, people with asthma, ect.):

  1. Sensitive groups should be cautious when planning outdoor activities in summer months especially if they are further inland and/or further to the south.
  2. Sensitive groups that are located in San Bernardino, Tulare, Kern, and Riverside counties and surrounding areas should become familiar with checking ozone pollution levels daily during the summer in order to identify the risks of spending time outdoors.

We also made the following recommendations for environmental policy makers:

  1. Policy efforts to reduce ozone pollution should be targeted towards southern inland areas of the state, particularly in summer months.
  2. Air quality alerts should be highly publicized, especially in the summer. Particular effort should be made to inform residents of San Bernardino, Tulare, Kern, and Riverside counties and surrounding areas that they are in an area that consistently reaches potentially harmful ozone concentration levels.
  3. Policies and incentives such as electric vehicle credits and other clean feul initiatives should be put into effect and/or expanded in order to reduce automobile pollution during weekday work commutes. These policies should also be targeted toward inland and southern areas of the state.
  4. Ozone pollution monitoring should improved and efforts to inform the public of the causes and effects of ozone pollution should be expanded.

The data

The data is a modified dataset from the U.S. Environmental Protection Agency (EPA).

Ozone contains the daily air quality summary statistics by monitor for the state of California for 2024. Each row contains the date and the air quality metrics per collection method and site

  • "Date" - the calendar date with which the air quality values are associated
  • "Source" - the data source: EPA's Air Quality System (AQS), or Airnow reports
  • "Site ID" - the id for the air monitoring site
  • "POC" - the id number for the monitor
  • "Daily Max 8-hour Ozone Concentration" - the highest 8-hour value of the day for ozone concentration
  • "Units" - parts per million by volume (ppm)
  • "Daily AQI Value" - the highest air quality index value for the day, telling how clean or polluted the air is (a value of 50 represents good air quality, while a value above 300 is hazardous)
  • "Local Site Name" - name of the monitoring site
  • "Daily Obs Count" - number of observations reported in that day
  • "Percent Complete" - indicates whether all expected samples were collected
  • "Method Code" - identifier for the collection method
  • "CBSA Code" - identifier for the core base statistical area (CBSA)
  • "CBSA Name" - name of the core base statistical area
  • "State FIPS Code" - identifier for the state
  • "State" - name of the state
  • "County FIPS Code" - identifer for the county
  • "County" - name of the county
  • "Site Latitude" - latitude coordinates of the site
  • "Site Longitude" - longitude coordinates of the side
# make necessary imports and read the data
import pandas as pd
import geopandas as gpd
import numpy as np
import folium
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as mcolors
import seaborn as sns
sns.set_style('darkgrid')

ozone = pd.read_csv('data/ozone.csv')
ozone.head()

Data Validation and Cleaning

We begin by validating and cleaning our data set. We will check the values in each column and deal with any formatting issues, we will check for missing values and replace them when possible, and we will drop any duplicate rows. Lets start with some basic info about the data.

ozone.info()

Our data contains 54759 rows. The Date, Source, Units, Local Site Name, CBSA Name, and County columns are strings, and the remaining columns are either integers or floats. The Daily Max 8-hour Ozone Concentration, Daily AQI Value, Method Code, CBSA Code, and CBSA Name columns are the only columns that contain missing values.

We now move on to validating and cleaning each column.

Date

print(ozone['Date'].head(10))
endswith = ozone['Date'].str.endswith('/2024').sum()
length = len(ozone['Date'])
print(f'Number of values that end with "/2024:" {endswith} out of {length}')

Looking at the Date column above we see that it does not have a standard format and some of the values are missing the day and the month. We also notice that each value ends with the string "/2024". To deal with the formatting issue we will split the Date column into year, month, and day columns. This will give the data a standard format to represent the date of the recording and will also allow us to see which of the dates are missing day and month values (assigning the value '/2024' to the date 01/01/2024 would be a mistake - we want to preserve the missing values in this case). If the date is missing a day or month value then the day or month will be saved as a 0.

# create year, month, and day columns. We know from above that all dates end with '/2024' but some are missing the month and/or the day.
ozone['year'] = 2024
ozone['month'] = pd.to_datetime(ozone['Date']).dt.month
ozone['day'] = pd.to_datetime(ozone['Date']).dt.day

# set year and month columns to 0 for rows missing a day or month
ozone.loc[ozone['Date'] == '/2024', 'day'] = 0
ozone.loc[ozone['Date'] == '/2024', 'month'] = 0

# Drop the date column since it has dates with missing day/month values.
ozone.drop(columns=['Date'], inplace=True)

We now have data for the date of recording standaridized. However, a significant number of records have a missing day and month value.

Source

missing = ozone['Source'].isna().sum()
unique = ozone['Source'].unique()
print(f'Source Values: {unique}')
print(f'Number of Missing Values: {missing}')

The Source column takes only two values and has no missing values.