Skip to content
0

Executive summary

In this workbook I explore the data containg daily 8 hour ozone concentration levels over time. The workbook follows a systematic appraoch to clean the data and then explore various aspects of the data across regions and time and the results are summarised at the end of the notebook. I found that the ozone concentrations are higher during summer months and a few regions are at high risk around that time of the year.

💾 Description of the data

The data is a modified dataset from the U.S. Environmental Protection Agency (EPA).

Ozone contains the daily air quality summary statistics by monitor for the state of California for 2024. Each row contains the date and the air quality metrics per collection method and site
  • "Date" - the calendar date with which the air quality values are associated
  • "Source" - the data source: EPA's Air Quality System (AQS), or Airnow reports
  • "Site ID" - the id for the air monitoring site
  • "POC" - the id number for the monitor
  • "Daily Max 8-hour Ozone Concentration" - the highest 8-hour value of the day for ozone concentration
  • "Units" - parts per million by volume (ppm)
  • "Daily AQI Value" - the highest air quality index value for the day, telling how clean or polluted the air is (a value of 50 represents good air quality, while a value above 300 is hazardous)
  • "Local Site Name" - name of the monitoring site
  • "Daily Obs Count" - number of observations reported in that day
  • "Percent Complete" - indicates whether all expected samples were collected
  • "Method Code" - identifier for the collection method
  • "CBSA Code" - identifier for the core base statistical area (CBSA)
  • "CBSA Name" - name of the core base statistical area
  • "State FIPS Code" - identifier for the state
  • "State" - name of the state
  • "County FIPS Code" - identifer for the county
  • "County" - name of the county
  • "Site Latitude" - latitude coordinates of the site
  • "Site Longitude" - longitude coordinates of the side
library(readr)
ozone <- read_csv('data/ozone.csv', show_col_types = FALSE)
head(ozone)

1) First load required packages

If they are not present then install them


# List of required packages
required_packages <- c("janitor","data.table","lubridate","tigris", "ggthemes", "sf", "ggplot2", "dplyr", "patchwork","plotly","viridis","geojsonio","RColorBrewer")

# Install any missing packages
for (pkg in required_packages) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg)
  }
}

# Load the required libraries
library(janitor)
library(data.table)
library(lubridate)
library(tigris)
library(ggthemes)
library(sf)
library(ggplot2)
library(dplyr)
library(patchwork)
library(plotly)
library(viridis)
library(geojsonio)
library(RColorBrewer)

2) Understanding the data and its columns

ozone <- janitor::clean_names(ozone) #standardises the name of the columns so that it is easy to use later on
colnames(ozone)
colSums(is.na(ozone)) #gives total number of NAs present in each of the columns

Note that the columns "daily_max_8_hour_ozone_concentration", "daily_aqi_value", "method_code", "cbsa_code" and "cbsa_name" have Nas in them.

ozone <- data.table(ozone) #converting to data table because it has faster operations
str(ozone) #gives information about the columns, their content and their type
summary(ozone) #produces some statistics of the numeric quantities, if present

From here we can observe that the date column is not consistent!

3) Cleaning the data table

I removed the NAs and unified the date column

#Removing rows whose regions are not known 
ozone_no_na <- ozone[!is.na(cbsa_code)]
#Then we create a new date column with unified date format
ozone_no_na$date2 <- suppressWarnings(parse_date_time(ozone_no_na$date, orders = "m/d/Y", tz = "UTC"))
#Then we remove all the Nas from the data
ozone_no_na <- ozone_no_na[complete.cases(ozone_no_na)]
#print(nrow(ozone_no_na))
#Finally clear all the duplicates
ozone_no_na <- unique(ozone_no_na)
#print(nrow(ozone_no_na))
colSums(is.na(ozone_no_na)) #to check the data
# find out how many different types are present in each of three types of regions
length(unique(ozone_no_na$local_site_name)) #prints number of local sites
length(unique(ozone_no_na$cbsa_code)) #prints number of CBSA areas
length(unique(ozone_no_na$county)) #prints number of counties

So local sites are smaller than county which in turn is smaller than CBSA regions. I will follow a topdown approach in which I will first study the ozone concentrations in cbsa regions, then I will try to find the counties with higher ozone concentrations and from there I will find local sites that are most exposed to ozone.

4) Exploratory data analysis