Cleaning data and the skies
📖 Background
Your are a data analyst at an environmental company. Your task is to evaluate ozone pollution across various regions.
You’ve obtained data from the U.S. Environmental Protection Agency (EPA), containing daily ozone measurements at monitoring stations across California. However, like many real-world datasets, it’s far from clean: there are missing values, inconsistent formats, potential duplicates, and outliers.
Before you can provide meaningful insights, you must clean and validate the data. Only then can you analyze it to uncover trends, identify high-risk regions, and assess where policy interventions are most urgently needed.
💾 The data
The data is a modified dataset from the U.S. Environmental Protection Agency (EPA).
Ozone contains the daily air quality summary statistics by monitor for the state of California for 2024. Each row contains the date and the air quality metrics per collection method and site
- "Date" - the calendar date with which the air quality values are associated
- "Source" - the data source: EPA's Air Quality System (AQS), or Airnow reports
- "Site ID" - the id for the air monitoring site
- "POC" - the id number for the monitor
- "Daily Max 8-hour Ozone Concentration" - the highest 8-hour value of the day for ozone concentration
- "Units" - parts per million by volume (ppm)
- "Daily AQI Value" - the highest air quality index value for the day, telling how clean or polluted the air is (a value of 50 represents good air quality, while a value above 300 is hazardous)
- "Local Site Name" - name of the monitoring site
- "Daily Obs Count" - number of observations reported in that day
- "Percent Complete" - indicates whether all expected samples were collected
- "Method Code" - identifier for the collection method
- "CBSA Code" - identifier for the core base statistical area (CBSA)
- "CBSA Name" - name of the core base statistical area
- "State FIPS Code" - identifier for the state
- "State" - name of the state
- "County FIPS Code" - identifer for the county
- "County" - name of the county
- "Site Latitude" - latitude coordinates of the site
- "Site Longitude" - longitude coordinates of the side
💪 Competition challenge
Create a report that covers the following:
- Your EDA and data cleaning process.
- How does daily maximum 8-hour ozone concentration vary over time and regions?
- Are there any areas that consistently show high ozone concentrations? Do different methods report different ozone levels?
- Consider if urban activity (weekend vs. weekday) has any affect on ozone levels across different days.
- Bonus: plot a geospatial heatmap showing any high ozone concentrations.
Executive Summary
After cleaning and validating the EPA’s ozone monitoring data for California, I conducted an in-depth analysis to identify patterns in air quality across regions and monitoring sites.
The results point to Fresno, Tulare, San Bernardino, and Los Angeles counties as persistent high-ozone areas. At the site level, Sequoia & Kings Canyon NPs – Lower Kaweah, Crestline, and Redlands recorded the highest average concentrations, indicating zones where environmental and public health interventions are most urgently needed.
I also observed slightly higher ozone levels on weekdays, suggesting a link to urban and commuter activity. While most sites used consistent measurement methods, a few variations were present and addressed. Missing values in fields like Method Code and CBSA were handled carefully to preserve data integrity.
Based on these findings, I recommend:
- Prioritizing high-exposure counties and sites for mitigation efforts.
- Implementing weekday-focused emission controls.
- Standardizing measurement methods across the network.
- Continuing coverage in non-metropolitan counties with ozone presence.
- Strengthening data validation procedures to support long-term air quality monitoring.
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as pxozone = pd.read_csv('data/ozone.csv')
ozone.head()EDA
# Display the shape of the ozone dataframe
ozone.shape# Summary of the ozone dataframe
ozone.info()# Correcting date dtype
ozone['Date'] = pd.to_datetime(ozone['Date'])Finding duplicates
# Define the keys to identify duplicate rows
dup_keys = ['Date', 'Site ID', 'POC',
'Daily Max 8-hour Ozone Concentration',
'Daily AQI Value','Daily Obs Count','Percent Complete']
# Find duplicate rows in the ozone dataframe based on the defined keys
duplicates = ozone[ozone.duplicated(subset = dup_keys, keep = False)]
# Display the number of duplicate rows found
print(f"Duplicate rows found: {len(duplicates)}")Handling duplicates
# Sort the ozone dataframe by 'Method Code' to prioritize rows with lower method codes
ozone_sorted = ozone.sort_values(by='Method Code')
# Drop duplicate rows based on the defined keys, keeping the first occurrence
ozone_no_dup = ozone_sorted.drop_duplicates(subset=dup_keys, keep='first')