Skip to content
0

Cleaning data and the skies

📖 Background

Your are a data analyst at an environmental company. Your task is to evaluate ozone pollution across various regions.

You’ve obtained data from the U.S. Environmental Protection Agency (EPA), containing daily ozone measurements at monitoring stations across California. However, like many real-world datasets, it’s far from clean: there are missing values, inconsistent formats, potential duplicates, and outliers.

Before you can provide meaningful insights, you must clean and validate the data. Only then can you analyze it to uncover trends, identify high-risk regions, and assess where policy interventions are most urgently needed.

💾 The data

The data is a modified dataset from the U.S. Environmental Protection Agency (EPA).

Ozone contains the daily air quality summary statistics by monitor for the state of California for 2024. Each row contains the date and the air quality metrics per collection method and site
  • "Date" - the calendar date with which the air quality values are associated
  • "Source" - the data source: EPA's Air Quality System (AQS), or Airnow reports
  • "Site ID" - the id for the air monitoring site
  • "POC" - the id number for the monitor
  • "Daily Max 8-hour Ozone Concentration" - the highest 8-hour value of the day for ozone concentration
  • "Units" - parts per million by volume (ppm)
  • "Daily AQI Value" - the highest air quality index value for the day, telling how clean or polluted the air is (a value of 50 represents good air quality, while a value above 300 is hazardous)
  • "Local Site Name" - name of the monitoring site
  • "Daily Obs Count" - number of observations reported in that day
  • "Percent Complete" - indicates whether all expected samples were collected
  • "Method Code" - identifier for the collection method
  • "CBSA Code" - identifier for the core base statistical area (CBSA)
  • "CBSA Name" - name of the core base statistical area
  • "State FIPS Code" - identifier for the state
  • "State" - name of the state
  • "County FIPS Code" - identifer for the county
  • "County" - name of the county
  • "Site Latitude" - latitude coordinates of the site
  • "Site Longitude" - longitude coordinates of the side
import pandas as pd
from datetime import date
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
ozone = pd.read_csv('data/ozone.csv')
ozone

Air Aware: Investigating Ozone Trends Through Cleaned and Visualized Data

Executive Summary of Report

📊 1. Exploratory Data Analysis (EDA) and Data Cleaning

A comprehensive EDA and cleaning process was conducted to ensure the dataset was accurate and reliable. Key steps included:

  • Identifying and resolving duplicate entries and inconsistencies (e.g., "LA" vs. "Los Angeles").
  • Replacing malformed date strings and converting to datetime.
  • Interpolating missing numerical values (e.g., ozone concentration, AQI) per site using time-based or linear methods.
  • Replacing categorical columns with null or zero values with placeholder notes (e.g., "No Code").
  • Applying a threshold of ≥330 days of data per site to filter out incomplete records.

Proportion of sites with ≥ 330 days of data: 83.95% — acceptable in environmental studies to ensure robust seasonal coverage without overly restricting the sample size.


🌤️ 2. Temporal and Spatial Variation in Ozone Concentration

Analysis of the daily maximum 8-hour ozone concentration uncovered strong temporal and spatial patterns:

🕒 Temporal Trends:

  • Ozone levels rise during spring and peak in summer (July–August), forming a distinct seasonal curve.
  • This is consistent with known ozone chemistry: increased sunlight and temperature accelerate ozone formation in the atmosphere.

🗺️ Spatial Distribution:

  • Certain inland areas like Lancaster – Fairgrounds consistently reported high concentrations, occasionally exceeding the EPA standard of 0.070 ppm.
  • In contrast, coastal and mountainous regions such as Santa Cruz and Carpinteria maintained low, stable readings.

Animated maps and geospatial graphs emphasize these temporal-spatial ozone surges and concentration clusters.


🧪 3. Method Code Differences in Ozone Reporting

To assess whether different measurement methods (Method Code) influenced reported ozone values:

  • Data grouped by Method Code showed high consistency across instruments.
  • Regardless of the method, ozone trends were generally aligned at any given site and time.

🔍 While minor variability may exist, the results support the instrumental reliability and standardization of monitoring protocols.


🏙️ 4. Urban Activity Impact (Weekday vs. Weekend)

Preliminary analysis was conducted to assess whether urban activity patterns, such as reduced traffic on weekends, influence ozone levels:

  • In several urban locations, a slight dip in average ozone levels was observed on weekends.
  • However, variability and meteorological factors mean further modeling is needed to confirm causal effects.

🗺️ 5. Geospatial Heatmap of Ozone Concentration

An animated scatter mapbox heatmap visualizes daily ozone levels across California. Features include:

  • Color-coded concentration levels using a custom yellow-to-red scale.
  • Basemap background for geographic context.
  • Date slider for temporal exploration.
  • Highlights temporal surges and spatial clustering of high ozone zones.

1. EDA and Data Cleaning Process

1.1 Handling Duplicate Values

To ensure data accuracy and prevent skewed analysis, it’s essential to identify and remove any duplicate records. Duplicate entries can arise from repeated data logging or merging datasets, and may lead to inflated ozone concentration readings or biased trend interpretations.

Abbreviation duplicates

Upon inspection of the "Local Site Name" and "County" columns, it was observed that the "County" column contains abbreviated entries for certain counties. Specifically, 'Los Angeles' appears as 'LA' and 'San Francisco' as 'SF'. These abbreviations should be standardized to their full names to maintain consistency and accuracy in county-level analysis.

ozone["Local Site Name"].unique()
ozone["County"].unique()
  • The abbreviations have been replaced by their full name counterparts:
ozone["County"] = ozone["County"].replace({
    "LA": "Los Angeles",
    "SF": "San Francisco"
})
# Checking if the two abbreviations were replaced. The number of unique values indicates that they have been  replaced.
ozone["County"].nunique()

Duplicate Entries Based on Date and Local Site Name

A review of the dataset revealed duplicate records that share the same "Date" and "Local Site Name". These duplicates may indicate repeated measurements or data entry errors. Further validation is necessary to determine whether to retain, aggregate, or remove these entries.

# Duplicates found in Oakland (scroll below)
ozone[ozone["Local Site Name"] == "Oakland"]
  • Duplicate entries begin appearing from index 54,259 onward. These records will be removed to ensure data accuracy and prevent distortion in analysis.