Skip to content
0

Breathing the Data: What the Air Is Telling Us In California 2024

I’m a data analyst at an environmental research firm, and my job isn’t just about numbers, it’s about people.

When I opened the 2024 ozone dataset from the U.S. Environmental Protection Agency, I wasn’t just looking at rows and columns. I was looking at the air children breathe on their way to school, the skyline hikers see in national parks, and the summer afternoons outdoor workers spend under the sun.

Like so many real-world environmental records, it was messy:

  • Missing values
  • Inconsistent formats
  • Duplicate entries
  • Suspicious outliers — one site even reported near-zero ozone on a smoggy August afternoon

I knew there was a story worth uncovering. Because in environmental science, bad data doesn’t just mislead, it endangers.

So I rolled up my sleeves. My mission was clear: to clean the data, validate it, and transform it into insight, to uncover when and where ozone concentrations rise, which communities are most affected, and how human activity (like weekday commutes) might be making it worse.

And what I found wasn’t just a trend. It was a pattern of inequality, exposure, and seasonal crisis with solutions hidden in the data.

This is the story of what the air is really telling us, and what we must do about it.


2 hidden cells

Data Cleaning and Preparation

After importing the dataset, I began with an initial inspection. The data contained 54,759 records across 17 columns, representing ozone concentration readings from air monitoring sites across California during 2024.

But before we can uncover patterns in ozone levels, we must first ensure the data is accurate and ready for analysis.


2 hidden cells

1. Fixing the Date

The dataset has nearly 17% (9,202 entries) had corrupted Date values, such as "/2024" . With incomplete dates, we couldn’t analyze seasonal patterns, weekday trends, or monthly summaries which is the core of this report.

So, I cleaned the Date column with a structured approach:

  1. Standardized Formats: The dataset used various formats like "Jan 05/2024" and "01/10/2024" , so I standardized them.

  2. Removed Incomplete Dates: Rows with unrecognizable or incomplete dates (e.g., "/2024") were removed, as we could not determine the month or day when ozone was measured, which is critical for our analysis.

Recommendation

To ensure accurate and usable data in future collections, data entry systems must enforce complete and standardized date formats.

  • All records should include day, month, and year (e.g., YYYY-MM-DD).
  • If data is entered manually, implement form validation that blocks incomplete dates (e.g., "/2024").
  • If dates are collected from automated sensors or external systems, ensure the ETL (Extract-Transform-Load) pipeline includes a validation layer to flag or reject invalid entries.

This small change would have saved nearly 1 in every 6 rows from being lost in 2024.


2 hidden cells

2. Removing Duplicates

I examined the dataset for duplicate records and 267 duplicate rows were found and removed based on full row duplication.


2 hidden cells

3. Handling Missing Values

Some columns had missing values:

Hidden code

I identified missing values in several key fields. Each was addressed with a purposeful strategy to preserve data integrity while enabling meaningful analysis.

  1. Daily AQI Value and Daily Max 8-hour Ozone Concentration
  • These are the core variables of this analysis.
  • Rows where BOTH columns are missing were dropped.
  • Rows with missing values in Daily Max 8-hour Ozone Concentration were dropped, as reliable ozone data is essential for trend analysis and public health assessment.
  • Rows with missing values in Daily AQI Value were filled from Daily Max 8-hour Ozone Concentration
    • This approach preserved data consistency and sample size without introducing artificial bias.
  1. Method Code
  • While exploring the dataset in Excel, I observed that the missing values in the Method Code column were all associated with records where the data source is "AirNow".
  • This suggests that AirNow uses a consistent, system-wide method that is not individually coded.
  • To maintain clarity and allow for method-based comparisons, I filled these missing values with 'AirNow' as a placeholder method identifier.
  1. CBSA Code/CBSA Name
  • These fields had low completeness and were not essential for answering the primary questions about time, region, and ozone levels.
  • They were excluded from analysis to focus on high-impact, reliable variables.

1 hidden cell