Ozone Watch: Cleaning and Analysing California's Air Quality Data

Competition: DataCamp – California Air Quality Monitoring
Author: Najib Yusuf Ubandiya
Date: July 2025

Executive Summary

This notebook presents an analytical overview of ozone pollution across California using state-level air monitoring data. After cleaning and preprocessing the raw dataset, we conducted statistical and geospatial analyses to identify when and where ozone concentrations exceed safe levels.

Key findings include:

Seasonal Variation: Ozone levels during summer are on average 36.6% higher than in winter, indicating heat and sunlight are key contributing factors.
Weekday Activity Impact: Weekday readings show a 0.9% increase over weekends, suggesting influence from traffic and industrial activity.
EPA Compliance: No monitoring sites currently exceed the EPA ozone threshold of 0.070 ppm, although one site approaches high-risk levels.
Regional Trends: Hotspots are concentrated in San Bernardino, Riverside, and Tulare counties, all of which show average ozone levels above 0.052 ppm.

These patterns point toward the need for localised interventions, especially during high-risk seasons. The insights generated support data-driven environmental policies aimed at reducing health risks and ensuring regulatory compliance.

Import necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

Set plotting style

plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

Load the dataset

How we're doing it:

We start with pandas for data manipulation and multiple visualization libraries (matplotlib, seaborn, plotly) to create comprehensive visualizations. The initial .head() and .shape give us a quick overview of data structure.

ozone = pd.read_csv('data/ozone.csv')

print("Dataset Shape:", ozone.shape)
print("\nFirst 5 rows:")
ozone.head()

1. Examine data types and missing values

print("Data Types and Missing Values:")
print("\n", ozone.info())
print("\nMissing Values Summary:")
missing_summary = ozone.isnull().sum()
missing_summary[missing_summary > 0]

🧾 Initial Data Audit: Data Types and Missing Values

We begin with a structural overview of the dataset using the .info() method to inspect data types and identify any missing values. This is essential for guiding our cleaning strategy and understanding how complete and usable each column is.

The dataset contains 54,759 records and 17 columns. Key observations from the output:

Date is currently stored as an object type, which suggests inconsistent formatting and will need to be parsed into a proper datetime format.
Both Daily Max 8-hour Ozone Concentration and Daily AQI Value have 2,738 missing values each. Since these are central to the analysis, their absence must be addressed.
Method Code, CBSA Code, and CBSA Name also contain missing entries. These may affect regional and methodological breakdowns.
Other fields like geolocation, site info, and observation counts appear to be complete.

The next step will involve:

Cleaning the Date field,
Evaluating whether to impute or remove records with missing ozone measurements,
And understanding the impact of missing metadata (CBSA, Method Code) on regional analyses.

2. Data Quality Assessment and Cleaning

What we're doing:

Systematically identifying and addressing data quality issues including missing values, inconsistent date formats, outliers, and duplicate records.

Examine the Date column for inconsistencies

print("Unique Date formats sample:")
print(ozone['Date'].head(20).tolist())

Check for obvious data quality issues

‌
‌
‌

Ozone Watch: Cleaning and Analysing California's Air Quality Data

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Ozone Watch: Cleaning and Analysing California's Air Quality Data

Executive Summary

Import necessary libraries

Set plotting style

Load the dataset

How we're doing it:

1. Examine data types and missing values

🧾 Initial Data Audit: Data Types and Missing Values

2. Data Quality Assessment and Cleaning

What we're doing:

Examine the Date column for inconsistencies

Check for obvious data quality issues

Ozone Watch: Cleaning and Analysing California's Air Quality Data