Ozone Watch: Cleaning and Analysing California's Air Quality Data
Competition: DataCamp – California Air Quality Monitoring
Author: Najib Yusuf Ubandiya
Date: July 2025
Executive Summary
This notebook presents an analytical overview of ozone pollution across California using state-level air monitoring data. After cleaning and preprocessing the raw dataset, we conducted statistical and geospatial analyses to identify when and where ozone concentrations exceed safe levels.
Key findings include:
- Seasonal Variation: Ozone levels during summer are on average 36.6% higher than in winter, indicating heat and sunlight are key contributing factors.
- Weekday Activity Impact: Weekday readings show a 0.9% increase over weekends, suggesting influence from traffic and industrial activity.
- EPA Compliance: No monitoring sites currently exceed the EPA ozone threshold of 0.070 ppm, although one site approaches high-risk levels.
- Regional Trends: Hotspots are concentrated in San Bernardino, Riverside, and Tulare counties, all of which show average ozone levels above 0.052 ppm.
These patterns point toward the need for localised interventions, especially during high-risk seasons. The insights generated support data-driven environmental policies aimed at reducing health risks and ensuring regulatory compliance.
Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")Load the dataset
How we're doing it:
We start with pandas for data manipulation and multiple visualization libraries (matplotlib, seaborn, plotly) to create comprehensive visualizations. The initial .head() and .shape give us a quick overview of data structure.
ozone = pd.read_csv('data/ozone.csv')
print("Dataset Shape:", ozone.shape)
print("\nFirst 5 rows:")
ozone.head()1. Examine data types and missing values
print("Data Types and Missing Values:")
print("\n", ozone.info())
print("\nMissing Values Summary:")
missing_summary = ozone.isnull().sum()
missing_summary[missing_summary > 0]🧾 Initial Data Audit: Data Types and Missing Values
We begin with a structural overview of the dataset using the .info() method to inspect data types and identify any missing values. This is essential for guiding our cleaning strategy and understanding how complete and usable each column is.
The dataset contains 54,759 records and 17 columns. Key observations from the output:
Dateis currently stored as anobjecttype, which suggests inconsistent formatting and will need to be parsed into a properdatetimeformat.- Both
Daily Max 8-hour Ozone ConcentrationandDaily AQI Valuehave 2,738 missing values each. Since these are central to the analysis, their absence must be addressed. Method Code,CBSA Code, andCBSA Namealso contain missing entries. These may affect regional and methodological breakdowns.- Other fields like geolocation, site info, and observation counts appear to be complete.
The next step will involve:
- Cleaning the
Datefield, - Evaluating whether to impute or remove records with missing ozone measurements,
- And understanding the impact of missing metadata (
CBSA,Method Code) on regional analyses.
2. Data Quality Assessment and Cleaning
What we're doing:
Systematically identifying and addressing data quality issues including missing values, inconsistent date formats, outliers, and duplicate records.
Examine the Date column for inconsistencies
print("Unique Date formats sample:")
print(ozone['Date'].head(20).tolist())Check for obvious data quality issues