Ozone: An air pollutant with a high presence in California

Cleaning data and the skies

📖 Background

Your are a data analyst at an environmental company. Your task is to evaluate ozone pollution across various regions.

You’ve obtained data from the U.S. Environmental Protection Agency (EPA), containing daily ozone measurements at monitoring stations across California. However, like many real-world datasets, it’s far from clean: there are missing values, inconsistent formats, potential duplicates, and outliers.

Before you can provide meaningful insights, you must clean and validate the data. Only then can you analyze it to uncover trends, identify high-risk regions, and assess where policy interventions are most urgently needed.

1 hidden cell

import pandas as pd
ozone = pd.read_csv('data/ozone.csv')
ozone.head()

💪 Competition challenge

Create a report that covers the following:

Your EDA and data cleaning process.
How does daily maximum 8-hour ozone concentration vary over time and regions?
Are there any areas that consistently show high ozone concentrations? Do different methods report different ozone levels?
Consider if urban activity (weekend vs. weekday) has any affect on ozone levels across different days.
Bonus: plot a geospatial heatmap showing any high ozone concentrations.

🧑‍⚖️ Judging criteria

CATEGORY	WEIGHTING	DETAILS
Recommendations	35%	Clarity of recommendations - how clear and well presented the recommendation is. Quality of recommendations - are appropriate analytical techniques used & are the conclusions valid? Number of relevant insights found for the target audience.
Storytelling	35%	How well the data and insights are connected to the recommendation. How the narrative and whole report connects together. Balancing making the report in-depth enough but also concise.
Visualizations	20%	Appropriateness of visualization used. Clarity of insight from visualization.
Votes	10%	Up voting - most upvoted entries get the most points.

✅ Checklist before publishing into the competition

Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
Remove redundant cells like the judging criteria, so the workbook is focused on your story.
Make sure the workbook reads well and explains how you found your insights.
Try to include an executive summary of your recommendations at the beginning.
Check that all the cells run without error

⌛️ Time is ticking. Good luck!

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Loading the dataset and performing an initial exploration.
ozone = pd.read_csv('data/ozone.csv')
print("1. Starting with the Exploratory Data Analysis(EDA):")
print('\nGetting more information about data types and missing values:\n')

print(ozone.info())

print("\nIt looks like there are columns with missing values in the dataset: Daily Max 8-hour Ozone Concentration, Daily AQI Value, Method Code, CBSA Code and CBSA Name.")

# Setting a seaborn style for better visuals.
sns.set(style="whitegrid")

# Visualizing the distribution of daily maximum 8-hour ozone concentration.
plt.figure(figsize=(5, 5))
sns.histplot(data=ozone, x='Daily Max 8-hour Ozone Concentration')
plt.title("Distribution of the daily maximum 8-hour ozone concentration")
plt.xlabel("Daily maximum 8-hour ozone concentration (ppm)")
plt.ylabel("Number of daily ozone measurements")
plt.show()

print("In order to quickly review the daily maximum ozone concentrations, I created this visualization. From this distribution it can be concluded that indeed there are missing values and, possibly, outliers as well.")

# Counting the number of missing values per column.
print("\nFirst step in the data cleaning: Dealing with missing values")
print("\nCounting the number of missing values per column:\n")

print(ozone.isna().sum())

# Calculating our missing values threshold(5% or less of all values).
threshold = round(len(ozone) * 0.05)

# Using Boolean indexing to filter for columns with missing values less than or equal to this threshold.
cols_to_drop = ozone.columns[ozone.isna().sum() <= threshold]

# Dropping missing values. We set inplace=True so the DataFrame is updated.
ozone.dropna(subset=cols_to_drop, inplace=True)

print("\nThere are various approaches to handle missing data. In this case I calculated a threshold(5% or less of all values): " + str(threshold))
print("\nWith this value I can filter the columns to drop, remove them and then count the number of missing values again:\n")

print(ozone.isna().sum())

# Imputing a summary statistic: mode
cols_with_missing_values = ozone.columns[ozone.isna().sum() > 0]

for col in cols_with_missing_values[:-1]:
    ozone[col].fillna(ozone[col].mode()[0], inplace=True) # Added inplace=True

# Checking the remaining missing values.
print('\nThere are still missing values, so I will be imputing a summary statistic(mode) with a for loop.')
print('Counting the number of missing values after imputation:\n')

print(ozone.isna().sum())

# The volume has changed from 6490 missing values to 5681. This is because some rows may have contained missing values for our subset columns. Imputing by sub-group.
ozone_dict = ozone.groupby("CBSA Name")["Method Code"].median().to_dict()
ozone["Method Code"] = ozone["Method Code"].fillna(ozone["CBSA Name"].map(ozone_dict))

print("\nThe for loop didn't work out, so the remaining missing values will be eliminated grouping the CBSA Name with the Method Code column. Conditionally imputing the missing values for Method Code column based on a dictionary.")
print('Checking for missing values:\n')

print(ozone.isna().sum())

# Converting Date to datetime. The pandas to_datetime function automatically accepts most date formats
ozone['Date'] = pd.to_datetime(ozone["Date"])

print("\nSecond step in the data cleaning: Converting the Date column data type to datetime")
print("Checking the values format in the Date column:\n")
print(ozone["Date"])

print("\nThird step in the data cleaning: Treating duplicates")

# Getting duplicates: The subset argument lets us set a list of column names to check for duplication and the keep argument lets us keep all occurrences of duplicate values by setting it to False.
duplicates = ozone.duplicated(subset = ["Daily Max 8-hour Ozone Concentration", "Date", "CBSA Code", "Method Code", "County FIPS Code", "POC", "Percent Complete", "Site ID", "Source", "Daily Obs Count", "County"], keep = False)

# Sorting the duplicate values.
duplicates_ozone = ozone[duplicates].sort_values(by = "Daily Max 8-hour Ozone Concentration")

print("\nLooking at all duplicates:\n")
print(duplicates_ozone[["Daily Max 8-hour Ozone Concentration", "Date", "CBSA Code", "Method Code", "County FIPS Code", "POC", "Percent Complete", "Site ID", "Source", "Daily Obs Count", "County"]])

# Dropping complete duplicates only.
ozone.drop_duplicates(inplace = True) # inplace argument which drops the duplicated values directly inside the DataFrame.

# Cheking if there are any remaining duplicates.
duplicates = ozone.duplicated(subset = ["Daily Max 8-hour Ozone Concentration", "Date", "CBSA Code", "Method Code", "County FIPS Code", "POC", "Percent Complete", "Site ID", "Source", "Daily Obs Count", "County"], keep = False)

duplicates_ozone = ozone[duplicates].sort_values(by = "Daily Max 8-hour Ozone Concentration")

print("\nLooking at the remaining duplicates:\n")
print(duplicates_ozone[["Daily Max 8-hour Ozone Concentration", "Date", "CBSA Code", "Method Code", "County FIPS Code", "POC", "Percent Complete", "Site ID", "Source", "Daily Obs Count", "County"]])
print("\nThe result was an empty DataFrame which means that all duplicates were eliminated, so it is not necesary to treat incomplete duplicates.")

# Looking for outliers. The maximum is almost three times the median and mean, seems extreme so we will use the IQR.
print("\nFourth step in the data cleaning: Handling outliers")
print("\nA starting place for identifying outliers is with the pandas .describe() method. I will use this method in the main numerical column that will be analyzed:\n")
print(ozone['Daily Max 8-hour Ozone Concentration'].describe())

# IQR = 75th - 25th percentile, so in this case IQR = 0.018. This can be represented using a boxplot.
print("\nThe maximum ozone concentration is almost three times the median and the mean, seems extreme so I will analyze this using a boxplot.")

plt.figure(figsize=(12, 5))
sns.boxplot(data=ozone, x='Daily Max 8-hour Ozone Concentration')
plt.title("Boxplot of the daily maximum 8-hour ozone concentration(ppm) for ozone data")
plt.xlabel("Daily maximum 8-hour ozone concentration(ppm)")
plt.show()

print("There are a lot of extreme values but I consider that this outliers are representative of a subset of our data since ozone concentration can show extreme variations from one place to another. So I would not make a further analysis for outliers.")
print("\n2. How does daily maximum 8-hour ozone concentration vary over time and regions?")

# Extracting parts of a full date using dt.month, dt.day, and dt.year attributes.
ozone["year"] = ozone["Date"].dt.year
ozone["month"] = ozone["Date"].dt.month
ozone["day"] = ozone["Date"].dt.day

# Visualizing how daily maximum 8-hour ozone concentration vary over time and regions.
plt.figure(figsize=(5, 5))
sns.lineplot(data=ozone, x="month", y="Daily Max 8-hour Ozone Concentration")
plt.title("Variation of the daily maximum 8-hour ozone concentration by month")
plt.xlabel("months")
plt.ylabel("daily maximum 8-hour ozone concentration(ppm)")
plt.show()

plt.figure(figsize=(10, 5))
sns.scatterplot(data=ozone, x='CBSA Name', y='Daily Max 8-hour Ozone Concentration')
plt.title("Variation of the daily maximum 8-hour ozone concentration by region")
plt.xlabel("Core Based Statistical Areas(CBSA)")
plt.ylabel("daily maximum 8-hour ozone concentration(ppm)")
plt.xticks(rotation=90)
plt.show()

print("Daily maximum 8-hour ozone concentration(ppm) vs. months vs. Core Based Statistical Areas(CBSA)")
sns.relplot(x="month", y="Daily Max 8-hour Ozone Concentration", data=ozone, kind="line", hue="CBSA Name", ci=None)
plt.xlabel("months")
plt.ylabel("daily maximum 8-hour ozone concentration(ppm)")
plt.show()

print("The variation in daily maximum ozone concentrations by month was chosen because the confidence intervals are smaller and the graph was much easier to understand. This visualization shows an increase in ozone concentrations starting in the first month of 2024, reaching a maximum around July and then declining again in December of the same year. The second figure simply shows the regions where the different ozone concentrations were recorded; no clear relationship is observed between the two variables. The third visualization integrates the two previous graphs, showing that ozone concentrations generally increase in July in all regions, but their behavior remains highly variable over time and place. There is not enough information to determine the reason behind this unusual increase in July. The regional variation may be linked to traffic and industrial development, but there is also insufficient data to corroborate this.")

# Visualizing the ozone concentration for different regions. To know which regions consistenly present a high ozone concentration, I will use the upper outlier definition.
seventy_fifth = ozone["Daily Max 8-hour Ozone Concentration"].quantile(0.75)
twenty_fifth = ozone["Daily Max 8-hour Ozone Concentration"].quantile(0.25)
ozone_iqr = seventy_fifth - twenty_fifth
upper_outliers = seventy_fifth + (1.5 * ozone_iqr)

high_ozone = ozone[ozone["Daily Max 8-hour Ozone Concentration"] > upper_outliers]

"""There was a problem using the upper outliers since when plotting the data there wasn't a region that consistently show these extreme values all the time. So I decided to split the data using the 75th percentile(0.079) as a reference."""

high_ozone_75 = ozone[ozone["Daily Max 8-hour Ozone Concentration"] > seventy_fifth]

print("\n3. Are there any areas that consistently show high ozone concentrations? Do different methods report different ozone levels?\n")

print("Daily maximum 8-hour ozone concentration above 0.079 ppm(75th percentile) vs. months vs. Core Based Statistical Areas(CBSA)")
sns.relplot(x="month", y="Daily Max 8-hour Ozone Concentration", data=high_ozone_75, kind="line", hue='CBSA Name', ci=None)
plt.xlabel("months")
plt.ylabel("daily maximum 8-hour ozone concentration(ppm)")
plt.show()

plt.figure(figsize=(12, 10))
sns.scatterplot(x="month", y="CBSA Name", data=high_ozone_75)
plt.title("Core Based Statistical Areas(CBSA) vs. month in which a daily maximum 8-hour ozone concentration major than 0.079 ppm(75th percentile) was recorded")
plt.xlabel("month in which a measurement was recorded")
plt.ylabel("Core Based Statistical Areas(CBSA)")
plt.show()

plt.figure(figsize=(12, 10))
sns.scatterplot(data=high_ozone_75, x="month", y="County")
plt.title("California Counties vs. month in which a daily maximum 8-hour ozone concentration major than 0.079 ppm(75th percentile) was recorded")
plt.xlabel("month in which a measurement was recorded")
plt.ylabel("California Counties")
plt.show()

print("From a statistical point of view, ozone concentrations above the 75th percentile can be considered high. The first graph shows the variation of the highest ozone concentrations by month and by region, but the visualization is not very clear; it is difficult to know in which regions the line has been continuous throughout all months. For this reason, two more visualizations were created, showing in which regions and counties high ozone measurements were recorded each month. Visually, it can be observed that there are only two: Riverside-San Bernardino-Ontario (Riverside County) and San Diego-Carlsbad (San Diego County).")
print('\nFrom another point of view: Using the Air Quality Index(AQI)')
print('\nOn the United States Environmental Agency (EPA) website, the Air Quality Index, or AQI, is a nationally uniform index for reporting and forecasting daily air quality. The AQI tells the public how clean or polluted the air is and how to avoid potential associated health effects. The AQI uses a normalized scale from 0 to 500. The higher the AQI value, the greater the level of pollution and the greater the health concern. When AQI values are above 150, air quality is considered to be unhealthy for everyone.')

# Filtering the ozone data to get daily AQI values greater than 150.
unhealthy_ozone = ozone[ozone['Daily AQI Value'] > 150]

# Visualizing the counties that registered unhealthy levels of pollution. 
plt.figure(figsize=(5, 5))
sns.scatterplot(x="month", y="County", data=unhealthy_ozone)
plt.title("California Counties vs. month in which a daily maximum 8-hour ozone concentration above an AQI value of 150 was recorded")
plt.xlabel("month in which a measurement was recorded")
plt.ylabel("California Counties")
plt.show()

print('Fortunately, no California county has consistently recorded dangerous ozone concentrations. But at least 4 counties: Tulare, Riverside, Los Angeles, and Kern, have experienced significant ozone pollution for at least 6 consecutive months. San Bernardino has had the longest period of dangerous ozone levels (8 consecutive months).')

# Visualizing how ozone concentration vary for different methods.
plt.figure(figsize=(5, 5))
sns.lineplot(data=ozone, x='month', y="Daily Max 8-hour Ozone Concentration", hue="Method Code", style='Method Code', markers=True, ci=None)
plt.title("Daily maximum 8-hour ozone concentration vs. month vs. method")
plt.xlabel("months")
plt.ylabel("daily maximum 8-hour ozone concentration")
plt.show()

print("The visualization clearly shows that four different measurement methods were used, with similar results for methods identified by codes 199.0 and 87.0. While the method with code 53.0 tends to record higher ozone concentrations compared to the aforementioned methods, its behavior is similar. On the other hand, the method with code 47.0 records much lower values and much more stable behavior over time. It is possible that the methods with codes 53.0 and 47.0 are responsible for the extreme values, but further analysis is needed to determine this.")
print("\n4. Consider if urban activity (weekend vs. weekday) has any affect on ozone levels across different days.")

# Using a datetime-related attribute to convert the date column to weekdays.
ozone["weekday"] = ozone["Date"].dt.day_name()

#Visualizing the ozone concentration on weekdays and weekend.
plt.figure(figsize=(5, 5))
sns.lineplot(data=ozone, x="day", y="Daily Max 8-hour Ozone Concentration")
plt.title("Variation of the daily maximum 8-hour ozone concentration by day")
plt.xlabel("day")
plt.ylabel("daily maximum 8-hour ozone concentration(ppm)")
plt.show()

ozone_weekday = ozone[ozone["weekday"].isin(["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"])]

print("Daily maximum 8-hour ozone concentration vs. weekday vs. Core Based Statistical Areas (CBSA)")
sns.relplot(x="weekday", y="Daily Max 8-hour Ozone Concentration", data=ozone_weekday, kind="line", hue="CBSA Name", ci=None)
plt.xlabel("weekday")
plt.ylabel("daily maximum 8-hour ozone concentration(ppm)")
plt.show()

plt.figure(figsize=(5, 5))
sns.lineplot(x="weekday", y="Daily Max 8-hour Ozone Concentration", data=ozone_weekday)
plt.title("Variation of the daily maximum 8-hour ozone concentration by weekday")
plt.xlabel("weekday")
plt.ylabel("daily maximum 8-hour ozone concentration(ppm)")
plt.show()

ozone_weekend = ozone[ozone["weekday"].isin(["Saturday", "Sunday"])]

print("Daily maximum 8-hour ozone concentration vs. weekend vs. Core Based Statistical Areas(CBSA)")
sns.relplot(x="weekday", y="Daily Max 8-hour Ozone Concentration", data=ozone_weekend, kind="line", hue="CBSA Name", ci=None)
plt.xlabel("weekend")
plt.ylabel("daily maximum 8-hour ozone concentration(ppm)")
plt.show()

plt.figure(figsize=(5, 5))
sns.lineplot(x="weekday", y="Daily Max 8-hour Ozone Concentration", data=ozone_weekend)
plt.title("Variation of the daily maximum 8-hour ozone concentration by weekend")
plt.xlabel("weekend")
plt.ylabel("daily maximum 8-hour ozone concentration(ppm)")
plt.show()

print("The first visualization shows a highly variable daily maximum ozone concentration across all days; however, the confidence intervals are not very pronounced, implying that this pattern persists monthly. From the second visualization, it can be inferred that the confidence intervals in the third visualization are quite large due to the large variations recorded by region/county. However, both show how the maximum daily ozone concentration increases between Mondays and Fridays, primarily between Wednesdays and Thursdays. Likewise, from the fourth visualization, it can be inferred that due to the large variations in measurements between regions, the confidence intervals in the fifth visualization are also large, but in general, a decrease in ozone concentrations is observed between Saturdays and Sundays. This decrease may be due to less traffic on weekends and a reduction in industrial activity, but more data would be needed to establish a clear relationship.")