Skip to content
0

Cleaning data and the skies

📖 Background

Your are a data analyst at an environmental company. Your task is to evaluate ozone pollution across various regions.

You’ve obtained data from the U.S. Environmental Protection Agency (EPA), containing daily ozone measurements at monitoring stations across California. However, like many real-world datasets, it’s far from clean: there are missing values, inconsistent formats, potential duplicates, and outliers.

Before you can provide meaningful insights, you must clean and validate the data. Only then can you analyze it to uncover trends, identify high-risk regions, and assess where policy interventions are most urgently needed.


1 hidden cell
import pandas as pd
ozone = pd.read_csv('ozone.csv')
ozone.head()

💪 Competition challenge

Create a report that covers the following:

  1. Your EDA and data cleaning process.
  2. How does daily maximum 8-hour ozone concentration vary over time and regions?
  3. Are there any areas that consistently show high ozone concentrations? Do different methods report different ozone levels?
  4. Consider if urban activity (weekend vs. weekday) has any affect on ozone levels across different days.
  5. Bonus: plot a geospatial heatmap showing any high ozone concentrations.

✅ Checklist before publishing into the competition

  • Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
  • Remove redundant cells like the judging criteria, so the workbook is focused on your story.
  • Make sure the workbook reads well and explains how you found your insights.
  • Try to include an executive summary of your recommendations at the beginning.
  • Check that all the cells run without error

⌛️ Time is ticking. Good luck!

Loading the data

import pandas as pd

df = pd.read_csv('ozone.csv')
display(df.head())

Explore the data

print("Data Types:")
display(df.dtypes)

print("\nShape of the DataFrame:")
display(df.shape)

print("\nMissing Values per Column:")
display(df.isnull().sum())

print("\nDescriptive Statistics for Numerical Columns:")
display(df.describe())

print("\nDescriptive Statistics for Categorical Columns:")
display(df.describe(include='object'))

Cleaning the data

# Inspect columns with missing values
print("Percentage of missing values:")
display(df[['Daily Max 8-hour Ozone Concentration', 'Daily AQI Value', 'Method Code', 'CBSA Code', 'CBSA Name']].isnull().mean() * 100)

# Strategy: Remove rows where key measurement columns ('Daily Max 8-hour Ozone Concentration', 'Daily AQI Value') are missing.
# For 'Method Code', 'CBSA Code', and 'CBSA Name', also remove rows with missing values for simplicity in this subtask.
df.dropna(subset=['Daily Max 8-hour Ozone Concentration', 'Daily AQI Value', 'Method Code', 'CBSA Code', 'CBSA Name'], inplace=True)

# Check for and remove duplicate rows
print("\nNumber of duplicate rows before removal:", df.duplicated().sum())
df.drop_duplicates(inplace=True)
print("Number of duplicate rows after removal:", df.duplicated().sum())

# Verify that missing values have been handled
print("\nMissing values after handling:")
display(df[['Daily Max 8-hour Ozone Concentration', 'Daily AQI Value', 'Method Code', 'CBSA Code', 'CBSA Name']].isnull().sum())

# Verify that duplicates have been removed
print("\nNumber of duplicate rows after verification:", df.duplicated().sum())

Analyzing ozone variation over time and regions

df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df.dropna(subset=['Date'], inplace=True)
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day_of_Week'] = df['Date'].dt.day_name()

# Analyze by year
yearly_ozone = df.groupby('Year')['Daily Max 8-hour Ozone Concentration'].mean()
print("Average Daily Max 8-hour Ozone Concentration by Year:")
display(yearly_ozone)

# Analyze by month
monthly_ozone = df.groupby('Month')['Daily Max 8-hour Ozone Concentration'].mean()
print("\nAverage Daily Max 8-hour Ozone Concentration by Month:")
display(monthly_ozone)

# Analyze by region (CBSA Name)
regional_ozone = df.groupby('CBSA Name')['Daily Max 8-hour Ozone Concentration'].mean()
print("\nAverage Daily Max 8-hour Ozone Concentration by Region:")
display(regional_ozone.sort_values(ascending=False))

# Analyze daily changes over time by region
daily_regional_ozone = df.groupby(['CBSA Name', 'Date'])['Daily Max 8-hour Ozone Concentration'].mean().reset_index()
print("\nAverage Daily Max 8-hour Ozone Concentration by Region and Date (first 10 rows):")
display(daily_regional_ozone.head(10))

Identifying high ozone areas and compare measurement methods