Cleaning data and the skies
📖 Background
Your are a data analyst at an environmental company. Your task is to evaluate ozone pollution across various regions.
You’ve obtained data from the U.S. Environmental Protection Agency (EPA), containing daily ozone measurements at monitoring stations across California. However, like many real-world datasets, it’s far from clean: there are missing values, inconsistent formats, potential duplicates, and outliers.
Before you can provide meaningful insights, you must clean and validate the data. Only then can you analyze it to uncover trends, identify high-risk regions, and assess where policy interventions are most urgently needed.
1 hidden cell
💪 Competition challenge
Create a report that covers the following:
- Your EDA and data cleaning process.
- How does daily maximum 8-hour ozone concentration vary over time and regions?
- Are there any areas that consistently show high ozone concentrations? Do different methods report different ozone levels?
- Consider if urban activity (weekend vs. weekday) has any affect on ozone levels across different days.
- Bonus: plot a geospatial heatmap showing any high ozone concentrations.
🧑⚖️ Judging criteria
| CATEGORY | WEIGHTING | DETAILS |
|---|---|---|
| Recommendations | 35% |
|
| Storytelling | 35% |
|
| Visualizations | 20% |
|
| Votes | 10% |
|
✅ Checklist before publishing into the competition
- Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
- Remove redundant cells like the judging criteria, so the workbook is focused on your story.
- Make sure the workbook reads well and explains how you found your insights.
- Try to include an executive summary of your recommendations at the beginning.
- Check that all the cells run without error
⌛️ Time is ticking. Good luck!
Ozone: An air pollutant with a high presence in California
Competition: DataCamp - Cleaning data and the skies
Author: Noelia Fernandez Paez
Date: July 2025
Importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsSet a seaborn style for better visuals
sns.set(style="whitegrid")Read in the dataset as a DataFrame
import pandas as pd
ozone = pd.read_csv('data/ozone.csv')
ozone.head()Exploratory Data Analysis (EDA): Initial exploration
print("Dataset shape:", ozone.shape)
print('\nGetting more information about data types and missing values:\n')
print(ozone.info())
print("\nCounting the number of missing values per column:\n")
print(ozone.isna().sum())