Skip to content
0

Cleaning data and the skies

📖 Background

Your are a data analyst at an environmental company. Your task is to evaluate ozone pollution across various regions.

You’ve obtained data from the U.S. Environmental Protection Agency (EPA), containing daily ozone measurements at monitoring stations across California. However, like many real-world datasets, it’s far from clean: there are missing values, inconsistent formats, potential duplicates, and outliers.

Before you can provide meaningful insights, you must clean and validate the data. Only then can you analyze it to uncover trends, identify high-risk regions, and assess where policy interventions are most urgently needed.


1 hidden cell

💪 Competition challenge

Create a report that covers the following:

  1. Your EDA and data cleaning process.
  2. How does daily maximum 8-hour ozone concentration vary over time and regions?
  3. Are there any areas that consistently show high ozone concentrations? Do different methods report different ozone levels?
  4. Consider if urban activity (weekend vs. weekday) has any affect on ozone levels across different days.
  5. Bonus: plot a geospatial heatmap showing any high ozone concentrations.

🧑‍⚖️ Judging criteria

CATEGORYWEIGHTINGDETAILS
Recommendations35%
  • Clarity of recommendations - how clear and well presented the recommendation is.
  • Quality of recommendations - are appropriate analytical techniques used & are the conclusions valid?
  • Number of relevant insights found for the target audience.
Storytelling35%
  • How well the data and insights are connected to the recommendation.
  • How the narrative and whole report connects together.
  • Balancing making the report in-depth enough but also concise.
Visualizations20%
  • Appropriateness of visualization used.
  • Clarity of insight from visualization.
Votes10%
  • Up voting - most upvoted entries get the most points.

✅ Checklist before publishing into the competition

  • Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
  • Remove redundant cells like the judging criteria, so the workbook is focused on your story.
  • Make sure the workbook reads well and explains how you found your insights.
  • Try to include an executive summary of your recommendations at the beginning.
  • Check that all the cells run without error

⌛️ Time is ticking. Good luck!

Ozone: An air pollutant with a high presence in California

Competition: DataCamp - Cleaning data and the skies

Author: Noelia Fernandez Paez

Date: July 2025

Importing necessary libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Set a seaborn style for better visuals

sns.set(style="whitegrid")

Read in the dataset as a DataFrame

import pandas as pd
ozone = pd.read_csv('data/ozone.csv')
ozone.head()

Exploratory Data Analysis (EDA): Initial exploration

print("Dataset shape:", ozone.shape)
print('\nGetting more information about data types and missing values:\n')

print(ozone.info())

print("\nCounting the number of missing values per column:\n")

print(ozone.isna().sum())