Skip to content
0

Cleaning data and the skies

📖 Background

Your are a data analyst at an environmental company. Your task is to evaluate ozone pollution across various regions.

You’ve obtained data from the U.S. Environmental Protection Agency (EPA), containing daily ozone measurements at monitoring stations across California. However, like many real-world datasets, it’s far from clean: there are missing values, inconsistent formats, potential duplicates, and outliers.

Before you can provide meaningful insights, you must clean and validate the data. Only then can you analyze it to uncover trends, identify high-risk regions, and assess where policy interventions are most urgently needed.

💾 The data

The data is a modified dataset from the U.S. Environmental Protection Agency (EPA).

Ozone contains the daily air quality summary statistics by monitor for the state of California for 2024. Each row contains the date and the air quality metrics per collection method and site
  • "Date" - the calendar date with which the air quality values are associated
  • "Source" - the data source: EPA's Air Quality System (AQS), or Airnow reports
  • "Site ID" - the id for the air monitoring site
  • "POC" - the id number for the monitor
  • "Daily Max 8-hour Ozone Concentration" - the highest 8-hour value of the day for ozone concentration
  • "Units" - parts per million by volume (ppm)
  • "Daily AQI Value" - the highest air quality index value for the day, telling how clean or polluted the air is (a value of 50 represents good air quality, while a value above 300 is hazardous)
  • "Local Site Name" - name of the monitoring site
  • "Daily Obs Count" - number of observations reported in that day
  • "Percent Complete" - indicates whether all expected samples were collected
  • "Method Code" - identifier for the collection method
  • "CBSA Code" - identifier for the core base statistical area (CBSA)
  • "CBSA Name" - name of the core base statistical area
  • "State FIPS Code" - identifier for the state
  • "State" - name of the state
  • "County FIPS Code" - identifer for the county
  • "County" - name of the county
  • "Site Latitude" - latitude coordinates of the site
  • "Site Longitude" - longitude coordinates of the side

Import dataframe ozone and display

import pandas as pd
ozone = pd.read_csv('data/ozone.csv')
ozone.head(20)

EXECUTIVE SUMMARY

Are there any areas that consistently show high ozone concentrations?

Bay area shows consistently low ozone levels and AQI values showing low air pollution.

In contrast, Los Angeles city has high ozone and AQI levels indicating high air pollution.

Counties outside the periphery of Los Angeles city, like Santa Barbara and Santa Maria also have low air pollution like the bay area.

How does daily maximum 8-hour ozone concentration vary over time?

Ozone concentration as well as AQI levels increase during summer months and declines in winter time.

Both pollution measures are well representerd by second order regression.

Consider if urban activity (weekend vs. weekday) has any affect on ozone levels across different days.

Both ozone concentration and AQI levels increase during weekdeys and drop on weekends.

EDA and data cleaning

# EDA and data cleaning
import missingno as msno
import pandas as pd
display(ozone.head(10))

# Print original ozone info and missing values
print('\nozone df info******')
print(ozone.info())
print('\nozone df missing values******')
print(ozone.isna().sum())

# Display missing values - complete random

msno.matrix(ozone.sort_values('Percent Complete'))

# Check date. Get month, day and day of year
print('\nConvert data and extract month,,day, day of year, and weekday******')
ozone['Date'] = pd.to_datetime(ozone['Date'])
ozone['Month'] = ozone['Date'].dt.month
ozone['Day'] = ozone['Date'].dt.day
ozone['DayofYear'] = ozone['Date'].dt.dayofyear
ozone['Weekday']=ozone['Date'].dt.weekday

# Interpolate missing daily AQI value
print('\nInterpolate missing AQI values and save as InterpolAQI******')
print('\nInterpolate Ozone concentration and save as Interpol OzoneCon******')
ozone['Interpol AQI'] = ozone['Daily AQI Value'].interpolate(method='linear')
ozone['Interpol OzoneCon']=ozone['Daily Max 8-hour Ozone Concentration'].\
interpolate(method='linear')
print ('\nNo action for other missing values for now******')


# Convert onjects to categories
print('\nConverts objects to categories******')
columns=ozone.columns
print(columns)
for col in columns:
    if ozone[col].dtype=='object':
        ozone[col]=ozone[col].astype('category')
        print ('\n',col, ':\n')
        print(ozone[col].nunique())
        print(ozone[col].unique())

print('\nCleaned ozone info******')
print(ozone.info())     

print('\nDisplay descriptive statistic of numerical values of ozone df******')
display(ozone.describe().transpose())

print('\nTop outliers for AQI and ozone concentration******')
AQI_top_outlier=73.5
ozone_con_top_outlier=0.079
print(f'AQI top outlier {AQI_top_outlier}')
print(f'ozone concentration top outlier {ozone_con_top_outlier}')

Remark

Missing values checked and missing Ozone concentration. and AQI levels are interpolated between top and bottom values. The interpolated values are kept as two additional variables while keeping the originals.

Other missing values are left as they are left as they are. subject to change if needed.

month day of month and day of year and weekdas are extract from datetime data.

Object type variables are converted to categories.

Unuqie categories are checked. There are some unknown CBSA names, but mo action was takes since we did not use CBSA and county nakes, becasue we used an official map of california from Folium as a base to bubble maps.

Plot geospatial bubble maps

Are there any areas that consistently show high or low ozone concentrations?

import matplotlib.pyplot as plt
import folium
import pandas as pd

def prep_bubble(df, col):
    # Get coordinates of areas
    df_coord = df.groupby('Site ID').agg({'Site Longitude': min, 'Site Latitude': max}).reset_index()
    # Get the mean
    df_mean = df.groupby('Site ID')[col].mean()
    df_max = df.groupby('Site ID')[col].max()
    # Extract latitude and longitude as arrays
    x = df_coord['Site Longitude'].to_numpy()
    y = df_coord['Site Latitude'].to_numpy()
    spmean = df_mean.to_numpy()
    spmax = df_max.to_numpy()

    return x, y, spmean, spmax

def map_folium(x, y, sx):
    map = folium.Map(location=[36., -119.0], tiles="OpenStreetMap", opacity=0.4, zoom_start=6)
    for i in range(len(x)):
        folium.CircleMarker(
            location=[y[i], x[i]],
            radius=sx[i],
            color='red',
            fill=True,
            fill_color='red'
        ).add_to(map)
        
    return map

# Assuming 'ozone' is a DataFrame that has been defined earlier
x, y, spm, spx = prep_bubble(ozone, 'Interpol AQI')
lat_long_df = pd.DataFrame({'long': x, 'lat': y, 'sp': spm})

lat_long_300 = lat_long_df[lat_long_df['sp'] > 60 ]
x = lat_long_300['long'].values
y = lat_long_300['lat'].values
sx = lat_long_300['sp'].values  # Convert to numpy array
map1a = map_folium(x, y, sx/10)

# Display the map
print('.             Average Daily AQI Value greater than 50 ppm-critial level ')
print(f"                   Maximum recorded average AQI value is {round(lat_long_df['sp'].max(),2)} ppm")
map1a
# Assuming 'ozone' is a DataFrame that has been defined earlier
x, y, spm, spx = prep_bubble(ozone, 'Interpol AQI')
lat_long_df = pd.DataFrame({'long': x, 'lat': y, 'sp': spm})

lat_long_300 = lat_long_df[lat_long_df['sp'] < 28]
x = lat_long_300['long'].values
y = lat_long_300['lat'].values
sx = lat_long_300['sp'].values  # Convert to numpy array

map1b = map_folium(x, y, sx/5)

# Display the map
print('.                         Average Daily AQI Value less than 31 ppm')
map1b

# Recreate the DataFrame with the original lengths of x, y, and spx
x, y, spm, spx = prep_bubble(ozone, 'Interpol AQI')
#print(len(x),len(spm),len(spx))

lat_long_df = pd.DataFrame({'long': x, 'lat': y, 'sp': spx})

lat_long_300 = lat_long_df[lat_long_df['sp'] > 175]
x = lat_long_300['long'].values
y = lat_long_300['lat'].values
sx = lat_long_300['sp'].values  # Convert to numpy array
map2a = map_folium(x, y, sx/25)

# Display the map
print('.           Locations where recorded maximum daily AQI Value is greater than 175 ppm')
print(f"                              Maximum recorded AQI value is {lat_long_df['sp'].max()} ppm")
map2a
# Recreate the DataFrame with the original lengths of x, y, and spx
x, y, spm, spx = prep_bubble(ozone, 'Interpol AQI')


lat_long_df = pd.DataFrame({'long': x, 'lat': y, 'sp': spx})
lat_long_300 = lat_long_df[lat_long_df['sp'] <50]
x = lat_long_300['long'].values
y = lat_long_300['lat'].values
sx = lat_long_300['sp'].values  # Convert to numpy array
map2b = map_folium(x, y, sx/8)

# Display the map
print('.             Site locations where maximum recorded daily AQI value is less than 75 ppm')
map2b
# Recreate the DataFrame with the original lengths of x, y, and spx
x, y, spm, spx = prep_bubble(ozone, 'Interpol OzoneCon')


lat_long_df = pd.DataFrame({'long': x, 'lat': y, 'sp': spx})

lat_long_300 = lat_long_df[lat_long_df['sp'] > 0.1]
x = lat_long_300['long'].values
y = lat_long_300['lat'].values
sx = lat_long_300['sp'].values  # Convert to numpy array


map3a = map_folium(x, y, sx*50)

# Display the map
print('.      Site Locaions wherem Maximum Daily Ozone concentration ppm greater than 0.1 ppm')
print(f"                 Maximum recorded ozone concentration is {lat_long_df['sp'].max()} ppm")
map3a