MICE Imputation of Pollution Dataset
This report provides a detailed analysis of air pollution data for Sebokeng, a densely populated, low-income settlement in southern Gauteng, located near industrial zones. The dataset covers daily measurements of five key air pollutants PM2.5, PM10, SO₂, NO₂, and O₃ - over the period from January 2011 to February 2020.
The analysis focuses on identifying trends, outliers, and the extent of missing data within the dataset. Visualizing the raw data helps in spotting patterns, assessing missing data, and understanding correlations between pollutants. In air quality research, missing data presents significant challenges, such as the risk of bias and reduced statistical accuracy, which complicate the assessment of exposure and health risks.
The importance of this analysis lies in addressing the challenges posed by missing data, which can lead to bias and reduced statistical power in our findings. Missing data is a common challenge across various research disciplines, particularly in environmental health sciences (Hadeed et al., 2020). Monitoring environmental contaminants plays a crucial role in exposure science research and public health efforts as government agencies often use environmental monitors to ensure regulatory compliance, while researchers employ them for scientific investigations (Hadeed et al., 2020). In environmental health research, these monitors are essential for measuring contaminant concentrations and linking those levels to potential exposures and associated health outcomes (Hadeed et al., 2020). Regardless of how the data is sampled, data that is missing at random (MAR) is frequently encountered in environmental health sciences studies. Understanding the nature of missing data is crucial for guiding imputation processes that can produce reliable estimates.
MICE Imputation of Pollution Dataset
import pandas as pd
data = pd.read_excel('Sebokeng_Data Spreadsheet.xlsx')data.info()Overview of the Dataset
The dataset comprises daily measurements of five air pollutants in Sebokeng, specifically Particular Matter (PM2.5 and PM10), Sulfur Dioxide (SO₂), Nitrogen Dioxide (NO₂), and Ozone (O₃). The period covers January 2011 to February 2020.
# Create Date Column
data['Date'] = pd.date_range(start='2011-01-01', periods=len(data), freq='D')
data.set_index('Date', inplace=True)
data.head()
data.rename(columns = {'sebSO2': 'SO2'}, inplace = True)
data.rename(columns = {'sebNO2': 'NO2'}, inplace = True)
data.rename(columns = {'sebO3': 'O3'}, inplace = True)
data.rename(columns = {'sebPM10': 'PM10'}, inplace = True)
data.rename(columns = {'sebPM25': 'PM25'}, inplace = True)
print(f"Head: \n{data.head()}")
print("\n", f"Tail: \n{data.tail()}")
print("\n", f"Shape: \n{data.shape}")Summary Statistics of Key Variables
Below we see the key summary statistics that have been calculated for each pollutant.
# Descriptive Stats
data.describe()import matplotlib.pyplot as plt
import seaborn as sns
# Line Plot
plt.figure(figsize=(10, 8))
sns.lineplot(data=data)
plt.title("Pollutant Concentrations Over Time")
plt.xlabel("Date")
plt.ylabel("Concentration")
plt.xticks(rotation=45)
plt.savefig('Line_Plot.png')
plt.show()
# Box Plot of Pollutant Levels
plt.figure(figsize = (10, 8))
sns.boxplot(data = data)
plt.title("Box Plot of Pollutant Concentrations")
plt.xlabel("Pollutant")
plt.ylabel("Concentration")
plt.savefig('Box_Plot_1.png')
plt.show()
# Heatmap of Pollutant Concentrations
plt.figure(figsize=(10, 8))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap of Pollutant Levels")
plt.xlabel("Pollutant")
plt.ylabel("Pollutant")
plt.show()
# Plot FacetGrid
data_melted = data.reset_index().melt(id_vars='Date', var_name='Pollutant', value_name='Concentration') # Melt the DataFrame
g = sns.FacetGrid(data_melted, col="Pollutant", col_wrap=3, sharey=False, height=4)
g.map(sns.lineplot, "Date", "Concentration") # Map a line plot to each facet
g.set_xticklabels(rotation=45) # Rotate x-axis labels for readability
plt.tight_layout()
plt.show()
From the original dataset, the general concentrations of pollutants from highest to lowest were O₃, PM10, PM2.5, NO₂, and SO₂. The mean measurements over a 9-year period for all the pollutants are significantly higher than the recommended levels proposed by the Global Air Quality Guidelines (AQG), which were released by the World Health Organization (WHO) in September 2021 (Figure below) (Garland et al., 2021). Comparing Sebokeng’s average levels of pollutants, the mode for SO₂ (3.45 μg/m3) sits between the WHO Interim Target (IT) 3 and the AQG, the mean for NO₂ (25.35 μg/m3) is between the annual IT-2 and IT-3; PM2.5 (31.67 μg/m3) is between IT-1 and IT-2; and PM10 (46.37 μg/m3) is between IT-2 and IT-3.
The observed correlations between the pollutant were generally weak. The strongest positive association was between PM2.5 and PM10 (0.71), while the strongest negative correlation was found between O₃ and NO₂. The strong correlation between PM2.5 and PM10 suggests that they may have similar emission sources (Zhou et al., 2016). According to Muyemeki et al. (2021), the positive correlation between PM2.5 and PM10 emanates from dust-related contributions (over 60%) and secondary aerosols (11%) from predominantly domestic coal burning.
Data was notably missing during the following periods: • May 2011 to January 2012 • March 2014 • January 2015 to January 2016 (the largest period of missing data) • May 2017 to July 2017
print("\n", f"Missing Data: \n{data.isnull().sum()}", "\n")Justification for Using MICE Imputation
Given the significant amounts of missing data, we assumed it to be missing at random (MAR). Multivariate Imputation by Chained-Equations (MICE) was chosen as it can estimate missing values based on the relationships between multiple variables. This method addresses gaps by iteratively generating predicted values for the missing entries (Kunal, 2024). During each cycle, the missing values for one variable are estimated using the information from the other variables. The process is repeated multiple times until the results stabilize, indicating that convergence has been achieved (Kunal, 2024).
pip install miceforest