Project: Exploring Experimental Design in the Energy Field

The energy sector involves intricate experiments to improve efficiency and sustainability. Proper experimental design helps to maximize insights and minimize errors. There are two common types of experimental design: factorial designs, which study multiple independent variables within a single experiment, and randomized block designs, which group experimental units to control variance. Understanding when to use each design is crucial for energy-related studies.

An environmental research team is investigating the impact of various fuel sources on CO2 emissions across different geographical regions. The goal is to understand which assigned fuel source contributes the most to CO2 emissions and whether this varies depending on location. The team has collected data from four distinct geographical regions: North, South, East, and West. In each region, multiple fuel sources—Natural Gas, Biofuel, and Coal—are used to generate energy. The resulting CO2 emissions are measured to evaluate the environmental impact of each fuel source.

As the data scientist on this project, you have access to two datasets, each representing data from one of the two mentioned experimental designs. The aim is to determine whether a factorial design or a randomized block design was used for the given experimental setup above and to analyze the dataset to identify key patterns and insights.

# Import required libraries
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import f_oneway, ttest_ind
from statsmodels.sandbox.stats.multicomp import multipletests

# Load datasets
energy_design_a = pd.read_csv("energy_design_a.csv")
energy_design_b = pd.read_csv("energy_design_b.csv")

energy_design_a.head()

energy_design_b.head()

design = "randomized_block"

data = energy_design_b

sns.boxplot(
    x="Geographical_Region",
    y="CO2_Emissions",
    data=data,
    hue="Fuel_Source"
)
plt.title('CO2 Emissions by Geographical Region and Fuel Source')
plt.xlabel('Geographical Region')
plt.ylabel('CO2 Emissions (tons)')
plt.show()

# Identify highest median CO2 emission region and source after viewing the plot
highest_co2_region = "South"
highest_co2_source = "Coal"

# Group by Geographical Region and apply ANOVA to check significance
test_results = data.groupby('Geographical_Region').apply(
    lambda x: f_oneway(
        x[x['Fuel_Source'] == 'Natural_Gas']['CO2_Emissions'],
        x[x['Fuel_Source'] == 'Biofuel']['CO2_Emissions'],
        x[x['Fuel_Source'] == 'Coal']['CO2_Emissions']
    )
)

print("Test Results:", test_results)

# Ensure test_results show significant results (one or more with p-value < 0.05)
if any(result.pvalue < 0.05 for result in test_results):
    bonferroni_p_values = []

    # Perform pairwise comparisons for Bonferroni correction
    for zone in ['North', 'South', 'East', 'West']:
        fuels = ['Natural_Gas', 'Biofuel', 'Coal']
        comparisons = [
            (fuels[i], fuels[j]) \
            for i in range(len(fuels)) \
            for j in range(i + 1, len(fuels))
        ]

        for fuel1, fuel2 in comparisons:
            group1 = data[(
                data['Geographical_Region'] == zone
            ) & (
                data['Fuel_Source'] == fuel1
                )]['CO2_Emissions']
            group2 = data[(
                data['Geographical_Region'] == zone
            ) & (
                data['Fuel_Source'] == fuel2
            )]['CO2_Emissions']
            _, p_val = ttest_ind(group1, group2)
            bonferroni_p_values.append(p_val)

    # Apply Bonferroni correction for multiple comparisons
    diff_results = multipletests(bonferroni_p_values, alpha=0.05, method='bonferroni')

print("Bonferroni Corrected P-values:", diff_results[1])