Skip to content

The energy sector involves intricate experiments to improve efficiency and sustainability. Proper experimental design helps to maximize insights and minimize errors. There are two common types of experimental design: factorial designs, which study multiple independent variables within a single experiment, and randomized block designs, which group experimental units to control variance. Understanding when to use each design is crucial for energy-related studies.

An environmental research team is investigating the impact of various fuel sources on CO2 emissions across different geographical regions. The goal is to understand which assigned fuel source contributes the most to CO2 emissions and whether this varies depending on location. The team has collected data from four distinct geographical regions: North, South, East, and West. In each region, multiple fuel sources—Natural Gas, Biofuel, and Coal—are used to generate energy. The resulting CO2 emissions are measured to evaluate the environmental impact of each fuel source.

As the data scientist on this project, you have access to two datasets, each representing data from one of the two mentioned experimental designs. The aim is to determine whether a factorial design or a randomized block design was used for the given experimental setup above and to analyze the dataset to identify key patterns and insights.

# Import required libraries
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import f_oneway, ttest_ind
from statsmodels.sandbox.stats.multicomp import multipletests

# Load datsets
energy_design_a = pd.read_csv('energy_design_a.csv')
energy_design_b = pd.read_csv('energy_design_b.csv')
# Start coding here (use as many cells as you need)

1 - Identifying the experimental design

# Exploring the data
energy_design_a.head()
energy_design_b.head()
design = 'randomized_block'
print(design)

2 - Create a boxplot

# Generate a boxplot to visualize CO2 emissions by geographical region coloring hue by fuel source
sns.boxplot(x='Geographical_Region', y='CO2_Emissions', hue='Fuel_Source', data=energy_design_b)
plt.title('CO2 Emissions by Geographical Region and Fuel Source')
plt.xlabel('Geographical Region')
plt.ylabel('CO2 Emissions (tons)')
plt.show()
highest_co2_region = 'South'
highest_co2_source = 'Coal'
print(highest_co2_region)
print(highest_co2_source)

3 - Apply a statistical test

# Use a statistical test to determine if there is a significant difference in CO2 emissions based on fuel source and geographical region
test_results = energy_design_b.groupby('Geographical_Region').apply(
    lambda x: f_oneway(
        x[x['Fuel_Source'] == 'Natural_Gas']['CO2_Emissions'],
        x[x['Fuel_Source'] == 'Biofuel']['CO2_Emissions'],
        x[x['Fuel_Source'] == 'Coal']['CO2_Emissions'])
)
print(test_results)

All p-values lower than 0.05 significance level, therefore, we can determine that there is a significant difference in CO2 emissions based on fuel source, grouped by region.

4 - Perform a correction