Skip to content

Unveiling Trends in Renewable Energy ๐ŸŒ๐Ÿ”‹

The Data Scientist Master

๐Ÿ“– Background

The race to net-zero emissions is heating up. As nations work to combat climate change and meet rising energy demands, renewable energy has emerged as a cornerstone of the clean transition. Solar, wind, and hydro are revolutionizing how we power our lives. Some countries are leading the charge, while others are falling behind. But which nations are making the biggest impact? Whatโ€™s driving their success? And what lessons can we learn to accelerate green energy transition?

As a data scientist at NextEra Energy, one of the worldโ€™s leading renewable energy providers, your role is to move beyond exploration, into prediction. Using a rich, real-world dataset, youโ€™ll build models to forecast renewable energy production, drawing on indicators like GDP, population, carbon emissions, and policy metrics.

With the world watching, your model could help shape smarter investments, forward-thinking policies, and a faster transition to clean energy. ๐Ÿ”ฎโšก๐ŸŒฑ

๐Ÿ’พ The data

Your team has gathered a global renewable energy dataset ("Training_set_augumented.csv") covering energy production, investments, policies, and economic factors shaping renewable adoption worldwide:

๐ŸŒ Basic Identifiers

  • Country โ€“ Country name
  • Year โ€“ Calendar year (YYYY)
  • Energy Type โ€“ Type of renewable energy (e.g., Solar, Wind)
โšก Energy Metrics
  • Production (GWh) โ€“ Renewable energy produced (Gigawatt-hours)
  • Installed Capacity (MW) โ€“ Installed renewable capacity (Megawatts)
  • Investments (USD) โ€“ Total investment in renewables (US Dollars)
  • Energy Consumption (GWh) โ€“ Total national energy use
  • Energy Storage Capacity (MWh) โ€“ Capacity of energy storage systems
  • Grid Integration Capability (Index) โ€“ Scale of 0โ€“1; ability to handle renewables in grid
  • Electricity Prices (USD/kWh) โ€“ Average cost of electricity
  • Energy Subsidies (USD) โ€“ Government subsidies for energy sector
  • Proportion of Energy from Renewables (%) โ€“ Share of renewables in total energy mix
๐Ÿง  Innovation & Tech
  • R&D Expenditure (USD) โ€“ R&D spending on renewables
  • Renewable Energy Patents โ€“ Number of patents filed
  • Innovation Index (Index) โ€“ Global innovation score (0โ€“100)
๐Ÿ’ฐ Economy & Policy
  • GDP (USD) โ€“ Gross domestic product
  • Population โ€“ Total population
  • Government Policies โ€“ Number of policies supporting renewables
  • Renewable Energy Targets โ€“ Whether national targets are in place (1 = Yes, 0 = No)
  • Public-Private Partnerships in Energy โ€“ Number of active collaborations
  • Energy Market Liberalization (Index) โ€“ Scale of 0โ€“1
๐Ÿง‘โ€๐Ÿคโ€๐Ÿง‘ Social & Governance
  • Ease of Doing Business (Score) โ€“ World Bank index (0โ€“100)
  • Regulatory Quality โ€“ Governance score (-2.5 to 2.5)
  • Political Stability โ€“ Governance score (-2.5 to 2.5)
  • Control of Corruption โ€“ Governance score (-2.5 to 2.5)
๐ŸŒฟ Environment & Resources
  • CO2 Emissions (MtCO2) โ€“ Emissions in million metric tons
  • Average Annual Temperature (ยฐC) โ€“ Countryโ€™s avg. temp
  • Solar Irradiance (kWh/mยฒ/day) โ€“ Solar energy availability
  • Wind Speed (m/s) โ€“ Average wind speed
  • Hydro Potential (Index) โ€“ Relative hydropower capability (0โ€“1)
  • Biomass Availability (Tons/year) โ€“ Total available biomass

EDA

#Reading Data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('data/Training_set_augmented.csv')
df.isna().sum()
Hidden output
df.describe()
df.dtypes
#Changing data types
df['Country'] = df['Country'].astype('category')
df['Energy Type'] = df['Energy Type'].astype('category')
df['Renewable Energy Targets'] = df['Renewable Energy Targets'].astype('bool')
df['Energy Market Liberalization'] = df['Energy Market Liberalization'].astype('float64')
col = df.pop('Production (GWh)')
df['Production (GWh)'] = col
df.head()
#Correlation 
correlation_with_production = df.corr(numeric_only=True)['Production (GWh)'].sort_values(ascending=False)
print(correlation_with_production)
#Plots
import matplotlib.pyplot as plt
import numpy as np

columns = [
    'Average Annual Temperature',
    'Solar Irradiance',
    'Electricity Prices',
    'Proportion of Energy from Renewables',
    'Wind Speed',
    'Hydro Potential',
    'CO2 Emissions',
    'Energy Consumption',
    'Population',
    'Biomass Availability',
    'Energy Storage Capacity',
    'Grid Integration Capability',
    'Installed Capacity (MW)',
    'R&D Expenditure'
]

plt.figure(figsize=(18, 15))

n_cols = 4
n_rows = (len(columns) + n_cols - 1) // n_cols

for i, col in enumerate(columns, 1):
    plt.subplot(n_rows, n_cols, i)
    data = df[col]
    plt.hist(data, bins=30, color='skyblue', edgecolor='black')
    mean_val = np.mean(data)
    plt.axvline(mean_val, color='red', linestyle='dashed', linewidth=2, label=f'Mean: {mean_val:.2f}')
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.legend()

plt.tight_layout()
plt.show()
# Defining X and y
X_lin_coef = df.loc[:, df.columns.difference(['Production (GWh)', 'Energy Type', 'Country'])]
y_lin_coef = df['Production (GWh)']
#Linear Coefficient
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

# Scale X and y
scaler_X = StandardScaler()
X_scaled = scaler_X.fit_transform(X_lin_coef[columns])

scaler_y = StandardScaler()
y_scaled = scaler_y.fit_transform(y_lin_coef.values.reshape(-1, 1)).flatten()

# Fit linear regression
model = LinearRegression()
model.fit(X_scaled, y_scaled)

# Plot bar chart
plt.figure(figsize=(12, 6))
plt.bar(X_lin_coef[columns].columns, model.coef_, color='skyblue')
plt.xticks(rotation=45, ha='right')
plt.title('Standardized Coefficients from Linear Regression')
plt.ylabel('Coefficient Value')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
#SHAP analysis
from sklearn.ensemble import RandomForestRegressor
import shap
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Assuming X and y are defined somewhere in the notebook
# Convert categorical features to numerical using one-hot encoding
X_shap = df.loc[:, df.columns.difference(['Production (GWh)'])]
y_shap = df['Production (GWh)']
X_encoded = pd.get_dummies(X_shap)

# Initialize model
rf = RandomForestRegressor(n_estimators=100, random_state=42)

# Train on encoded features 
rf.fit(X_encoded, y_shap)

# SHAP explanation
import shap

# Use TreeExplainer for Random Forest
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_encoded)

# Summary plot
shap.summary_plot(
    shap_values, 
    X_encoded, 
    plot_type='bar', 
    show=False,
    max_display=X_encoded.shape[1] 
)

# Adjust layout
plt.xlabel("Mean |SHAP value|", fontsize=12) 
plt.title("SHAP Feature Importance (Bar Plot)", fontsize=14)
plt.tight_layout()
plt.xticks(rotation=45, ha='right')  
plt.subplots_adjust(bottom=0.25)    

plt.show()
โ€Œ
โ€Œ
โ€Œ