Renewable Energy Prediction: One Watt at a Time

Unveiling Trends in Renewable Energy 🌍🔋

The Data Scientist Master

📖 Background

The race to net-zero emissions is heating up. As nations work to combat climate change and meet rising energy demands, renewable energy has emerged as a cornerstone of the clean transition. Solar, wind, and hydro are revolutionizing how we power our lives. Some countries are leading the charge, while others are falling behind. But which nations are making the biggest impact? What’s driving their success? And what lessons can we learn to accelerate green energy transition?

As a data scientist at NextEra Energy, one of the world’s leading renewable energy providers, your role is to move beyond exploration, into prediction. Using a rich, real-world dataset, you’ll build models to forecast renewable energy production, drawing on indicators like GDP, population, carbon emissions, and policy metrics.

With the world watching, your model could help shape smarter investments, forward-thinking policies, and a faster transition to clean energy. 🔮⚡🌱

💾 The data

Your team has gathered a global renewable energy dataset ("Training_set_augumented.csv") covering energy production, investments, policies, and economic factors shaping renewable adoption worldwide:

🌍 Basic Identifiers

Country – Country name
Year – Calendar year (YYYY)
Energy Type – Type of renewable energy (e.g., Solar, Wind)

⚡ Energy Metrics

Production (GWh) – Renewable energy produced (Gigawatt-hours)
Installed Capacity (MW) – Installed renewable capacity (Megawatts)
Investments (USD) – Total investment in renewables (US Dollars)
Energy Consumption (GWh) – Total national energy use
Energy Storage Capacity (MWh) – Capacity of energy storage systems
Grid Integration Capability (Index) – Scale of 0–1; ability to handle renewables in grid
Electricity Prices (USD/kWh) – Average cost of electricity
Energy Subsidies (USD) – Government subsidies for energy sector
Proportion of Energy from Renewables (%) – Share of renewables in total energy mix

🧠 Innovation & Tech

R&D Expenditure (USD) – R&D spending on renewables
Renewable Energy Patents – Number of patents filed
Innovation Index (Index) – Global innovation score (0–100)

💰 Economy & Policy

GDP (USD) – Gross domestic product
Population – Total population
Government Policies – Number of policies supporting renewables
Renewable Energy Targets – Whether national targets are in place (1 = Yes, 0 = No)
Public-Private Partnerships in Energy – Number of active collaborations
Energy Market Liberalization (Index) – Scale of 0–1

🧑‍🤝‍🧑 Social & Governance

Ease of Doing Business (Score) – World Bank index (0–100)
Regulatory Quality – Governance score (-2.5 to 2.5)
Political Stability – Governance score (-2.5 to 2.5)
Control of Corruption – Governance score (-2.5 to 2.5)

🌿 Environment & Resources

CO2 Emissions (MtCO2) – Emissions in million metric tons
Average Annual Temperature (°C) – Country’s avg. temp
Solar Irradiance (kWh/m²/day) – Solar energy availability
Wind Speed (m/s) – Average wind speed
Hydro Potential (Index) – Relative hydropower capability (0–1)
Biomass Availability (Tons/year) – Total available biomass

EDA

#Reading Data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('data/Training_set_augmented.csv')

df.isna().sum()

Hidden output

df.describe()

df.dtypes

#Changing data types
df['Country'] = df['Country'].astype('category')
df['Energy Type'] = df['Energy Type'].astype('category')
df['Renewable Energy Targets'] = df['Renewable Energy Targets'].astype('bool')
df['Energy Market Liberalization'] = df['Energy Market Liberalization'].astype('float64')

col = df.pop('Production (GWh)')
df['Production (GWh)'] = col
df.head()

#Correlation 
correlation_with_production = df.corr(numeric_only=True)['Production (GWh)'].sort_values(ascending=False)
print(correlation_with_production)

#Plots
import matplotlib.pyplot as plt
import numpy as np

columns = [
    'Average Annual Temperature',
    'Solar Irradiance',
    'Electricity Prices',
    'Proportion of Energy from Renewables',
    'Wind Speed',
    'Hydro Potential',
    'CO2 Emissions',
    'Energy Consumption',
    'Population',
    'Biomass Availability',
    'Energy Storage Capacity',
    'Grid Integration Capability',
    'Installed Capacity (MW)',
    'R&D Expenditure'
]

plt.figure(figsize=(18, 15))

n_cols = 4
n_rows = (len(columns) + n_cols - 1) // n_cols

for i, col in enumerate(columns, 1):
    plt.subplot(n_rows, n_cols, i)
    data = df[col]
    plt.hist(data, bins=30, color='skyblue', edgecolor='black')
    mean_val = np.mean(data)
    plt.axvline(mean_val, color='red', linestyle='dashed', linewidth=2, label=f'Mean: {mean_val:.2f}')
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.legend()

plt.tight_layout()
plt.show()

# Defining X and y
X_lin_coef = df.loc[:, df.columns.difference(['Production (GWh)', 'Energy Type', 'Country'])]
y_lin_coef = df['Production (GWh)']

#Linear Coefficient
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

# Scale X and y
scaler_X = StandardScaler()
X_scaled = scaler_X.fit_transform(X_lin_coef[columns])

scaler_y = StandardScaler()
y_scaled = scaler_y.fit_transform(y_lin_coef.values.reshape(-1, 1)).flatten()

# Fit linear regression
model = LinearRegression()
model.fit(X_scaled, y_scaled)

# Plot bar chart
plt.figure(figsize=(12, 6))
plt.bar(X_lin_coef[columns].columns, model.coef_, color='skyblue')
plt.xticks(rotation=45, ha='right')
plt.title('Standardized Coefficients from Linear Regression')
plt.ylabel('Coefficient Value')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

#SHAP analysis
from sklearn.ensemble import RandomForestRegressor
import shap
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Assuming X and y are defined somewhere in the notebook
# Convert categorical features to numerical using one-hot encoding
X_shap = df.loc[:, df.columns.difference(['Production (GWh)'])]
y_shap = df['Production (GWh)']
X_encoded = pd.get_dummies(X_shap)

# Initialize model
rf = RandomForestRegressor(n_estimators=100, random_state=42)

# Train on encoded features 
rf.fit(X_encoded, y_shap)

# SHAP explanation
import shap

# Use TreeExplainer for Random Forest
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_encoded)

# Summary plot
shap.summary_plot(
    shap_values, 
    X_encoded, 
    plot_type='bar', 
    show=False,
    max_display=X_encoded.shape[1] 
)

# Adjust layout
plt.xlabel("Mean |SHAP value|", fontsize=12) 
plt.title("SHAP Feature Importance (Bar Plot)", fontsize=14)
plt.tight_layout()
plt.xticks(rotation=45, ha='right')  
plt.subplots_adjust(bottom=0.25)    

plt.show()

‌
‌
‌