Unveiling Trends in Renewable Energy ๐๐
The Data Scientist Master
๐ Background
The race to net-zero emissions is heating up. As nations work to combat climate change and meet rising energy demands, renewable energy has emerged as a cornerstone of the clean transition. Solar, wind, and hydro are revolutionizing how we power our lives. Some countries are leading the charge, while others are falling behind. But which nations are making the biggest impact? Whatโs driving their success? And what lessons can we learn to accelerate green energy transition?
As a data scientist at NextEra Energy, one of the worldโs leading renewable energy providers, your role is to move beyond exploration, into prediction. Using a rich, real-world dataset, youโll build models to forecast renewable energy production, drawing on indicators like GDP, population, carbon emissions, and policy metrics.
With the world watching, your model could help shape smarter investments, forward-thinking policies, and a faster transition to clean energy. ๐ฎโก๐ฑ
๐พ The data
Your team has gathered a global renewable energy dataset ("Training_set_augumented.csv") covering energy production, investments, policies, and economic factors shaping renewable adoption worldwide:
๐ Basic Identifiers
Countryโ Country nameYearโ Calendar year (YYYY)Energy Typeโ Type of renewable energy (e.g., Solar, Wind)
โก Energy Metrics
Production (GWh)โ Renewable energy produced (Gigawatt-hours)Installed Capacity (MW)โ Installed renewable capacity (Megawatts)Investments (USD)โ Total investment in renewables (US Dollars)Energy Consumption (GWh)โ Total national energy useEnergy Storage Capacity (MWh)โ Capacity of energy storage systemsGrid Integration Capability (Index)โ Scale of 0โ1; ability to handle renewables in gridElectricity Prices (USD/kWh)โ Average cost of electricityEnergy Subsidies (USD)โ Government subsidies for energy sectorProportion of Energy from Renewables (%)โ Share of renewables in total energy mix
๐ง Innovation & Tech
R&D Expenditure (USD)โ R&D spending on renewablesRenewable Energy Patentsโ Number of patents filedInnovation Index (Index)โ Global innovation score (0โ100)
๐ฐ Economy & Policy
GDP (USD)โ Gross domestic productPopulationโ Total populationGovernment Policiesโ Number of policies supporting renewablesRenewable Energy Targetsโ Whether national targets are in place (1 = Yes, 0 = No)Public-Private Partnerships in Energyโ Number of active collaborationsEnergy Market Liberalization (Index)โ Scale of 0โ1
๐งโ๐คโ๐ง Social & Governance
Ease of Doing Business (Score)โ World Bank index (0โ100)Regulatory Qualityโ Governance score (-2.5 to 2.5)Political Stabilityโ Governance score (-2.5 to 2.5)Control of Corruptionโ Governance score (-2.5 to 2.5)
๐ฟ Environment & Resources
CO2 Emissions (MtCO2)โ Emissions in million metric tonsAverage Annual Temperature (ยฐC)โ Countryโs avg. tempSolar Irradiance (kWh/mยฒ/day)โ Solar energy availabilityWind Speed (m/s)โ Average wind speedHydro Potential (Index)โ Relative hydropower capability (0โ1)Biomass Availability (Tons/year)โ Total available biomass
EDA
#Reading Data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('data/Training_set_augmented.csv')df.isna().sum()df.describe()df.dtypes#Changing data types
df['Country'] = df['Country'].astype('category')
df['Energy Type'] = df['Energy Type'].astype('category')
df['Renewable Energy Targets'] = df['Renewable Energy Targets'].astype('bool')
df['Energy Market Liberalization'] = df['Energy Market Liberalization'].astype('float64')
col = df.pop('Production (GWh)')
df['Production (GWh)'] = col
df.head()#Correlation
correlation_with_production = df.corr(numeric_only=True)['Production (GWh)'].sort_values(ascending=False)
print(correlation_with_production)#Plots
import matplotlib.pyplot as plt
import numpy as np
columns = [
'Average Annual Temperature',
'Solar Irradiance',
'Electricity Prices',
'Proportion of Energy from Renewables',
'Wind Speed',
'Hydro Potential',
'CO2 Emissions',
'Energy Consumption',
'Population',
'Biomass Availability',
'Energy Storage Capacity',
'Grid Integration Capability',
'Installed Capacity (MW)',
'R&D Expenditure'
]
plt.figure(figsize=(18, 15))
n_cols = 4
n_rows = (len(columns) + n_cols - 1) // n_cols
for i, col in enumerate(columns, 1):
plt.subplot(n_rows, n_cols, i)
data = df[col]
plt.hist(data, bins=30, color='skyblue', edgecolor='black')
mean_val = np.mean(data)
plt.axvline(mean_val, color='red', linestyle='dashed', linewidth=2, label=f'Mean: {mean_val:.2f}')
plt.title(f'Distribution of {col}')
plt.xlabel(col)
plt.ylabel('Frequency')
plt.legend()
plt.tight_layout()
plt.show()
# Defining X and y
X_lin_coef = df.loc[:, df.columns.difference(['Production (GWh)', 'Energy Type', 'Country'])]
y_lin_coef = df['Production (GWh)']#Linear Coefficient
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
# Scale X and y
scaler_X = StandardScaler()
X_scaled = scaler_X.fit_transform(X_lin_coef[columns])
scaler_y = StandardScaler()
y_scaled = scaler_y.fit_transform(y_lin_coef.values.reshape(-1, 1)).flatten()
# Fit linear regression
model = LinearRegression()
model.fit(X_scaled, y_scaled)
# Plot bar chart
plt.figure(figsize=(12, 6))
plt.bar(X_lin_coef[columns].columns, model.coef_, color='skyblue')
plt.xticks(rotation=45, ha='right')
plt.title('Standardized Coefficients from Linear Regression')
plt.ylabel('Coefficient Value')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
#SHAP analysis
from sklearn.ensemble import RandomForestRegressor
import shap
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Assuming X and y are defined somewhere in the notebook
# Convert categorical features to numerical using one-hot encoding
X_shap = df.loc[:, df.columns.difference(['Production (GWh)'])]
y_shap = df['Production (GWh)']
X_encoded = pd.get_dummies(X_shap)
# Initialize model
rf = RandomForestRegressor(n_estimators=100, random_state=42)
# Train on encoded features
rf.fit(X_encoded, y_shap)
# SHAP explanation
import shap
# Use TreeExplainer for Random Forest
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_encoded)
# Summary plot
shap.summary_plot(
shap_values,
X_encoded,
plot_type='bar',
show=False,
max_display=X_encoded.shape[1]
)
# Adjust layout
plt.xlabel("Mean |SHAP value|", fontsize=12)
plt.title("SHAP Feature Importance (Bar Plot)", fontsize=14)
plt.tight_layout()
plt.xticks(rotation=45, ha='right')
plt.subplots_adjust(bottom=0.25)
plt.show()โ
โ