Predicting renewable energy production - sailing in rough seas

Unveiling Trends in Renewable Energy 🌍🔋

The Data Scientist Master

📖 Background

The race to net-zero emissions is heating up. As nations work to combat climate change and meet rising energy demands, renewable energy has emerged as a cornerstone of the clean transition. Solar, wind, and hydro are revolutionizing how we power our lives. Some countries are leading the charge, while others are falling behind. But which nations are making the biggest impact? What’s driving their success? And what lessons can we learn to accelerate green energy transition?

As a data scientist at NextEra Energy, one of the world’s leading renewable energy providers, your role is to move beyond exploration, into prediction. Using a rich, real-world dataset, you’ll build models to forecast renewable energy production, drawing on indicators like GDP, population, carbon emissions, and policy metrics.

With the world watching, your model could help shape smarter investments, forward-thinking policies, and a faster transition to clean energy. 🔮⚡🌱

💾 The data

Your team has gathered a global renewable energy dataset ("Training_set_augumented.csv") covering energy production, investments, policies, and economic factors shaping renewable adoption worldwide:

🌍 Basic Identifiers

Country – Country name
Year – Calendar year (YYYY)
Energy Type – Type of renewable energy (e.g., Solar, Wind)

⚡ Energy Metrics

Production (GWh) – Renewable energy produced (Gigawatt-hours)
Installed Capacity (MW) – Installed renewable capacity (Megawatts)
Investments (USD) – Total investment in renewables (US Dollars)
Energy Consumption (GWh) – Total national energy use
Energy Storage Capacity (MWh) – Capacity of energy storage systems
Grid Integration Capability (Index) – Scale of 0–1; ability to handle renewables in grid
Electricity Prices (USD/kWh) – Average cost of electricity
Energy Subsidies (USD) – Government subsidies for energy sector
Proportion of Energy from Renewables (%) – Share of renewables in total energy mix

🧠 Innovation & Tech

R&D Expenditure (USD) – R&D spending on renewables
Renewable Energy Patents – Number of patents filed
Innovation Index (Index) – Global innovation score (0–100)

💰 Economy & Policy

GDP (USD) – Gross domestic product
Population – Total population
Government Policies – Number of policies supporting renewables
Renewable Energy Targets – Whether national targets are in place (1 = Yes, 0 = No)
Public-Private Partnerships in Energy – Number of active collaborations
Energy Market Liberalization (Index) – Scale of 0–1

🧑‍🤝‍🧑 Social & Governance

Ease of Doing Business (Score) – World Bank index (0–100)
Regulatory Quality – Governance score (-2.5 to 2.5)
Political Stability – Governance score (-2.5 to 2.5)
Control of Corruption – Governance score (-2.5 to 2.5)

🌿 Environment & Resources

CO2 Emissions (MtCO2) – Emissions in million metric tons
Average Annual Temperature (°C) – Country’s avg. temp
Solar Irradiance (kWh/m²/day) – Solar energy availability
Wind Speed (m/s) – Average wind speed
Hydro Potential (Index) – Relative hydropower capability (0–1)
Biomass Availability (Tons/year) – Total available biomass

Input and evaluate training data

import pandas as pd
import os
import matplotlib.pyplot as plt


# Loading and surveying data

# List the files in the 'data' directory
print(os.listdir('data'))


# Read the CSV file from the 'data' directory

print('\n***read data***')
df = pd.read_csv('data/Training_set_augmented.csv')
print(df.head())
print('\n***info on columns of dataframe:\n')
print(df.info())

# Check categories
print('\n***check categories')
print('\n***Countries:')
print(df['Country'].unique())

print('\n***Years from to:')
print(df['Year'].min(), df['Year'].max())

print('\n***Renewable energy types')
print(df['Energy Type'].unique())

print('\n***Nominals:')
cols=['Government Policies','Renewable Energy Targets',\
      'Public-Private Partnerships in Energy','Energy Market Liberalization']
for col in cols:
    print(f' {col} {df[col].unique()}')

Input and evaluate test data

import pandas as pd
import os
import matplotlib.pyplot as plt


print(os.listdir('data'))
# Read the CSV file from the 'data' directory
print('\n***input teest data and print info***')
df_test = pd.read_csv('data/Public_Test_Set.csv')
print(df_test.head())
print('\n***info on columns of dataframe:\n')
print(df_test.info())

Prediction all countries undifferentiated


# Predict for all countries undifferentiated
print('\n***Predict renewable energy all countries undifferentiated***')
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error as MSE
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor

print('\n***drop columns for train and test data***')
cols_to_drop=['Country','Energy Type','Production (GWh)']

df_cntry = df
X=df_cntry.drop(cols_to_drop,axis=1) 
X_test=df_test.drop(['ID','Country','Energy Type'],axis=1)

print('\n***prepare predictor and predicted data***')
Xcols=X.columns
X = X.values
y = df_cntry['Production (GWh)'].values  

print('\n***run randomforest regressor***')
rf = RandomForestRegressor(n_estimators=25, max_depth=5, random_state=42)
rf.fit(X, y)
y_pred = rf.predict(X_test)
predictions_all_countries=pd.DataFrame({'ID':df_test['ID'],\
'Predicted Renewable Energy (GWh)':y_pred})
rf_r2 = rf.score(X, y)

print('n***overall R^2 al countries included')
print(f" R^2 Score: {rf_r2.round(2)}")

print('\n***plot feature importances***')
importances = pd.Series(data=rf.feature_importances_, index=Xcols)
importances_sorted = importances.sort_values()[0:10]
importances_sorted.plot(kind='barh')
plt.title(f'Top 10 Feature_importances')
plt.xlabel('Feature importances')
plt.show()

print('\n***print predicted values for all countries undifferentiated***')
print(predictions_all_countries.head())

Observations

Random forest analysis of data with all countries Undifferentiated yields disappointing results.

R2 is only 0.13 therefore predictions will be very unreliable and in all likelihood way off from acceptable rmse values.

Therefore, we try analysing each country one by one with the hope that individual countries will yield better scores. The poor lumpsum results suggest different countries have different factors playing important role in producing renewable energy.

Overall CO2 emissions, corruption, energy subsidies, political stability,and population energe as the most important features in determining renewable energy production.

Prediction for each counrtry seperately

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error as MSE
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor

#Predicting renewable energy production by ech country seperately

print('\n***Predicting renewable energy production by each country seperately***\n')

print('\n***prepare data drop columns, set country list***')
predictions_by_country=pd.DataFrame()
cols_to_drop=['Country','Energy Type','Production (GWh)']
test_cols_to_drop=['Country','Energy Type']
country_list=['UK','Australia', 'USA', 'Germany', 'France', 'Russia', 'China', 'Japan','India','Canada','Brazil']

energy_list=['Hydro', 'Solar', 'Biomass', 'Geothermal', 'Wind']
print('\n***renewable energy list***')
print(energy_list)

print('\n***calculate predicted values by random forest regressor for each country***\n')
pred_all_countries=pd.DataFrame()
for ct in country_list:
    df_cntry = df[df['Country'] == ct]  

    X=df_cntry.drop(cols_to_drop,axis=1) 
    Xcols=X.columns
    #print(X.shape)
    X = X.values
    y = df_cntry['Production (GWh)'].values  
    rf = RandomForestRegressor(n_estimators=25, max_depth=5, random_state=42)

    X_test_cntr=df_test[df_test['Country']==ct]
    X_test=X_test_cntr.drop(test_cols_to_drop,axis=1)
    test_id=X_test['ID']
    X_test=X_test.drop('ID',axis=1)
    x_test=X_test.values
    rf.fit(X, y)
    y_pred = rf.predict(X_test)
    rf_r2 = rf.score(X, y)
    print(f" {ct} - R^2 Score: {rf_r2.round(2)}")
    importances = pd.Series(data=rf.feature_importances_, index=Xcols)
    importances_sorted = importances.sort_values()[0:10]
    importances_sorted.plot(kind='barh')
    plt.title(f'Top 10 Feature_importances {ct}')
    plt.xlabel('Feature importances')
    plt.show()

    y_pred = rf.predict(X_test)
    predictions_ct=pd.DataFrame({'ID': test_id,\
            'Predicted Renewable Energy (GWh)':y_pred}).reset_index(drop=True)
    pred_all_countries = pd.concat([pred_all_countries,predictions_ct],ignore_index=True)

Observations:

On the average, predictors explain only 50% of the renewable energy production.

The top country is UK with r^2=0.8.

Each country has very different set of important features.

For example UK has Energy consumption,Investment and Innovation index at the top three. Whereas, Germany has Hydro potential, Electricity prices and CO2 emissons as top three features.

Combining contry by country predictions into a single file to print out

print('\n***concatenate predicted values for each country into one file***')
print('***sort and make it ready to write out***')

predicted_energy_production_values = pred_all_countries.sort_values(by='ID').set_index('ID')
print('\n***check size of the predicted data file must have the same number of rows***')
print(pred_all_countries.shape)
print('\n***print top a few lines of the predicted data file***')   

# No need to reset the index again with 'ID' as it is already reset above
print(predicted_energy_production_values)

print('\n***output the predicted values to data/ directory***')
#predicted_energy_production_values.to_csv('data/predicted_energy_production_values.csv', index=False)
#print(os.listdir('data'))


print(os.listdir('data'))

Conclusion:

Evaluating each country seperately should yield more accurate results than overall approach.

The results for eacn country are concotenated into a single file and output into \data directory.

Even more accurate results could be obtained by evaluating each country and each renewable energy type separately.