Unveiling Trends in Renewable Energy ๐๐
The Data Scientist Master
๐ Background
The race to net-zero emissions is heating up. As nations work to combat climate change and meet rising energy demands, renewable energy has emerged as a cornerstone of the clean transition. Solar, wind, and hydro are revolutionizing how we power our lives. Some countries are leading the charge, while others are falling behind. But which nations are making the biggest impact? Whatโs driving their success? And what lessons can we learn to accelerate green energy transition?
As a data scientist at NextEra Energy, one of the worldโs leading renewable energy providers, your role is to move beyond exploration, into prediction. Using a rich, real-world dataset, youโll build models to forecast renewable energy production, drawing on indicators like GDP, population, carbon emissions, and policy metrics.
With the world watching, your model could help shape smarter investments, forward-thinking policies, and a faster transition to clean energy. ๐ฎโก๐ฑ
๐พ The data
Your team has gathered a global renewable energy dataset ("Training_set_augumented.csv") covering energy production, investments, policies, and economic factors shaping renewable adoption worldwide:
๐ Basic Identifiers
Countryโ Country nameYearโ Calendar year (YYYY)Energy Typeโ Type of renewable energy (e.g., Solar, Wind)
โก Energy Metrics
Production (GWh)โ Renewable energy produced (Gigawatt-hours)Installed Capacity (MW)โ Installed renewable capacity (Megawatts)Investments (USD)โ Total investment in renewables (US Dollars)Energy Consumption (GWh)โ Total national energy useEnergy Storage Capacity (MWh)โ Capacity of energy storage systemsGrid Integration Capability (Index)โ Scale of 0โ1; ability to handle renewables in gridElectricity Prices (USD/kWh)โ Average cost of electricityEnergy Subsidies (USD)โ Government subsidies for energy sectorProportion of Energy from Renewables (%)โ Share of renewables in total energy mix
๐ง Innovation & Tech
R&D Expenditure (USD)โ R&D spending on renewablesRenewable Energy Patentsโ Number of patents filedInnovation Index (Index)โ Global innovation score (0โ100)
๐ฐ Economy & Policy
GDP (USD)โ Gross domestic productPopulationโ Total populationGovernment Policiesโ Number of policies supporting renewablesRenewable Energy Targetsโ Whether national targets are in place (1 = Yes, 0 = No)Public-Private Partnerships in Energyโ Number of active collaborationsEnergy Market Liberalization (Index)โ Scale of 0โ1
๐งโ๐คโ๐ง Social & Governance
Ease of Doing Business (Score)โ World Bank index (0โ100)Regulatory Qualityโ Governance score (-2.5 to 2.5)Political Stabilityโ Governance score (-2.5 to 2.5)Control of Corruptionโ Governance score (-2.5 to 2.5)
๐ฟ Environment & Resources
CO2 Emissions (MtCO2)โ Emissions in million metric tonsAverage Annual Temperature (ยฐC)โ Countryโs avg. tempSolar Irradiance (kWh/mยฒ/day)โ Solar energy availabilityWind Speed (m/s)โ Average wind speedHydro Potential (Index)โ Relative hydropower capability (0โ1)Biomass Availability (Tons/year)โ Total available biomass
Input and evaluate training data
import pandas as pd
import os
import matplotlib.pyplot as plt
# Loading and surveying data
# List the files in the 'data' directory
print(os.listdir('data'))
# Read the CSV file from the 'data' directory
print('\n***read data***')
df = pd.read_csv('data/Training_set_augmented.csv')
print(df.head())
print('\n***info on columns of dataframe:\n')
print(df.info())
# Check categories
print('\n***check categories')
print('\n***Countries:')
print(df['Country'].unique())
print('\n***Years from to:')
print(df['Year'].min(), df['Year'].max())
print('\n***Renewable energy types')
print(df['Energy Type'].unique())
print('\n***Nominals:')
cols=['Government Policies','Renewable Energy Targets',\
'Public-Private Partnerships in Energy','Energy Market Liberalization']
for col in cols:
print(f' {col} {df[col].unique()}')Input and evaluate test data
import pandas as pd
import os
import matplotlib.pyplot as plt
print(os.listdir('data'))
# Read the CSV file from the 'data' directory
print('\n***input teest data and print info***')
df_test = pd.read_csv('data/Public_Test_Set.csv')
print(df_test.head())
print('\n***info on columns of dataframe:\n')
print(df_test.info())
Prediction all countries undifferentiated
# Predict for all countries undifferentiated
print('\n***Predict renewable energy all countries undifferentiated***')
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error as MSE
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
print('\n***drop columns for train and test data***')
cols_to_drop=['Country','Energy Type','Production (GWh)']
df_cntry = df
X=df_cntry.drop(cols_to_drop,axis=1)
X_test=df_test.drop(['ID','Country','Energy Type'],axis=1)
print('\n***prepare predictor and predicted data***')
Xcols=X.columns
X = X.values
y = df_cntry['Production (GWh)'].values
print('\n***run randomforest regressor***')
rf = RandomForestRegressor(n_estimators=25, max_depth=5, random_state=42)
rf.fit(X, y)
y_pred = rf.predict(X_test)
predictions_all_countries=pd.DataFrame({'ID':df_test['ID'],\
'Predicted Renewable Energy (GWh)':y_pred})
rf_r2 = rf.score(X, y)
print('n***overall R^2 al countries included')
print(f" R^2 Score: {rf_r2.round(2)}")
print('\n***plot feature importances***')
importances = pd.Series(data=rf.feature_importances_, index=Xcols)
importances_sorted = importances.sort_values()[0:10]
importances_sorted.plot(kind='barh')
plt.title(f'Top 10 Feature_importances')
plt.xlabel('Feature importances')
plt.show()
print('\n***print predicted values for all countries undifferentiated***')
print(predictions_all_countries.head())Observations
Random forest analysis of data with all countries Undifferentiated yields disappointing results.
R2 is only 0.13 therefore predictions will be very unreliable and in all likelihood way off from acceptable rmse values.
Therefore, we try analysing each country one by one with the hope that individual countries will yield better scores. The poor lumpsum results suggest different countries have different factors playing important role in producing renewable energy.
Overall CO2 emissions, corruption, energy subsidies, political stability,and population energe as the most important features in determining renewable energy production.
Prediction for each counrtry seperately
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error as MSE
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
#Predicting renewable energy production by ech country seperately
print('\n***Predicting renewable energy production by each country seperately***\n')
print('\n***prepare data drop columns, set country list***')
predictions_by_country=pd.DataFrame()
cols_to_drop=['Country','Energy Type','Production (GWh)']
test_cols_to_drop=['Country','Energy Type']
country_list=['UK','Australia', 'USA', 'Germany', 'France', 'Russia', 'China', 'Japan','India','Canada','Brazil']
energy_list=['Hydro', 'Solar', 'Biomass', 'Geothermal', 'Wind']
print('\n***renewable energy list***')
print(energy_list)
print('\n***calculate predicted values by random forest regressor for each country***\n')
pred_all_countries=pd.DataFrame()
for ct in country_list:
df_cntry = df[df['Country'] == ct]
X=df_cntry.drop(cols_to_drop,axis=1)
Xcols=X.columns
#print(X.shape)
X = X.values
y = df_cntry['Production (GWh)'].values
rf = RandomForestRegressor(n_estimators=25, max_depth=5, random_state=42)
X_test_cntr=df_test[df_test['Country']==ct]
X_test=X_test_cntr.drop(test_cols_to_drop,axis=1)
test_id=X_test['ID']
X_test=X_test.drop('ID',axis=1)
x_test=X_test.values
rf.fit(X, y)
y_pred = rf.predict(X_test)
rf_r2 = rf.score(X, y)
print(f" {ct} - R^2 Score: {rf_r2.round(2)}")
importances = pd.Series(data=rf.feature_importances_, index=Xcols)
importances_sorted = importances.sort_values()[0:10]
importances_sorted.plot(kind='barh')
plt.title(f'Top 10 Feature_importances {ct}')
plt.xlabel('Feature importances')
plt.show()
y_pred = rf.predict(X_test)
predictions_ct=pd.DataFrame({'ID': test_id,\
'Predicted Renewable Energy (GWh)':y_pred}).reset_index(drop=True)
pred_all_countries = pd.concat([pred_all_countries,predictions_ct],ignore_index=True)
Observations:
On the average, predictors explain only 50% of the renewable energy production.
The top country is UK with r^2=0.8.
Each country has very different set of important features.
For example UK has Energy consumption,Investment and Innovation index at the top three. Whereas, Germany has Hydro potential, Electricity prices and CO2 emissons as top three features.
Combining contry by country predictions into a single file to print out
print('\n***concatenate predicted values for each country into one file***')
print('***sort and make it ready to write out***')
predicted_energy_production_values = pred_all_countries.sort_values(by='ID').set_index('ID')
print('\n***check size of the predicted data file must have the same number of rows***')
print(pred_all_countries.shape)
print('\n***print top a few lines of the predicted data file***')
# No need to reset the index again with 'ID' as it is already reset above
print(predicted_energy_production_values)
print('\n***output the predicted values to data/ directory***')
#predicted_energy_production_values.to_csv('data/predicted_energy_production_values.csv', index=False)
#print(os.listdir('data'))
print(os.listdir('data'))Conclusion:
Evaluating each country seperately should yield more accurate results than overall approach.
The results for eacn country are concotenated into a single file and output into \data directory.
Even more accurate results could be obtained by evaluating each country and each renewable energy type separately.