Unveiling Trends in Renewable Energy ๐๐
The Data Scientist Master
๐ Background
The race to net-zero emissions is heating up. As nations work to combat climate change and meet rising energy demands, renewable energy has emerged as a cornerstone of the clean transition. Solar, wind, and hydro are revolutionizing how we power our lives. Some countries are leading the charge, while others are falling behind. But which nations are making the biggest impact? Whatโs driving their success? And what lessons can we learn to accelerate green energy transition?
As a data scientist at NextEra Energy, one of the worldโs leading renewable energy providers, your role is to move beyond exploration, into prediction. Using a rich, real-world dataset, youโll build models to forecast renewable energy production, drawing on indicators like GDP, population, carbon emissions, and policy metrics.
With the world watching, your model could help shape smarter investments, forward-thinking policies, and a faster transition to clean energy. ๐ฎโก๐ฑ
๐พ The data
Your team has gathered a global renewable energy dataset ("Training_set_augumented.csv") covering energy production, investments, policies, and economic factors shaping renewable adoption worldwide:
๐ Basic Identifiers
Countryโ Country nameYearโ Calendar year (YYYY)Energy Typeโ Type of renewable energy (e.g., Solar, Wind)
โก Energy Metrics
Production (GWh)โ Renewable energy produced (Gigawatt-hours)Installed Capacity (MW)โ Installed renewable capacity (Megawatts)Investments (USD)โ Total investment in renewables (US Dollars)Energy Consumption (GWh)โ Total national energy useEnergy Storage Capacity (MWh)โ Capacity of energy storage systemsGrid Integration Capability (Index)โ Scale of 0โ1; ability to handle renewables in gridElectricity Prices (USD/kWh)โ Average cost of electricityEnergy Subsidies (USD)โ Government subsidies for energy sectorProportion of Energy from Renewables (%)โ Share of renewables in total energy mix
๐ง Innovation & Tech
R&D Expenditure (USD)โ R&D spending on renewablesRenewable Energy Patentsโ Number of patents filedInnovation Index (Index)โ Global innovation score (0โ100)
๐ฐ Economy & Policy
GDP (USD)โ Gross domestic productPopulationโ Total populationGovernment Policiesโ Number of policies supporting renewablesRenewable Energy Targetsโ Whether national targets are in place (1 = Yes, 0 = No)Public-Private Partnerships in Energyโ Number of active collaborationsEnergy Market Liberalization (Index)โ Scale of 0โ1
๐งโ๐คโ๐ง Social & Governance
Ease of Doing Business (Score)โ World Bank index (0โ100)Regulatory Qualityโ Governance score (-2.5 to 2.5)Political Stabilityโ Governance score (-2.5 to 2.5)Control of Corruptionโ Governance score (-2.5 to 2.5)
๐ฟ Environment & Resources
CO2 Emissions (MtCO2)โ Emissions in million metric tonsAverage Annual Temperature (ยฐC)โ Countryโs avg. tempSolar Irradiance (kWh/mยฒ/day)โ Solar energy availabilityWind Speed (m/s)โ Average wind speedHydro Potential (Index)โ Relative hydropower capability (0โ1)Biomass Availability (Tons/year)โ Total available biomass
๐ช Challenge
As a data scientist at NextEra Energy, your task is to use the Training Set (80% of the data) to train a powerful machine learning model that can predict renewable energy production (GWh). Once your model is trained, you will use it to generate predictions for the Test Set, which does not include the target (Production (GWh)) but has an additional ID column.
๐ Your Task:
-
Train Your Model:
- Use the Training Set, which contains all features and the target (
Production (GWh)), to build and fine-tune your model. - Explore, clean, and transform the data as needed.
- Use the Training Set, which contains all features and the target (
-
Generate Predictions:
- Use your trained model to make predictions for the Test Set (20%), which has all the features except
Production (GWh). - The Test Set also has an
IDcolumn, which uniquely identifies each row.
- Use your trained model to make predictions for the Test Set (20%), which has all the features except
-
Submit Your Results:
-
Save your predictions as a CSV file with exactly two columns:
ID: Directly from the Test Set (must match exactly).Predicted Production (GWh): Your modelโs predictions for each row.
-
๐ Ready to Start?
- Download the Training Set and Test Set.
- Build, train, and test your model.
- Submit your predictions. ๐
๐ Your model wonโt just generate predictions โ it will uncover underlying drivers of renewable energy production and reveal where the biggest gains can be made!
๐งโโ๏ธ Judging Criteria
Your submission will be evaluated using a hybrid system, combining Model Accuracy (80%) and Community Votes (20%).
๐ 1. Model Accuracy (80%)
- Your submission will be scored using Root Mean Squared Error (RMSE), which measures how close your predictions are to the actual values in our hidden test set.
- The lower your RMSE, the better your modelโs performance.
โ
Submission Instructions:
-
First, submit your Datalab workbook.
-
Then, submit your predictions as a .csv file via this Google Form.
-
Your file must contain exactly two columns:
ID: Directly from the Test Set (must match exactly).Predicted Production (GWh): Your modelโs predictions for each row.
โ
Submission Example:
| ID | Predicted Production (GWh) |
|---|---|
| 1 | 50200.34 |
| 2 | 67820.78 |
| 3 | 45210.55 |
| ... | ... |
โ
Important:
- Use the same email address for the Google Form as the one associated with your DataCamp account. This is how we will link your submission to your Datalab workbook.
- Only submissions in the correct format will be accepted and scored.
- We will automatically check for formatting errors (missing IDs, extra IDs, or invalid columns).
โ๏ธ 2. Community Votes (20%)
- Once the competition ends, you will be able to view the top submissions from other participants.
- Vote for the most insightful, creative, or well-explained solutions.
โ
Checklist before publishing
- Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
- Remove redundant cells like the introduction to data science notebooks, so the workbook is focused on your story.
- Check that all the cells run without error.
โณ Data is the new fuel - letโs generate insights and electrify the future!
๐ Renewable Energy Production Prediction
DataCamp Competition Portfolio Project
๐ Executive Summary
This project analyzes global renewable energy production patterns and builds predictive models to forecast energy generation. Using advanced machine learning techniques and comprehensive feature engineering, we aim to identify key drivers of renewable energy success worldwide.
๐ Table of Contents
- Data Loading & Initial Exploration (Invalid URL)
- Exploratory Data Analysis (Invalid URL)
- Feature Engineering (Invalid URL)
- Model Development (Invalid URL)
- Model Evaluation & Selection (Invalid URL)
- Predictions & Submission (Invalid URL)
- Business Insights (Invalid URL)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import Ridge, ElasticNet
from sklearn.preprocessing import StandardScaler, RobustScaler, LabelEncoder
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.feature_selection import SelectKBest, f_regression
import xgboost as xgb
import warnings
warnings.filterwarnings('ignore')
# Set style for beautiful plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
print("๐ Environment setup complete! Ready to predict renewable energy production.")
## 1. Data Loading & Initial Exploration {#data-loading}
# Load the datasets
print("๐ Loading datasets...")
train_df = pd.read_csv('Training_set_augmented.csv')
test_df = pd.read_csv('Public_Test_Set.csv') # Assuming test set filename
print(f"โ
Training set shape: {train_df.shape}")
print(f"โ
Test set shape: {test_df.shape}")
# Display basic information
print("\n๐ Dataset Overview:")
train_df.info()
# First look at the data
print("\n๐ First 5 rows of training data:")
train_df.head()
# Check for missing values
print("\n๐ Missing Values Analysis:")
missing_train = train_df.isnull().sum()
missing_test = test_df.isnull().sum()
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Training set missing values
missing_train_pct = (missing_train / len(train_df)) * 100
missing_train_pct = missing_train_pct[missing_train_pct > 0].sort_values(ascending=False)
if not missing_train_pct.empty:
ax1.barh(range(len(missing_train_pct)), missing_train_pct.values, color='coral')
ax1.set_yticks(range(len(missing_train_pct)))
ax1.set_yticklabels(missing_train_pct.index)
ax1.set_xlabel('Missing Percentage (%)')
ax1.set_title('Missing Values - Training Set')
else:
ax1.text(0.5, 0.5, 'No Missing Values!', ha='center', va='center', fontsize=16, color='green')
ax1.set_title('Missing Values - Training Set')
# Test set missing values
missing_test_pct = (missing_test / len(test_df)) * 100
missing_test_pct = missing_test_pct[missing_test_pct > 0].sort_values(ascending=False)
if not missing_test_pct.empty:
ax2.barh(range(len(missing_test_pct)), missing_test_pct.values, color='lightblue')
ax2.set_yticks(range(len(missing_test_pct)))
ax2.set_yticklabels(missing_test_pct.index)
ax2.set_xlabel('Missing Percentage (%)')
ax2.set_title('Missing Values - Test Set')
else:
ax2.text(0.5, 0.5, 'No Missing Values!', ha='center', va='center', fontsize=16, color='green')
ax2.set_title('Missing Values - Test Set')
plt.tight_layout()
plt.show()
# Summary statistics
print("\n๐ Summary Statistics:")
train_df.describe()
2. Exploratory Data Analysis {#eda}
2.1 Target Variable Analysis
Analyze the target variable
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))
# Distribution of Production
ax1.hist(train_df['Production (GWh)'], bins=50, alpha=0.7, color='skyblue', edgecolor='black')
ax1.set_xlabel('Production (GWh)')
ax1.set_ylabel('Frequency')
ax1.set_title('๐ Distribution of Renewable Energy Production')
ax1.grid(True, alpha=0.3)
# Log-transformed distribution
ax2.hist(np.log1p(train_df['Production (GWh)']), bins=50, alpha=0.7, color='lightgreen', edgecolor='black')
ax2.set_xlabel('Log(Production + 1)')
ax2.set_ylabel('Frequency')
ax2.set_title('๐ Log-Transformed Production Distribution')
ax2.grid(True, alpha=0.3)
# Box plot by Energy Type
energy_types = train_df['Energy Type'].value_counts().head(8).index
filtered_df = train_df[train_df['Energy Type'].isin(energy_types)]
sns.boxplot(data=filtered_df, x='Energy Type', y='Production (GWh)', ax=ax3)
ax3.set_xticklabels(ax3.get_xticklabels(), rotation=45)
ax3.set_title('๐ Production by Energy Type')
# Production over time
yearly_production = train_df.groupby('Year')['Production (GWh)'].sum()
ax4.plot(yearly_production.index, yearly_production.values, marker='o', linewidth=2, markersize=6)
ax4.set_xlabel('Year')
ax4.set_ylabel('Total Production (GWh)')
ax4.set_title('๐ Total Renewable Energy Production Over Time')
ax4.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"๐ฏ Target Variable Statistics:")
print(f"Mean Production: {train_df['Production (GWh)'].mean():,.2f} GWh")
print(f"Median Production: {train_df['Production (GWh)'].median():,.2f} GWh")
print(f"Standard Deviation: {train_df['Production (GWh)'].std():,.2f} GWh")
print(f"Skewness: {train_df['Production (GWh)'].skew():.2f}")
2.2 Geographic and Energy Type Analysis
Top producing countries
top_countries = train_df.groupby('Country')['Production (GWh)'].sum().sort_values(ascending=False).head(15)
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(15, 12))
# Top producing countries
colors = plt.cm.viridis(np.linspace(0, 1, len(top_countries)))
bars = ax1.barh(range(len(top_countries)), top_countries.values, color=colors)
ax1.set_yticks(range(len(top_countries)))
ax1.set_yticklabels(top_countries.index)
ax1.set_xlabel('Total Production (GWh)')
ax1.set_title('๐ Top 15 Renewable Energy Producing Countries')
ax1.grid(True, alpha=0.3)
# Add value labels on bars
for i, (bar, value) in enumerate(zip(bars, top_countries.values)):
ax1.text(value + max(top_countries.values) * 0.01, i, f'{value:,.0f}',
va='center', ha='left', fontweight='bold')
# Energy type breakdown
energy_production = train_df.groupby('Energy Type')['Production (GWh)'].sum().sort_values(ascending=False)
colors_energy = plt.cm.Set3(np.linspace(0, 1, len(energy_production)))
wedges, texts, autotexts = ax2.pie(energy_production.values, labels=energy_production.index,
autopct='%1.1f%%', colors=colors_energy, startangle=90)
ax2.set_title('โก Renewable Energy Production by Type')
# Enhance pie chart text
for autotext in autotexts:
autotext.set_color('black')
autotext.set_fontweight('bold')
plt.tight_layout()
plt.show()2.3 Correlation Analysis
Select numeric columns for correlation analysis
numeric_cols = train_df.select_dtypes(include=[np.number]).columns
correlation_matrix = train_df[numeric_cols].corr()
# Create a more focused correlation heatmap
fig, ax = plt.subplots(figsize=(16, 14))
# Create mask for upper triangle
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
# Generate heatmap
sns.heatmap(correlation_matrix, mask=mask, annot=True, cmap='RdYlBu_r', center=0,
square=True, fmt='.2f', cbar_kws={'shrink': 0.8}, ax=ax)
ax.set_title('๐ Feature Correlation Matrix', fontsize=16, pad=20)
plt.tight_layout()
plt.show()
# Show strongest correlations with target
target_corr = correlation_matrix['Production (GWh)'].abs().sort_values(ascending=False)
print("๐ฏ Strongest correlations with Production:")
print(target_corr.head(10))
โ
โ