Unveiling trends in renewable energy with Linear Regression 🌍🔋

Unveiling Trends in Renewable Energy 🌍🔋

The Data Scientist Master

📖 Background

The race to net-zero emissions is heating up. As nations work to combat climate change and meet rising energy demands, renewable energy has emerged as a cornerstone of the clean transition. Solar, wind, and hydro are revolutionizing how we power our lives. Some countries are leading the charge, while others are falling behind. But which nations are making the biggest impact? What’s driving their success? And what lessons can we learn to accelerate green energy transition?

As a data scientist at NextEra Energy, one of the world’s leading renewable energy providers, your role is to move beyond exploration, into prediction. Using a rich, real-world dataset, you’ll build models to forecast renewable energy production, drawing on indicators like GDP, population, carbon emissions, and policy metrics.

With the world watching, your model could help shape smarter investments, forward-thinking policies, and a faster transition to clean energy. 🔮⚡🌱

💾 The data

Your team has gathered a global renewable energy dataset ("Training_set_augumented.csv") covering energy production, investments, policies, and economic factors shaping renewable adoption worldwide:

🌍 Basic Identifiers

Country – Country name
Year – Calendar year (YYYY)
Energy Type – Type of renewable energy (e.g., Solar, Wind)

⚡ Energy Metrics

Production (GWh) – Renewable energy produced (Gigawatt-hours)
Installed Capacity (MW) – Installed renewable capacity (Megawatts)
Investments (USD) – Total investment in renewables (US Dollars)
Energy Consumption (GWh) – Total national energy use
Energy Storage Capacity (MWh) – Capacity of energy storage systems
Grid Integration Capability (Index) – Scale of 0–1; ability to handle renewables in grid
Electricity Prices (USD/kWh) – Average cost of electricity
Energy Subsidies (USD) – Government subsidies for energy sector
Proportion of Energy from Renewables (%) – Share of renewables in total energy mix

🧠 Innovation & Tech

R&D Expenditure (USD) – R&D spending on renewables
Renewable Energy Patents – Number of patents filed
Innovation Index (Index) – Global innovation score (0–100)

💰 Economy & Policy

GDP (USD) – Gross domestic product
Population – Total population
Government Policies – Number of policies supporting renewables
Renewable Energy Targets – Whether national targets are in place (1 = Yes, 0 = No)
Public-Private Partnerships in Energy – Number of active collaborations
Energy Market Liberalization (Index) – Scale of 0–1

🧑‍🤝‍🧑 Social & Governance

Ease of Doing Business (Score) – World Bank index (0–100)
Regulatory Quality – Governance score (-2.5 to 2.5)
Political Stability – Governance score (-2.5 to 2.5)
Control of Corruption – Governance score (-2.5 to 2.5)

🌿 Environment & Resources

CO2 Emissions (MtCO2) – Emissions in million metric tons
Average Annual Temperature (°C) – Country’s avg. temp
Solar Irradiance (kWh/m²/day) – Solar energy availability
Wind Speed (m/s) – Average wind speed
Hydro Potential (Index) – Relative hydropower capability (0–1)
Biomass Availability (Tons/year) – Total available biomass

💪 Challenge

As a data scientist at NextEra Energy, your task is to use the Training Set (80% of the data) to train a powerful machine learning model that can predict renewable energy production (GWh). Once your model is trained, you will use it to generate predictions for the Test Set, which does not include the target (Production (GWh)) but has an additional ID column.

🚀 Your Task:

Train Your Model:
- Use the Training Set, which contains all features and the target (Production (GWh)), to build and fine-tune your model.
- Explore, clean, and transform the data as needed.
Generate Predictions:
- Use your trained model to make predictions for the Test Set (20%), which has all the features except Production (GWh).
- The Test Set also has an ID column, which uniquely identifies each row.
Submit Your Results:
- Save your predictions as a CSV file with exactly two columns:
  - ID: Directly from the Test Set (must match exactly).
  - Predicted Production (GWh): Your model’s predictions for each row.

🌐 Ready to Start?

Download the Training Set and Test Set.
Build, train, and test your model.
Submit your predictions. 🚀

🔎 Your model won’t just generate predictions — it will uncover underlying drivers of renewable energy production and reveal where the biggest gains can be made!

🧑‍⚖️ Judging Criteria

Your submission will be evaluated using a hybrid system, combining Model Accuracy (80%) and Community Votes (20%).

📊 1. Model Accuracy (80%)

Your submission will be scored using Root Mean Squared Error (RMSE), which measures how close your predictions are to the actual values in our hidden test set.
The lower your RMSE, the better your model’s performance.

✅ Submission Instructions:

First, submit your Datalab workbook.
Then, submit your predictions as a .csv file via this Google Form.
Your file must contain exactly two columns:
- ID: Directly from the Test Set (must match exactly).
- Predicted Production (GWh): Your model’s predictions for each row.

✅ Submission Example:

ID	Predicted Production (GWh)
1	50200.34
2	67820.78
3	45210.55
...	...

✅ Important:

Use the same email address for the Google Form as the one associated with your DataCamp account. This is how we will link your submission to your Datalab workbook.
Only submissions in the correct format will be accepted and scored.
We will automatically check for formatting errors (missing IDs, extra IDs, or invalid columns).

✍️ 2. Community Votes (20%)

Once the competition ends, you will be able to view the top submissions from other participants.
Vote for the most insightful, creative, or well-explained solutions.

# Library Imports
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, f_classif, RFE, VarianceThreshold
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression, Lasso
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import KBinsDiscretizer
import os
import numpy as np

training_path_training_set = os.path.join("data","Training_set_augmented.csv")
public_test_set = os.path.join("data","Public_Test_Set.csv")
target = 'Production (GWh)'

df = pd.read_csv(training_path_training_set)
test_set = pd.read_csv(public_test_set)
print(df.shape)

X = df.drop(columns=target)
y = df[target]

class LogTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None):
        self.columns = columns  # You can specify which columns to apply the log transformation to
    
    def fit(self, X, y=None):
        return self  # No fitting necessary
    
    def transform(self, X):
        # Check if input is a Pandas DataFrame or NumPy array
        if isinstance(X, np.ndarray):
            # If input is a NumPy array, we don't have columns, just indices
            X_copy = X.copy()

            # Handle NaN by replacing with column's median (for NumPy array)
            # If column-wise operations are needed, it's better to convert to DataFrame
            X_copy = np.where(np.isnan(X_copy), np.nanmedian(X_copy, axis=0), X_copy)

            # Handle infinity by replacing with the column's max value
            X_copy = np.where(np.isinf(X_copy), np.nanmax(X_copy, axis=0), X_copy)

            # Apply log transformation to all elements
            X_copy = np.log1p(X_copy)  # Log1p avoids log(0) by computing log(x + 1)

        elif isinstance(X, pd.DataFrame):
            # If input is a DataFrame, use pandas' functionality
            X_copy = X.copy()

            # Handle NaN by filling with the median of the column
            X_copy = X_copy.fillna(X_copy.median())

            # Handle infinity by replacing with the column's max value
            X_copy = X_copy.replace(np.inf, X_copy.max())

            # Apply log transformation to specified columns or all columns
            if self.columns is not None:
                for col in self.columns:
                    X_copy[col] = np.log1p(X_copy[col])  # Apply log1p to specified columns
            else:
                X_copy = np.log1p(X_copy)  # Apply log1p to all columns

        return X_copy

# Custom Binning Transformer to return DataFrame with column names
class BinningTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, bins=3):
        self.bins = bins
    
    def fit(self, X, y=None):
        return self  # No fitting necessary for binning
    
    def transform(self, X):
        # Ensure that the data is numeric
        X_copy = X.copy()  # Working on a copy of the data
        
        # Apply binning to each numeric column
        for i in range(X_copy.shape[1]):
            column_data = X_copy[:, i]  # Accessing column i as an array
            if np.issubdtype(column_data.dtype, np.number):  # Apply binning only to numeric columns
                discretizer = KBinsDiscretizer(n_bins=self.bins, encode='ordinal', strategy='uniform')
                binned = discretizer.fit_transform(column_data.reshape(-1, 1))  # Reshape for binning
                X_copy[:, i] = binned.flatten()  # Flatten to match original shape
        
        return X_copy  # Return the transformed NumPy array

# Custom Ratio Feature Transformer to return DataFrame with column names
class RatioFeatureTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self  # No fitting necessary
    
    def transform(self, X):
        X_copy = X.copy()
        # Create ratio features
        X_copy['GDP_to_Energy_Consumption'] = X_copy['GDP'] / X_copy['Energy Consumption']
        X_copy['Energy_Storage_to_Capacity'] = X_copy['Energy Storage Capacity'] / X_copy['Installed Capacity (MW)']
        
        # Ensure the result is a Pandas DataFrame with column names
        return X_copy

# Custom Time Feature Transformer to return DataFrame with column names
class TimeFeatureTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self  # No fitting necessary
    
    def transform(self, X):
        X_copy = X.copy()
        # Create time-based features
        X_copy['Year_Mod5'] = X_copy['Year'] % 5  # Capture cyclical trends based on year
        X_copy['Year_Diff'] = X_copy['Year'] - X_copy['Year'].min()  # Difference from the earliest year
        
        # Ensure the result is a Pandas DataFrame with column names
        return X_copy

numeric_features = X.select_dtypes(exclude='object').columns

log_transform_features = [
    'Biomass Availability',
    'CO2 Emissions',
    'Electricity Prices',
    'Energy Consumption',
    'Energy Storage Capacity',
    'Energy Subsidies',
    'GDP',
    'Government Policies',
    'Innovation Index',
    'Installed Capacity (MW)',
    'Investments (USD)',
    'Population',
    'Public-Private Partnerships in Energy',
    'R&D Expenditure',
    'Renewable Energy Patents',
    'Solar Irradiance',
    'Wind Speed'
]

passthrough_features = [
    'Control of Corruption',
    'Ease of Doing Business',
    'Energy Market Liberalization',
    'Grid Integration Capability',
    'Hydro Potential',
    'Innovation Index',
    'Political Stability',
    'Proportion of Energy from Renewables',
    'Regulatory Quality',
    'Renewable Energy Targets',
]


year_features = ['Year']
ratio_features = ['GDP', 'Energy Consumption', 'Energy Storage Capacity', 'Installed Capacity (MW)']
ignore_features = year_features + ratio_features

num_remaining_features = [col for col in numeric_features if col not in log_transform_features]
num_remaining_features = [col for col in num_remaining_features if col not in ratio_features]
num_remaining_features = [col for col in num_remaining_features if col not in passthrough_features]
num_remaining_features = [col for col in num_remaining_features if col not in year_features]


cat_features = X.select_dtypes(include='object').columns


### TRANSFORMERS
log_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('log', LogTransformer())
])


time_transformer = Pipeline(steps=[
    ('time', TimeFeatureTransformer()),
])

binning_ratio_transformer = Pipeline(steps=[
    ('ratio', RatioFeatureTransformer()),
    ('imputer', SimpleImputer(strategy='median')),  # Handle missing numerical values
    ('binning', BinningTransformer())
])

num_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),  # Handle missing numerical values
    ('scaler', StandardScaler())  # Scale the remaining numerical features
])

cat_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

### PIPELINE
preprocessor = ColumnTransformer(
    transformers=[
        ('log', log_transformer, log_transform_features),
        ('passthrough', 'passthrough', passthrough_features),
        ('binning_ratio', binning_ratio_transformer, ratio_features),
        ('year', time_transformer, ['Year']),
        ('num', num_transformer, num_remaining_features),
        ('cat', cat_transformer, cat_features)
    ]
)

# Evaluating three different models: Linear Regression, Regularized Linear Regression, and Random Forest
for model in [LinearRegression(), Lasso(), RandomForestRegressor()]:
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])
    
    cv_scores = cross_val_score(
        pipeline,
        X,
        y,
        cv=5,
        scoring='neg_mean_squared_error',
        n_jobs=-1
    )
    rmse_scores = np.sqrt(-cv_scores)
    rmse = rmse_scores.mean()
    print(rmse)

# Elected to submit a model using Linear Regression

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LinearRegression())
])

pipeline.fit(X, y)

X_test = test_set.drop('ID', axis = 1)

y_pred = pipeline.predict(X_test)

prediction_df = pd.DataFrame({'ID': test_set['ID'], 'Predicted Production (GWh)': y_pred})
prediction_df.head()