GreenSight 🌍🔋📊 Uncovering Clean Energy Trends Through Data

🏆 Unveiling Trends in Renewable Energy: The Definitive Model

Project by The Data Scientist Master

Welcome to a definitive analysis of global renewable energy production. This notebook presents a clean, robust, and high-performing predictive model. Our journey begins with a targeted exploratory analysis, followed by simple yet effective feature engineering, and culminates in a rigorously validated model. The final result is not just a prediction, but a clear and interpretable solution ready to make an impact.

1. Project Setup: Libraries & Data Loading

Our first step is to establish a clean and reproducible environment by importing the necessary libraries and loading our datasets. We also apply a crucial best practice: stripping any whitespace from column names to prevent potential errors down the line.

# ===================================================================
# SECTION 1: PROJECT SETUP
# ===================================================================

# --- Core Libraries ---
import pandas as pd
import numpy as np

# --- Visualization ---
import matplotlib.pyplot as plt
import seaborn as sns
import shap

# --- Preprocessing & Metrics ---
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error

# --- Models ---
import lightgbm as lgb
import joblib  # For saving models

# --- Notebook Configuration ---
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 7)
pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings('ignore')

# --- Load Data ---
train_df = pd.read_csv('data/Training_set_augmented.csv')
test_df = pd.read_csv('data/Public_Test_Set.csv')

# --- Standardize all column names ---
def standardize_columns(df):
    """Cleans and standardizes column names."""
    df.columns = (df.columns
                  .str.strip()
                  .str.lower()
                  .str.replace(' ', '_')
                  .str.replace('(', '')
                  .str.replace(')', '')
                  .str.replace('%', 'perc'))
    return df

train_df = standardize_columns(train_df)
test_df = standardize_columns(test_df)

print("✅ Data loaded and column names have been standardized successfully!")
print("Example of new column names:", list(train_df.columns[:5]))
print(f"Training data shape: {train_df.shape}")
print(f"Test data shape: {test_df.shape}")

2. Exploratory Data Analysis (EDA)

A focused EDA helps us understand the data's story. We will start by analyzing our most important variable—the target—and then explore its relationships with key features.

2.1. Target Variable Analysis: Production (GWh)

The target variable is highly skewed, which is common for production metrics. A log transformation is essential to normalize its distribution, which helps stabilize model training.

# Plot the distribution of the target variable and its log transformation
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Original Distribution
sns.histplot(train_df['production_gwh'], kde=True, bins=50, ax=axes[0])
axes[0].set_title('Original Distribution of Production (GWh)', fontsize=14)
axes[0].set_xlabel('Production (GWh)', fontsize=12)

# Log-Transformed Distribution
sns.histplot(np.log1p(train_df['production_gwh']), kde=True, bins=50, ax=axes[1], color='orange')
axes[1].set_title('Log-Transformed Distribution of Production (GWh)', fontsize=14)
axes[1].set_xlabel('Log(1 + Production (GWh))', fontsize=12)

plt.tight_layout()
plt.show()

2.2. Production by Key Categories

Let's explore how production varies across different energy types and over the years. This gives us intuition about major trends in the data.

# Create comparison plots: production by type and over time
fig, axes = plt.subplots(1, 2, figsize=(20, 7))

# Option 1: Log scale (preferred for interpretability)
sns.boxplot(x='energy_type', y='production_gwh', data=train_df, ax=axes[0])
axes[0].set_yscale('log')
axes[0].set_title('Production (GWh) by Energy Type (Log Scale)', fontsize=16)
axes[0].set_ylabel('Production (GWh, log scale)', fontsize=12)
axes[0].set_xlabel('Energy Type', fontsize=12)

# Production Trends Over Time
yearly_production = train_df.groupby('year')['production_gwh'].mean().reset_index()
sns.lineplot(x='year', y='production_gwh', data=yearly_production, marker='o', ax=axes[1])
axes[1].set_title('Average Renewable Energy Production Over Years', fontsize=16)
axes[1].set_ylabel('Average Production (GWh)', fontsize=12)
axes[1].set_xlabel('Year', fontsize=12)

plt.tight_layout()
plt.show()

2.3. Correlation Analysis

A correlation heatmap provides a quick overview of the linear relationships between numerical features. This can hint at which variables might be important predictors.

plt.figure(figsize=(20, 16))

# Select numeric columns and drop those with constant values
numeric_cols = train_df.select_dtypes(include=np.number)
numeric_cols = numeric_cols.loc[:, numeric_cols.nunique() > 1]

# Compute correlation matrix
corr_matrix = numeric_cols.corr()

# Mask upper triangle for cleaner display
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

# Plot heatmap
sns.heatmap(
    corr_matrix,
    mask=mask,
    annot=True,
    fmt=".2f",
    cmap='viridis',
    square=True,
    linewidths=0.5
)
plt.title('Correlation Heatmap of Numerical Features', fontsize=16)
plt.show()

3. Data Preprocessing & Feature Engineering

This section combines all preprocessing and feature engineering steps into a clean, logical flow for both training and testing data.

3.1. Feature Engineering: Final Feature Set

We define a robust function to engineer our final features. It safely creates domain-informed ratios and a valid lag feature (lag_production_1_year) without causing data leakage. These engineered features are applied consistently to both training and test datasets.

def feature_engineer_final(df):
    """
    Engineers the final set of features.
    Keeps the 'good leak' (lag feature) and removes the 'bad leak'.
    """
    df_copy = df.copy()
    
    # Check for essential columns
    required_cols = ['gdp', 'population', 'installed_capacity_mw']
    for col in required_cols:
        if col not in df_copy.columns:
            raise ValueError(f"Missing required column: '{col}'")

    # Sort data to correctly calculate lag features
    df_copy = df_copy.sort_values(by=['country', 'energy_type', 'year']).reset_index(drop=True)

    # --- THE "GOOD LEAK": Valid lag feature ---
    if 'production_gwh' in df_copy.columns:
        df_copy['lag_production_1_year'] = (
            df_copy.groupby(['country', 'energy_type'])['production_gwh']
            .shift(1)
            .fillna(-1)
        )
    
    # --- SAFE FEATURES ---
    df_copy['gdp_per_capita'] = df_copy['gdp'] / df_copy['population'].replace(0, np.nan)
    df_copy['capacity_per_capita'] = df_copy['installed_capacity_mw'] / df_copy['population'].replace(0, np.nan)

    # Fill remaining NaNs
    df_copy = df_copy.fillna(0)

    return df_copy

Apply this to both train and test sets:

# --- Apply Feature Engineering to Training Data ---
train_featured_full = feature_engineer_final(train_df.copy())

# Separate features and target
if 'production_gwh' not in train_featured_full.columns:
    raise ValueError("Target column 'production_gwh' missing in training data.")

train_labels = np.log1p(train_featured_full['production_gwh'])
train_features = train_featured_full.drop(columns=['production_gwh'])

# --- Apply Feature Engineering to Test Data ---
test_features = feature_engineer_final(test_df.copy())

# Ensure target is not in test
if 'production_gwh' in test_features.columns:
    test_features = test_features.drop(columns=['production_gwh'])

# Align columns (in case test is missing any features)
train_cols = train_features.columns
test_features = test_features.reindex(columns=train_cols).fillna(0)

# Optional: check results
print("✅ Final feature engineering complete.")
print(f"Training features shape: {train_features.shape}")
print(f"Test features shape: {test_features.shape}")
print(train_features[['country', 'energy_type', 'year', 'lag_production_1_year']].head(10))

3.2. Preprocessing Pipeline Definition

This step defines a unified preprocessing pipeline to handle missing values, scale numerical features, and encode categorical ones.

# --- 3. Preprocessing Pipeline Definition ---
# A single, clean pipeline to process our final feature set.

# Identify features
categorical_features = train_features.select_dtypes(include=['object', 'category']).columns
numerical_features = train_features.select_dtypes(include=np.number).columns

# Safety check
if len(numerical_features) == 0:
    print("⚠️ Warning: No numerical features found.")
if len(categorical_features) == 0:
    print("⚠️ Warning: No categorical features found.")

# Define preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ]), numerical_features),
        
        ('cat', Pipeline([
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('onehot', OneHotEncoder(handle_unknown='ignore'))
        ]), categorical_features)
    ],
    remainder='drop'  # or 'passthrough' if you want to keep extra columns
)

print("✅ Preprocessing pipeline defined.")

4. Model Training with Robust Cross-Validation

This is the core of our project. We train our LightGBM model using a K-Fold Cross-Validation strategy to ensure our performance estimate is stable and reliable. We print detailed metrics for each fold to monitor the model's behavior.

‌
‌
‌