Skip to content

As the climate changes, predicting the weather becomes ever more important for businesses. You have been asked to support on a machine learning project with the aim of building a pipeline to predict the climate in London, England. Specifically, the model should predict mean temperature in degrees Celsius (°C).

Since the weather depends on a lot of different factors, you will want to run a lot of experiments to determine what the best approach is to predict the weather. In this project, you will run experiments for different regression models predicting the mean temperature, using a combination of sklearn and mlflow.

You will be working with data stored in london_weather.csv, which contains the following columns:

  • date - recorded date of measurement - (int)
  • cloud_cover - cloud cover measurement in oktas - (float)
  • sunshine - sunshine measurement in hours (hrs) - (float)
  • global_radiation - irradiance measurement in Watt per square meter (W/m2) - (float)
  • max_temp - maximum temperature recorded in degrees Celsius (°C) - (float)
  • mean_temp - target mean temperature in degrees Celsius (°C) - (float)
  • min_temp - minimum temperature recorded in degrees Celsius (°C) - (float)
  • precipitation - precipitation measurement in millimeters (mm) - (float)
  • pressure - pressure measurement in Pascals (Pa) - (float)
  • snow_depth - snow depth measurement in centimeters (cm) - (float)
import sys
print(f"The current python version being used is {sys.version}")
!pip install --upgrade pip
# Run this cell to install mlflow
!pip install -q ydata-profiling mlflow scipy graphviz
from importlib.metadata import version

pkgs = [
    "pandas",
    "numpy",
    "mlflow",
    "matplotlib",
    "scikit-learn",
    "seaborn",
    "ydata-profiling",
    "scipy",
    "graphviz"
]

for p in pkgs:
    print(f"{p} version: {version(p)}")
# Run this cell to import the modules you require
import pandas as pd
import numpy as np
import mlflow
import mlflow.sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from ydata_profiling import ProfileReport
from scipy.stats import boxcox, yeojohnson
from sklearn.impute import KNNImputer


# Read in the data
weather = pd.read_csv("london_weather.csv")

# Start coding here
# Use as many cells as you like
print("Concise summary of the data")
weather.info()
print("=" * 100)
print("sample head and tail")
print(weather.head().transpose())
print("tail")
print(weather.tail().transpose())
# explore the dataset with pandas profiling
profile = ProfileReport(weather, title="Profiling Report")
profile.to_file("weather_dataset.html")
profile

Considerations while using this dataset:

  • Precipitation and snow depth have a lot of missing values. Over 50%, it won't be useful to have these in the dataset. It could make model fitting difficult.
  • Sunshine and cloud cover also have a couple of missing values but we can work with what we have here.
  • A lot of variables have high correlation. Which makes it impossible for us to make a model since we don't meet the criteria for making statistical models. I think we can have mean_temperature since it has representation that has compressed representation of min and max temperatures.
  • The dates are very unique. We can do more feature engineering to ensure that we can make more data to fit the model.

Potential things that I have left out:

By Claude 3.5 Sonnet by Anthropic

  1. Correlations:

    • Cloud cover, global radiation, max_temp, mean_temp, min_temp, and sunshine are all highly correlated with each other.
    • Pressure and precipitation are highly correlated with each other.
  2. Missing and Zero Values:

    • Snow_depth has 1441 (9.4%) missing values and is highly skewed (γ1 = 22.58706503).
    • Several variables have a significant number of zero values:
      • Snow_depth: 13760 (89.7%) zeros
      • Precipitation: 7963 (51.9%) zeros
      • Sunshine: 2570 (16.8%) zeros
      • Cloud_cover: 381 (2.5%) zeros
  3. Unique Values:

    • The date field has unique values, which is expected for a time series dataset.

Insights and Implications:

  1. The high correlation between cloud cover, global radiation, temperature variables, and sunshine is expected, as these are closely related weather phenomena. This multicollinearity might affect certain types of statistical analyses or machine learning models.

  2. The correlation between pressure and precipitation suggests a strong relationship between atmospheric pressure and rainfall patterns.

  3. The high percentage of zero values in snow_depth (89.7%) and precipitation (51.9%) indicates that:

    • The location likely experiences long periods without snowfall.
    • There are many days without rainfall, which could suggest a dry or seasonal climate.
  4. The skewness in snow_depth data, combined with its high percentage of zeros and missing values, suggests that snowfall is a rare event in this location. This variable might need special treatment in analyses.

  5. The presence of zero values in sunshine hours (16.8%) could indicate completely overcast days or potentially issues with measurement during night hours.

  6. The small percentage of zeros in cloud cover (2.5%) suggests that completely clear skies are relatively rare.

  7. The uniqueness of date values confirms this is likely a daily time series dataset without duplicates.

These observations are crucial for data preprocessing, feature engineering, and model selection. You may need to consider:

  • Handling missing values in snow_depth
  • Addressing the skewness in snow_depth, possibly through transformation
  • Deciding how to treat the high number of zero values, especially for precipitation and snow_depth
  • Being cautious about multicollinearity in model building
  • Potentially creating derived features that capture seasonal patterns, given the time series nature of the data

Next steps:

By Claude 3.5 Sonnet by Anthropic

To improve the dataset for modeling mean temperature using tree-based models and linear regression, you can consider the following actions:

  1. Handle missing values:

    • For snow_depth, which has 9.4% missing values, you could: a) Impute using seasonal averages or a more sophisticated method like multiple imputation b) Create a binary flag indicating whether snow_depth was missing c) If snow is rare, consider merging missing values with zero values
  2. Address skewness and zero-inflation:

    • For highly skewed variables like snow_depth: a) Apply log transformation: log(snow_depth + 1) b) Use square root transformation c) Bin the values into categories
    • For zero-inflated variables (snow_depth, precipitation): a) Create binary indicators (e.g., did_snow, did_rain) b) Consider two-part models: one to predict occurrence, another to predict amount
  3. Feature engineering:

    • Create lag features for time-dependent variables (e.g., yesterday's temperature)
    • Generate rolling averages or other statistics (e.g., 7-day moving average of temperature)
    • Extract seasonal features from the date (month, day of year, season)
    • Create interaction terms between correlated variables (e.g., sunshine * global_radiation)
  4. Handle multicollinearity:

    • For tree-based models, multicollinearity is less of an issue
    • For linear regression: a) Use regularization techniques (Lasso, Ridge, or Elastic Net) b) Select one representative variable from highly correlated groups c) Create composite variables (e.g., principal component analysis)
  5. Feature selection:

    • Use techniques like recursive feature elimination or feature importance from tree-based models to identify the most predictive variables
  6. Normalize/Standardize data:

    • Important for linear regression, less so for tree-based models
    • Standardize numerical features to have zero mean and unit variance
  7. Handle outliers:

    • Identify and investigate extreme values in all variables
    • Consider capping extreme values (winsorization) or using robust scaling methods
  8. Cross-validation strategy:

    • Given the time series nature, use time-based cross-validation instead of random splitting
  9. Create domain-specific features:

    • Heat index or wind chill factor if humidity and wind data are available
    • Day length calculated from the date, which affects temperature
  10. Balance the dataset:

    • If certain weather conditions are rare but important, consider oversampling or using weighted models
  11. Encoding categorical variables:

    • If you create categorical features (e.g., seasons), use appropriate encoding methods

Remember, tree-based models and linear regression have different strengths and assumptions. Tree-based models can capture non-linear relationships and interactions automatically, while linear regression assumes linear relationships but provides more interpretable coefficients.

For linear regression, focus more on addressing multicollinearity, normalizing features, and ensuring linearity assumptions are met. For tree-based models, feature engineering and handling of zero-inflated variables might be more impactful.

Handling snow_depth

There could be a need to find out why data is missing. But, for the seasons there might not be any snow in london that's why the values are missing. Unfortunately, we can't leave a blind eye that the variable is right skewed. Let's try doing feature engineering to make more data to highlight the seasons, impute the missing values with zeros and try to transform the data to make the best out of it.

# impute the snow_depth with binary mask: 1 if snow_depth is not null, 0 otherwise
weather["snow_depth_mask"] = weather["snow_depth"].notnull().astype(int)

# get approximate values to fill in the missing values using KNN
# imputer = KNNImputer(n_neighbors=5)
# weather["snow_depth_imputed"] = imputer.fit_transform(weather[["snow_depth"]])
weather["snow_depth_imputed"] = weather["snow_depth"].fillna(0)

# boxcox transformation
# weather["snow_depth_boxcox"], _ = boxcox(weather["snow_depth_imputed"] + 1)
# yeo-johnson transformation
weather["snow_depth_yeojohnson"], _ = yeojohnson(weather["snow_depth_imputed"] + 1)

print(weather["snow_depth_mask"].value_counts())
print("#" * 100)
# see skewness of the original, boxcox and yeo-johnson transformed snow_depth
print("The original skew value:" + "\n")
print(weather["snow_depth"].skew())
# print(f"After boxcox transformation: {weather["snow_depth_boxcox"].skew()}")
print("After yeojohnson transformation: " + "\n")
print(weather["snow_depth_yeojohnson"].skew())
# Extract seasonal features from the date (month, day of year, season)
weather["date"] = pd.to_datetime(weather["date"])

# extract month, day of year and season
weather["month"] = weather["date"].dt.month
weather["dayofyear"] = weather["date"].dt.dayofyear
weather["year"] = weather["date"].dt.year

# seasons in london are defined as follows: 
# Spring: March 1 to May 31
# Summer: June 1 to August 31
# Fall: September 1 to November 30
# Winter: December 1 to February 28
weather["season"] = weather["date"].dt.month.apply(
    lambda x: "Spring" if 3 <= x <= 5
    else "Summer" if 6 <= x <= 8
    else "Fall" if 9 <= x <= 11
    else "Winter"
)

# drop the original date column
weather.drop("date", axis=1, inplace=True)

# resultant columns
print(weather[["month", "dayofyear", "season", "year"]].head())
print("=" * 100)
print(weather[["month", "dayofyear", "season", "year"]].tail())

Interaction features

Weather variables are highly correlated. Therefore, we can combine them to make a scoring system that we are able to leverage the observations in a coalesced form. By multiplying the variables related to each other to help improve the performance of the models.

# Create interaction features: sunshine and global_radiation
weather["sunshine_global_radiation"] = weather["sunshine"] * weather["global_radiation"]

# max temperature and min temperature interaction
weather["max_min_temperature"] = weather["max_temp"] * weather["min_temp"]

# pressure and precipitation interaction
weather["pressure_precipitation"] = weather["pressure"] * weather["precipitation"]

# resultant columns
print(weather[["sunshine_global_radiation", "max_min_temperature", "pressure_precipitation"]].head())

# resultant columns
print(weather[["sunshine_global_radiation", "max_min_temperature", "pressure_precipitation"]].tail())

# drop the original columns
weather.drop(["sunshine", "global_radiation", "max_temp", "min_temp", "pressure", "precipitation", "snow_depth_imputed", "snow_depth"], axis=1, inplace=True)
# missing values percentage
missing_values = weather.isnull().mean() * 100

print(missing_values)
print("not too bad we don't have a lot of missing values now. It would be unrealistic to leave out some since some weather metrics are needed for accurate prediction")
# Preprocess data
imputer = SimpleImputer(strategy='median')

# One hot encode categorical features
weather = pd.get_dummies(weather, drop_first=True)

X = weather.drop(columns=['mean_temp'])
y = weather['mean_temp']

# Impute missing values in X and y
X = imputer.fit_transform(X)
y = imputer.fit_transform(y.values.reshape(-1, 1)).ravel()

# Split data into training, validation, and testing sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_valid, X_test, y_valid, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Function to train and evaluate models
def train_and_log_model(model, model_name, params, input_example):
    with mlflow.start_run() as run:
        model.fit(X_train, y_train)
        predictions = model.predict(X_train)
        rmse = np.sqrt(mean_squared_error(y_train, predictions))
        
        mlflow.log_params(params)
        mlflow.log_metric("rmse", rmse)
        mlflow.sklearn.log_model(model, model_name, input_example=input_example)
        
        return run.info.experiment_id, run.info.run_id, rmse, model

# Train and log Linear Regression model
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_valid_scaled = scaler.transform(X_valid)

lr = LinearRegression()
lr_params = {}
lr_input_example = pd.DataFrame(X_train_scaled[:1], columns=weather.drop(columns=['mean_temp']).columns)
lr_experiment_id, lr_run_id, lr_rmse, lr_model = train_and_log_model(lr, "LinearRegression", lr_params, lr_input_example)

# Hyperparameter tuning for Decision Tree
dt = DecisionTreeRegressor(random_state=42)
dt_param_dist = {
    "max_depth": [3, 4, None],
    "min_samples_split": [2, 10, 25],
    "min_samples_leaf": [1, 5, 10],
    #"criterion": ["mse", "mae"]
}
dt_random_search = RandomizedSearchCV(dt, param_distributions=dt_param_dist, n_iter=10, cv=5, random_state=42)
dt_random_search.fit(X_train, y_train)
dt_best_params = dt_random_search.best_params_

dt_input_example = pd.DataFrame(X_train[:1], columns=weather.drop(columns=['mean_temp']).columns)
dt_experiment_id, dt_run_id, dt_rmse, dt_model = train_and_log_model(dt_random_search.best_estimator_, "DecisionTreeRegressor", dt_best_params, dt_input_example)

# Hyperparameter tuning for Random Forest
rf = RandomForestRegressor(random_state=42)
rf_param_dist = {
    "n_estimators": [10, 50, 100],
    "max_depth": [3, 4, None],
    "min_samples_split": [2, 10, 25],
    "min_samples_leaf": [1, 5, 10],
    "bootstrap": [True, False]
}
rf_random_search = RandomizedSearchCV(rf, param_distributions=rf_param_dist, n_iter=10, cv=5, random_state=42)
rf_random_search.fit(X_train, y_train)
rf_best_params = rf_random_search.best_params_

rf_input_example = pd.DataFrame(X_train[:1], columns=weather.drop(columns=['mean_temp']).columns)
rf_experiment_id, rf_run_id, rf_rmse, rf_model = train_and_log_model(rf_random_search.best_estimator_, "RandomForestRegressor", rf_best_params, rf_input_example)

# Determine the model with the lowest RMSE
models_rmse = {
    "LinearRegression": (lr_rmse, lr_model),
    "DecisionTreeRegressor": (dt_rmse, dt_model),
    "RandomForestRegressor": (rf_rmse, rf_model)
}

best_model_name, (best_rmse, best_model) = min(models_rmse.items(), key=lambda item: item[1][0])

#print(f"The best model is {best_model_name} with an RMSE of {best_rmse}")
#print(models_rmse)

# Search all MLflow runs and store the results
#experiment_id = rf_experiment_id  # Assuming all runs are in the same experiment
experiment_results = mlflow.search_runs(filter_string="metrics.rmse < 3", search_all_experiments=True)
#experiment_results = mlflow.search_runs(experiment_ids=[experiment_id], filter_string="metrics.rmse <= 3")


print(experiment_results)