Skip to content

As the climate changes, predicting the weather becomes ever more important for businesses. Since the weather depends on a lot of different factors, you will want to run a lot of experiments to determine what the best approach is to predict the weather. In this project, you will run experiments for different regression models predicting the mean temperature, using a combination of sklearn and MLflow.

You will be working with data stored in london_weather.csv, which contains the following columns:

  • date - recorded date of measurement - (int)
  • cloud_cover - cloud cover measurement in oktas - (float)
  • sunshine - sunshine measurement in hours (hrs) - (float)
  • global_radiation - irradiance measurement in Watt per square meter (W/m2) - (float)
  • max_temp - maximum temperature recorded in degrees Celsius (°C) - (float)
  • mean_temp - mean temperature in degrees Celsius (°C) - (float)
  • min_temp - minimum temperature recorded in degrees Celsius (°C) - (float)
  • precipitation - precipitation measurement in millimeters (mm) - (float)
  • pressure - pressure measurement in Pascals (Pa) - (float)
  • snow_depth - snow depth measurement in centimeters (cm) - (float)
# Run this cell to import the modules you require
import pandas as pd
import numpy as np
import mlflow
import mlflow.sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from scipy.stats import randint, uniform

# Read in the data
weather = pd.read_csv("london_weather.csv")

# Start coding here
# Use as many cells as you like

1. Loading the data

print(weather.info())
print("size >>> ", len(weather))
print(weather.head())
print(weather.columns)

2. Data cleaning

# Working with the date column : Convert the 'date' column from int (YYYYMMDD) to datetime
weather["date"] = pd.to_datetime(weather["date"].astype(str), format="%Y%m%d")
weather["year"] = weather["date"].dt.year
# Extracting more date information
sns.histplot(data=weather, x="year", binwidth=1)
plt.show()

3. Exploratory data analysis

plt.figure(figsize=(10, 6))
sns.lineplot(data=weather, x="year", y="mean_temp", label="Mean Temp")
sns.lineplot(data=weather, x="year", y="max_temp", label="Max Temp")
sns.lineplot(data=weather, x="year", y="min_temp", label="Min Temp")
plt.ylabel("Temperature")
plt.title("Yearly Mean, Max, and Min Temperatures")
plt.legend()
plt.show()

4. Feature selection

weather_without_na = weather.dropna()
X = weather_without_na.drop(columns=["mean_temp", "date", "year"])
y = weather_without_na[["mean_temp"]]
# Set SEED for reproducibility
SEED = 1

# Split the data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)

# 1. Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
# For linear regression, use absolute value of coefficients as "importance"
lr_importances = pd.Series(data=np.abs(lr.coef_).flatten(), index=X_train.columns)

# 2. Decision Tree Regressor
dt = DecisionTreeRegressor(random_state=SEED)
dt.fit(X_train, y_train)
dt_importances = pd.Series(data=dt.feature_importances_, index=X_train.columns)

# 3. Random Forest Regressor
rf = RandomForestRegressor(n_estimators=25, random_state=2)
rf.fit(X_train, y_train)
rf_importances = pd.Series(data=rf.feature_importances_, index=X_train.columns)

# Combine all importances into a DataFrame
importances_df = pd.DataFrame({
    'Linear Regression': lr_importances,
    'Decision Tree': dt_importances,
    'Random Forest': rf_importances
})

# Sort features by Random Forest importance for plotting
importances_df = importances_df.loc[rf_importances.sort_values(ascending=False).index]

# Plot
importances_df.plot(kind='barh', figsize=(8,6))
plt.title('Feature Importances by Model')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.legend(loc='lower right')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

5. Preprocess data