Skip to content

As the climate changes, predicting the weather becomes ever more important for businesses. The aim of this task is to build a pipeline to predict the climate in London, England. Specifically, the model should predict mean temperature in degrees Celsius (°C).

Since the weather depends on a lot of different factors, we will run a lot of experiments to determine what the best approach is to predict the weather. In this project, we will run experiments for different regression models predicting the mean temperature, using a combination of sklearn and mlflow.

We will be working with data stored in london_weather.csv, which contains the following columns:

  • date - recorded date of measurement - (int)
  • cloud_cover - cloud cover measurement in oktas - (float)
  • sunshine - sunshine measurement in hours (hrs) - (float)
  • global_radiation - irradiance measurement in Watt per square meter (W/m2) - (float)
  • max_temp - maximum temperature recorded in degrees Celsius (°C) - (float)
  • mean_temp - target mean temperature in degrees Celsius (°C) - (float)
  • min_temp - minimum temperature recorded in degrees Celsius (°C) - (float)
  • precipitation - precipitation measurement in millimeters (mm) - (float)
  • pressure - pressure measurement in Pascals (Pa) - (float)
  • snow_depth - snow depth measurement in centimeters (cm) - (float)
# install mlflow
!pip install mlflow
# Import the modules
import pandas as pd
import numpy as np
import mlflow
import mlflow.sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Read the data
weather = pd.read_csv("london_weather.csv")
# Explanatory Data Analysis and data cleaning
weather.head()
weather.shape
weather.describe()
weather.info()
weather.isnull().sum()
# Convert date
weather['date'] = pd.to_datetime(weather['date'], format = "%Y%m%d")
# Extract the year and month
weather['year'] = weather['date'].dt.year
weather['month'] = weather['date'].dt.month
weather.head()
# Agregate and calculate average metrics
weather_metrics = ['cloud_cover', 'sunshine', 'global_radiation', 'max_temp', 'mean_temp', 'min_temp','precipitation', 'pressure', 'snow_depth' ]
weather_per_month = weather.groupby(['year', 'month'], as_index=False)[weather_metrics].mean()
# Visualize relationships in the data
sns.lineplot(weather, x='year', y = 'mean_temp')
plt.show()
correlation = weather.corr()
sns.heatmap(correlation)
plt.show()
# Choose features, define the target, and drop null values 
feature_selection = ['sunshine', 'cloud_cover','global_radiation', 'max_temp','min_temp', 'month']
target_variable = 'mean_temp'
weather = weather.dropna(subset=['mean_temp'])
# Subset feature and target sets
x = weather[feature_selection]
y = weather[target_variable]
#split the dataset
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
# Impute mising values
imputer = SimpleImputer(strategy="mean")
x_train = imputer.fit_transform(x_train)
#Transform on the test data
x_test = imputer.transform(x_test)

# Scale the data
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
# Transfrom on the test data
x_test = scaler.transform(x_test)