Skip to content

As the climate changes, predicting the weather becomes ever more important for businesses. Since the weather depends on a lot of various factors, you want to run couple experiments to determine which model is to predict the weather.

London Weather data sourced from Kaggle london_weather.csv, which contains the following columns:

  • Target Variable: mean_temp - mean temperature in degrees Celsius (°C) - (float)
  • date - recorded date of measurement - (int)
  • cloud_cover - cloud cover measurement in oktas - (float)
  • sunshine - sunshine measurement in hours (hrs) - (float)
  • global_radiation - irradiance measurement in Watt per square meter (W/m2) - (float)
  • max_temp - maximum temperature recorded in degrees Celsius (°C) - (float)
  • min_temp - minimum temperature recorded in degrees Celsius (°C) - (float)
  • precipitation - precipitation measurement in millimeters (mm) - (float)
  • pressure - pressure measurement in Pascals (Pa) - (float)
  • snow_depth - snow depth measurement in centimeters (cm) - (float)

Step 0: Import libraries

First, you'll import necessary libraries, including MLflow.

MLflow is an open-source platform designed to help manage the end-to-end machine learning lifecycle. It provides a comprehensive set of tools and features to streamline the process of building, training, and deploying machine learning models. Today, we'll be using MLflow for tracking experiments, hyperparameter tuning, model performance evaluation, and comparison and analysis of multiple models.

To use MLflow, we first need to install the package, since it's not included in the workspace by default. Using the !, we can run a bash command to install it.

!pip install mlflow
Hidden output
# Basic operations
import pandas as pd
import numpy as np

# Machine learning experiments
import mlflow
import mlflow.sklearn

# Data visualizations
import seaborn as sns
import matplotlib.pyplot as plt

# sklearn for predictive analytics
## Model preparation 
from sklearn.model_selection import train_test_split

## Missing data handling and data preprocessing
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

## ML algorithms
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

## Model evaluation metrics
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

Step 1: Get to know the dataset

# Load the dataset
df=pd.read_csv('london_weather.csv')
# Check first 5 rows
df.head(5)
# Show data types and counts
df.info()

Takeaway from the first look at the dataset:

  • Data types:"Date" column must be coverted to datetime format for further analysis.
  • Missing values: Most of the features have missing data. Imputation is required before modeling.

Step 2: Exploratory Data Analysis

It is time to perform exploratory data analysis to understand the dataset better and prepare it for modelling.

Data exploration and cleaning includes:

  • Uncovering initial patterns and characteristics
  • Identifying and handling data quality issues:missing values, outliers, inconsistencies, and errors.

Time Decomposition

# Converting 'date' column to datetime format
df['date'] = pd.to_datetime(df['date'],format="%Y%m%d")
# Adding year and month
df['month'] = df['date'].dt.month
df['year'] = df['date'].dt.year
# Dropping the date column
df= df.drop('date', axis=1)
df.head()
# View the summary statistics of the dataset
df.describe()
# Grouping data by year and month, calculating mean of weather metrics
#Excluding 'max_temp','min_temp' since it is another version of 'mean_temp'
df_metrics = ['month', 'cloud_cover', 'sunshine', 'global_radiation', 'mean_temp', 'precipitation', 'pressure', 'snow_depth']
df_per_month = df.groupby(['year', 'month'], as_index = False)[df_metrics].mean()

# Visualizing "Average Temp by Year"
sns.lineplot(x="year", y="mean_temp", data=df_per_month,ci=None)
sns.barplot(x='month', y='sunshine', data=df)
# Visualizing heatmap of correlation
sns.heatmap(df[df_metrics].corr(), annot=True)