As the climate changes, predicting the weather becomes ever more important for businesses. You have been asked to support on a machine learning project with the aim of building a pipeline to predict the climate in London, England. Specifically, the model should predict mean temperature in degrees Celsius (°C).
Since the weather depends on a lot of different factors, you will want to run a lot of experiments to determine what the best approach is to predict the weather. In this project, you will run experiments for different regression models predicting the mean temperature, using a combination of sklearn
and mlflow
.
You will be working with data stored in london_weather.csv
, which contains the following columns:
- date - recorded date of measurement - (int)
- cloud_cover - cloud cover measurement in oktas - (float)
- sunshine - sunshine measurement in hours (hrs) - (float)
- global_radiation - irradiance measurement in Watt per square meter (W/m2) - (float)
- max_temp - maximum temperature recorded in degrees Celsius (°C) - (float)
- mean_temp - target mean temperature in degrees Celsius (°C) - (float)
- min_temp - minimum temperature recorded in degrees Celsius (°C) - (float)
- precipitation - precipitation measurement in millimeters (mm) - (float)
- pressure - pressure measurement in Pascals (Pa) - (float)
- snow_depth - snow depth measurement in centimeters (cm) - (float)
# Run this cell to install mlflow
!pip install mlflow
# Run this cell to import the modules you require
import pandas as pd
import numpy as np
import mlflow
import mlflow.sklearn
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
# Read in the data
weather = pd.read_csv("london_weather.csv")
# view the first few rows
print(weather.head())
Exploratory Analysis
# view the dataset columns
print("columns in weather are :", weather.columns)
# view the length of the dataset
num_rows = weather.shape[0]
print("Number of rows in weather:", num_rows)
# view the dataset description
print("Here is the description of the weather dataset :", weather.describe)
# view the dataset Null values
null_sum = weather.isna().sum()
print("Sum of null valus per column:", null_sum)
Replace Missing Values
# Calculate mean values for each column
mean_values = weather.mean()
# Replace missing values in columns with NaNs using mean values
weather.fillna(mean_values, inplace=True)
print(weather.isna().sum())
#view column types
print(weather.dtypes)
Explore the Target Column
print(weather.mean_temp)
Split the Dataset
# Split the data into features (X) and target (y)
X = weather.drop("mean_temp", axis=1)
y = weather["mean_temp"]
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Train and log the models
# Define models
models = {
"Linear Regression": LinearRegression(),
"Decision Tree": DecisionTreeRegressor(random_state=42),
"Random Forest": RandomForestRegressor(n_estimators=100, random_state=42)
}
# Train and evaluate models
for model_name, model in models.items():
with mlflow.start_run():
# Train the model
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Calculate RMSE
rmse = mean_squared_error(y_test, y_pred, squared=False)
# Log the model, hyperparameters, and RMSE score
mlflow.sklearn.log_model(model, f"{model_name}_model")
mlflow.log_metric("rmse", rmse)
# Search all MLflow runs and store the results
experiment_results = mlflow.search_runs()