Skip to content

Introduction

This is a dataset that I found on Kaggle. It documents various weather data across a 10 year span in various cities across Australia. The goal with this data is to build models to predict if there will be rain tomorrow. The dataset contains a target variable called RainTomorrow with a No or Yes (1mm or more).

Since this dataset contains multiple cities that span across an entire continent, I will focus on one specific city to help predict more localized weather events. We don't want to see other weather areas affecting our predictions. I have chosen the city of Sydney to make my predicitons on.

Source - https://www.kaggle.com/jsphyg/weather-dataset-rattle-package

Import Modules and Data

#Import modules 
import numpy as np 
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import preprocessing
import os as os
from sklearn.metrics import mean_squared_error
%matplotlib inline 
import sys
from sklearn.metrics import r2_score 
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_curve, auc, confusion_matrix, classification_report
from sklearn.model_selection  import train_test_split, cross_val_score, KFold
from sklearn.model_selection import GridSearchCV
#import the weather data csv
weather = pd.read_csv('weatherAUS.csv', sep=',', engine='python')
weather.shape

The Dataset

# Move Target Variable to front of dataframe
targetName = 'RainTomorrow'
targetSeries = weather[targetName]
del weather[targetName]
weather.insert(0, targetName, targetSeries)
weather.head()
weather.tail()

Since weather is best predictable locally, I want to focus on one city in Australia, Sydney. I will now filter the dataset down to the Sydney observations.

#Create dataframe where Location is Sydney
weather_syd=weather.query('Location == "Sydney"')
weather_syd.head()
weather_syd.shape
#Drop date and location fields, they are not needed.
weather_syd=weather_syd.drop(['Date', 'Location'],axis=1)
weather_syd.dtypes

Breakdown of the attributes from source -

  • MinTemp - Minimum temperature in the 24 hours to 9am in degrees Celsius
  • MaxTemp - Maximum temperature in the 24 hours from 9am in degrees Celsius
  • Rainfall - Precipitation (rainfall) in the 24 hours to 9am in millimeters
  • Evaporation - "Class A" pan evaporation in the 24 hours to 9am in millimeters
  • Sunshine - Bright sunshine in the 24 hours to midnight in hours
  • WindGustDir - Direction of strongest gust in the 24 hours to midnight in compass points
  • WindGustSpeed - Speed of strongest wind gust in the 24 hours to midnight in kilometers per hour
  • WindDir9am - Wind direction averaged over 10 minutes prior to 9 am in compass points
  • WindDir3pm - Wind direction averaged over 10 minutes prior to 3 pm in compass points
  • WindSpeed9am - Wind speed averaged over 10 minutes prior to 9 am in kilometers per hour
  • WindSpeed3pm - Wind speed averaged over 10 minutes prior to 3 pm in kilometers per hour
  • Humidity9am - Relative humidity at 9 am in percent
  • Humidity3pm - Relative humidity at 3 pm in percent
  • Pressure9am - Atmospheric pressure reduced to mean sea level at 9 am in hectopascals
  • Pressure3pm - Atmospheric pressure reduced to mean sea level at 3 pm in hectopascals
  • Cloud9am - Fraction of sky obscured by cloud at 9 am in eighths
  • Cloud3pm - Fraction of sky obscured by cloud at 3 pm in eighths
  • Temp9am - Temperature at 9 am in degrees Celsius
  • Temp3pm - Temperature at 3 pm in degrees Celsius
  • RainToday - Yes/No if rained today more than 1mm+

TARGET VARIABLE

  • RainTomorrow - Yes/No if rained tomorrow more than 1mm

Data was compiled and sourced from the Australian Government Bureau of Meteorology

#Check for Null Values
#weather_syd.isna().any() - omitted for space