Project: Analyzing River Thames Water Levels

Analyzing River Thames Water Levels

Time series data is everywhere, from watching your stock portfolio to monitoring climate change, and even live-tracking as local cases of a virus become a global pandemic. In this project, you’ll work with a time series that tracks the tide levels of the Thames River. You’ll first load the data and inspect it data visually, and then perform calculations on the dataset to generate some summary statistics. You’ll end by decomposing the time series into its component attributes and analyzing them.

The original dataset is available from the British Oceanographic Data Center here and you can read all about this fascinating archival story in this article from the Nature journal.

Here's a map of the locations of the tidal gauges along the River Thames in London.

The dataset comes with a file called Data_description.pdf. The dataset consists of 13 .txt files, containing comma separated data. We'll begin by analyzing one of them, the London Bridge gauge, and preparing it for analysis. The same code can be used to analyze data from other files (i.e. other gauges along the river) later.

Variable Name	Description	Format
Date and time	Date and time of measurement to GMT. Note the tide gauge is accurate to one minute.	dd/mm/yyyy hh:mm:ss
Water level	High or low water level measured by tide gauge. Tide gauges are accurate to 1 centimetre.	metres (Admiralty Chart Datum (CD), Ordnance Datum Newlyn (ODN or Trinity High Water (THW))
Flag	High water flag = 1, low water flag = 0	Categorical (0 or 1)

import pandas as pd


def IQR(column):
    q25, q75 = column.quantile([0.25, 0.75])
    return q75 - q25

Task 1: Loading and filtering data from London Bridge

Begin by using pandas to load the London Bridge data as lb; you can find the file path by navigating through the data folder in your workspace. Since the column "flag, HW=1 or LW=0" has a comma, pandas will create an extra, empty column when reading this as a csv. Create a new DataFrame df, which takes only the first three columns of lb, and rename them as datetime, water_level, and is_high_tide, respectively.

filepath = "data/10-11_London_Bridge.txt"
lb = pd.read_csv(filepath)

london_bridge = pd.DataFrame(columns=["datetime", "water_level", "is_high_tide"])

london_bridge["datetime"] = lb.iloc[:, 0:1]
london_bridge["water_level"] = lb.iloc[:, 1:2]
london_bridge["is_high_tide"] = lb.iloc[:, 2:3]
london_bridge.head()

Task 2: Preparing selected columns for analysis

By calling .dtypes on df, you'll find that the datetime and water_level columns are currently of the object data type. Convert the datetime column to the datetime format and water_level to the float format. Add two new columns to df, month and year, which you'll need to extract from datetime.

london_bridge["datetime"] = pd.to_datetime(london_bridge["datetime"])
london_bridge["water_level"] = london_bridge["water_level"].astype(float)

london_bridge["year"] = london_bridge["datetime"].dt.year
london_bridge["month"] = london_bridge["datetime"].dt.month


london_bridge.head()

Task 3: Get a sense of typical tide levels for London Bridge

As shown in the Data Description table, the is_high_tide column has two values, 0 and 1, indicating low and high tide, respectively. Create a variable called tide_high, which returns the column water_level from df only where tide is high, and another variable called tide_low, which returns the column water_level from df only where tide is low. Using tide_high and tide_low, create two dictionaries, high_statistics and low_statistics, containing the mean, median, and IQR of each variable respectively, using the .agg() function from pandas and the IQR function provided.

tide_high = london_bridge[london_bridge["is_high_tide"] == 1]["water_level"]
tide_low = london_bridge[london_bridge["is_high_tide"] == 0]["water_level"]


(tide_low.head(), tide_high.head())

high_statistics = tide_high.agg(["mean", "median", IQR])
low_statistics = tide_low.agg(["mean", "median", IQR])

high_statistics, low_statistics

Task 4: Count the number of days London Bridge had record-high tide levels over the last 84 years.

Return a pandas Series that counts the number of days of high tide data you have per year in df and store it as all_high_days. Return a pandas Series that counts the number of days of data where water level was above the 75th percentile in df. Return a variable called high_ratio, which computes the ratio of high_days to all_high_days.

count_high_tide_by_year = london_bridge[london_bridge["is_high_tide"] == 1].groupby("year").count()
all_high_days = pd.Series(count_high_tide_by_year["is_high_tide"]).rename("high_days")


percentile_75 = london_bridge["water_level"].quantile(0.75)
high_days = london_bridge[london_bridge["water_level"] > percentile_75].groupby("year").count()
high_days = pd.Series(high_days["water_level"]).rename("high_days_above_75_percentile")

high_ratio = high_days / all_high_days

all_high_days.head(), high_days.head(), high_ratio

Task 5: Count the number of days London Bridge had record-low tide levels over the last 84 years.

Return a pandas Series that counts the number of days of low tide data you have per year in df and store it as all_low_days. Return a pandas Series that counts the number of days of data where the water level was below the 25th percentile in df. Return a variable called low_ratio which computes the ratio of low_days to all_low_days.

count_low_tide_by_year = london_bridge[london_bridge["is_high_tide"] == 0].groupby("year").count()
all_low_days = pd.Series(count_low_tide_by_year["is_high_tide"]).rename("low_days")


percentile_25 = london_bridge["water_level"].quantile(0.25)
low_days = london_bridge[london_bridge["water_level"] < percentile_25].groupby("year").count()
low_days = pd.Series(low_days["water_level"]).rename("low_days_below_25_percentile")

low_ratio = low_days / all_low_days

all_low_days.head(), low_days.head(), low_ratio

Task 6: Print your solution

Return your solution as a dictionary with the keys high_statistics, low_statistics, high_ratio, and low_ratio. Use the variables you have already generated as values for each of these keys.

solution = {"high_statistics": high_statistics, "low_statistics": low_statistics, "high_ratio": high_ratio, "low_ratio": low_ratio}
print(solution)

Task 7: Some Insighful Plots

‌
‌
‌