Skip to content

Analyzing River Thames Water Levels

Time series data is everywhere, from watching your stock portfolio to monitoring climate change, and even live-tracking as local cases of a virus become a global pandemic. In this project, you’ll work with a time series that tracks the tide levels of the Thames River. You’ll first load the data and inspect it data visually, and then perform calculations on the dataset to generate some summary statistics. You’ll end by reducing the time series to its component attributes and analyzing them.

The original dataset is available from the British Oceanographic Data Center.

Here's a map of the locations of the tidal meters along the River Thames in London.

The provided datasets are in the data folder in this workspace. For this project, you will work with one of these files, 10-11_London_Bridge.txt, which contains comma separated values for water levels in the Thames River at the London Bridge. After you've finished the project, you can use your same code to analyze data from the other files (at other spots in the UK where tidal data is collected) if you'd like.

The TXT file contains data for three variables, described in the table below.

Variable NameDescriptionFormat
Date and timeDate and time of measurement to GMT. Note the tide gauge is accurate to one minute.dd/mm/yyyy hh:mm:ss
Water levelHigh or low water level measured by tide meter. Tide gauges are accurate to 1 centimetre.metres (Admiralty Chart Datum (CD), Ordnance Datum Newlyn (ODN or Trinity High Water (THW))
FlagHigh water flag = 1, low water flag = 0Categorical (0 or 1)
# Read and survey data file and make necessary modifications
import pandas as pd               



# read in data file
water_levels = pd.read_csv('data/10-11_London_Bridge.txt')
print(water_levels.head())
columns = water_levels.columns

#replace column labels
#columns2 = columns.str.strip() #we could strip and keep original labels but..
#better replace with shorter labels. 
#change the last column label to redundant as we are going to get rid of it
columns2=['Date and time','Water_level','Flag','redundant']
water_levels.columns = columns2

#drop the reundant column
water_levels = water_levels.drop('redundant', axis=1)

#change date time to datetime format
water_levels['Datetime'] = pd.to_datetime(water_levels['Date and time'])

#change water level to float
water_levels['Water_level'] = water_levels['Water_level'].astype('float')

#drop Date and time columns
water_levels = water_levels.drop('Date and time', axis=1)

#add year and month columns to df
water_levels['year'] = water_levels['Datetime'].dt.year
water_levels['month'] = water_levels['Datetime'].dt.month

#print info again and top few rows of df
print(water_levels.info())
print(water_levels.head())

#separate df into two dfs for high tide and low tide
high_tide = water_levels[water_levels['Flag'] > 0]
low_tide = water_levels[water_levels['Flag'] < 1]
print('High tide*******')
print(high_tide.head())
print('Low tide *******')
print(low_tide.head())

Find the mean, median, and interquartile range for high- and low-tide data and save them as two separate pandas Series.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#define IQR as a function 
def IQR(column): 
    """ Calculates the interquartile range (IQR) for a given DataFrame column using the quantile method """
    q25, q75 = column.quantile([0.25, 0.75])
    return q75 - q25


#group by 'Flag' and aggregate using mean, median, and IQR
flag_stats = water_levels.groupby('Flag')['Water_level'].agg([np.mean, np.median, IQR]) # IQR defined above
print(flag_stats)

#extract low tide stats
low_tide_stats = flag_stats.loc[0, :]
print(f'\nlow tide stats********\n{low_tide_stats}')

#extract high tide stats
high_tide_stats = flag_stats.loc[1, :]
print(f'\nhigh tide stats********\n{high_tide_stats}')

#calculate the quantiles for high tide levels
high_tide_level=high_tide['Water_level']
high_tide_90= high_tide_level.quantile(0.9)
print(high_tide_90)
low_tide_level=low_tide['Water_level']
low_tide_10=low_tide_level.quantile(0.1)
print(low_tide_10)

#filter years with water level > high_tide_90th_quantile
high_tide_90_filter=high_tide[high_tide['Water_level']>high_tide_90]

#count all the year with high tide
years_high_days=high_tide.groupby('year')['Water_level'].count()

#count years with high tide above 90%
year_days_90=high_tide_90_filter.groupby('year')['Flag'].count()

#calculate the ratio of days above 90 to all high tide days
ratio_high_90=(year_days_90/years_high_days)
print(f'ratio_high_90:*******\n {ratio_high_90}')

#plot the ratio
plt.plot(ratio_high_90)
plt.xlabel('Years')
plt.ylabel('Ratio')
plt.title('Change in ratio of very high to high tide over the years')
plt.show()

#repeat the same procedure for low tide

#count years with low tide
years_all_low=low_tide.groupby('year')['Water_level'].count()
low_tide_10_filter=low_tide[low_tide['Water_level']<low_tide_10]

#count ears with tide lower than 90%
year_days_10=low_tide_10_filter.groupby('year')['Flag'].count()

#calculate the ratio
ratio_low_10=(year_days_10/years_all_low)
print(f'ratio_low_10: ********\n {ratio_low_10}')

#plot the ratio
plt.plot(ratio_low_10)
plt.xlabel('Years')
plt.ylabel('Ratio')
plt.title('Change in ratio of very low to low tide over the years')
plt.show()


#recast results as a dictionary

#convert to df and set water_level label for high and low tide ratios
ratio_high_90=ratio_high_90.reset_index()
ratio_high_90=ratio_high_90.rename(columns={0:'Water_level'})

ratio_low_10=ratio_low_10.reset_index()
ratio_low_10=ratio_low_10.rename(columns={0:'Water_level'})

#set up the dictionary and plot
solution={'high_statistics':high_tide_stats,'low_statistics':low_tide_stats,\
         'very_high_ratio':ratio_high_90,'very_low_ratio':ratio_low_10}
print(solution)