London, or as the Romans called it "Londonium"! Home to over 8.5 million residents who speak over 300 languages. While the City of London is a little over one square mile (hence its nickname "The Square Mile"), Greater London has grown to encompass 32 boroughs spanning a total area of 606 square miles!
Given the city's roads were originally designed for horse and cart, this area and population growth has required the development of an efficient public transport system! Since the year 2000, this has been through the local government body called Transport for London, or TfL, which is managed by the London Mayor's office. Their remit covers the London Underground, Overground, Docklands Light Railway (DLR), buses, trams, river services (clipper and Emirates Airline cable car), roads, and even taxis.
The Mayor of London's office make their data available to the public here. In this project, you will work with a slightly modified version of a dataset containing information about public transport journey volume by transport type.
The data has been loaded into a Snowflake database called TFL with a single table called JOURNEYS, including the following data:
TFL.JOURNEYS
| Column | Definition | Data type |
|---|---|---|
MONTH | Month in number format, e.g., 1 equals January | INTEGER |
YEAR | Year | INTEGER |
DAYS | Number of days in the given month | INTEGER |
REPORT_DATE | Date that the data was reported | DATE |
JOURNEY_TYPE | Method of transport used | VARCHAR |
JOURNEYS_MILLIONS | Millions of journeys, measured in decimals | FLOAT |
Note that in Snowflake all databases, tables, and columns are upper case by default.
You will execute SQL queries to answer three questions, as listed in the instructions.
Task: What are the most popular transport types, measured by the total number of journeys?
Similarly to the Carbon Emissions dataset, I was asked for the sum of the most popular transportation types in London. Since the question was similar I chose a bar chart for the same reasons.
SELECT JOURNEY_TYPE, SUM(JOURNEYS_MILLIONS) AS TOTAL_JOURNEYS_MILLIONS
FROM TFL.JOURNEYS
GROUP BY JOURNEY_TYPE
ORDER BY TOTAL_JOURNEYS_MILLIONS DESC;
I have left the last two queries as tables to showcase the data in a different form. Each uses a WHERE clause to further refine the data on London's transportation over differing dates.
Task: Which five months and years were the most popular for the Emirates Airline?
- For this task, the WHERE clause is used to exclude any data that is labelled as NULL so it does not affect the aggregation done by SUM.
- HAVING narrows down the queries output to the appropriate journey type. In this case, all data was appropriately stored so no cleaning was necessary.
- Finally, the combination of ORDER BY and DESC allows for the specifics of month and year mentioned in the task to be output.
SELECT MONTH, YEAR, ROUND(JOURNEYS_MILLIONS, 2) AS ROUNDED_JOURNEYS_MILLIONS
FROM TFL.JOURNEYS
WHERE JOURNEYS_MILLIONS IS NOT NULL
HAVING JOURNEY_TYPE = 'Emirates Airline'
ORDER BY ROUNDED_JOURNEYS_MILLIONS DESC
LIMIT 5;Task: Find the five years with the lowest volume of Underground & DLR journeys
- The WHERE clause in this task completes the same function as the HAVING clause in the previous task, again data cleanup was not necessary due to the nature of what was input.
- When using an aggregate function a GROUP BY clasue is necessary.
- Finally ORDER BY and ASC were used in the same way as previously, but this time ASC was used to gain insight into the lowest frequency of travel on the Underground and DLR.
SELECT YEAR, JOURNEY_TYPE, SUM(JOURNEYS_MILLIONS) AS TOTAL_JOURNEYS_MILLIONS
FROM TFL.JOURNEYS
WHERE JOURNEY_TYPE = 'Underground & DLR'
GROUP BY YEAR, JOURNEY_TYPE
ORDER BY TOTAL_JOURNEYS_MILLIONS ASC
LIMIT 5;