Skip to content

London Public Transport in depth Analysis

1. Understanding the data

1. Understanding the Structure and Content of the Table

First, we will inspect the structure and some sample content of the table.

Spinner
DataFrameas
df
variable
SELECT *
FROM TFL.JOURNEYS
LIMIT 10;

2. Data Cleaning

Check for and handle missing values, duplicates, and outliers.

Total rows in the journeys table

Spinner
DataFrameas
df1
variable
SELECT
	COUNT(*) AS total_rows
FROM TFL.JOURNEYS;

Check for Missing Values

Spinner
DataFrameas
df2
variable
SELECT
	COUNT(*) AS total_rows,
	COUNT(month),
	COUNT(year),
	COUNT(days),
	COUNT(report_date),
	COUNT(journey_type),
	COUNT(journeys_millions),
FROM TFL.JOURNEYS;

The journeys_millions column in journeys table have 95 Null values.

Check for duplicate records(rows)

Spinner
DataFrameas
df3
variable
SELECT DISTINCT COUNT(*) DISTINCT
FROM TFL.JOURNEYS;

The above query returned 936 Distinct rows which is equals to total rows in journeys table. Hence there are no duplicates in the table.

2. Exploratory Data Analysis

1. Descriptive Statistics