Competition - high fatality accidents

High speed driving is the significant cause of major accidents

Context

You work for the road safety team within the department of transport and are looking into how they can reduce the number of major incidents. The safety team classes major incidents as fatal accidents involving 3+ casualties. They are trying to learn more about the characteristics of these major incidents so they can brainstorm interventions that could lower the number of deaths. They have asked for your assistance with answering a number of questions.

Research questions

What time of day and day of the week do most major incidents happen?
Are there any patterns in the time of day/ day of the week when major incidents occur?
What characteristics stand out in major incidents compared with other accidents?
On what areas would you recommend the planning team focus their brainstorming efforts to reduce major incidents?

Executive summary

Most accidents occur between 7:00 AM and 11:00 PM
Major accidents are more common on Saturdays, especially at night (7:00 PM to midnight)
Due to the COVID-19 lockdown, no major accidents were observed in April, May, and November
The identified hotspots ( at least five accidents with 100m minimum distance) cause no major accidents
Speed limit shows the highest association with accident type. Most of the major accidents (~50%) occurred on roads with a speed limit of 60 km/h
The majority of high-speed roads are outside of city limits
Most major accidents have a higher number (>2) of cars involved
Dark conditions may have an impact on the occurrence of a major accident
Combining a single-carriageway with a high-speed limit can become a fatal accident zone.
Wet/damp conditions can make roads slippery and more prone to major accidents
The presence of carriageway hazards on the roads may lead to a major accident

Introduction

No one in their sane consciousness hops into their car to get involved in a car accident. A car accident can occur at any time, at any place, for many reasons. Car accidents are dangerous not only because of their severe consequences but because it happens so fast, and you must make a split-second decision to avoid them. Not everyone can deal with the tremendous pressure that one faces during an accident.

So, the questions remain, how can we reduce the possibility of experiencing such a situation? How can we prevent it? There are different approaches and opinions regarding these questions, but I would say improved knowledge about what might increase the risk of a car accident. How can we do that? By analyzing prior car accidents, identifying the common factors involved in these accidents. Then after analyzing these factors, we can rectify these conditions and alert the general population to be cautious when they observe these factors.

Here, we will analyze all the car accidents in the United Kingdom in the year 2020. We will focus our energy on identifying the standard features of major accidents (Fatal accidents with 3+ casualties). Finally, we will recommend possible approaches to reduce the number of major car accidents.

%%capture
### install required packages and import libaries 

import sys
## install required libaries
!"{sys.executable}" -mpip install shapely
!"{sys.executable}" -mpip install pyproj
!"{sys.executable}" -mpip install geopandas
!"{sys.executable}" -m pip install folium
!"{sys.executable}" -m pip install researchpy
!"{sys.executable}" -m pip install phik

### import libaries

import random
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from matplotlib import rcParams
from matplotlib.patches import Patch
from matplotlib.lines import Line2D
import matplotlib.dates as md

from datetime import datetime
import seaborn as sns
from collections import Counter
import math
from sklearn.cluster import DBSCAN
from shapely.geometry import Point
from functools import partial
from shapely.ops import transform
from shapely.ops import unary_union
import pyproj
import geopandas as gpd
import folium

import scipy.stats as ss
from collections import Counter
import researchpy as rp
import itertools
import phik
from phik.report import plot_correlation_matrix




%matplotlib inline
# from matplotlib.gridspec import GridSpec

### set seed
random.seed(1169)

# ### set context and palette for seaborn
# sns.set_context('talk')
# sns.set_palette('colorblind')


# plt.rcParams["figure.figsize"] = [15,6]
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

💾 The data

The reporting department has been collecting data on every accident that is reported. They've included this along with a lookup file for 2020's accidents.

Published by the department for transport. https://data.gov.uk/dataset/road-accidents-safety-data Contains public sector information licensed under the Open Government Licence v3.0.

The dataset mainly contains pertinent information regarding each accident in the U.K. in the year 2020. Most of the features of this dataset are nominal variables. We have assigned appropriate column types to each feature and created a label column that identifies each accident as major or minor type accidents. The major kind of accidents is those which were fatal with 3+ casualties. All other accidents were classified as minor accidents.

## data processing

lookup = pd.read_csv(r'./data/road-safety-lookups.csv')

accidents = pd.read_csv(r'./data/accident-data.csv')
## as most of the columns contain codes for different categorical variables, these columns need to converted into objact class 
object_list =['accident_severity',
              'first_road_class',
              'road_type',
              'speed_limit',
              'junction_detail',
              'junction_control',
              'second_road_class',
              'pedestrian_crossing_human_control',
              'pedestrian_crossing_physical_facilities',
              'light_conditions','weather_conditions',
              'road_surface_conditions',
              'special_conditions_at_site',
              'carriageway_hazards',
              'urban_or_rural_area']

accidents[object_list] = accidents[object_list].astype('object')
 
## created 'date_time' column as 'datetime64[ns]' type for further ease of calculations
accidents['date_time']= pd.to_datetime(accidents['date']+' '+accidents['time'], format = '%d/%m/%Y %H:%M')

## created an accident hour time column
accidents['hour'] = accidents.loc[:,'date_time'].dt.hour

## created a cloumn titled 'label' to assign accident type 'major' or 'minor' to each incident. 
## incidents with 'accident severity' class 'fatal'(=1) and atleast  3 as 'number of casualties' are considered 'major' class
## all other incidents are labeled as minor class.
major_or_minor=(accidents['accident_severity']==1) & (accidents['number_of_casualties']>=3)

accidents['label']= np.where(major_or_minor,'major','minor')

## remove redundent columns
drop_list =['accident_year', 'accident_reference', 'date', 'time']

accidents.drop(drop_list,axis=1, inplace=True)
accidents.head()

Handling missing values

accidents.info()

The provided data has no null values. However, some columns have a special code (-1) for missing values or out-of-range conditions. If necessary, we will carefully consider them when we conduct analysis.

Analysis Plan

This project aims to identify timeline patterns and typical characteristics of major accidents. To achieve these objectives, we will conduct the following analysis steps:

Exploratory data analysis -- to determine the timeline pattern of major accidents
Density-based spatial clustering of applications with noise (DBSCAN). -- to identify accident hotspots based on geospatial coordinates of the accidents. And to identify if there is any relationship between accident hotspots and major accidents.
Pearson Chi-square test
Cramér's V analysis
Theil's U (uncertainty coefficient) analysis The last three approaches will be used to identify the most influencing features associated with major accidents. We want to mention that, Pearson Chi-square test is applied to sets of categorical data to evaluate how likely it is that any observed difference between the sets arose by chance. This test only reveals any significant relationship between the two features. The Chi-square test can not measure the level of association between two features.

Cramér's V approach is based on the Chi-squared test, and it provides a measure of association between two nominal variables. But the drawback of Cramér's V approach is that it can identify the causative feature between features; it considers a both-way relationship.

Teil's U is also a measure of nominal association. The major advantage of this approach is that it measures association in one way(cause to effect).

Finally, we will utilize a new approach based on a paper published in 2020. The authors introduced a new correlation coefficient, φK (phik), which works consistently between categorical, ordinal, and interval variables and can capture nonlinear dependency. Another exciting aspect of this approach is that it measures the relationship significance between each category between two features. The link to this article is as below: https://www.sciencedirect.com/science/article/abs/pii/S0167947320301341

We will compare feature importance among all the approaches and make a holistic decision regarding the typical characteristics of significant accidents in comparison to other accidents.

Exploratory Data Analysis (EDA)

Distribution of accident types

total_accidents= accidents.shape[0]

print('In the year 2020, there were a total of {} motor vehicle accidents in\nthe United Kingdom.'.format(total_accidents))
print('\n')

fig, ax = plt.subplots(figsize=(6,6))

# sns.set_context("paper") 

accident_type= pd.DataFrame(accidents['label'].value_counts()).reset_index()
accident_type.columns =['accident type' ,'accident count']
accident_type['percentage']=(accident_type['accident count']/accident_type['accident count'].sum())*100

colors =['grey','steelblue']
ax = sns.barplot(y="accident count",x='accident type', data=accident_type, palette=colors)

for p in ax.patches:
    x=p.get_bbox().get_points()[:,0]
    y=p.get_bbox().get_points()[1,1]
    ax.annotate('{:.4f}%'.format((p.get_height()/total_accidents)*100), (x.mean(), y), 
            ha='center', va='bottom', size=12)

plt.title('Distribution of minor and major type of accidents', fontdict={'fontsize' : 18}, loc='left', pad=10)
plt.xlabel('Accident type',fontsize=15,loc='left')
plt.ylabel('Accident count',fontsize=15, loc='top')

ax.tick_params(axis='both', which='major', labelsize=12)
ax.tick_params(axis='both', which='minor', labelsize=12)

sns.despine()
plt.show()

print('\n')
print('Among these accidents,')

for i in accident_type.index:
    print('{}. {} accidents ({:.4f}%) were classified as {} accidents'.format(i+1 ,accident_type['accident count'][i], accident_type['percentage'][i], accident_type['accident type'][i]))

## as we are going to work with major accidents only, it is better to subset the dataframe based on accident type(label).

major =  accidents[accidents['label']=='major']
minor = accidents[accidents['label']=='minor']

‌
‌
‌