Skip to content
GLMs and Machine Learning Codes For General Insurance Pricing in Python
GLMs and Machine Learning for General Insurance Pricing
This project aims to utilise machine learning techniques that can be used to model data in the context of a non life insurance company.
Data used is the dataset of motor contracts from a French insurer. The file is named ‘freMTPL2freq’ and can be found on kaggle.com.'freMTPL2freq' consists of over 650 000 lines of contracts, for which we’re given several features of the contract including the exposure, and the number of claims declared by the insured which will be used to model claim frequency in this project.
Table of content
1. Data Overview
- 1.1 Setting the index for data
- 1.2 Cleaning and preparing data
- 1.3 Data features exploration
- 1.4 Main variable exploration
2. Model Building
- 2.1 Setting up and explore train and test sets
- 2.2 The Dummy Estimator
- 2.3 The Dummy Estimator with Sci-kit learn
- 2.4 The automated Dummy Estimator with Sci-kit learn
- 2.5 The Generalised Linear Model (GLMs)
- 2.6 The Random Forest Model and K-fold validation
3. Visualisation of the best model - Random Forest
1. Data Overview
#Data Overview
import numpy as np
import pandas as pd
import dill
df = pd.read_csv('freMTPL2freq.csv')
print('Number of data:', len(df),
'\nFirst few lines look like this: \n\n', df.head())
print('\nThe columns of our dataset are: \n', df.columns)
#Check for missing value:
df.isna().sum()
1.1 Setting the index for data
#Set the Policy ID to be the index of data frame
##Make sure that IDpol has no duplicating lines
df['IDpol'].is_unique
## Convert IDpol column to integer type
df['IDpol'] = df['IDpol'].astype(int)
## Set IDpol column as index column
df.set_index('IDpol', inplace=True)
1.2 Cleaning and preparing data
#Cleaning and preparing data
##Remove single quotes in string-type columns
for column in df.columns[df.dtypes.values == object]: # for all columns with type Object
df[column] = df[column].str.strip("'") # Remove '
##Check data
df.dtypes
print(df.head())
1.3 Data features exploration
#Features exploration
df.describe()
df.hist('VehPower')
df.hist('VehAge')
df.hist('DrivAge')
df.hist('BonusMalus')
df.hist('Density')
1.4 Main variable exploration
#Main variable exploration
#Prepare claim frequency data for modelling
##Calculate the claim frequency given ClaimNb (number of claims) and Exposure (% of days per year that policy cover)
df['Frequency'] = df['ClaimNb']/df['Exposure']
df.hist('Frequency')
#Retrieve 20 rows with the highest frequency for epxloration
df.sort_values(by=['Frequency'], ascending= False).head(20)
#-> High frequency has low exposure -> has minor impact to the model
#Explore Frequency and Exposure columns
df['Frequency'].describe()
#Explore Frequency with value less than 20
df[(df['Frequency'] < 20) & (df['Frequency'] > 0)].hist('Frequency')
2. Model Building
2.1 Setting up train and test sets
#Setting up train and test sets:
from sklearn.model_selection import train_test_split
df_train , df_test = train_test_split(df, test_size = 0.2, random_state = 123)
#Explore train and test sets:
print('Number of lines in training set:', len(df_train), '\n')
print('First lines of training set: \n\n', df_train.head(), )
print('\nNumber of lines in testing set:', len(df_test), '\n')
print('First lines of testing set: \n\n', df_test.head(), )
2.2 The Dummy Estimator