Skip to content

Predicting Survival Rate on the Titanic

Data source: Kaggle

Abstract

Below we examined survival rate on the Titanic's maiden voyage using a pair of Kaggle datasets. The train dataset consisted of 891 observations (individual passengers) of 12 variables, while the test set consisted of 418 observations of 11 variables. The missing variable from the test set was Survived, a binary 0 or 1 indicator as to whether or not the passenger survived the disaster.

Our goal was to visualize different relationships among the variables to illustrate which factors played the biggest roles in passenger survival. In addition, we fit multiple classification models on the training set using cross-validation to determine which model fared best in accuracy predicting survival. We then fit the best-performing model on the test dataset to deliver a final prediction as to whether or not passengers in the test data survived.

Loading Datasets and Libraries

# Load libraries
library(readr)
library(dplyr)
library(ggplot2)
library(broom)
install.packages("naniar")
library(naniar)
library(rsample)
library(purrr)
library(forcats)
library(tidyr)
install.packages("simputation")
library(simputation)
library(tidymodels)
library(xgboost)
Hidden output

2 hidden cells (error)
# Load datasets

train_titanic <- read.csv('titanic_train_csv')

test_titanic <- read.csv("titanic_test.csv")

Data Preprocessing

# Check structure and missing values in train and test sets
str(train_titanic)
miss_var_summary(train_titanic)

str(test_titanic)
miss_var_summary(test_titanic)

The variables were the same across the training and test dataset, with the aforementioned exception of Survived being missing from the test data. The 12 variables in the training data were:

  • Passenger ID: Numeric variable with a unique ID for each observation
  • Survived: Numeric varaible with either 0 (deceased) or 1 (survived)
  • Pclass: Numeric variable indicating a passenger's class on the ship (1, 2 or 3)
  • Name: Character variable indicating a passenger's name (including any titles or prefixes)
  • Sex: Character variable with either male or female as the values
  • Age: Numeric variable with a passenger's age (including decimals)
  • SibSp: Numeric variable indicating the total number of siblings + spouse(s) a passenger boarded the Titanic with
  • Parch: Numeric variable indicating the total number of parents + children a passenger boarded the Titanic with
  • Ticket: Character variable with the passenger's ticket ID
  • Fare: Numeric variable indicating how much the passenger paid for the ticket
  • Cabin: Character variable indicating the passenger's cabin ID
  • Embarked: Character variable indicating which of the 3 sites (S, C or Q) the passenger embarked at. Those abbreviations are short for Southampton, Cherbourg and Queenstown respectively.

Our first data cleaning step was to convert the character variables Sex, Embarked and Survived into factor varaibles. We also re-labeled the Embarked sites with the full location names.

# Convert Character to Factor Variables in Training Data
train_titanic <- train_titanic %>%
  mutate(Sex = factor(Sex, levels = c("male", "female")),
         Embarked = factor(Embarked, labels = c("Cherbourg", "Queenstown", "Southampton")),
         Survived = factor(Survived, levels = c(0, 1)))
str(train_titanic)

test_titanic <- test_titanic %>%
    mutate(Sex = factor(Sex, levels = c("male", "female")),
           Embarked = factor(Embarked, labels = c("Cherbourg", "Queenstown", "Southampton")))
str(test_titanic)

Identifying and Imputing Missing Values

The vast majority of the missing values were in the Cabin and Age variables. We used the training dataset to summarize the relationship of the missingness in these variables with other variables of interest.

Exploring missingness in the Cabin variable

# Summarize missing averages in the Cabin variable for age, fare and family members
train_titanic %>%
  bind_shadow() %>% 
  group_by(Cabin_NA) %>%
  summarize(mean_age = mean(Age, na.rm = TRUE),
            mean_siblings_spouses = mean(SibSp),
            mean_parents_children = mean(Parch),
            mean_fare = mean(Fare))

# Visualize missing totals in the Cabin variable for Class
train_titanic %>%
  bind_shadow() %>% 
  ggplot(aes(Pclass, fill = Cabin_NA)) +
  geom_bar(position = "fill")

# Visualize missing totals in the Cabin variable for Sex
train_titanic %>%
  bind_shadow() %>% 
  ggplot(aes(Sex, fill = Cabin_NA)) +
  geom_bar(position = "fill")

# Visualize missing totals in the Cabin variable for Where Embarked (ignoring 2 NA values)
train_titanic %>%
  bind_shadow() %>%
  subset(!is.na(Embarked)) %>%
  ggplot(aes(Embarked, fill = Cabin_NA)) +
  geom_bar(position = "fill")

Exploring missingness in the Age variable

# Summarize missing averages in the Age variable for fare and family members
train_titanic %>%
  bind_shadow() %>% 
  group_by(Age_NA) %>%
  summarize(mean_siblings_spouses = mean(SibSp),
            mean_parents_children = mean(Parch),
            mean_fare = mean(Fare))

# Visualize missing totals in the Age variable for Class
train_titanic %>%
  bind_shadow() %>% 
  ggplot(aes(Pclass, fill = Age_NA)) +
  geom_bar(position = "fill")

# Visualize missing totals in the Age variable for Sex
train_titanic %>%
  bind_shadow() %>% 
  ggplot(aes(Sex, fill = Age_NA)) +
  geom_bar(position = "fill")

# Visualize missing totals in the Age variable for Where Embarked (ignoring 2 NA values)
train_titanic %>%
  bind_shadow() %>%
  subset(!is.na(Embarked)) %>%
  ggplot(aes(Embarked, fill = Age_NA)) +
  geom_bar(position = "fill")