Skip to content

Titanic Challange

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
train = pd.read_csv("train.csv")
train.head()
test.head()

How you can see, there is no target in this dataset. That's because we have to guess it. We never will know. The testing will be made by the Kaggle. That's why we have to mantain the PassengerId in the test, otherwise Kaggle can't compare with its table which contains the target and show you the score.

trainingTotal = train.count().sum()
testingTotal  = test.count().sum()
total 		  = trainingTotal + testingTotal

print(f'Samples in training dataframe: {trainingTotal}')
print(f'Samples in testing  dataframe: {testingTotal}')
print(f'Proportion: Training {trainingTotal/total*100:.1f}% x Test {testingTotal/total*100:.1f}%')

We have 12 columns, which is 11 features and 1 target. The total of samples is 891.
The target is the Survived Column.

  • Passanger ID. Unique ID
  • Pclass. Ticket Class.
  • Name. Complete passanger's name.
  • Sex. Male or Female.
  • Age Integer.
  • SibSp Irmãos...
  • Parch Number of parents and children.
  • Ticket Ticket number.
  • Fare Valor do ticket.
  • Cabin Cabine.
  • Embarked Embark Port.
# Create a list with the features of interest
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
# Add the last column as target
train = train[features + ['Survived']]
train.head()

Let's play!

Preprocessing

First, we have to transform all numerical features to float.

train.info()

Sex, Ticket, Cabin and Embarked are categorical features. We will ignore. The others we will transform to float.

 for feature in features:
        if feature not in ['Sex', 'Ticket', 'Cabin', 'Embarked']:
            print(feature)
            train[feature] = pd.to_numeric(train[feature], downcast="float")
            test [feature] = pd.to_numeric(test [feature], downcast="float")

train['Survived'] = pd.to_numeric(train['Survived'], downcast="float")

print('Survived')
train.info()