Titanic survival predictor

In this tutorial, we will learn to use TFLearn and TensorFlow to model the survival chance of titanic passengers using their personal information (such as gender, age, and so on). To tackle this classic machine learning task, we are going to build a DNN classifier.

Let's take a look at the dataset (TFLearn will automatically download it for you).

For each passenger, the following information is provided:

survivedSurvived (0 = No; 1 = Yes) 
pclass            Passenger Class (1 = st; 2 = nd; 3 = rd) 
name Name 
sex  Sex 
age  Age 
sibsp   Number of Siblings/Spouses Aboard 
parch   Number of Parents/Children Aboard 
ticket  Ticket Number 
fare    Passenger Fare

Here are some samples extracted from the dataset:

survived	pclass	name	sex	age	sibsp	parch	ticket	fare
1	1	Aubart, Mme. Leontine Pauline	Female	24	0	0	PC 17477	69.3000
0	2	Bowenur, Mr. Solomon	Male	42	0	0	211535	13.0000
1	3	Baclini, Miss. Marie Catherine	Female	5	2	1	2666	19.2583
0	3	Youseff, Mr. Gerious	Male	45.5	0	0	2628	7.2250

There are two classes in our task: not survived (class = 0) and survived (class = 1), and the passenger data has eight features.

The titanic dataset is stored in a CSV file, so we can use the TFLearn load_csv() function to load the data from the file into a Python list. We specify the target_column argument to indicate that our labels (survived or not) are located in the first column (ID is 0). The functions will return a tuple (data, labels).

Let's start by importing the numpy and TFLearn libraries:

import numpy as np 
import tflearn

Download the titanic dataset:

from tflearn.datasets import titanic 
titanic.download_dataset('titanic_dataset.csv')

Load the CSV file, indicating that the first column represents labels:

from tflearn.data_utils import load_csv 
data, labels = load_csv('titanic_dataset.csv', target_column=0, 
                        categorical_labels=True, n_classes=2)

The data needs some preprocessing to be ready to be used in our DNN classifier. Indeed, we must delete the columns, fields that don't help in our analysis. We discard the name and ticket fields, because we estimate that passenger name and ticket are not correlated with their chance of surviving:

def preprocess(data, columns_to_ignore):

The preprocessing phase starts by descending the id and delete columns:

for id in sorted(columns_to_ignore, reverse=True): 
        [r.pop(id) for r in data] 
for i in range(len(data)):

The sex field is converted to float (to be manipulated):

      data[i][1] = 1. if data[i][1] == 'female' else 0. 
    return np.array(data, dtype=np.float32)

As already described, the fields name and ticket will be ignored by the analysis:

to_ignore=[1, 6]

Here, we call the preprocess procedure:

data = preprocess(data, to_ignore)

First of all, we specify the shape of our input data. The input sample has a total of 6 features, and we will process samples per batch to save memory, so our data input shape is [None, 6]. The None parameter means an unknown dimension, so we can change the total number of samples that are processed in a batch:

net = tflearn.input_data(shape=[None, 6])

Finally, we build a three-layer neural network with this simple sequence of statements:

net = tflearn.fully_connected(net, 32) 
net = tflearn.fully_connected(net, 32) 
net = tflearn.fully_connected(net, 2, activation='softmax') 
net = tflearn.regression(net)

TFLearn provides a model wrapper DNN that can automatically perform neural network classifier tasks:

model = tflearn.DNN(net)

We will run it for 10 epochs, with a batch size of 16:

model.fit(data, labels, n_epoch=10, batch_size=16, show_metric=True)

Running the model, you should have an output as follows:

Training samples: 1309 
Validation samples: 0 
-- 
Training Step: 82  | total loss: 0.64003 
| Adam | epoch: 001 | loss: 0.64003 - acc: 0.6620 -- iter: 1309/1309 
-- 
Training Step: 164  | total loss: 0.61915 
| Adam | epoch: 002 | loss: 0.61915 - acc: 0.6614 -- iter: 1309/1309 
-- 
Training Step: 246  | total loss: 0.56067 
| Adam | epoch: 003 | loss: 0.56067 - acc: 0.7171 -- iter: 1309/1309 
-- 
Training Step: 328  | total loss: 0.51807 
| Adam | epoch: 004 | loss: 0.51807 - acc: 0.7799 -- iter: 1309/1309 
-- 
Training Step: 410  | total loss: 0.47475 
| Adam | epoch: 005 | loss: 0.47475 - acc: 0.7962 -- iter: 1309/1309 
-- 
Training Step: 492  | total loss: 0.51677 
| Adam | epoch: 006 | loss: 0.51677 - acc: 0.7701 -- iter: 1309/1309 
-- 
Training Step: 574  | total loss: 0.48988 
| Adam | epoch: 007 | loss: 0.48988 - acc: 0.7891 -- iter: 1309/1309 
-- 
Training Step: 656  | total loss: 0.55073 
| Adam | epoch: 008 | loss: 0.55073 - acc: 0.7427 -- iter: 1309/1309 
-- 
Training Step: 738  | total loss: 0.50242 
| Adam | epoch: 009 | loss: 0.50242 - acc: 0.7854 -- iter: 1309/1309 
-- 
Training Step: 820  | total loss: 0.41557 
| Adam | epoch: 010 | loss: 0.41557 - acc: 0.8110 -- iter: 1309/1309 
--

The model accuracy is around 81%, which means that it can predict the correct outcome (survived or not) for 81% of the total passengers.

Finally, evalute the model to get the final accuracy:

 accuracy = model.evaluate(data, labels, batch_size=16)
 print('Accuracy: ', accuracy)

The following is the output:

Accuracy:  [0.78456837289473591]

Table of Contents for Titanic survival predictor

Create new playlist

Sign In

Sign Up

Table of Contents for
Titanic survival predictor