Titanic survival predictor

In this tutorial, we will learn to use TFLearn and TensorFlow to model the survival chance of titanic passengers using their personal information (such as gender, age, and so on). To tackle this classic machine learning task, we are going to build a DNN classifier.

Let's take a look at the dataset (TFLearn will automatically download it for you).

For each passenger, the following information is provided:

survivedSurvived (0 = No; 1 = Yes) 
pclass Passenger Class (1 = st; 2 = nd; 3 = rd)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare

Here are some samples extracted from the dataset:

survived pclass name sex age sibsp parch ticket fare
1 1 Aubart, Mme. Leontine Pauline Female 24 0 0 PC 17477 69.3000
0 2 Bowenur, Mr. Solomon Male 42 0 0 211535 13.0000
1 3 Baclini, Miss. Marie Catherine Female 5 2 1 2666 19.2583
0 3 Youseff, Mr. Gerious Male 45.5 0 0 2628 7.2250

There are two classes in our task: not survived (class = 0) and survived (class = 1), and the passenger data has eight features.

The titanic dataset is stored in a CSV file, so we can use the TFLearn load_csv() function to load the data from the file into a Python list. We specify the target_column argument to indicate that our labels (survived or not) are located in the first column (ID is 0). The functions will return a tuple (data, labels).

Let's start by importing the numpy and TFLearn libraries:

import numpy as np 
import tflearn

Download the titanic dataset:

from tflearn.datasets import titanic 
titanic.download_dataset('titanic_dataset.csv')

Load the CSV file, indicating that the first column represents labels:

from tflearn.data_utils import load_csv 
data, labels = load_csv('titanic_dataset.csv', target_column=0,
categorical_labels=True, n_classes=2)

The data needs some preprocessing to be ready to be used in our DNN classifier. Indeed, we must delete the columns, fields that don't help in our analysis. We discard the name and ticket fields, because we estimate that passenger name and ticket are not correlated with their chance of surviving:

def preprocess(data, columns_to_ignore):

The preprocessing phase starts by descending the id and delete columns:

for id in sorted(columns_to_ignore, reverse=True): 
[r.pop(id) for r in data]
for i in range(len(data)):

The sex field is converted to float (to be manipulated):

      data[i][1] = 1. if data[i][1] == 'female' else 0. 
return np.array(data, dtype=np.float32)

As already described, the fields name and ticket will be ignored by the analysis:

to_ignore=[1, 6]

Here, we call the preprocess procedure:

data = preprocess(data, to_ignore)

First of all, we specify the shape of our input data. The input sample has a total of 6 features, and we will process samples per batch to save memory, so our data input shape is [None, 6]. The None parameter means an unknown dimension, so we can change the total number of samples that are processed in a batch:

net = tflearn.input_data(shape=[None, 6])

Finally, we build a three-layer neural network with this simple sequence of statements:

net = tflearn.fully_connected(net, 32) 
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 2, activation='softmax')
net = tflearn.regression(net)

TFLearn provides a model wrapper DNN that can automatically perform neural network classifier tasks:

model = tflearn.DNN(net)

We will run it for 10 epochs, with a batch size of 16:

model.fit(data, labels, n_epoch=10, batch_size=16, show_metric=True)

Running the model, you should have an output as follows:

Training samples: 1309 
Validation samples: 0
--
Training Step: 82 | total loss: 0.64003
| Adam | epoch: 001 | loss: 0.64003 - acc: 0.6620 -- iter: 1309/1309
--
Training Step: 164 | total loss: 0.61915
| Adam | epoch: 002 | loss: 0.61915 - acc: 0.6614 -- iter: 1309/1309
--
Training Step: 246 | total loss: 0.56067
| Adam | epoch: 003 | loss: 0.56067 - acc: 0.7171 -- iter: 1309/1309
--
Training Step: 328 | total loss: 0.51807
| Adam | epoch: 004 | loss: 0.51807 - acc: 0.7799 -- iter: 1309/1309
--
Training Step: 410 | total loss: 0.47475
| Adam | epoch: 005 | loss: 0.47475 - acc: 0.7962 -- iter: 1309/1309
--
Training Step: 492 | total loss: 0.51677
| Adam | epoch: 006 | loss: 0.51677 - acc: 0.7701 -- iter: 1309/1309
--
Training Step: 574 | total loss: 0.48988
| Adam | epoch: 007 | loss: 0.48988 - acc: 0.7891 -- iter: 1309/1309
--
Training Step: 656 | total loss: 0.55073
| Adam | epoch: 008 | loss: 0.55073 - acc: 0.7427 -- iter: 1309/1309
--
Training Step: 738 | total loss: 0.50242
| Adam | epoch: 009 | loss: 0.50242 - acc: 0.7854 -- iter: 1309/1309
--
Training Step: 820 | total loss: 0.41557
| Adam | epoch: 010 | loss: 0.41557 - acc: 0.8110 -- iter: 1309/1309
--

The model accuracy is around 81%, which means that it can predict the correct outcome (survived or not) for 81% of the total passengers.

Finally, evalute the model to get the final accuracy:

 accuracy = model.evaluate(data, labels, batch_size=16)
print('Accuracy: ', accuracy)

The following is the output:

Accuracy:  [0.78456837289473591]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset