In this tutorial, we will learn to use TFLearn and TensorFlow to model the survival chance of titanic passengers using their personal information (such as gender, age, and so on). To tackle this classic machine learning task, we are going to build a DNN classifier.
Let's take a look at the dataset (TFLearn will automatically download it for you).
For each passenger, the following information is provided:
survivedSurvived (0 = No; 1 = Yes)
pclass Passenger Class (1 = st; 2 = nd; 3 = rd)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
Here are some samples extracted from the dataset:
survived | pclass | name | sex | age | sibsp | parch | ticket | fare |
1 | 1 | Aubart, Mme. Leontine Pauline | Female | 24 | 0 | 0 | PC 17477 | 69.3000 |
0 | 2 | Bowenur, Mr. Solomon | Male | 42 | 0 | 0 | 211535 | 13.0000 |
1 | 3 | Baclini, Miss. Marie Catherine | Female | 5 | 2 | 1 | 2666 | 19.2583 |
0 | 3 | Youseff, Mr. Gerious | Male | 45.5 | 0 | 0 | 2628 | 7.2250 |
There are two classes in our task: not survived (class = 0) and survived (class = 1), and the passenger data has eight features.
The titanic dataset is stored in a CSV file, so we can use the TFLearn load_csv() function to load the data from the file into a Python list. We specify the target_column argument to indicate that our labels (survived or not) are located in the first column (ID is 0). The functions will return a tuple (data, labels).
Let's start by importing the numpy and TFLearn libraries:
import numpy as np
import tflearn
Download the titanic dataset:
from tflearn.datasets import titanic
titanic.download_dataset('titanic_dataset.csv')
Load the CSV file, indicating that the first column represents labels:
from tflearn.data_utils import load_csv
data, labels = load_csv('titanic_dataset.csv', target_column=0,
categorical_labels=True, n_classes=2)
The data needs some preprocessing to be ready to be used in our DNN classifier. Indeed, we must delete the columns, fields that don't help in our analysis. We discard the name and ticket fields, because we estimate that passenger name and ticket are not correlated with their chance of surviving:
def preprocess(data, columns_to_ignore):
The preprocessing phase starts by descending the id and delete columns:
for id in sorted(columns_to_ignore, reverse=True):
[r.pop(id) for r in data]
for i in range(len(data)):
The sex field is converted to float (to be manipulated):
data[i][1] = 1. if data[i][1] == 'female' else 0.
return np.array(data, dtype=np.float32)
As already described, the fields name and ticket will be ignored by the analysis:
to_ignore=[1, 6]
Here, we call the preprocess procedure:
data = preprocess(data, to_ignore)
First of all, we specify the shape of our input data. The input sample has a total of 6 features, and we will process samples per batch to save memory, so our data input shape is [None, 6]. The None parameter means an unknown dimension, so we can change the total number of samples that are processed in a batch:
net = tflearn.input_data(shape=[None, 6])
Finally, we build a three-layer neural network with this simple sequence of statements:
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 32)
net = tflearn.fully_connected(net, 2, activation='softmax')
net = tflearn.regression(net)
TFLearn provides a model wrapper DNN that can automatically perform neural network classifier tasks:
model = tflearn.DNN(net)
We will run it for 10 epochs, with a batch size of 16:
model.fit(data, labels, n_epoch=10, batch_size=16, show_metric=True)
Running the model, you should have an output as follows:
Training samples: 1309
Validation samples: 0
--
Training Step: 82 | total loss: 0.64003
| Adam | epoch: 001 | loss: 0.64003 - acc: 0.6620 -- iter: 1309/1309
--
Training Step: 164 | total loss: 0.61915
| Adam | epoch: 002 | loss: 0.61915 - acc: 0.6614 -- iter: 1309/1309
--
Training Step: 246 | total loss: 0.56067
| Adam | epoch: 003 | loss: 0.56067 - acc: 0.7171 -- iter: 1309/1309
--
Training Step: 328 | total loss: 0.51807
| Adam | epoch: 004 | loss: 0.51807 - acc: 0.7799 -- iter: 1309/1309
--
Training Step: 410 | total loss: 0.47475
| Adam | epoch: 005 | loss: 0.47475 - acc: 0.7962 -- iter: 1309/1309
--
Training Step: 492 | total loss: 0.51677
| Adam | epoch: 006 | loss: 0.51677 - acc: 0.7701 -- iter: 1309/1309
--
Training Step: 574 | total loss: 0.48988
| Adam | epoch: 007 | loss: 0.48988 - acc: 0.7891 -- iter: 1309/1309
--
Training Step: 656 | total loss: 0.55073
| Adam | epoch: 008 | loss: 0.55073 - acc: 0.7427 -- iter: 1309/1309
--
Training Step: 738 | total loss: 0.50242
| Adam | epoch: 009 | loss: 0.50242 - acc: 0.7854 -- iter: 1309/1309
--
Training Step: 820 | total loss: 0.41557
| Adam | epoch: 010 | loss: 0.41557 - acc: 0.8110 -- iter: 1309/1309
--
The model accuracy is around 81%, which means that it can predict the correct outcome (survived or not) for 81% of the total passengers.
Finally, evalute the model to get the final accuracy:
accuracy = model.evaluate(data, labels, batch_size=16)
print('Accuracy: ', accuracy)
The following is the output:
Accuracy: [0.78456837289473591]