The Titanic: Machine Learning from Disaster problem

The dataset for the Titanic consists of the passenger manifest for the doomed trip, along with various features and an indicator variable telling whether the passenger survived the sinking of the ship or not. The essence of the problem is to be able to predict, given a passenger and his/her associated features, whether this passenger survived the sinking of the Titanic or not. The features are as follows.

The data consists of two datasets: one training dataset and one test dataset. The training dataset consists of 891 passenger cases, and the test dataset consists of 491 passenger cases.

The training dataset also consists of 11 variables, of which 10 are features and 1 dependent/indicator variable, Survived, which indicated whether the passenger survived the disaster or not.

The feature variables are as follows:

  • PassengerID
  • Cabin
  • Sex
  • Pclass (passenger class)
  • Fare
  • Parch (number of parents and children)
  • Age
  • Sibsp (number of siblings)
  • Embarked

We can make use of pandas to help us to preprocess data in the following ways:

  • Data cleaning and the categorization of some variables
  • The exclusion of unnecessary features that obviously have no bearing on the survivability of the passenger; for example, name
  • Handling missing data

There are various algorithms that we can use to tackle this problem. They are as follows:

  • Decision trees
  • Neural networks
  • Random forests
  • Support vector machines
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset