General boilerplate code explanation

In this section, we will introduce boilerplate code for the implementation of the following algorithms by using Patsy and scikit-learn. The reason for doing this is that most of the code for the following algorithms is repeatable.

In the following sections, the workings of the algorithms will be described together with the code specific to each algorithm:

First, let's make sure that we're in the correct folder by using the following command line. Assuming that the working directory is located at ~/devel/Titanic, we have the following:

    In [17]: %cd ~/devel/Titanic
            /home/youruser/devel/sandbox/Learning/Kaggle/Titanic

Here, we import the required packages and read in our training and test datasets:

    In [18]: import matplotlib.pyplot as plt
                 import pandas as pd
                 import numpy as np
                 import patsy as pt
    In [19]: train_df = pd.read_csv('csv/train.csv', header=0)
             test_df = pd.read_csv('csv/test.csv', header=0)

Next, we specify the formulas we would like to submit to Patsy:

    In [21]: formula1 = 'C(Pclass) + C(Sex) + Fare'
             formula2 = 'C(Pclass) + C(Sex)'
             formula3 = 'C(Sex)'
             formula4 = 'C(Pclass) + C(Sex) + Age + SibSp + Parch'
             formula5 = 'C(Pclass) + C(Sex) + Age + SibSp + Parch + 
             C(Embarked)' 
             formula6 = 'C(Pclass) + C(Sex) + Age + SibSp + C(Embarked)'
             formula7 = 'C(Pclass) + C(Sex) + SibSp + Parch + C(Embarked)'
             formula8 = 'C(Pclass) + C(Sex) + SibSp + Parch + C(Embarked)'
    
    In [23]: formula_map = {'PClass_Sex_Fare' : formula1,
                            'PClass_Sex' : formula2,
                            'Sex' : formula3,
                            'PClass_Sex_Age_Sibsp_Parch' : formula4,
                            'PClass_Sex_Age_Sibsp_Parch_Embarked' : 
                             formula5,
               'PClass_Sex_Embarked' : formula6,
               'PClass_Sex_Age_Parch_Embarked' : formula7,
               'PClass_Sex_SibSp_Parch_Embarked' : formula8
                  }

We will define a function that helps us to handle missing values. The following function finds the cells within the DataFrame that have null values, obtains the set of similar passengers, and sets the null value to the mean value of that feature for the set of similar passengers. Similar passengers are defined as those having the same gender and passenger class as the passenger with the null feature value:

    In [24]: 
    def fill_null_vals(df,col_name):
        null_passengers=df[df[col_name].isnull()]
        passenger_id_list = null_passengers['PassengerId'].tolist()
        df_filled=df.copy()
        for pass_id in passenger_id_list:
            idx=df[df['PassengerId']==pass_id].index[0]
            similar_passengers = df[(df['Sex']== 
            null_passengers['Sex'][idx]) & 
            (df['Pclass']==null_passengers['Pclass'][idx])]
            mean_val = np.mean(similar_passengers[col_name].dropna())
            df_filled.loc[idx,col_name]=mean_val
        return df_filled

Here, we create filled versions of our training and test DataFrames.

Our test DataFrame is what the fitted scikit-learn model will generate predictions on to produce output that will be submitted to Kaggle for evaluation:

    In [28]: train_df_filled=fill_null_vals(train_df,'Fare')
             train_df_filled=fill_null_vals(train_df_filled,'Age')
             assert len(train_df_filled)==len(train_df)
    
             test_df_filled=fill_null_vals(test_df,'Fare')
             test_df_filled=fill_null_vals(test_df_filled,'Age')
             assert len(test_df_filled)==len(test_df)

Here is the actual implementation of the call to scikit-learn to learn from the training data by fitting a model and then generate predictions on the test dataset. Note that even though this is boilerplate code, for the purpose of illustration, an actual call is made to a specific algorithm—in this case, DecisionTreeClassifier.

The output data is written to files with descriptive names, for example, csv/dt_PClass_Sex_Age_Sibsp_Parch_1.csv and csv/dt_PClass_Sex_Fare_1.csv:

    In [29]: 
    from sklearn import metrics,svm, tree
    for formula_name, formula in formula_map.iteritems():
            print "name=%s formula=%s" % (formula_name,formula)
          y_train,X_train = pt.dmatrices('Survived ~ ' + formula, 
                                        train_df_filled,return_type='dataframe')
         y_train = np.ravel(y_train)
         model = tree.DecisionTreeClassifier(criterion='entropy', 
                 max_depth=3,min_samples_leaf=5)
         print "About to fit..."
         dt_model = model.fit(X_train, y_train)
         print "Training score:%s" % dt_model.score(X_train,y_train)
         X_test=pt.dmatrix(formula,test_df_filled)
         predicted=dt_model.predict(X_test)
         print "predicted:%s" % predicted[:5]
         assert len(predicted)==len(test_df)
         pred_results = pd.Series(predicted,name='Survived')
         dt_results = pd.concat([test_df['PassengerId'],  
                      pred_results],axis=1)
         dt_results.Survived = dt_results.Survived.astype(int)
         results_file = 'csv/dt_%s_1.csv' % (formula_name)
         print "output file: %s
" % results_file
         dt_results.to_csv(results_file,index=False)

The preceding code follows a standard recipe, and the summary is as follows:

Read in the training and test datasets.
Fill in any missing values for the features we wish to consider in both datasets.
Define formulas for the various feature combinations we wish to generate machine learning models for in Patsy.
For each formula, perform the following set of steps:

1. Call Patsy to create design matrices for our training feature set and training label set (designated by X_train and y_train).

2. Instantiate the appropriate scikit-learn classifier. In this case, we use DecisionTreeClassifier.

3. Fit the model by calling the fit(..) method.

4. Make a call to Patsy to create a design matrix (X_test) for our predicted output via a call to patsy.dmatrix(..).

5. Predict on the X_test design matrix, and save the results in the variable predicted.

6. Write our predictions to an output file, which will be submitted to Kaggle.

We will consider the following supervised learning algorithms:

Logistic regression
Support vector machine
Random forest
Decision trees

Table of Contents for General boilerplate code explanation

Create new playlist

Sign In

Sign Up

Table of Contents for
General boilerplate code explanation