Handling missing values

One issue that we have to deal with in datasets for machine learning is how to handle missing values in the training set.

Let's visually identify where we have missing values in our feature set.

For that, we can make use of an equivalent of the missmap function in R, written by Tom Augspurger. The next screenshot shows how much data is missing for the various features in an intuitively appealing manner:

For more information and the code used to generate this data, see the following: http://tomaugspurger.github.io/blog/2014/02/22/Visualizing%20Missing%20Data/.

We can also calculate how much data is missing for each of the features:

In [83]: missing_perc=train_df.apply(lambda x: 100*(1-x.count().sum()/(1.0*len(x))))
In [85]: sorted_missing_perc=missing_perc.order(ascending=False)
            sorted_missing_perc
Out[85]: Cabin          77.104377
         Age            19.865320
         Embarked        0.224467
         Fare            0.000000
         Ticket          0.000000
         Parch           0.000000
         SibSp           0.000000
         Sex             0.000000
         Name            0.000000
         Pclass          0.000000
         Survived        0.000000
         PassengerId     0.000000
         dtype: float64

Hence, we can see that most of the Cabin data is missing (77%), while around 20% of the Age data is missing. We then decide to drop the Cabin data from our learning feature set as the data is too sparse to be of much use.

Let's do a further breakdown of the various features that we would like to examine. In the case of categorical/discrete features, we use bar plots; for continuous valued features, we use histograms. The code to generate the charts is as shown:

In [137]:  import random
                   bar_width=0.1
                   categories_map={'Pclass':{'First':1,'Second':2, 'Third':3},
                   'Sex':{'Female':'female','Male':'male'},
                   'Survived':{'Perished':0,'Survived':1},
                   'Embarked':{'Cherbourg':'C','Queenstown':'Q','Southampton':'S'},
                   'SibSp': { str(x):x for x in [0,1,2,3,4,5,8]},
                   'Parch': {str(x):x for x in range(7)}
                   }
                 colors=['red','green','blue','yellow','magenta','orange']
                 subplots=[111,211,311,411,511,611,711,811]
                 cIdx=0
                 fig,ax=plt.subplots(len(subplots),figsize=(10,12))
    
                 keyorder = ['Survived','Sex','Pclass','Embarked','SibSp','Parch']
    
    for category_key,category_items in sorted(categories_map.iteritems(),
                                              key=lambda i:keyorder.index(i[0])):
        num_bars=len(category_items)
        index=np.arange(num_bars)
        idx=0
        for cat_name,cat_val in sorted(category_items.iteritems()):
            ax[cIdx].bar(idx,len(train_df[train_df[category_key]==cat_val]), label=cat_name,
                    color=np.random.rand(3,1))
            idx+=1
        ax[cIdx].set_title('%s Breakdown' % category_key)
        xlabels=sorted(category_items.keys()) 
        ax[cIdx].set_xticks(index+bar_width)
        ax[cIdx].set_xticklabels(xlabels)
        ax[cIdx].set_ylabel('Count')
        cIdx +=1 
    fig.subplots_adjust(hspace=0.8)
    for hcat in ['Age','Fare']:
        ax[cIdx].hist(train_df[hcat].dropna(),color=np.random.rand(3,1))
        ax[cIdx].set_title('%s Breakdown' % hcat)
        #ax[cIdx].set_xlabel(hcat)
        ax[cIdx].set_ylabel('Frequency')
        cIdx +=1
    
    fig.subplots_adjust(hspace=0.8)
    plt.show()

Take a look at the following output:

From the data and illustration in the preceding screenshot, we can observe the following:

About twice as many passengers perished than survived (62% versus 38%).
There were about twice as many male passengers as female passengers (65% versus 35%).
There were about 20% more passengers in the third class versus the first and second together (55% versus 45%).
Most passengers were solo; that is, had no children, parents, siblings, or spouse on board.

These observations might lead us to dig deeper and investigate whether there is some correlation between the chances of survival and gender and fare class, particularly if we take into account the fact that the Titanic had a women-and-children-first policy and the fact that the Titanic was carrying fewer lifeboats (20) than it was designed to (32).

In light of this, let's further examine the relationships between survival and some of these features. We'll start with gender:

In [85]: from collections import OrderedDict
             num_passengers=len(train_df)
             num_men=len(train_df[train_df['Sex']=='male'])
             men_survived=train_df[(train_df['Survived']==1 ) & (train_df['Sex']=='male')]
             num_men_survived=len(men_survived)
             num_men_perished=num_men-num_men_survived
             num_women=num_passengers-num_men
             women_survived=train_df[(train_df['Survived']==1) & (train_df['Sex']=='female')]
             num_women_survived=len(women_survived)
             num_women_perished=num_women-num_women_survived
             gender_survival_dict=OrderedDict()
             gender_survival_dict['Survived']={'Men':num_men_survived,'Women':num_women_survived}
             gender_survival_dict['Perished']={'Men':num_men_perished,'Women':num_women_perished}
             gender_survival_dict['Survival Rate']= {'Men' : 
            round(100.0*num_men_survived/num_men,2),
            'Women':round(100.0*num_women_survived/num_women,2)}
    pd.DataFrame(gender_survival_dict)
    Out[85]:

Take a look at the following table:

Gender	Survived	Perished	Survival Rate
Men	109	468	18.89
Women	233	81	74.2

We'll now illustrate this data in a bar chart:

In [76]: #code to display survival by gender
            fig = plt.figure()
            ax = fig.add_subplot(111)
            perished_data=[num_men_perished, num_women_perished]
            survived_data=[num_men_survived, num_women_survived]
            N=2
            ind = np.arange(N)     # the x locations for the groups
            width = 0.35
    
            survived_rects = ax.barh(ind, survived_data, width,color='green')
            perished_rects = ax.barh(ind+width, perished_data, width,color='red')
    
            ax.set_xlabel('Count')
            ax.set_title('Count of Survival by Gender')
            yTickMarks = ['Men','Women']
            ax.set_yticks(ind+width)
            ytickNames = ax.set_yticklabels(yTickMarks)
            plt.setp(ytickNames, rotation=45, fontsize=10)
    
            ## add a legend
            ax.legend((survived_rects[0], perished_rects[0]), ('Survived', 'Perished') )
            plt.show()

The preceding code produces the following bar graph diagram:

From the preceding diagram, we can see that the majority of the women survived (74%), while most of the men perished (only 19% survived).

This leads us to the conclusion that the gender of the passenger may be a contributing factor to whether a passenger survived or not.

Next, let's look at passenger class. First, we generate the survived and perished data for each of the three passenger classes, as well as survival rates:

    In [86]: 
    from collections import OrderedDict
    num_passengers=len(train_df)
    num_class1=len(train_df[train_df['Pclass']==1])
    class1_survived=train_df[(train_df['Survived']==1 ) & (train_df['Pclass']==1)]
    num_class1_survived=len(class1_survived)
    num_class1_perished=num_class1-num_class1_survived
    num_class2=len(train_df[train_df['Pclass']==2])
    class2_survived=train_df[(train_df['Survived']==1) & (train_df['Pclass']==2)]
    num_class2_survived=len(class2_survived)
    num_class2_perished=num_class2-num_class2_survived
    num_class3=num_passengers-num_class1-num_class2
    class3_survived=train_df[(train_df['Survived']==1 ) & (train_df['Pclass']==3)]
    num_class3_survived=len(class3_survived)
    num_class3_perished=num_class3-num_class3_survived
    pclass_survival_dict=OrderedDict()
    pclass_survival_dict['Survived']={'1st Class':num_class1_survived,
                                      '2nd Class':num_class2_survived,
                                      '3rd Class':num_class3_survived}
    pclass_survival_dict['Perished']={'1st Class':num_class1_perished,
                                      '2nd Class':num_class2_perished,
                                     '3rd Class':num_class3_perished}
    pclass_survival_dict['Survival Rate']= {'1st Class' : round(100.0*num_class1_survived/num_class1,2),
                   '2nd Class':round(100.0*num_class2_survived/num_class2,2),
                   '3rd Class':round(100.0*num_class3_survived/num_class3,2),}
    pd.DataFrame(pclass_survival_dict)
    
    Out[86]:

Then, we show them in a table:

Class	Survived	Perished	Survival Rate
First Class	136	80	62.96
Second Class	87	97	47.28
Third Class	119	372	24.24

We can then plot the data by using matplotlib in a similar manner to that for the survivor count by gender described earlier:

    In [186]:
    fig = plt.figure()
    ax = fig.add_subplot(111)
    perished_data=[num_class1_perished, num_class2_perished, num_class3_perished]
    survived_data=[num_class1_survived, num_class2_survived, num_class3_survived]
    N=3
    ind = np.arange(N)                # the x locations for the groups
    width = 0.35
    survived_rects = ax.barh(ind, survived_data, width,color='blue')
    perished_rects = ax.barh(ind+width, perished_data, width,color='red')
    ax.set_xlabel('Count')
    ax.set_title('Survivor Count by Passenger class')
    yTickMarks = ['1st Class','2nd Class', '3rd Class']
    ax.set_yticks(ind+width)
    ytickNames = ax.set_yticklabels(yTickMarks)
    plt.setp(ytickNames, rotation=45, fontsize=10)
    ## add a legend
    ax.legend( (survived_rects[0], perished_rects[0]), ('Survived', 'Perished'),
              loc=10 )
    plt.show()

This produces the following bar plot diagram:

It seems clear from the preceding data and diagram that the higher the passenger fare class, the greater the passenger's chances of survival.

Given that both gender and fare class seem to influence the chances of a passenger's survival, let's see what happens when we combine these two features and plot a combination of both. For this, we will use the crosstab function in pandas:

In [173]: survival_counts=pd.crosstab([train_df.Pclass,train_df.Sex],train_df.Survived.astype(bool))
survival_counts
Out[173]:      Survived False  True
               Pclass       Sex             
               1            female    3     91
                            male     77     45
               2            female    6     70
                            male     91     17
               3            female   72     72
                            male    300     47

Let's now display this data using matplotlib. First, let's do some re-labeling for display purposes:

In [183]: survival_counts.index=survival_counts.index.set_levels([['1st', '2nd', '3rd'], ['Women', 'Men']])
In [184]: survival_counts.columns=['Perished','Survived']

Now, we plot the passenger data by using the plot function of a pandas DataFrame:

In [185]: fig = plt.figure()
              ax = fig.add_subplot(111)
              ax.set_xlabel('Count')
              ax.set_title('Survivor Count by Passenger class, Gender')
              survival_counts.plot(kind='barh',ax=ax,width=0.75,
                                   color=['red','black'], xlim=(0,400))
Out[185]: <matplotlib.axes._subplots.AxesSubplot at 0x7f714b187e90>

Table of Contents for Handling missing values

Create new playlist

Sign In

Sign Up

Table of Contents for
Handling missing values