One issue that we have to deal with in datasets for machine learning is how to handle missing values in the training set.
Let's visually identify where we have missing values in our feature set.
For that, we can make use of an equivalent of the missmap function in R, written by Tom Augspurger. The next screenshot shows how much data is missing for the various features in an intuitively appealing manner:
For more information and the code used to generate this data, see the following: http://tomaugspurger.github.io/blog/2014/02/22/Visualizing%20Missing%20Data/.
We can also calculate how much data is missing for each of the features:
In [83]: missing_perc=train_df.apply(lambda x: 100*(1-x.count().sum()/(1.0*len(x)))) In [85]: sorted_missing_perc=missing_perc.order(ascending=False) sorted_missing_perc Out[85]: Cabin 77.104377 Age 19.865320 Embarked 0.224467 Fare 0.000000 Ticket 0.000000 Parch 0.000000 SibSp 0.000000 Sex 0.000000 Name 0.000000 Pclass 0.000000 Survived 0.000000 PassengerId 0.000000 dtype: float64
Hence, we can see that most of the Cabin data is missing (77%), while around 20% of the Age data is missing. We then decide to drop the Cabin data from our learning feature set as the data is too sparse to be of much use.
Let's do a further breakdown of the various features that we would like to examine. In the case of categorical/discrete features, we use bar plots; for continuous valued features, we use histograms. The code to generate the charts is as shown:
In [137]: import random bar_width=0.1 categories_map={'Pclass':{'First':1,'Second':2, 'Third':3}, 'Sex':{'Female':'female','Male':'male'}, 'Survived':{'Perished':0,'Survived':1}, 'Embarked':{'Cherbourg':'C','Queenstown':'Q','Southampton':'S'}, 'SibSp': { str(x):x for x in [0,1,2,3,4,5,8]}, 'Parch': {str(x):x for x in range(7)} } colors=['red','green','blue','yellow','magenta','orange'] subplots=[111,211,311,411,511,611,711,811] cIdx=0 fig,ax=plt.subplots(len(subplots),figsize=(10,12)) keyorder = ['Survived','Sex','Pclass','Embarked','SibSp','Parch'] for category_key,category_items in sorted(categories_map.iteritems(), key=lambda i:keyorder.index(i[0])): num_bars=len(category_items) index=np.arange(num_bars) idx=0 for cat_name,cat_val in sorted(category_items.iteritems()): ax[cIdx].bar(idx,len(train_df[train_df[category_key]==cat_val]), label=cat_name, color=np.random.rand(3,1)) idx+=1 ax[cIdx].set_title('%s Breakdown' % category_key) xlabels=sorted(category_items.keys()) ax[cIdx].set_xticks(index+bar_width) ax[cIdx].set_xticklabels(xlabels) ax[cIdx].set_ylabel('Count') cIdx +=1 fig.subplots_adjust(hspace=0.8) for hcat in ['Age','Fare']: ax[cIdx].hist(train_df[hcat].dropna(),color=np.random.rand(3,1)) ax[cIdx].set_title('%s Breakdown' % hcat) #ax[cIdx].set_xlabel(hcat) ax[cIdx].set_ylabel('Frequency') cIdx +=1 fig.subplots_adjust(hspace=0.8) plt.show()
Take a look at the following output:
From the data and illustration in the preceding screenshot, we can observe the following:
- About twice as many passengers perished than survived (62% versus 38%).
- There were about twice as many male passengers as female passengers (65% versus 35%).
- There were about 20% more passengers in the third class versus the first and second together (55% versus 45%).
- Most passengers were solo; that is, had no children, parents, siblings, or spouse on board.
These observations might lead us to dig deeper and investigate whether there is some correlation between the chances of survival and gender and fare class, particularly if we take into account the fact that the Titanic had a women-and-children-first policy and the fact that the Titanic was carrying fewer lifeboats (20) than it was designed to (32).
In light of this, let's further examine the relationships between survival and some of these features. We'll start with gender:
In [85]: from collections import OrderedDict num_passengers=len(train_df) num_men=len(train_df[train_df['Sex']=='male']) men_survived=train_df[(train_df['Survived']==1 ) & (train_df['Sex']=='male')] num_men_survived=len(men_survived) num_men_perished=num_men-num_men_survived num_women=num_passengers-num_men women_survived=train_df[(train_df['Survived']==1) & (train_df['Sex']=='female')] num_women_survived=len(women_survived) num_women_perished=num_women-num_women_survived gender_survival_dict=OrderedDict() gender_survival_dict['Survived']={'Men':num_men_survived,'Women':num_women_survived} gender_survival_dict['Perished']={'Men':num_men_perished,'Women':num_women_perished} gender_survival_dict['Survival Rate']= {'Men' :
round(100.0*num_men_survived/num_men,2),
'Women':round(100.0*num_women_survived/num_women,2)} pd.DataFrame(gender_survival_dict) Out[85]:
Take a look at the following table:
Gender |
Survived |
Perished |
Survival Rate |
Men |
109 |
468 |
18.89 |
Women |
233 |
81 |
74.2 |
We'll now illustrate this data in a bar chart:
In [76]: #code to display survival by gender fig = plt.figure() ax = fig.add_subplot(111) perished_data=[num_men_perished, num_women_perished] survived_data=[num_men_survived, num_women_survived] N=2 ind = np.arange(N) # the x locations for the groups width = 0.35 survived_rects = ax.barh(ind, survived_data, width,color='green') perished_rects = ax.barh(ind+width, perished_data, width,color='red') ax.set_xlabel('Count') ax.set_title('Count of Survival by Gender') yTickMarks = ['Men','Women'] ax.set_yticks(ind+width) ytickNames = ax.set_yticklabels(yTickMarks) plt.setp(ytickNames, rotation=45, fontsize=10) ## add a legend ax.legend((survived_rects[0], perished_rects[0]), ('Survived', 'Perished') ) plt.show()
The preceding code produces the following bar graph diagram:
From the preceding diagram, we can see that the majority of the women survived (74%), while most of the men perished (only 19% survived).
This leads us to the conclusion that the gender of the passenger may be a contributing factor to whether a passenger survived or not.
Next, let's look at passenger class. First, we generate the survived and perished data for each of the three passenger classes, as well as survival rates:
In [86]: from collections import OrderedDict num_passengers=len(train_df) num_class1=len(train_df[train_df['Pclass']==1]) class1_survived=train_df[(train_df['Survived']==1 ) & (train_df['Pclass']==1)] num_class1_survived=len(class1_survived) num_class1_perished=num_class1-num_class1_survived num_class2=len(train_df[train_df['Pclass']==2]) class2_survived=train_df[(train_df['Survived']==1) & (train_df['Pclass']==2)] num_class2_survived=len(class2_survived) num_class2_perished=num_class2-num_class2_survived num_class3=num_passengers-num_class1-num_class2 class3_survived=train_df[(train_df['Survived']==1 ) & (train_df['Pclass']==3)] num_class3_survived=len(class3_survived) num_class3_perished=num_class3-num_class3_survived pclass_survival_dict=OrderedDict() pclass_survival_dict['Survived']={'1st Class':num_class1_survived, '2nd Class':num_class2_survived, '3rd Class':num_class3_survived} pclass_survival_dict['Perished']={'1st Class':num_class1_perished, '2nd Class':num_class2_perished, '3rd Class':num_class3_perished} pclass_survival_dict['Survival Rate']= {'1st Class' : round(100.0*num_class1_survived/num_class1,2), '2nd Class':round(100.0*num_class2_survived/num_class2,2), '3rd Class':round(100.0*num_class3_survived/num_class3,2),} pd.DataFrame(pclass_survival_dict) Out[86]:
Then, we show them in a table:
Class |
Survived |
Perished |
Survival Rate |
First Class |
136 |
80 |
62.96 |
Second Class |
87 |
97 |
47.28 |
Third Class |
119 |
372 |
24.24 |
We can then plot the data by using matplotlib in a similar manner to that for the survivor count by gender described earlier:
In [186]: fig = plt.figure() ax = fig.add_subplot(111) perished_data=[num_class1_perished, num_class2_perished, num_class3_perished] survived_data=[num_class1_survived, num_class2_survived, num_class3_survived] N=3 ind = np.arange(N) # the x locations for the groups width = 0.35 survived_rects = ax.barh(ind, survived_data, width,color='blue') perished_rects = ax.barh(ind+width, perished_data, width,color='red') ax.set_xlabel('Count') ax.set_title('Survivor Count by Passenger class') yTickMarks = ['1st Class','2nd Class', '3rd Class'] ax.set_yticks(ind+width) ytickNames = ax.set_yticklabels(yTickMarks) plt.setp(ytickNames, rotation=45, fontsize=10) ## add a legend ax.legend( (survived_rects[0], perished_rects[0]), ('Survived', 'Perished'), loc=10 ) plt.show()
This produces the following bar plot diagram:
It seems clear from the preceding data and diagram that the higher the passenger fare class, the greater the passenger's chances of survival.
Given that both gender and fare class seem to influence the chances of a passenger's survival, let's see what happens when we combine these two features and plot a combination of both. For this, we will use the crosstab function in pandas:
In [173]: survival_counts=pd.crosstab([train_df.Pclass,train_df.Sex],train_df.Survived.astype(bool)) survival_counts Out[173]: Survived False True Pclass Sex 1 female 3 91 male 77 45 2 female 6 70 male 91 17 3 female 72 72 male 300 47
Let's now display this data using matplotlib. First, let's do some re-labeling for display purposes:
In [183]: survival_counts.index=survival_counts.index.set_levels([['1st', '2nd', '3rd'], ['Women', 'Men']]) In [184]: survival_counts.columns=['Perished','Survived']
Now, we plot the passenger data by using the plot function of a pandas DataFrame:
In [185]: fig = plt.figure() ax = fig.add_subplot(111) ax.set_xlabel('Count') ax.set_title('Survivor Count by Passenger class, Gender') survival_counts.plot(kind='barh',ax=ax,width=0.75, color=['red','black'], xlim=(0,400)) Out[185]: <matplotlib.axes._subplots.AxesSubplot at 0x7f714b187e90>