Belligerents

Lastly, as we noticed, in some rows, the axis and allies parties are swapped. It is slightly confusing for this specific dataset. For example, in this dual model, we'll have to mark Soviets as axis when they attacked Poland during the initial stages of the war. Let's take a look at all the possible combinations: 

battles['Belligerents.allies'].value_counts()

Here, value_counts() calculates a number of occurrences of each value. Hence, the index of those series represents unique values. There is a more intuitive alternative – the unique() function (which is also faster). However, this is a NumPy function and it returns a NumPy array, which Jupyter prints badly—that's the only reason we prefer to use value_counts.

From the examination, we can observe that all the incorrect values contain either one of 'Germany', 'Italy', or 'Estonian conscripts'. We can use these to run our swap operation:

words = ['Germany', 'Italy', 'Estonian conscripts']
for word in words:
mask = battles['Belligerents.allies'].fillna('').str.contains(word)

axis_party = battles.loc[mask, 'Belligerents.allies']
battles.loc[mask, 'Belligerents.allies'] = battles.loc[mask, 'Belligerents.axis']
battles.loc[mask, 'Belligerents.axis'] = axis_party

Note that we had to use fillna(), as pandas won't run string operations if any values in the column are not strings. 

OK, that was relatively easy. Finally, we've reached our final column to parse—casualties. This is the most complex task we're doing so far in this chapter!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset