Time

Another column is time. Now, pandas has a built-in DateTime parser and a very good one! Just use pd.to_datetime() on your scalar value or a collection. In this case, however, it won't work, and neither will any external packages that usually help (dateparser is our favorite). And all that because cells describe a time range, and not just one specific date.

Again, let's (at least, for now) see whether we can make our life simpler. Indeed, we probably don't care about specific dates—all we need is the month and year. Luckily, all months are properly stated and uniform—and pd.to_datetime can parse them. So, all we need is to correctly extract two month-year pairs from each.

Now, it seems hard to define one regular expression that will work here. Instead, we can try to get all years (we know all of them are four-digit numbers, starting with 19) and all months (there are just 12 variants). Then, we can combine them, using the year twice if there is only one value.

Let's try it out! First, we define the patterns:

d = ('January', 'February', 'March', 'April', 'May', 
'June', "July",' August', 'September', 'October', 'November', 'December')

month_pattern = r'(' + "|".join(d) + ')'
year_pattern = r'(19dd)'

Now, instead of str.extract, we will use the str.extractall method—it will try to retrieve ALL occurrences of the pattern in the string. As a result, it will create multiindex—an index with multiple levels. In this case, the first level will be the original one, taken from the argument. The second one will represent the number of occurrences within the string. Here, we should use the .unstack() function, which will rotate Series into DataFrame, so that the first level will be its index, and the second its columns.

As you may have guessed, there is an opposite function, stack(), which converts a dataframe into a series with a multilevel index.

In the following code, we run a regex to extract two values—the start and end of the column:

year_extracted = battles['Date'].str.extractall(year_pattern).unstack()

Notice that there are four, not two, columns here. Those are empty for most of the columns, but its mere existence means that there is at least one row where this last column is not empty. In the following code block, we mask our dataframe to show only records where the last column is not null:

>>> year_extracted[year_extracted.iloc[:, -1].notnull()]
0
match 0 1 2 3
94 1943 1943 1943 1943

It seems that only one record has a value in there. Let's take a look at the corresponding raw value:

>>> battles.loc[94, 'Date']
'3 November 1943 – 13 November 1943(Offensive operation) 13 November 1943 – 22 December 1943(Defensive operation)'

The corresponding record indeed has four-year values, but all of them are the same. Another row has three values, but again, all of them are the same—so there is no harm in dropping all but the first two columns:

year_extracted = year_extracted.iloc[:, :2]

We can also fill empty cells of the second row with values from the first one, using the fillna() function. This function can fill empty cells in a series with a given scalar value, or corresponding values from another series of the same length (our case), or even from the series itself, using one of a few methods (for example, using the value in the previous cell). The following code does precisely that in that it fills the empty second column with the corresponding values from the first one:

year_extracted.iloc[:, 1].fillna(year_extracted.iloc[:, 0], inplace=True)

Now, let's do the same with Months except that this time, we'll use the fillna from left to right, and use the first and the last columns (as we require the beginning and end of the event):

month_extracted = battles['Date'].str.extractall(month_pattern).unstack()

for i in range(2, month_extracted.shape[1]+1):
month_extracted.iloc[:, -1].fillna(month_extracted.iloc[:, -i], inplace=True)

month_extracted = month_extracted.iloc[:, [0, -1]]

Finally, we need to combine the two. Let's rename the columns so that we can use .loc, and then just loop over them:

year_extracted.columns = month_extracted.columns = ['start', 'end']
I = battles.index

for col in 'start', 'end':
combined = month_extracted.loc[I, col] + ' ' + year_extracted.loc[I, i]
battles[col] = pd.to_datetime(combined)

Yay! We're done with our second column – time. It wasn't easy, but we were able to convert the text into datetime values so that we can analyze them in the future. Next in line is belligerents.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset