Aggregating the data to calculate summary statistics 

To aggregate values over some grouping, pandas has the groupby operationone of the library's killer features. This function creates a GroupBy object, which can behave as an iterable of (name, group) tuples, or similar to a dataframe, you can select one or many columns the same way you'd do for a dataframe.

Most importantly, those objects have two special methods:

  • agg, which will perform the given aggregation function (say, calculate averages) for each group, and return them as a dataframe with one row per each group.
  •  transform does all of the same—except that it will return the corresponding group's aggregate values for each row in the original dataframe.

The great part of both of those functions is their flexibility—they both accept a handful of options as their arguments, including the following:

  • Strings with operation names (count, median, and more).
  • Custom functions.
  • A dictionary with specific operations (either strings or functions) for specific column names; we can even pass a dictionary with a list of multiple operations for one column.

For example, let's see how the casualties statistics change for each operation, which we store in the parent column:

aggregate = data[~mask].groupby('parent').agg({'axis killed': ['sum', 'median', 'count'],
'allies killed': ['sum', 'median'],
'killed total': ['sum', 'median']}).astype(int)

Here are the first three rows of the outcome:

Axis killed Allies killed Killed total
Parent Sum Median Count Sum Median Sum Median
Axis invasion of the Soviet Union 440,560 20,364 12 2,811,366 103,166 3,251,926 111,681
Battle for Narva Bridgehead 200 200 1 3,000 3,000 3,200 3,200
Battle of Berlin 60,000 60,000 1 20,000 20,000 80,000 80,000

Note how the result has multilevel columns, the first level being the original column names and the second, the specific operations we performed. In many ways, this multilevel index is useful but can also make things more complex. In particular, now we can't select a column by nameinstead, we need to use so-called pandas.IndexSlice. For example, to get a column, 'axis killed', 'sum', we need to use this code:

>>> idx = pd.IndexSlice
>>> aggr[idx['axis killed', 'sum']].head(3)

parent
Axis invasion of the Soviet Union 440560
Battle for Narva Bridgehead 200
Battle of Berlin 60000
Name: (axis killed, sum), dtype: int6

The nice part is that IndexSlice, similar to normal pandas slicing, supports semicolons and non-specified levels. For example, we can pull 'sum' values for both sides like this (for some reason, it requires a loc method):

>>> aggr.loc[:, idx[:, 'sum']].head(3)
axis killed allies killed killed total
sum sum sum
parent
Axis invasion of the Soviet Union 440560 2811366 3251926
Battle for Narva Bridgehead 200 3000 3200
Battle of Berlin 60000 20000 80000

Now, in continuation of our visualization spree, let's plot all given operations as a scatterplot, using casualties for both sides as x and y coordinates. For that, pandas has a dedicated interface, as well. Consider the following example:

idx = pd.IndexSlice

aggr.plot(kind='scatter',
x=idx['allies killed', 'sum'],
y=idx['axis killed', 'sum'],
figsize=(7,7),
title='Deaths on both sides')

plt.axis('equal');
plt.tight_layout();

Here, all we need is to execute the plot command, specifying the kind of a plot, the columns to be used for the x and y coordinates, and a few other parameters. Here, we also have to use IndexSlice to specify columns. The plt.axis('equal') method ensures that the x and y coordinates preserve the same scale for comparison. The following is the resulting image, showing the trend of axis/allies casualties by battle. Note that, in this case, we didn't have to specify axis labels—they were generated automatically from the column names:

The scatterplot we made definitely tells a lot, but we have to guess which battle is represented by which point. Don't worrywe'll get to interactive charts very soon. 

Let's now finish with aggregation methods by covering another technique: time-based resampling.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset