Pivots and reshaping data

This section deals with how you can reshape data. Sometimes, data is stored in what is known as a stacked format. Here is an example of stacked data using the PlantGrowth dataset:

    In [344]: plantGrowthRawDF=pd.read_csv('./PlantGrowth.csv')
           plantGrowthRawDF
    Out[344]:   observation   weight  group
            0   1      4.17  ctrl
            1   2      5.58  ctrl
            2   3      5.18  ctrl
            ...
            10       1        4.81    trt1
            11       2        4.17    trt1
            12       3        4.41    trt1
           ... 
            20       1        6.31    trt2
            21       2        5.12    trt2
            22       3        5.54    trt2

This data consists of results from an experiment that compared the dried weight yields of plants that were obtained under a control (ctrl) and two different treatment conditions (trt1 and trt2). Suppose we wanted to do some analysis on this data by group value. One way to do this would be to use a logical filter on the DataFrame:

    In [346]: plantGrowthRawDF[plantGrowthRawDF['group']=='ctrl']
    Out[346]:   observation   weight  group
            0     1      4.17   ctrl
            1   2      5.58   ctrl
            2   3      5.18   ctrl
            3   4      6.11   ctrl
           ...

This can be tedious, so we would instead like to pivot/unstack this data and display it in a form that is more conducive to analysis. We can do this using the DataFrame.pivot function as follows:

    In [345]: plantGrowthRawDF.pivot(index='observation',columns='group',values='weight')
    Out[345]: weight    
              group   ctrl  trt1  trt2
          observation
            1   4.17   4.81   6.31
            2   5.58   4.17   5.12
            3   5.18   4.41   5.54
            4   6.11   3.59   5.50
            5   4.50   5.87   5.37
            6   4.61   3.83   5.29
            7   5.17   6.03   4.92
            8    4.53   4.89   6.15
            9   5.33   4.32   5.80
           10   5.14   4.69   5.26

Here, a DataFrame is created with columns corresponding to the different values of a group, or, in statistical parlance, levels of the factor.

Some more examples of pivoting on salesdata.csv are as follows:

datastr=pd.read_csv('salesdata.csv')
table=pd.pivot_table(datastr,index=['Customer Segment'])# the aggregate values are average by default

The following will be the output. This gives the results for all the columns:

If we specify a columns parameter with a variable name, all the categories in that variable become separate columns:

table2=pd.pivot_table(datastr,values='Sales',index=['Customer Segment'],columns=['Region'])

For example, the output of the preceding code would be as shown following:

Multi-indexed pivots are also possible, as shown:

table4=pd.pivot_table(datastr,values='Sales',index=['Customer Segment','Ship Mode'],columns=['Region'])

The following will be the output:

A different aggregate function, other than default average, or a custom function can be applied for aggregation as shown in the example following:

table5=pd.pivot_table(datastr,values='Sales',index=['Customer Segment','Ship Mode'],columns=['Region'],aggfunc=sum)

The following will be the output:

Some more important tips and tricks to keep in mind while using pivot_tables are listed following:

If you expect missing values in your pivot table, then use fill.values=0:

table4=pd.pivot_table(datastr,values='Sales',index=['Customer Segment','Ship Mode'],columns=['Region'],fill_values=0)

If you want totals at the end, use margins=TRUE:

table4=pd.pivot_table(datastr,values='Sales',index=['Customer Segment','Ship Mode'],columns=['Region'],fill_values=0,margins=TRUE)

You can pass different aggregate functions to different value columns:

table6=pd.pivot_table(datastr,values=['Sales','Unit Price'],index=['Customer Segment','Ship Mode'],columns=['Region'],aggfunc={"Sales":sum,"Unit Price":len})

Table of Contents for Pivots and reshaping data

Create new playlist

Sign In

Sign Up

Table of Contents for
Pivots and reshaping data