The essential basic functionality

pandas supports many essential functionalities that are useful to manipulate pandas data structures. In this module, we will focus on the most important features regarding exploration and analysis.

Reindexing and altering labels

Reindex is a critical method in the pandas data structures. It confirms whether the new or modified data satisfies a given set of labels along a particular axis of pandas object.

First, let's view a reindex example on a Series object:

>>> s2.reindex([0, 2, 'b', 3])
0    0.6913
2    0.8627
b    NaN
3    0.7286
dtype: float64

When reindexed labels do not exist in the data object, a default value of NaN will be automatically assigned to the position; this holds true for the DataFrame case as well:

>>> df1.reindex(index=[0, 2, 'b', 3],
        columns=['Density', 'Year', 'Median_Age','C'])
   Density  Year  Median_Age        C
0      244  2000        24.2      NaN
2      268  2010        28.5      NaN
b      NaN   NaN         NaN      NaN
3      279  2014        30.3      NaN

We can change the NaN value in the missing index case to a custom value by setting the fill_value parameter. Let us take a look at the arguments that the reindex function supports, as shown in the following table:

Argument

Description

index

This is the new labels/index to conform to.

method

This is the method to use for filling holes in a reindexed object. The default setting is unfill gaps.

pad/ffill: fill values forward

backfill/bfill: fill values backward

nearest: use the nearest value to fill the gap

copy

This return a new object. The default setting is true.

level

The matches index values on the passed multiple index level.

fill_value

This is the value to use for missing values. The default setting is NaN.

limit

This is the maximum size gap to fill in forward or backward method.

Head and tail

In common data analysis situations, our data structure objects contain many columns and a large number of rows. Therefore, we cannot view or load all information of the objects. pandas supports functions that allow us to inspect a small sample. By default, the functions return five elements, but we can set a custom number as well. The following example shows how to display the first five and the last three rows of a longer Series:

>>> s7 = pd.Series(np.random.rand(10000))
>>> s7.head()
0    0.631059
1    0.766085
2    0.066891
3    0.867591
4    0.339678
dtype: float64
>>> s7.tail(3)
9997    0.412178
9998    0.800711
9999    0.438344
dtype: float64

We can also use these functions for DataFrame objects in the same way.

Binary operations

Firstly, we will consider arithmetic operations between objects. In different indexes objects case, the expected result will be the union of the index pairs. We will not explain this again because we had an example about it in the previous section (s5 + s6). This time, we will show another example with a DataFrame:

>>> df5 = pd.DataFrame(np.arange(9).reshape(3,3),0
                       columns=['a','b','c'])
>>> df5
   a  b  c
0  0  1  2
1  3  4  5
2  6  7  8
>>> df6 = pd.DataFrame(np.arange(8).reshape(2,4), 
                      columns=['a','b','c','d'])
>>> df6
   a  b  c  d
0  0  1  2  3
1  4  5  6  7
>>> df5 + df6
    a   b   c   d
0   0   2   4 NaN
1   7   9  11 NaN
2   NaN NaN NaN NaN

The mechanisms for returning the result between two kinds of data structure are similar. A problem that we need to consider is the missing data between objects. In this case, if we want to fill with a fixed value, such as 0, we can use the arithmetic functions such as add, sub, div, and mul, and the function's supported parameters such as fill_value:

>>> df7 = df5.add(df6, fill_value=0)
>>> df7
   a  b   c   d
0  0  2   4   3
1  7  9  11   7
2  6  7   8   NaN

Next, we will discuss comparison operations between data objects. We have some supported functions such as equal (eq), not equal (ne), greater than (gt), less than (lt), less equal (le), and greater equal (ge). Here is an example:

>>> df5.eq(df6)
       a      b      c      d
0   True   True   True  False
1  False  False  False  False
2  False  False  False  False
Binary operations

Functional statistics

The supported statistics method of a library is really important in data analysis. To get inside a big data object, we need to know some summarized information such as mean, sum, or quantile. pandas supports a large number of methods to compute them. Let's consider a simple example of calculating the sum information of df5, which is a DataFrame object:

>>> df5.sum()
a     9
b    12
c    15
dtype: int64

When we do not specify which axis we want to calculate sum information, by default, the function will calculate on index axis, which is axis 0:

  • Series: We do not need to specify the axis.
  • DataFrame: Columns (axis = 1) or index (axis = 0). The default setting is axis 0.

We also have the skipna parameter that allows us to decide whether to exclude missing data or not. By default, it is set as true:

>>> df7.sum(skipna=False)
a    13
b    18
c    23
d   NaN
dtype: float64

Another function that we want to consider is describe(). It is very convenient for us to summarize most of the statistical information of a data structure such as the Series and DataFrame, as well:

>>> df5.describe()
         a    b    c
count  3.0  3.0  3.0
mean   3.0  4.0  5.0
std    3.0  3.0  3.0
min    0.0  1.0  2.0
25%    1.5  2.5  3.5
50%    3.0  4.0  5.0
75%    4.5  5.5  6.5
max    6.0  7.0  8.0

We can specify percentiles to include or exclude in the output by using the percentiles parameter; for example, consider the following:

>>> df5.describe(percentiles=[0.5, 0.8])
         a    b    c
count  3.0  3.0  3.0
mean   3.0  4.0  5.0
std    3.0  3.0  3.0
min    0.0  1.0  2.0
50%    3.0  4.0  5.0
80%    4.8  5.8  6.8
max    6.0  7.0  8.0

Here, we have a summary table for common supported statistics functions in pandas:

Function

Description

idxmin(axis), idxmax(axis)

This compute the index labels with the minimum or maximum corresponding values.

value_counts()

This compute the frequency of unique values.

count()

This return the number of non-null values in a data object.

mean(), median(), min(), max()

This return mean, median, minimum, and maximum values of an axis in a data object.

std(), var(), sem()

These return the standard deviation, variance, and standard error of mean.

abs()

This gets the absolute value of a data object.

Function application

pandas supports function application that allows us to apply some functions supported in other packages such as NumPy or our own functions on data structure objects. Here, we illustrate two examples of these cases, firstly, using apply to execute the std() function, which is the standard deviation calculating function of the NumPy package:

>>> df5.apply(np.std, axis=1)    # default: axis=0
0    0.816497
1    0.816497
2    0.816497
dtype: float64

Secondly, if we want to apply a formula to a data object, we can also useapply function by following these steps:

  1. Define the function or formula that you want to apply on a data object.
  2. Call the defined function or formula via apply. In this step, we also need to figure out the axis that we want to apply the calculation to:
    >>> f = lambda x: x.max() – x.min()    # step 1
    >>> df5.apply(f, axis=1)               # step 2
    0    2
    1    2
    2    2
    dtype: int64
    >>> def sigmoid(x):
        return 1/(1 + np.exp(x))
    >>> df5.apply(sigmoid)
         a           b         c
    0  0.500000  0.268941  0.119203
    1  0.047426  0.017986  0.006693
    2  0.002473  0.000911  0.000335
    

Sorting

There are two kinds of sorting method that we are interested in: sorting by row or column index and sorting by data value.

Firstly, we will consider methods for sorting by row and column index. In this case, we have the sort_index() function. We also have axis parameter to set whether the function should sort by row or column. The ascending option with the true or false value will allow us to sort data in ascending or descending order. The default setting for this option is true:

>>> df7 = pd.DataFrame(np.arange(12).reshape(3,4),  
                       columns=['b', 'd', 'a', 'c'],
                       index=['x', 'y', 'z'])
>>> df7
   b  d   a   c
x  0  1   2   3
y  4  5   6   7
z  8  9  10  11
>>> df7.sort_index(axis=1)
    a  b   c  d
x   2  0   3  1
y   6  4   7  5
z  10  8  11  9

Series has a method order that sorts by value. For NaN values in the object, we can also have a special treatment via the na_position option:

>>> s4.order(na_position='first')
024     NaN
065     NaN
002    Mary
001     Nam
dtype: object
>>> s4
002    Mary
001     Nam
024     NaN
065     NaN
dtype: object

Besides that, Series also has the sort() function that sorts data by value. However, the function will not return a copy of the sorted data:

>>> s4.sort(na_position='first')
>>> s4
024     NaN
065     NaN
002    Mary
001     Nam
dtype: object

If we want to apply sort function to a DataFrame object, we need to figure out which columns or rows will be sorted:

>>> df7.sort(['b', 'd'], ascending=False)
   b  d   a   c
z  8  9  10  11
y  4  5   6   7
x  0  1   2   3

If we do not want to automatically save the sorting result to the current data object, we can change the setting of the inplace parameter to False.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset