Selection – by columns, indices, or both

Now, let's learn how to access and edit specific values in pandas data structures. We'll start with a toy example—here, I will generate a dataframe from a dictionary of lists:

import pandas as pd

data = {'x':[1,2,3], 'y':['a', 'b', 'c'], 'z': [False, True, False]}
df = pd.DataFrame(data)

Now, we can take a look at the data we just stored:

>>> df
x y z
0 1 a False
1 2 b True
2 3 c False

As you can see, this frame has three rows and two columns. Let's see how it works:

  1. First, let's start selecting columns. Any column can be selected using indexing via square brackets with the column name. As we're asking for one column, it will be returned as a pandas Series object:
>>> df['x']
1
2
3
Name: x, dtype: int
  1. Now, a similar approach can be taken to select multiple columns—to do this, we just pass a list of column names instead of one name:
>>> df[['x', 'y']]
x y
0 1 a
1 2 b
2 3 c

In this case, a dataframe will be returned. Even if we have only one column name in the list, df[['x']], it will return a dataframe.

If we need to get a specific row or rows, we need to use the loc method. You can use loc in a similar way to columns: df.loc[0] will select one row as a pandas Series. In order to select multiple rows, pass a list of indices—df.loc[[0,2]]. In both cases, numbers represent indices of the rows, which have to be neither ordered, nor numeric, nor unique.

In fact, .loc can be used to select both rows and columns; in other words, any arbitrary subset of a dataframe. To do so, just pass any definition of rows first, and then the definition of columns as a second argument:

>>> df.loc[[0,1], 'z']  # first two rows, column z
False
True
Name: z, dtype: bool

If one index and one column name are passed, loc will return a raw value, not a series!

Selection can be used for both reading and writing data—or even creating new columns provided there are none with the same name:

>>> df['new_column'] = -1
>>> df['new_column']
0 -1
1 -1
2 -1
Name:new_column, dtype: int

You can pass either a collection of the same length (same number of rows) or a scalar (single) value. You can also write to a specific cell or a specific subset of cells. If needed, you can even pass multiple columns at once as a dataframe.

In some cases, you might need to select according to the order of columns or rows, for example, the first five rows, no matter what their indices say. For that, there is another method—.iloc. .iloc works similar to list slicing; it supports negative indices and much more besides:

>>> df.iloc[-2:, 1:]
y z new_column
1 b True -1
2 c False -1

Also, if you just want to get first, last, or random N rows, you can use the .head, .tail, or .sample functions, respectively. Each takes an integer as an argument, defining how many rows to return.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset