Chapter 4. Handling Data with pandas

In this chapter, we will introduce pandas, a powerful and versatile Python library that provides tools for data handling and analysis. We will consider the two main pandas structures for storing data, the Series and DataFrame objects, in detail. You will learn how to create these structures and how to access and insert data into them. We also cover the important topic of slicing, that is, how to access portions of data using the different indexing methods provided by pandas. Next, we'll discuss the computational and graphics tools offered by pandas, and finish the chapter by demonstrating how to work with a realistic dataset.

pandas is an extensive package for data-oriented manipulation, and it is beyond the scope of this book to realistically cover all aspects of the package. We will cover only some of the most useful data structures and functionalities. In particular, we will not cover the Panel data structure and multi-indexes. However, we will provide a solid foundation for readers who wish to expand their knowledge by consulting the official package documentation. Throughout this chapter, we assume the following imports:

%pylab inline
from pandas import Series, DataFrame
import pandas as pd

The Series class

A Series object represents a one-dimensional, indexed series of data. It can be thought of as a dictionary, with one main difference: the indexes in a Series class are ordered. The following example constructs a Series object and displays it:

grades1 = Series([76, 82, 78, 100],
                 index = ['Alex', 'Robert', 'Minnie', 'Alice'],
                 name = 'Assignment 1', dtype=float64)
grades1

This produces the following output:

Alex       76
Robert     82
Minnie     78
Alice     100
Name: Assignment 1, dtype: float64

Notice the format of the constructor call:

Series(<data>, index=<indexes>, name=<name>, dtype=<type>)

Both data and indexes are usually lists or NumPy arrays, but can be any Python iterable. The lists must have the same length. The name variable is a string that describes the data in the series. The type variable is a NumPy data type. The indexes and the name variables are optional (if indexes are omitted, they are set to integers—starting at 0). The data type is also optional, in which case it is inferred from the data.

A Series object supports the standard dictionary interface. As an example, run the following code in a cell:

print grades1['Minnie']
grades1['Minnie'] = 80
grades1['Theo'] = 92
grades1

The output of the preceding command lines is as follows:

78.0
Alex       76
Robert     82
Minnie     80
Alice     100
Theo       92
Name: Assignment 1, dtype: float64

Here is another interesting example:

for student in grades1.keys():
    print '{} got {} points in {}'.format(student, grades1[student], grades1.name)

The preceding command lines produce the following output:

Alex got 76.0 points in Assignment 1
Robert got 82.0 points in Assignment 1
Minnie got 80.0 points in Assignment 1
Alice got 100.0 points in Assignment 1
Theo got 92.0 points in Assignment 1

Note that the order of the output is exactly the same as the order in which each of the elements were inserted in the series. Contrary to a standard Python dictionary, the Series object keeps track of the order of the elements. In fact, elements can be accessed through an integer index, as shown in the following example:

grades1[2]

The preceding command returns the following output:

80.0

Actually, all of Python's list-access interface is supported. For instance, we can use slices, which return Series objects:

grades1[1:-1]

The preceding command gives the following output:

Robert     82
Minnie     80
Alice     100
Name: Assignment 1, dtype: float64

The indexing capabilities are even more flexible; this is illustrated in the following example:

grades1[['Theo', 'Alice']]

The preceding command returns the following output:

Theo      92
Alice    100
dtype: float64

It is also possible to append new data to the series, by using the following command:

grades1a = grades1.append(Series([79, 81], index=['Theo', 'Joe']))
grades1a

The output of the preceding command is as follows:

Alex       76
Robert     82
Minnie     80
Alice     100
Theo       92
Kate       69
Molly      74
Theo       79
Joe        81
dtype: float64

Note that the series now contains two entries corresponding to the key, Theo. This makes sense, since in real-life data there could be more than one data value associated to the same index. In our example, a student might be able to hand in more than one version of the assignment. What happens when we try to access this data? pandas conveniently returns a Series object so that no data is lost:

grades1a['Theo']

The output of the preceding command is as follows:

Theo    92
Theo    79
dtype: float64

Note

Note that the append() method does not append the values to the existing Series object. Instead, it creates a new object that consists of the original Series object with the appended elements. This behavior is not the same as what happens when elements are appended to a Python list. Quite a few methods of the Series class display behavior that is different from their corresponding list counterparts. A little experimentation (or reading the documentation) may be required to understand the conventions that pandas uses.

Let's define a new series with the following command lines:

grades2 = Series([87, 76, 76, 94, 88],
               index = ['Alex', 'Lucy', 'Robert', 'Minnie', 'Alice'],
               name='Assignment 2',
               dtype=float64)
grades2

The preceding command lines give the following output:

Alex      87
Lucy      76
Robert    76
Minnie    94
Alice     88
Name: Assignment 2, dtype: float64

If we want to compute each student's average in the two assignments, we can use the following command:

average = 0.5 * (grades1 + grades2)
average

On running the preceding code, we get the following output:

Alex      81.5
Alice     94.0
Lucy       NaN
Minnie    87.0
Robert    79.0
Theo       NaN
dtype: float64

The value NaN stands for Not a number, which is a special floating-point value that is used to indicate the result of an invalid operation, such as zero divided by zero. In pandas, it is used to represent a missing data value. We can locate the missing values in Series using the isnull() method. For example, run the following code in a cell:

averages.isnull()

Running the preceding command line produces the following output:

Alex      False
Alice     False
Lucy       True
Minnie    False
Robert    False
Theo       True
dtype: bool

If we decide that the missing data can be safely removed from the series, we can use the dropna() method:

average.dropna()

The preceding command line produces the following output:

Alex      81.5
Alice     94.0
Minnie    87.0
Robert    79.0
dtype: float64

Notice that this is another case in which the original series is not modified.

The Series class provides a series of useful methods for its instances. For example, we can sort both the values and the indexes. To sort the values in-place, we use the sort() method:

grades1.sort()
grades1

This generates the following output:

Alex       76
Minnie     80
Robert     82
Theo       92
Alice     100
Name: Assignment 1, dtype: float64

To sort the indexes of a series, use the sort_index() method. For example, consider the following command:

grades1.sort_index()

This produces the following output:

Alex       76
Minnie     80
Robert     82
Theo       92
Alice     100
Name: Assignment 1, dtype: float64

Note

Note that the sorting is not in-place this time, a new series object is returned.

For the next examples, we will use data on maximum daily temperatures for the month of June from a weather station nearby the author's location. The following command lines generates the series of temperatures for the days from June 6 to June 15:

temps = Series([71,76,69,67,74,80,82,70,66,80],
               index=range(6,16), 
               name='Temperatures', dtype=float64)
temps

The preceding command produces the following output:

6     71
7     76
8     69
9     67
10    74
11    80
12    82
13    70
14    66
15    80
Name: Temperatures, dtype: float64

Let's first compute the mean and standard deviation of the temperatures using the following command:

print temps.mean(), temps.std()

The result of the preceding computation is as follows:

73.5 5.77831194112

If we want a quick overview of the data in the series, we can use the describe() method:

temps.describe()

The preceding command produces the following output:

count    10.000000
mean     73.500000
std       5.778312
min      66.000000
25%      69.250000
50%      72.500000
75%      79.000000
max      82.000000
Name: Temperatures, dtype: float64

Note that the information is returned as a Series object, so it can be stored in case it is needed in further computations.

To draw a plot of the series, we use the plot() method. If we just need a quick graphical overview of the data, we can just run the following command:

temps.plot()

However, it's also possible to produce nicely formatted, production-quality plots of the data, since all matplotlib features are supported in pandas. The following code illustrates how some of the graph formatting options discussed in Chapter 3, Graphics with matplotlib, are being used:

temps.plot(style='-s', lw=2, color='green')
axis((6,15,65, 85))
xlabel('Day')
ylabel('Temperature')
title('Maximum daily temperatures in June')
None # prevent text output

The preceding command lines produce the following plot:

The Series class

Suppose we want to find the days in which the maximum temperature was above 75 degrees. This can be achieved with the following expression:

temps[temps > 75]

The preceding command returns the following series:

7     76
11    80
12    82
15    80
Name: Temperatures, dtype: float64

There are many more useful methods provided by the Series class. Remember that in order to see all the available methods, we can use the code completion feature of IPython. Start typing temps. and you will get the available methods.

Then press the Tab key. A window with a list of all available methods will pop up. You can then explore what is available.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset