In this chapter, we will introduce pandas, a powerful and versatile Python library that provides tools for data handling and analysis. We will consider the two main pandas structures for storing data, the Series
and DataFrame
objects, in detail. You will learn how to create these structures and how to access and insert data into them. We also cover the important topic of slicing, that is, how to access portions of data using the different indexing methods provided by pandas. Next, we'll discuss the computational and graphics tools offered by pandas, and finish the chapter by demonstrating how to work with a realistic dataset.
pandas is an extensive package for data-oriented manipulation, and it is beyond the scope of this book to realistically cover all aspects of the package. We will cover only some of the most useful data structures and functionalities. In particular, we will not cover the Panel
data structure and multi-indexes. However, we will provide a solid foundation for readers who wish to expand their knowledge by consulting the official package documentation. Throughout this chapter, we assume the following imports:
%pylab inline from pandas import Series, DataFrame import pandas as pd
A Series
object represents a one-dimensional, indexed series of data. It can be thought of as a dictionary, with one main difference: the indexes in a Series
class are ordered. The following example constructs a Series
object and displays it:
grades1 = Series([76, 82, 78, 100], index = ['Alex', 'Robert', 'Minnie', 'Alice'], name = 'Assignment 1', dtype=float64) grades1
This produces the following output:
Alex 76 Robert 82 Minnie 78 Alice 100 Name: Assignment 1, dtype: float64
Notice the format of the constructor call:
Series(<data>, index=<indexes>, name=<name>, dtype=<type>)
Both data
and indexes
are usually lists or NumPy
arrays, but can be any Python iterable. The lists must have the same length. The name
variable is a string that describes the data in the series. The type
variable is a NumPy
data type. The indexes
and the name
variables are optional (if indexes
are omitted, they are set to integers—starting at 0). The data type is also optional, in which case it is inferred from the data.
A Series
object supports the standard dictionary interface. As an example, run the following code in a cell:
print grades1['Minnie'] grades1['Minnie'] = 80 grades1['Theo'] = 92 grades1
The output of the preceding command lines is as follows:
78.0 Alex 76 Robert 82 Minnie 80 Alice 100 Theo 92 Name: Assignment 1, dtype: float64
Here is another interesting example:
for student in grades1.keys(): print '{} got {} points in {}'.format(student, grades1[student], grades1.name)
The preceding command lines produce the following output:
Alex got 76.0 points in Assignment 1 Robert got 82.0 points in Assignment 1 Minnie got 80.0 points in Assignment 1 Alice got 100.0 points in Assignment 1 Theo got 92.0 points in Assignment 1
Note that the order of the output is exactly the same as the order in which each of the elements were inserted in the series. Contrary to a standard Python dictionary, the Series
object keeps track of the order of the elements. In fact, elements can be accessed through an integer index, as shown in the following example:
grades1[2]
The preceding command returns the following output:
80.0
Actually, all of Python's list-access interface is supported. For instance, we can use slices, which return Series
objects:
grades1[1:-1]
The preceding command gives the following output:
Robert 82 Minnie 80 Alice 100 Name: Assignment 1, dtype: float64
The indexing capabilities are even more flexible; this is illustrated in the following example:
grades1[['Theo', 'Alice']]
The preceding command returns the following output:
Theo 92 Alice 100 dtype: float64
It is also possible to append new data to the series, by using the following command:
grades1a = grades1.append(Series([79, 81], index=['Theo', 'Joe'])) grades1a
The output of the preceding command is as follows:
Alex 76 Robert 82 Minnie 80 Alice 100 Theo 92 Kate 69 Molly 74 Theo 79 Joe 81 dtype: float64
Note that the series now contains two entries corresponding to the key, Theo
. This makes sense, since in real-life data there could be more than one data value associated to the same index. In our example, a student might be able to hand in more than one version of the assignment. What happens when we try to access this data? pandas conveniently returns a Series
object so that no data is lost:
grades1a['Theo']
The output of the preceding command is as follows:
Theo 92 Theo 79 dtype: float64
Note that the
append()
method does not append the values to the existing Series
object. Instead, it creates a new object that consists of the original Series
object with the appended elements. This behavior is not the same as what happens when elements are appended to a Python list. Quite a few methods of the Series
class display behavior that is different from their corresponding list counterparts. A little experimentation (or reading the documentation) may be required to understand the conventions that pandas uses.
Let's define a new series with the following command lines:
grades2 = Series([87, 76, 76, 94, 88], index = ['Alex', 'Lucy', 'Robert', 'Minnie', 'Alice'], name='Assignment 2', dtype=float64) grades2
The preceding command lines give the following output:
Alex 87 Lucy 76 Robert 76 Minnie 94 Alice 88 Name: Assignment 2, dtype: float64
If we want to compute each student's average in the two assignments, we can use the following command:
average = 0.5 * (grades1 + grades2) average
On running the preceding code, we get the following output:
Alex 81.5 Alice 94.0 Lucy NaN Minnie 87.0 Robert 79.0 Theo NaN dtype: float64
The value NaN
stands for Not a number, which is a special floating-point value that is used to indicate the result of an invalid operation, such as zero divided by zero. In pandas, it is used to represent a missing data value. We can locate the missing values in Series
using the isnull()
method. For example, run the following code in a cell:
averages.isnull()
Running the preceding command line produces the following output:
Alex False Alice False Lucy True Minnie False Robert False Theo True dtype: bool
If we decide that the missing data can be safely removed from the series, we can use the dropna()
method:
average.dropna()
The preceding command line produces the following output:
Alex 81.5 Alice 94.0 Minnie 87.0 Robert 79.0 dtype: float64
Notice that this is another case in which the original series is not modified.
The Series
class provides a series of useful methods for its instances. For example, we can sort both the values and the indexes. To sort the values in-place, we use the sort()
method:
grades1.sort() grades1
This generates the following output:
Alex 76 Minnie 80 Robert 82 Theo 92 Alice 100 Name: Assignment 1, dtype: float64
To sort the indexes of a series, use the sort_index()
method. For example, consider the following command:
grades1.sort_index()
This produces the following output:
Alex 76 Minnie 80 Robert 82 Theo 92 Alice 100 Name: Assignment 1, dtype: float64
For the next examples, we will use data on maximum daily temperatures for the month of June from a weather station nearby the author's location. The following command lines generates the series of temperatures for the days from June 6 to June 15:
temps = Series([71,76,69,67,74,80,82,70,66,80], index=range(6,16), name='Temperatures', dtype=float64) temps
The preceding command produces the following output:
6 71 7 76 8 69 9 67 10 74 11 80 12 82 13 70 14 66 15 80 Name: Temperatures, dtype: float64
Let's first compute the mean and standard deviation of the temperatures using the following command:
print temps.mean(), temps.std()
The result of the preceding computation is as follows:
73.5 5.77831194112
If we want a quick overview of the data in the series, we can use the describe()
method:
temps.describe()
The preceding command produces the following output:
count 10.000000 mean 73.500000 std 5.778312 min 66.000000 25% 69.250000 50% 72.500000 75% 79.000000 max 82.000000 Name: Temperatures, dtype: float64
Note that the information is returned as a Series
object, so it can be stored in case it is needed in further computations.
To draw a plot of the series, we use the plot()
method. If we just need a quick graphical overview of the data, we can just run the following command:
temps.plot()
However, it's also possible to produce nicely formatted, production-quality plots of the data, since all matplotlib features are supported in pandas. The following code illustrates how some of the graph formatting options discussed in Chapter 3, Graphics with matplotlib, are being used:
temps.plot(style='-s', lw=2, color='green') axis((6,15,65, 85)) xlabel('Day') ylabel('Temperature') title('Maximum daily temperatures in June') None # prevent text output
The preceding command lines produce the following plot:
Suppose we want to find the days in which the maximum temperature was above 75 degrees. This can be achieved with the following expression:
temps[temps > 75]
The preceding command returns the following series:
7 76 11 80 12 82 15 80 Name: Temperatures, dtype: float64
There are many more useful methods provided by the Series
class. Remember that in order to see all the available methods, we can use the code completion feature of IPython. Start typing temps.
and you will get the available methods.
Then press the Tab key. A window with a list of all available methods will pop up. You can then explore what is available.