
The Pandas library builds on NumPy by introducing several useful data structures and functionalities to read and process data. Pandas is a great tool for general data munging. It easily handles common tasks such as dealing with missing data, manipulating shapes and sizes, converting between data formats and structures, and importing data from different sources.

The main data structures introduced by Pandas are:

  • Series
  • The DataFrame
  • Panel

The DataFrame is probably the most widely used. It is a two-dimensional structure that is effectively a table created from either a NumPy array, lists, dicts, or series. You can also create a DataFrame by reading from a file.

Probably the best way to get a feel for Pandas is to go through a typical use case. Let's say that we are given the task of discovering how the daily maximum temperature has changed over time. For this example, we will be working with historical weather observations from the Hobart weather station in Tasmania. Download the following ZIP file and extract its contents into a folder called data in your Python working directory:

The first thing we do is create a DataFrame from it:

import pandas as pd

Check the first few rows in this data:


We can see that the product code and the station number are the same for each row and that this information is superfluous. Also, the days of accumulated maximum temperature are not needed for our purpose, so we will delete them as well:

del df['Bureau of Meteorology station number']
del df['Product code']
del df['Days of accumulation of maximum temperature']

Let's make our data a little easier to read by shorting the column labels:

df=df.rename(columns={'Maximum temperature (Degree C)':'maxtemp'})

We are only interested in data that is of high quality, so we include only records that have a Y in the quality column:


We can get a statistical summary of our data:


If we import the matplotlib.pyplot package, we can graph the data:

import matplotlib.pyplot as plt
plt.plot(df.Year, df.maxtemp)

Notice that PyPlot correctly formats the date axis and deals with the missing data by connecting the two known points on either side. We can convert a DataFrame into a NumPy array using the following:

ndarray = df.values

If the DataFrame contains a mixture of data types, then this function will convert them to the lowest common denominator type, which means that the one that accommodates all values will be chosen. For example, if the DataFrame consists of a mix of float16 and float32 types, then the values will be converted to float 32.

The Pandas DataFrame is a great object for viewing and manipulating simple text and numerical data. However, Pandas is probably not the right tool for more sophisticated numerical processing such as calculating the dot product, or finding the solutions to linear systems. For numerical applications, we generally use the NumPy classes.

