Computational and graphics tools

The objects of pandas have a rich set of built-in computational tools. To illustrate some of this functionality, we will use the random data stored in the dframe object defined in the previous section. If you discarded that object, here is how to construct it again:

means = [0, 0, 1, 1, -1, -1, -2, -2]
sdevs = [1, 2, 1, 2,  1,  2,  1,  2]
random_data = {}
nrows = 30
for mean, sdev in zip(means, sdevs):
    label = 'Mean={}, sd={}'.format(mean, sdev)
    random_data[label] = normal(mean, sdev, nrows)
row_labels = ['Row {}'.format(i) for i in range(nrows)]
dframe = DataFrame (random_data, index=row_labels)

Let's explore some of this functionality of the built-in computational tools.

  • To get a list of the methods available for the object, start typing the following command in a cell:
    dframe.
    
  • Then, press the Tab key. The completion popup allows us to select a method by double clicking on it. For example, double click on mean. The cell text changes to the following:
    dframe.mean
    
  • Now, add a question mark to the preceding command line and run the cell:
    dframe.mean?
    

    This will display information about the mean method (which, not surprisingly, computes the mean of the data).

Using tab-completion and IPython's help features is an excellent way to learn about pandas' features. I recommend that you always display the documentation this way, at least the first few times a method is used. Learning about the features that pandas offers can be a real time-saver.

Now, let's continue with the functionalities:

  • Let's say that we want to compute the column means for our random data. This can be done by evaluating the following command:
    dframe.mean()
    
  • The standard deviation values can be computed with the following command:
    dframe.std()
    

Note that the results for all of the immediately preceding command lines are returned as Series objects, which is the default object type that pandas uses for one-dimensional data. In particular, the column labels become the indexes of the objects. Let's say we want to create a DataFrame object containing the mean and standard deviation in two rows. pandas makes this a very easy task, using built-in conversions and constructors.

mean_series = dframe.mean()
std_series = dframe.std()
mean_std = DataFrame([dict(mean_series), 
                            dict(std_series)], 
                           index=['mean', 'std'])
mean_std

In this code, we first compute the means and standard deviations and assign them to variables for clarity. Then, we call the DataFrame constructor that accepts a list of Python dictionaries. This is made easy because pandas allows conversion from a Series object to a dictionary in a convenient way: dict(mean_series) returns the representation of mean_series as a dictionary, using the indexes of the Series object as keys to the dictionary.

Let's say we want to standardize the data in all columns so that they all have a common mean value 100 and standard deviation value 20. This can be achieved using the following command lines:

dframe_stnd = 100 + 20 * (dframe - mean_std.iloc[0,:]) / mean_std.iloc[1,:] 
dframe_stnd

The preceding command lines simply implement the definition of standardization: we subtract the means from the data, divide by the standard deviation, scale by the desired value of the deviation, and add the desired mean. To check that we get the expected results, run the following command lines:

print dframe_stnd.mean()
print dframe_stnd.std()

To illustrate the possibilities, let's do a two-sided test of the hypothesis that the mean of each column is 0. We first compute the Z-scores for the columns. The Z-score of each column is just the deviation from the column mean to the model mean (0 in this case), properly scaled by the standard deviation:

zscores = mean_std.iloc[0,:] / (mean_std.iloc[1,:] / sqrt(len(dframe)))
zscores 

The scaling factor, sqrt(len(dframe)), is the square root of the number of data points, which is given by the number of rows in the table. The last step is to compute the p-values for each column. The p-values are simply a measure of the probability that the data deviates from the mean by more than the corresponding Z-score, given the assumed distribution. These values are obtained from a normal distribution (technically, we should use a t-distribution, since we are using the sample standard deviation, but in this example this does not really make any difference, since the data is normally generated, and the sample size is large enough). The following command lines use the normal distribution object, norm, from SciPy to compute the p-values as percentages:

from scipy.stats import norm
pvalues = 2 * norm.cdf(-abs(zscores)) * 100
pvalues_series = Series(pvalues, index = zscores.index)
pvalues_series

The line that computes the p-values is as follows:

pvalues = 2 * norm.cdf(-abs(zscores)) * 100

We use the cdf() method, which computes the cumulative distribution function for the normal curve from the norm object. We then multiply it with 2, since this is a two-sided test, and multiply by 100 to get a percentage.

The next line converts the p-values into a Series object. This is not necessary, but makes the results easier to visualize.

The following are the results obtained:

Mean=-1, sd=1    1.374183e-02
Mean=-1, sd=2    1.541008e-01
Mean=-2, sd=1    2.812333e-26
Mean=-2, sd=2    1.323917e-04
Mean=0, sd=1     2.840077e+01
Mean=0, sd=2     6.402502e+01
Mean=1, sd=1     2.182986e-06
Mean=1, sd=2     5.678316e-01
dtype: float64

Note

Please note that in the preceding example, you will get different numbers, since the data is randomly generated.

The results are what we expect, given the way the data was generated: the p-values are all very small, except for the columns that have mean 0.

Now, let's explore some of the graphical capabilities provided by pandas. The pandas plots are produced using matplotlib, so the basic interface has already been discussed in Chapter 3, Graphics with matplotlib. In the examples that follow, we will assume that we are using the magic. Run the following command in the cell:

%pylab inline

Most of the plotting capabilities of pandas are implemented as methods of Series or DataFrame objects.

Let's define the following data in our table to include more data points:

means = [0, 0, 1, 1, -1, -1, -2, -2]
sdevs = [1, 2, 1, 2,  1,  2,  1,  2]
random_data = {}
nrows = 300
for mean, sdev in zip(means, sdevs):
    label = 'Mean={}, sd={}'.format(mean, sdev)
    random_data[label] = normal(mean, sdev, nrows)
row_labels = ['Row {}'.format(i) for i in range(nrows)]
dframe = DataFrame (random_data, index=row_labels)

To display a grid of histograms of the data, we can use the following command:

dframe.hist(color='DarkCyan')
subplots_adjust(left=0.5, right=2, top=2.5, bottom=1.0)

We use the hist() method to generate the histograms and use the color option as well, which is passed to the matplotlib function calls that actually do the drawing. The second line of code adds spaces to the plots so that the axis labels do not overlap. You may find that some of the histograms do not look normal. To fix their appearance, it is possible to fiddle with the bins and range options of the hist() method, as shown in the following example:

dframe.loc[:,'Mean=0, sd=2'].hist(bins=40, range=(-10,10), color='LightYellow')
title('Normal variates, mean 0, standard deviation 2')

This will draw a histogram of the data in the column for a mean of 0 and standard deviation of 2, with 40 bins in the range from -10 to 10. In other words, each bin will have a width of 0.5. Note that the plot may not include all the range from -10 to 10, since pandas restricts the drawing to ranges that actually contain data.

For example, let's generate data according to Geometrical Brownian Motion (GBM), which is a model used in mathematical finance to represent the evolution of stock prices. (For details, see http://en.wikipedia.org/wiki/Geometric_Brownian_motion.) This model is defined in terms of two parameters, representing the percentage drift and percentage volatility of the stock. We start by defining these two values in our model, as well as the initial value of the stock:

mu = 0.15
sigma = 0.33
S0 = 150

The simulation should run from time 0.0 to the maximum time 20.0, and we want to generate 200 data points. The following command lines define these parameters:

nsteps = 200
tmax = 20.
dt = tmax/nsteps
times = arange(0, tmax, dt)

The stock model would naturally be represented by a time series (a Series object). However, to make the simulation simpler, we will use a DataFame object and build the simulation column by column. We will start with a very simple table containing only integer indexes and the simulation times:

gbm_data = DataFrame(times, columns=['t'], index=range(nsteps))

To see the first few rows of the table, we can use the following command line:

gbm_data.loc[:5,:]

You might want to run this command after each column is added in order to get a better idea of how the simulation progresses.

The basis for the GBM model is (unsurprisingly) a stochastic process called Brownian Motion (BM). This process has two parts. A deterministic component, called drift, is computed as follows:

gbm_data['drift'] = (mu - sigma**2/2) * gbm_data.loc[:,'t']

The next component adds randomness. It is defined in terms of increments, which are normally distributed with mean zero and standard deviation given by the time interval multiplied by the percentage volatility:

gbm_data['dW'] = normal(0.0, sigma * dt, nsteps)

The BM component is then defined as the cumulative sum of the increments, as shown in the following command lines:

gbm_data['W'] = gbm_data.loc[:,'dW'].cumsum()
gbm_data.ix[0, 'W'] = 0.0

In the preceding command lines, we add the second line because we want the process to start at 0, which is not the convention adopted by the cumsum() method.

We are now ready to compute the stock simulation. It is calculated by taking the drift component, adding to the BM component, taking the exponential of the result, and finally, multiplying it by the initial value of the stock. This is all done with the following command:

gbm_data['S'] = S0 * exp(gbm_data.loc[:,'drift'] + gbm_data.loc[:,'W'])

We are now ready to plot the result of the simulation using the following command lines:

gbm_data.plot(x='t', y='S', lw=2, color='green',
              title='Geometric Brownian Motion')

The preceding command lines produce the following graph. Obviously, the graph you will get will be different due to randomness.

Computational and graphics tools
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset