Multivariate distributions

So far in this chapter, we considered only the case of a random experiment that has a single numeric outcome. Within this framework, we can model only a single variable. In most data analysis problems, we may be interested in relationships between variables. For example, we might want to understand the relation between the height and weight of a person or between income and educational levels. In another situation, we may be observing a variable repeatedly. As an example, we might be interested in the daily snowfall in a region during the winter months.

To handle these situations, we need models described by multivariate distributions. We have the analogous of the cdf and pdf (or pmf for discrete distributions), but now we have to use functions depending on several variables. The univariate distributions that we discussed in the previous sections are used as building blocks, but we have the extra complication of having to specify how the different variables interact with each other.

A typical example is the bivariate Normal distribution. In this model, we observe two random variables that, individually, are normally distributed. Each of the two variables is characterized by its own mean and standard deviation. However, we must also say how the two variables interact.

In the simplest case, the outcome of one of the variables has no influence whatsoever on the outcome of the other. Consider, for example, the relationship between the snowfall in London and the score of a soccer match in Sidney. Unless we believe in some kind of supernatural connection, we don't expect there to be any connection between these variables. In this case, we say that the variables are independent.

On the other hand, and more interestingly, the variables may be correlated; in the sense that the result of one of the observations will affect the probabilities for the other. For example, we expect the weight and height of a person to be correlated. In this case, we will be interested in knowing how strong the correlation is and perhaps use one of the variables to make predictions about the other.

The multivariate Normal distribution is also part of the scipy.stats package. For now, we will just see how to generate random variates according to a bivariate Normal distribution. We will look at multivariate distributions in detail later in the book. Let's run the following code to generate random variables:

binorm_variates = st.multivariate_normal.rvs(mean=[0,0], size=300) 
df = DataFrame(binorm_variates, columns=['Z1', 'Z2']) 
df.head(10) 

In this code, we are using the multivariate_normal.rvs() function to generate a sample of size 300 from a bivariate normal with the mean zero and default covariance 1. We then convert the NumPy array to a Pandas DataFrame and print the first 10 components of the DataFrame. We can now create a scatterplot of the variates using the following lines of code:

df.plot(kind='scatter', x='Z1', y='Z2') 
plt.title('Bivariate Normal Distribution') 
plt.axis([-4,4,-4,4]); 

Here, we are using the plot() method of the df object. The kind=scatter option is used to produce a scatterplot. We have to specify the x and y components for the scatterplot with the corresponding options. After that, we set the title of the plot and adjust the axis ranges:

Multivariate distributions

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset