Where do models come from?

In this section, we will consider what is perhaps the most important practical point about models. How are models conceived and how do we know what is the right model to use in a given situation?

These are not simple questions, and the process of designing and choosing appropriate models is as much an art as a science. At the risk of oversimplifying, we could say that probabilistic models can come from two sources:

  • In priori models, the researcher considers the relevant factors, identifies important quantities and relationships, and creates a description that fits the problem being considered
  • In limit models, the researcher attempts to find an approximation to a model that is too complex, either conceptually or computationally

In both cases, the resulting model may take several different forms. It can be, for example, a mathematical formula, simulation, or algorithm. Always, the model must be validated against real data after the experiments or observations are carried out.

The most important point to emphasize is that all models have assumptions. These establish the boundaries within which the model is valid. Identifying the assumptions that are made is probably the first important step in successful model selection.

As examples, we will consider two models with both historical and practical importance: the Binomial distribution and de Moivre approximation.

The Binomial distribution is an example of a discrete random variable that was originally conceived in the context of gambling but has wide applicability to many situations. Here are the assumptions that underlie the model:

  • A series of N trials is conducted, where each trial can only have one of two results. We will call the two possible outcomes 0 and 1 (failure and success).
  • Each trial has a known probability p of having the outcome 1 and a corresponding probability 1-p of having the outcome 0.
  • The trials are independent, in the sense that the result of each trial is not affected by the results of the other trials.
  • The variable being observed is the number of outcomes equal to 1 in the N trials.

It is easy to see why this model appeals to gamblers: in a game of chance, one can either win, represented by 1 or lose, represented by 0. The probability of each outcome is known. The gambler plays the game repeatedly and is interested in knowing how much money will be made, which is determined by the number of wins.

In a more practical application, we can think of a quality control system. In this case, 1 represents a good item and 0 represents a defective item. We know the probability p of an item being good or defective and want to know the variability of the number of good items when N items are produced.

To be concrete, let's say that we are playing a fair game in which the probability of winning and losing are both 0.5. We assume that 20 games are played. The Binomial distribution is also part of the scipy.stats module, and we can create an object representing the distribution with the following code:

N = 20 
p = 0.5 
rv_binom = st.binom(N, p) 

As the game is played 20 times, the number of wins in the series is an integer between 0 and 20. To compute the probability that we win 12 out of 20 games, for example, we use the probability mass function, as indicated by the following expression:

rv_binom.pmf(12) 

Evaluating this code, we conclude that the probability of winning exactly 12 out of 20 games is about 0.12 or 12%.

In many cases, we are not interested in the probability of an exact number of wins but on the probability of a range. For example, let's suppose that on a particular day, we win only 7 out of 20 games and wonder if we have been cheated. One way to assess this is to compute the probability of winning seven or fewer games. If this probability is small, it is likely that the assumption that the game is fair is not valid. To compute this probability, we need the cumulative distribution function, which can be computed with the following line of code:

rv_binom.cdf(7) 

The result shows that this event happens with a probability of approximately 0.13, that is, 13% of the time. This is not such a small number, so we would expect that, from time to time, we would actually win only 7 of the 20 games, even in a fair game. So, there does not seem to be reason to suspect cheating, at least in this isolated case.

To get an idea of how the distribution behaves, let's make plots of the cdf and pmf. This can be done with the following code:

xx = np.arange(N+1) 
cdf = rv_binom.cdf(xx) 
pmf = rv_binom.pmf(xx) 
xvalues = np.arange(N+1) 
plt.figure(figsize=(9,3.5)) 
plt.subplot(1,2,1) 
plt.step(xvalues, cdf, lw=2, color='brown') 
plt.grid(lw=1, ls='dashed') 
plt.title('Binomial cdf, $N=20$, $p=0.5$', fontsize=16) 
plt.subplot(1,2,2) 
left = xx - 0.5 
plt.bar(left, pmf, 1.0, color='CornflowerBlue') 
plt.title('Binomial pmf, $N=20$, $p=0.5$', fontsize=16) 
plt.axis([0, 20, 0, .18]);

This code is similar to the methods that we used in the graphs previously displayed in this chapter, but we repeat it here as you might find it useful to have a model to make plots of discrete distributions. For the cdf, we simply plot the xvalues and yvalues arrays, which contain the stairs of the cdf taken from the cdf method of the rv_binom object. As we want it to be displayed as such (that is, a staircase), we use the step function to plot it. For the pmf, we use a slightly different approach, the bar() function, which is a matplotlib function that draws a generic bar chart. The arguments for this function are two arrays containing the left coordinate and height of each bar and a number specifying the width of the bars. The plots are displayed in the following figure:

Where do models come from?

There is an important point to notice here: the pmf is a function that is defined only for the integer values from 1 to 20. However, for visualization purposes, instead of just plotting the discrete points, we plot bars with width one since the data is discrete. Each bar is centered at the integer value that corresponds to the number of wins. With these choices, the probabilities in the pmf correspond to the areas of the bars, which is the same interpretation for the pdf of a continuous distribution.

You probably noticed that there is a striking similarity between the preceding plots and Normal distribution. The French mathematician, de Moivre, was the first one to notice that the plot of the pmf for the Binomial distribution approximates a smooth curve if the number of trials N is large. He realized that, if he were able to find a formula for this curve, he would have a simpler way to calculate binomial probabilities for a large number of trials. He was able to find the formula for the curve and, using this formula, he computed binomial probabilities for 3,600 trials, a remarkable feat at the time. This was the birth of the Normal distribution.

To understand the de Moivre approximation, we must first compute the mean and standard deviation of the Binomial distribution. We will do this in two ways. First, let's use the functions provided in scipy.stats, as indicated in the following code:

mean = rv_binom.mean() 
std = rv_binom.std() 
print(mean, std) 

Here, we are simply calling the mean() and std() methods to compute the mean and standard deviation. An alternative approach is to use the theoretical formulas, as shown in the following code:

mean = N * p 
std = np.sqrt(N * p * (1 - p)) 
print(mean, std) 

Either way, we get 10.0 for the mean and approximately 2.236 for the standard deviation. The de Moivre approximation theorem can be informally stated as follows:

For large N, a Binomial distribution is approximated by a Normal distribution with the same mean and standard deviation.

The following figure shows the pmf of the Binomial distribution superimposed with a plot of the Normal distribution with the same mean and standard deviation. It can be seen that the agreement is remarkable:

Where do models come from?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset